Category

ai

Vehicle Forecast: Sales Expected to Exceed 17 million SAAR in September

By | ai, bigdata, machinelearning


The automakers will report September vehicle sales on Tuesday, October 3rd.

Note: There were 26 selling days in September 2017, there were 25 in September 2016.

From WardsAuto: Forecast: SAAR Expected to Surpass 17 Million in September

A WardsAuto forecast calls for U.S. light-vehicle sales to reach a 17.5 million-unit seasonally adjusted annual rate in September, following August’s 16.0 million SAAR and ending a 6-month streak of sub-17 million figures. In same-month 2016, the SAAR reached 17.6 million.

Preliminary assumptions pointed to October, rather than September, as the turning point for the market, as consumers replace vehicles lost due to natural disasters and automakers push sales to clear out excess model-year ’17 stock. However, the winds have already begun to turn, and September sales will be significantly higher than originally expected.
emphasis added

Sales have been below 17 million SAAR for six consecutive months.


Source link

Stan Weekly Roundup, 22 September 2017

By | ai, bigdata, machinelearning

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

This week (and a bit from last week) in Stan:

  • Paul-Christian Bürkner‘s paper on brms (a higher-level interface to RStan, which preceded rstanarm and is still widely used and recommended by our own devs) was just published as a JStatSoft article. If you follow the link, the abstract explains what brms does.

  • Ben Goodrich and Krzysztof Sakrejda have been working on standalone functions in RStan. The bottleneck has been random number generators. As an application, users want to write Stan functions that they can use for efficient calculations inside R; it’s easier than C++ and we have a big stats library with derivatives backing it up.

  • Ben Goodrich has also been working on multi-threading capabilities for RStan.

  • Sean Talts has been continuing the wrestling match with continuous integration. Headaches continue from new compiler versions, Travis timeouts, and fragile build scripts.

  • Sean Talts has been working with Sebastian code reviewing MPI. The goal is to organize the code so that it’s easy to test and maintain (the two go together in well written code along with readability and crisp specs for the API boundaries).

  • Sean Talts has been working on his course materials for Simulation-Based Algorithmic Calibration (SBAC), the new name for applying the diagnostic of Cook et al.

  • Bob Carpenter has been working on writing a proper language spec for Stan looking forward to tuples, ragged and sparse structures, and functions for Stan 3. That’s easy; the denotational semantics will be more challenging as it has to be generic in terms of types and discuss how Stan compiles to a function with transforms.

  • Bob Carpenter has also been working on getting variable declarations through the model concept. After that, a simple base class and constant correctness for the model class to make it easier to use outside of Stan’s algorithms.

  • Michael Betancourt and Sean Talts have been prepping their physics talks for the upcoming course at MIT. There are post-it notes at metric heights in our office and they filmed themselves dropping a ball while filming a phone’s stopwatch (clever! hope that’s not too much of a spoiler for the course).

  • Michael Betancourt is also working on organizing the course before StanCon that’ll benefit NumFOCUS.

  • Jonah Gabry and T.J. Mahr released BayesPlot 1.4 with some new visualizations from Jonah’s paper.

  • Jonah Gabry working on Mac/C++ issues with R and has had to communicate with the R devs themselves it’s gotten so deep.

  • Lauren Kennedy has joined the team at Columbia taking over for Jonah Gabry at the population research center in the School of Social Work; she’ll be focusing on population health. She’ll also be working us (specifically with Andrew Gelman) on survey weighting and multilevel regression and poststratification with rstanarm.

  • Dan Simpson has been working on sparse auotidff architecture and talking with Aki and me about parallelization.

  • Dan Simpson and Andrew Gelman have been plotting how to marginalize out random effects in multilevel linear regression.

  • Andrew Gelman has been working with Jennifer Hill and others on revising jis and Jennifer’s regression book. It’ll come out in two parts, the first of which is (tentatively?) titled Regression and Other Stories. Jonah Gabry has been working on the R packages for it and beefing up bayesplot and rstanarm to handle it.

  • Andrew Gelman is trying to write all the workflow stuff down with everyone else including Sean Talts, Michael Betancourt, and Daniel Simpson. As usual, a simple request from Andrew to write a short paper has spun out into real developments on the computational and methodological front.

  • Aki Vehtari, Andrew Gelman and others have a new version of the expectation propagation (EP) paper; we’re excited about the data parallel aspects of this.

  • Aki Vehtari gave a talk on priors for GPs at the summer school for GPs in Sheffield. He reports there were even some Stan users there using Stan for GPs. Their lives should get easier over the next year or two.

  • Aki Vehtari reports there are 190 students in his Bayes class in Helsinki!

  • Michael Betancourt, Dan Simpson, and Aki Vehtari wrote comments on a paper about frequentist properties of horseshoe priors. Aki’s revised horseshoe prior paper has also been accepted.

  • Ben Bales wrote some generic array append code and also some vectorized random number generators which I reviewed and should go in soon.

  • Bill Gillespie is working on a piecewise linear interpolation function for Stan’s math library; he already added it to Torsten in advance of the ACoP tutorial he’s doing next month. He’ll be looking at a 1D integrator as an exercise, picking up from where Marco Inacio left off (it’s based on some code by John Cook).

  • Bill Gillespie is trying to hire a replacement for Charles Margossian at Metrum. He’s looking for someone who wants to work on Stan and pharmacology, preferably with experience in C++ and numerical analysis.

  • Krzysztof Sakrejda started a postdoc working on statistical modeling for global scale demographics for reproductive health with Leontine Alkema at UMass Amherst.

  • Krzysztof Sakrejda is working with makefiles trying to simplify them and inadvertently solved some of our clang++ compiler issues for CmdStan.

  • Matthijs Vákár got a pull request in for GLMs to speed up logistic regression by a factor of four or so by introducing analytic derivatives.

  • Matthijs Vákár is also working on higher-order imperative semantics for probabilistic programming languages like Stan.

  • Mitzi Morris finished last changes for pull request on base expression type refactor (this will pave the way for tuples, sparse matrices, ragged arrays, and functional types—hence all the semantic activity).

  • Mitzi Morris is also refactoring the local variable type inference system to squash a meta-bug that surfaced with ternary operators and will simplify the code.

  • Charles Margossian is finishing a case study on the algebraic solver to submit for the extended StanCon deadline. While he’s knee-deep in first-year grad student courses in measure theory and statistics.

  • Breck Baldwin and others have been talking to DataCamp (hi, Rasmus!) and Coursera. We’ll be getting some Stan classes out over the next year or two. Coordinating with DataCamp is easy, Coursera plus Columbia less so.

The post Stan Weekly Roundup, 22 September 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

The post Stan Weekly Roundup, 22 September 2017 appeared first on All About Statistics.




Source link

Stan Weekly Roundup, 22 September 2017

By | ai, bigdata, machinelearning

This week (and a bit from last week) in Stan:

  • Paul-Christian Bürkner‘s paper on brms (a higher-level interface to RStan, which preceded rstanarm and is still widely used and recommended by our own devs) was just published as a JStatSoft article. If you follow the link, the abstract explains what brms does.

  • Ben Goodrich and Krzysztof Sakrejda have been working on standalone functions in RStan. The bottleneck has been random number generators. As an application, users want to write Stan functions that they can use for efficient calculations inside R; it’s easier than C++ and we have a big stats library with derivatives backing it up.

  • Ben Goodrich has also been working on multi-threading capabilities for RStan.

  • Sean Talts has been continuing the wrestling match with continuous integration. Headaches continue from new compiler versions, Travis timeouts, and fragile build scripts.

  • Sean Talts has been working with Sebastian code reviewing MPI. The goal is to organize the code so that it’s easy to test and maintain (the two go together in well written code along with readability and crisp specs for the API boundaries).

  • Sean Talts has been working on his course materials for Simulation-Based Algorithmic Calibration (SBAC), the new name for applying the diagnostic of Cook et al.

  • Bob Carpenter has been working on writing a proper language spec for Stan looking forward to tuples, ragged and sparse structures, and functions for Stan 3. That’s easy; the denotational semantics will be more challenging as it has to be generic in terms of types and discuss how Stan compiles to a function with transforms.

  • Bob Carpenter has also been working on getting variable declarations through the model concept. After that, a simple base class and constant correctness for the model class to make it easier to use outside of Stan’s algorithms.

  • Michael Betancourt and Sean Talts have been prepping their physics talks for the upcoming course at MIT. There are post-it notes at metric heights in our office and they filmed themselves dropping a ball while filming a phone’s stopwatch (clever! hope that’s not too much of a spoiler for the course).

  • Michael Betancourt is also working on organizing the course before StanCon that’ll benefit NumFOCUS.

  • Jonah Gabry and T.J. Mahr released BayesPlot 1.4 with some new visualizations from Jonah’s paper.

  • Jonah Gabry working on Mac/C++ issues with R and has had to communicate with the R devs themselves it’s gotten so deep.

  • Lauren Kennedy has joined the team at Columbia taking over for Jonah Gabry at the population research center in the School of Social Work; she’ll be focusing on population health. She’ll also be working us (specifically with Andrew Gelman) on survey weighting and multilevel regression and poststratification with rstanarm.

  • Dan Simpson has been working on sparse auotidff architecture and talking with Aki and me about parallelization.

  • Dan Simpson and Andrew Gelman have been plotting how to marginalize out random effects in multilevel linear regression.

  • Andrew Gelman has been working with Jennifer Hill and others on revising jis and Jennifer’s regression book. It’ll come out in two parts, the first of which is (tentatively?) titled Regression and Other Stories. Jonah Gabry has been working on the R packages for it and beefing up bayesplot and rstanarm to handle it.

  • Andrew Gelman is trying to write all the workflow stuff down with everyone else including Sean Talts, Michael Betancourt, and Daniel Simpson. As usual, a simple request from Andrew to write a short paper has spun out into real developments on the computational and methodological front.

  • Aki Vehtari, Andrew Gelman and others have a new version of the expectation propagation (EP) paper; we’re excited about the data parallel aspects of this.

  • Aki Vehtari gave a talk on priors for GPs at the summer school for GPs in Sheffield. He reports there were even some Stan users there using Stan for GPs. Their lives should get easier over the next year or two.

  • Aki Vehtari reports there are 190 students in his Bayes class in Helsinki!

  • Michael Betancourt, Dan Simpson, and Aki Vehtari wrote comments on a paper about frequentist properties of horseshoe priors. Aki’s revised horseshoe prior paper has also been accepted.

  • Ben Bales wrote some generic array append code and also some vectorized random number generators which I reviewed and should go in soon.

  • Bill Gillespie is working on a piecewise linear interpolation function for Stan’s math library; he already added it to Torsten in advance of the ACoP tutorial he’s doing next month. He’ll be looking at a 1D integrator as an exercise, picking up from where Marco Inacio left off (it’s based on some code by John Cook).

  • Bill Gillespie is trying to hire a replacement for Charles Margossian at Metrum. He’s looking for someone who wants to work on Stan and pharmacology, preferably with experience in C++ and numerical analysis.

  • Krzysztof Sakrejda started a postdoc working on statistical modeling for global scale demographics for reproductive health with Leontine Alkema at UMass Amherst.

  • Krzysztof Sakrejda is working with makefiles trying to simplify them and inadvertently solved some of our clang++ compiler issues for CmdStan.

  • Matthijs Vákár got a pull request in for GLMs to speed up logistic regression by a factor of four or so by introducing analytic derivatives.

  • Matthijs Vákár is also working on higher-order imperative semantics for probabilistic programming languages like Stan.

  • Mitzi Morris finished last changes for pull request on base expression type refactor (this will pave the way for tuples, sparse matrices, ragged arrays, and functional types—hence all the semantic activity).

  • Mitzi Morris is also refactoring the local variable type inference system to squash a meta-bug that surfaced with ternary operators and will simplify the code.

  • Charles Margossian is finishing a case study on the algebraic solver to submit for the extended StanCon deadline. While he’s knee-deep in first-year grad student courses in measure theory and statistics.

  • Breck Baldwin and others have been talking to DataCamp (hi, Rasmus!) and Coursera. We’ll be getting some Stan classes out over the next year or two. Coordinating with DataCamp is easy, Coursera plus Columbia less so.

The post Stan Weekly Roundup, 22 September 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.


Source link

Tutorial: Launch a Spark and R cluster with HDInsight

By | ai, bigdata, machinelearning

If you'd like to get started using R with Spark, you'll need to set up a Spark cluster and install R and all the other necessary software on the nodes. A really easy way to achieve that is to launch an HDInsight cluster on Azure, which is just a managed Spark cluster with some useful extra components. You'll just need to configure the components you'll need, in our case R and Microsoft R Server, and RStudio Server.

This tutorial explains how to launch an HDInsight cluster for use with R. It explains how to size the cluster and launch the cluster, connect to it via SSH, install Microsoft R Server (with R) on each of the nodes, and install RStudio Server community edition to use as an IDE on the edge node. (If you find you need a larger or smaller cluster after you've set it up, it's easy to resize the cluster dynamically.) Once you have the cluster up an running, here are some things you can try:

To get started with R and Spark, follow the instructions for setting up a HDInsight cluster at the link below.

Microsoft Azure Documentation: Get started using R Server on HDInsight


Source link

Multi-Dimensional Reduction and Visualisation with t-SNE

By | ai, bigdata, machinelearning

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

t-SNE is a very powerful technique that can be used for visualising (looking for patterns) in multi-dimensional data. Great things have been said about this technique. In this blog post I did a few experiments with t-SNE in R to learn about this technique and its uses.

Its power to visualise complex multi-dimensional data is apparent, as well as its ability to cluster data in an unsupervised way.

What’s more, it is also quite clear that t-SNE can aid machine learning algorithms when it comes to prediction and classification. But the inclusion of t-SNE in machine learning algorithms and ensembles has to be ‘crafted’ carefully, since t-SNE was not originally intended for this purpose.

All in all, t-SNE is a powerful technique that merits due attention.

t-SNE

Let’s start with a brief description. t-SNE stands for t-Distributed Stochastic Neighbor Embedding and its main aim is that of dimensionality reduction, i.e., given some complex dataset with many many dimensions, t-SNE projects this data into a 2D (or 3D) representation while preserving the ‘structure’ (patterns) in the original dataset. Visualising high-dimensional data in this way allows us to look for patterns in the dataset.

t-SNE has become so popular because:

  • it was demonstrated to be useful in many situations,
  • it’s incredibly flexible, and
  • can often find structure where other dimensionality-reduction algorithms cannot.

A good introduction on t-SNE can be found here. The original paper on t-SNE and some visualisations can be found at this site. In particular, I like this site which shows how t-SNE is used for unsupervised image clustering.

While t-SNE itself is computationally heavy, a faster version exists that uses what is known as the Barnes-Hut approximation. This faster version allows t-SNE to be applied on large real-world datasets.

t-SNE for Exploratory Data Analysis

Because t-SNE is able to provide a 2D or 3D visual representation of high-dimensional data that preserves the original structure, we can use it during initial data exploration. We can use it to check for the presence of clusters in the data and as a visual check to see if there is some ‘order’ or some ‘pattern’ in the dataset. It can aid our intuition about what we think we know about the domain we are working in.

Apart from the initial exploratory stage, visualising information via t-SNE (or any other algorithm) is vital throughout the entire analysis process – from those first investigations of the raw data, during data preparation, as well as when interpreting the outputs of machine learning algorithms and presenting insights to others. We will see further on that we can use t-SNE even during the predcition/classification stage itself.

Experiments on the Optdigits dataset

In this post, I will apply t-SNE to a well-known dataset, called optdigits, for visualisation purposes.

The optdigits dataset has 64 dimensions. Can t-SNE reduce these 64 dimensions to just 2 dimension while preserving structure in the process? And will this structure (if present) allow handwritten digits to be correctly clustered together? Let’s find out.

tsne package

We will use the tsne package that provides an exact implementation of t-SNE (not the Barnes-Hut approximation). And we will use this method to reduce dimensionality of the optdigits data to 2 dimensions. Thus, the final output of t-SNE will essentially be an array of 2D coordinates, one per row (image). And we can then plot these coordinates to get the final 2D map (visualisation). The algorithm runs in iterations (called epochs), until the system converges. Every
(K) number of iterations and upon convergence, t-SNE can call a user-supplied callback function, and passes the list of 2D coordinates to it. In our callback function, we plot the 2D points (one per image) and the corresponding class labels, and colour-code everything by the class labels.

traindata <- read.table("optdigits.tra", sep=",")
trn <- data.matrix(traindata)

require(tsne)

cols <- rainbow(10)

# this is the epoch callback function used by tsne. 
# x is an NxK table where N is the number of data rows passed to tsne, and K is the dimension of the map. 
# Here, K is 2, since we use tsne to map the rows to a 2D representation (map).
ecb = function(x, y){ plot(x, t='n'); text(x, labels=trn[,65], col=cols[trn[,65] +1]); }

tsne_res = tsne(trn[,1:64], epoch_callback = ecb, perplexity=50, epoch=50)

The images below show how the clustering improves as more epochs pass.

As one can see from the above diagrams (especially the last one, for epoch 1000), t-SNE does a very good job in clustering the handwritten digits correctly.

But the algorithm takes some time to run. Let’s try out the more efficient Barnes-Hut version of t-SNE, and see if we get equally good results.

Rtsne package

The Rtsne package can be used as shown below. The perplexity parameter is crucial for t-SNE to work correctly – this parameter determines how the local and global aspects of the data are balanced. A more detailed explanation on this parameter and other aspects of t-SNE can be found in this article, but a perplexity value between 30 and 50 is recommended.

traindata <- read.table("optdigits.tra", sep=",")
trn <- data.matrix(traindata)

require(Rtsne)

# perform dimensionality redcution from 64D to 2D
tsne <- Rtsne(as.matrix(trn[,1:64]), check_duplicates = FALSE, pca = FALSE, perplexity=30, theta=0.5, dims=2)

# display the results of t-SNE
cols <- rainbow(10)
plot(tsne$Y, t='n')
text(tsne$Y, labels=trn[,65], col=cols[trn[,65] +1])

Gives this plot:

Note how clearly-defined and distinctly separable the clusters of handwritten digits are. We have only a minimal amount of incorrect entries in the 2D map.

Of Manifolds and the Manifold Assumption

How can t-SNE achieve such a good result? How can it ‘drop’ 64-dimensional data into just 2 dimensions and still preserve enough (or all) structure to allow the classes to be separated?

The reason has to do with the mathematical concept of manifolds. A Manifold is a (d)-dimensional surface that lives in an (D)-dimensional space, where (d < D). For the 3D case, imagine a 2D piece of paper that is embedded within 3D space. Even if the piece of paper is crumpled up extensively, it can still be ‘unwrapped’ (uncrumpled) into the 2D plane that it is. This 2D piece of paper is a manifold in 3D space. Or think of an entangled string in 3D space – this is a 1D manifold in 3D space.

Now there’s what is known as the Manifold Hypothesis, or the >Manifold Assumption, that states that natural data (like images, etc.) forms lower dimensional manifolds in their embedding space. If this assumption holds (there are theoretical and experimental evidence for this hypothesis), then t-SNE should be able to find this lower-dimensional manifold, ‘unwrap it’, and present it to us as a lower-dimensional map of the original data.

t-SNE vs. SOM

t-SNE is actually a tool that does something similar to Self-Organising Maps (SOMs), though the underlying process is quite different. We have used a SOM on the optdigits dataset in a previous blog and obtained the following unsupervised clustering shown below.

The t-SNE map is more clear than that obtained via a SOM and the clusters are separated much better.

t-SNE for Shelter Animals dataset

In a previous blog, I applied machine learning algorithms for predicting the outcome of shelter animals. Let’s now apply t-SNE to the dataset – I am using the cleaned and modified data as described in this blog entry.

trn <- read.csv('train.modified.csv.zip')
trn <- data.matrix(trn)

require(Rtsne)

# scale the data first prior to running t-SNE
trn[,-1]  <- scale(trn[,-1])

tsne <- Rtsne(trn[,-1], check_duplicates = FALSE, pca = FALSE, perplexity=50, theta=0.5, dims=2)

# display the results of t-SNE
cols <- rainbow(5)
plot(tsne$Y, t='n')
text(tsne$Y, labels=as.numeric(trn[,1]), col=cols[trn[,1]])

Gives this plot:

Here, while it is evident that there is some structure and patterns to the data, clustering by OutcomeType has not happened.

Let’s now use t-SNE to perform dimensionality reduction to 3D on the same dataset. We just need to set the dims parameter to 3 instead of 2. And we will use package rgl for plotting the 3D map produced by t-SNE.

tsne <- Rtsne(trn[,-1], check_duplicates = FALSE, pca = FALSE, perplexity=50, theta=0.5, dims=2)

#display results of t-SNE
require(rgl)
plot3d(tsne$Y, col=cols[trn[,1]])
legend3d("topright", legend = '0':'5', pch = 16, col = rainbow(5))

Gives this plot:

This 3D map has a richer structure than the 2D version, but again the resulting clustering is not done by OutcomeType.

One possible reason could be that the Manifold Assumption is failing for this dataset when trying to reduce to 2D and 3D. Please note that the Manifold Assumption cannot be always true for a simple reason. If it were, we could take an arbitrary set of (N)-dimensional points and conclude that they lie on some (M)-dimensional manifold (where (M < N)). We could then take those (M)-dimensional points and conclude that they lie on some other lower-dimensional manifold (with, say, (K) dimensions, where (K < M)). We could repeat this process indefinitely and eventually conclude that our arbitrary (N)-dimensional dataset lies on a 1D manifold. Because this cannot be true for all datasets, the manifold assumption cannot always be true. In general, if you have (N)-dimensional points being generated by a process with (N) degrees of freedom, your data will not lie on a lower-dimensional manifold. So probably the Animal Shelter dataset has degrees of freedom higher than 3.

So we can conclude that t-SNE did not aid much the initial exploratory analysis for this dataset.

t-SNE and machine learning?

One disadvantage of t-SNE is that there is currently no incremental version of this algorithm. In other words, it is not possible to run t-SNE on a dataset, then gather a few more samples (rows), and “update” the t-SNE output with the new samples. You would need to re-run t-SNE from scratch on the full dataset (previous dataset + new samples). Thus t-SNE works only in batch mode.

This disadvantage appears to makes it difficult for t-SNE to be used in a machine learning system. But as we will see in a future post, it is still possible to use t-SNE (with care) in a machine learning solution. And the use of t-SNE can improve classification results, sometimes markedly. The limitation is the extra non-realtime processing brought about by t-SNE’s batch mode nature.

Stay tuned.

    Related Post

    1. Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016
    2. Analyzing Obesity across USA
    3. Can we predict flu deaths with Machine Learning and R?
    4. Graphical Presentation of Missing Data; VIM Package
    5. Creating Graphs with Python and GooPyCharts

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };

    (function(d, t) {
    var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
    s.src = "http://cdn.viglink.com/api/vglnk.js";
    var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
    }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...




    Source link

    Four short links: 22 September 2017

    By | ai, bigdata, machinelearning

    Molecular Robots, Distributed Deep Nets, SQL Notebook, and Super-Accurate GPS

    1. Scientists Create World’s First ‘Molecular Robot’ Capable Of Building MoleculesEach individual robot is capable of manipulating a single molecule and is made up of just 150 carbon, hydrogen, oxygen and nitrogen atoms. To put that size into context, a billion billion of these robots piled on top of each other would still only be the same size as a single grain of salt. The robots operate by carrying out chemical reactions in special solutions which can then be controlled and programmed by scientists to perform the basic tasks. (via Slashdot)
    2. Distributed Deep Neural Networks — in Adrian Colyer’s words: DDNNs partition networks between mobile/embedded devices, cloud (and edge), although the partitioning is static. What’s new and very interesting here though is the ability to aggregate inputs from multiple devices (e.g., with local sensors) in a single model, and the ability to short-circuit classification at lower levels in the model (closer to the end devices) if confidence in the classification has already passed a certain threshold. It looks like both teams worked independently and in parallel on their solutions. Overall, DDNNs are shown to give lower latency decisions with higher accuracy than either cloud or devices working in isolation, as well as fault tolerance in the sense that classification accuracy remains high even if individual devices fail. (via Morning Paper)
    3. Franchisean open-source notebook for sql.
    4. Super-Accurate GPS Chips Coming to Smartphones in 2018 (IEEE Spectrum) — 30cm accuracy (today: 5m), will help with the reflections you get in cities, and with 50% energy savings.

    Continue reading Four short links: 22 September 2017.




    Source link

    Mortgage Equity Withdrawal slightly positive in Q2

    By | ai, bigdata, machinelearning


    Note: This is not Mortgage Equity Withdrawal (MEW) data from the Fed. The last MEW data from Fed economist Dr. Kennedy was for Q4 2008.

    The following data is calculated from the Fed’s Flow of Funds data (released yesterday) and the BEA supplement data on single family structure investment. This is an aggregate number, and is a combination of homeowners extracting equity – hence the name “MEW” – and normal principal payments and debt cancellation (modifications, short sales, and foreclosures).

    For Q2 2017, the Net Equity Extraction was a positive $12 billion, or a positive 0.3% of Disposable Personal Income (DPI) .

    Mortgage Equity Withdrawal Click on graph for larger image.

    This graph shows the net equity extraction, or mortgage equity withdrawal (MEW), results, using the Flow of Funds (and BEA data) compared to the Kennedy-Greenspan method.

    Note: This data is impacted by debt cancellation and foreclosures, but much less than a few years ago.

    The Fed’s Flow of Funds report showed that the amount of mortgage debt outstanding increased by $64 billion in Q2.

    The Flow of Funds report also showed that Mortgage debt has declined by $1.23 trillion since the peak. This decline is mostly because of debt cancellation per foreclosures and short sales, and some from modifications. There has also been some reduction in mortgage debt as homeowners paid down their mortgages so they could refinance.

    With a slower rate of debt cancellation, MEW will likely be mostly positive going forward.

    For reference:

    Dr. James Kennedy also has a simple method for calculating equity extraction: “A Simple Method for Estimating Gross Equity Extracted from Housing Wealth“. Here is a companion spread sheet (the above uses my simple method).

    For those interested in the last Kennedy data included in the graph, the spreadsheet from the Fed is available here.


    Source link

    Air rage update

    By | ai, bigdata, machinelearning

    (This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

    So. Marcus Crede, Carol Nickerson, and I published a letter in PPNAS criticizing the notorious “air rage” article. (Due to space limitations, our letter contained only a small subset of the many possible criticisms of that paper.) Our letter was called “Questionable association between front boarding and air rage.”

    The authors of the original paper, Katherine DeCelles and Michael Norton, published a response in which they concede nothing. They state that their hypotheses are “are predicated on decades of theoretical and empirical support across the social sciences” and they characterize their results as “consistent with theory.” I have no reason to dispute either of these claims, but at the same time these theories are so flexible that they could predict just about anything, including, I suspect, the very opposite of the claims made in the paper. As usual, there’s a confusion between a general scientific theory and some very specific claims regarding regression coefficients in some particular fitted model.

    Considering the DeCelles and Norton reply in a context-free sense, it reads as reasonable: yes, it is possible for the signs and magnitudes of estimates to change when adding controls to a regression. The trouble is that their actual data seem to be of low quality, and due to the observational nature of their study, there are lots of interactions not included in the model that are possibly larger than their main effects (for example, interactions of plane configuration with type of flight, interactions with alcohol consumption, nonlinearities in the continuous predictors such as number of seats and flight difference).

    The whole thing is interesting in that it reveals the challenge of interpreting this sort of exchange from the outside. how it is possible for researchers to string together paragraphs that have the form or logical argument, in support of whatever claim they’d like to make. Of course someone could say the same about us. . . .

    One good thing about slogans such as “correlation does not imply causation” is that they get right to the point.

    The post Air rage update appeared first on Statistical Modeling, Causal Inference, and Social Science.

    Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

    The post Air rage update appeared first on All About Statistics.




    Source link

    My advice on dplyr::mutate()

    By | ai, bigdata, machinelearning

    There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.

    Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.

    Vlcsnap 00887

    “Character is what you are in the dark.”

    John Whorfin quoting Dwight L. Moody.

    I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.

    What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the R package dplyr.

    • Disclaimer: Win-Vector LLC has no official standing with RStudio, or dplyr development.
    • However:

      “One need not have been Caesar in order to understand Caesar.”

      Alternately: Georg Simmmel or Max Webber.

      Win-Vector LLC, as a consultancy, has experience helping large companies deploy enterprise big data solutions involving R, dplyr, sparklyr, and Apache Spark. Win-Vector LLC, as a training organization, has experience in how new users perceive, reason about, and internalize how to use R and dplyr. Our group knows how to help deploy production grade systems, and how to help new users master these systems.

    From experience we have distilled a lot of best practices. And below we will share one.

    From: “R for Data Science; Whickham, Grolemund; O’Reilly, 2017” we have:

    Note that you can refer to columns that you’ve just created:

    mutate(flights_sml,
       gain = arr_delay - dep_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours
     )
    

    Let’s try that with database backed data:

    suppressPackageStartupMessages(library("dplyr"))
    packageVersion("dplyr")
    # [1] ‘0.7.3’
    
    db <- DBI::dbConnect(RSQLite::SQLite(), 
                         ":memory:")
    flights <- copy_to(db, 
                 nycflights13::flights,
                 'flights')
    
    mutate(flights,
           gain = arr_delay - dep_delay,
           hours = air_time / 60,
           gain_per_hour = gain / hours
    )
    # # Source:   lazy query [?? x 22]
    # # Database: sqlite 3.19.3 [:memory:]
    # year month   day dep_time sched_dep_time        ...
    #                        ...
    #   1  2013     1     1      517            515   ...
    # ...
    

    That worked. One of the selling points of dplyr is a lot of dplyr is source-generic or source-agnostic: meaning it can be run against different data providers (in-memory, databases, Spark).

    However, if a new user tries to extend such an example (say adding gain_per_minutes) they run into this:

    mutate(flights,
           gain = arr_delay - dep_delay,
           hours = air_time / 60,
           gain_per_hour = gain / hours,
           gain_per_minute = 60 * gain_per_hour
    )
    # Error in rsqlite_send_query(conn@ptr, statement) : 
    #   no such column: gain_per_hour
    

    It is hard for experts to understand how frustrating the above is to a new R user or to a part time R user. It feels like any variation on the original code causes it to fail. None of the rules they have been taught anticipate this, or tell them how to get out of this situation.

    This quickly leads to strong feelings of learned helplessness and anxiety.

    Our rule for dplyr::mutate() has been for some time:

    Each column name used in a single mutate must appear only on the left-hand-side of a single assignment, or otherwise on the right-hand-side of any number of assignments (but never both sides, even if it is different assignments).

    Under this rule neither of the above mutates() are allowed. The second should be written as (switching to pipe-notation):

    flights %>%
      mutate(gain = arr_delay - dep_delay,
             hours = air_time / 60) %>%
      mutate(gain_per_hour = gain / hours) %>%
      mutate(gain_per_minute = 60 * gain_per_hour)
    

    And the above works.

    If we teach this rule we can train users to be properly cautious, and hopefully avoid them becoming frustrated, scared, anxious, or angry.

    dplyr documentation (such as “help(mutate)“) does not strongly commit to what order mutate expressions are executed in, or visibility and durability of intermediate results (i.e., a full description of intended semantics). Our rule intentionally limits the user to a set of circumstances where none of those questions matter.

    Now the error we saw above is a mere bug that one expects will be fixed some day (in fact it is dplyr issue 3095, we looked a bit at the generate queries here). It can be a bit unfair to criticize a package for having a bug.

    However, confusion around re-use of column names has been driving dplyr issues for quite some time:

    It makes sense to work in a reliable and teachable sub-dialect of dplyr that will serve users well (or barring that, you can use an adapter, such as seplyr). In production you must code to what systems are historically reliably capable of, not just the specification.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };

    (function(d, t) {
    var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
    s.src = "http://cdn.viglink.com/api/vglnk.js";
    var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
    }(document, 'script'));




    Source link

    Data liquidity in the age of inference

    By | ai, bigdata, machinelearning

    Probabilistic computation holds too much promise for it to be stifled by playing zero sum games with data.

    It’s a special time in the evolutionary history of computing. Oft-used terms like big data, machine learning, and artificial intelligence have become popular descriptors of a broader underlying shift in information processing. While traditional rules-based computing isn’t going anywhere, a new computing paradigm is emerging around probabilistic inference, where digital reasoning is learned from sample data rather than hardcoded with boolean logic. This shift is so significant that a new computing stack is forming around it with emphasis on data engineering, algorithm development, and even novel hardware designs optimized for parallel computing workloads, both within data centers and at endpoints.

    A funny thing about probabilistic inference is that when models work well, they’re probably right most of the time, but always wrong at least some of the time. From a mathematics perspective, this is because such models take a numerical approach to problem analysis, as opposed to an analytical one. That is, they learn patterns from data (with various levels of human involvement) that have certain levels of statistical significance, but remain somewhat ignorant to any physics-level intuition related to those patterns, whether represented by math theorems, conjectures, or otherwise. However, that’s also precisely why probabilistic inference is so incredibly powerful. Many real-world systems are so multivariate, complex, and even stochastic that analytical math models do not exist and remain tremendously difficult to develop. In the meanwhile, their physics-ignorant, FLOPS-happy, and often brutish machine learning counterparts can develop deductive capabilities that don’t nicely follow any known rules, yet still almost always arrive at the correct answers.

    Continue reading Data liquidity in the age of inference.




    Source link