All Posts By


HVS: Q1 2017 Homeownership and Vacancy Rates

By | ai, bigdata, machinelearning

The Census Bureau released the Residential Vacancies and Homeownership report for Q1 2017.

This report is frequently mentioned by analysts and the media to track household formation, the homeownership rate, and the homeowner and rental vacancy rates. However, there are serious questions about the accuracy of this survey.

This survey might show the trend, but I wouldn’t rely on the absolute numbers. The Census Bureau is investigating the differences between the HVS, ACS and decennial Census, and analysts probably shouldn’t use the HVS to estimate the excess vacant supply or household formation, or rely on the homeownership rate, except as a guide to the trend.

Homeownership Rate Click on graph for larger image.

The Red dots are the decennial Census homeownership rates for April 1st 1990, 2000 and 2010. The HVS homeownership rate decreased to 63.6% in Q1, from 63.7% in Q4.

I’d put more weight on the decennial Census numbers – and given changing demographics, the homeownership rate is probably close to a bottom.

Homeowner Vacancy RateThe HVS homeowner vacancy declined to 1.7% in Q1.

Once again – this probably shows the general trend, but I wouldn’t rely on the absolute numbers.

Rental Vacancy RateThe rental vacancy rate increased to 7.0% in Q1.

The quarterly HVS is the most timely survey on households, but there are many questions about the accuracy of this survey.

Overall this suggests that vacancies have declined significantly, and my guess is the homeownership rate is probably close to the bottom.

Source link

A whole fleet of Wansinks: is “evidence-based design” a pseudoscience that’s supporting a trillion-dollar industry?

By | ai, bigdata, machinelearning

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Following a recent post that mentioned <>, we go this blockbuster comment which seemed worth its own post by Ujjval Vyas:

I work in an area where social psychology is considered the gold standard of research and thus the whole area is completely full of Wansink stuff (“people recover from surgery faster if they have a view of nature out the window”, “obesity and diabetes are caused by not enough access to nature for the poor”, biomimicry is a particularly egregious idea in this field). No one even knows how to really read any of the footnotes or cares, since it is all about confirmation bias and the primary professional organizations in the area directly encourage such lack of rigor. Obscure as it may sound, the whole area of “research” into architecture and design is full of this kind of thing. But the really odd part is that the field is made up of people who have no idea what a good study is or could be (architects, designers, interior designers, academic “researchers” at architecture schools or inside furniture manufacturers trying to sell more). They even now have groups that pursue “evidence-based healthcare design” which simply means that some study somewhere says what they need it to say. The field is at such a low level that it is not worth mentioning in many ways except that it is deeply embedded in a $1T industry for building and construction as well as codes and regulations based on this junk. Any idea of replication is simply beyond the kenning in this field because, as one of your other commenters put it, the publication is only a precursor to Ted talks and keynote addresses and sitting on officious committees to help change the world (while getting paid well). Sadly, as you and commenters have indicated, no one thinks they are doing anything wrong at all. I only add this comment to suggest that there are whole fields and sub-fields that suffer from the problems outlined here (much of this research would make Wansink look scrupulous).

Here’s the wikipedia page on Evidence-based design, including this chilling bit:

As EBD is supported by research, many healthcare organizations are adopting its principles with the guidance of evidence-based designers. The Center for Health Design and InformeDesign (a not-for-profit clearinghouse for design and human-behaviour research) have developed the Pebble Project, a joint research effort by CHD and selected healthcare providers on the effect of building environments on patients and staff.

The Evidence Based Design Accreditation and Certification (EDAC) program was introduced in 2009 by The Center for Health Design to provide internationally recognized accreditation and promote the use of EBD in healthcare building projects, making EBD an accepted and credible approach to improving healthcare outcomes. The EDAC identifies those experienced in EBD and teaches about the research process: identifying, hypothesizing, implementing, gathering and reporting data associated with a healthcare project.

Later on the page is a list of 10 strategies (1. Start with problems. 2. Use an integrated multidisciplinary approach with consistent senior involvement, ensuring that everyone with problem-solving tools is included. etc.). Each of these steps seems reasonable, but put them together and they do read like a recipe for taking hunches, ambitious ideas, and possible scams and making them look like science. So I’m concerned. Maybe it would make sense to collaborate with someone in the field of architecture and design and try to do something about this.

P.S. It might seem kinda mean for me to pick on these qualitative types for trying their best to achieve something comparable to quantitative rigor. But . . . if there are really billions of dollars at stake, we shouldn’t sit idly by. Also, I feel like Wansink-style pseudoscience can be destructive of qualitative expertise. I’d rather see some solid qualitative work than bogus number crunching.

The post A whole fleet of Wansinks: is “evidence-based design” a pseudoscience that’s supporting a trillion-dollar industry? appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

The post A whole fleet of Wansinks: is “evidence-based design” a pseudoscience that’s supporting a trillion-dollar industry? appeared first on All About Statistics.

Source link

Thesis: Randomized Algorithms for Large-Scale Data Analysis by Farhad Pourkamali-Anaraki

By | machinelearning

Image 1

Stephen just sent me the following:

Hi Igor,

It’s a pleasure to write to you again and announce the graduation of my PhD student Farhad Pourkamali-Anaraki.
It contains a lot of good things, some published some not. In particular (see attached image 1) he has great work on a 1-pass algorithm for K-means that seems to be one of the only 1-pass algorithms to accurately estimate cluster centers (implementation at ), and also has very recent work on efficient variations of the Nystrom method for approximating kernel matrices that seems to give the high-accuracy of the clustered Nystrom method at a fraction of the computational cost (see image 2).


Image 2
Thanks Stephen but I think the following paper also does 1-pass for K-Means (Keriven N., Tremblay N., Traonmilin Y., Gribonval R., “Compressive K-means” and its implementation SketchMLbox: A MATLAB toolbox for large-scale mixture learning ) even though the contruction seems different.
Anyway, congratulations Dr. Pourkamali-Anaraki !
Randomized Algorithms for Large-Scale Data Analysis, Farhad Pourkamali-Anaraki The abstract reads :

Massive high-dimensional data sets are ubiquitous in all scientific disciplines. Extract- ing meaningful information from these data sets will bring future advances in fields of science and engineering. However, the complexity and high-dimensionality of modern data sets pose unique computational and statistical challenges. The computational requirements of analyzing large-scale data exceed the capacity of traditional data analytic tools. The challenges surrounding large high-dimensional data are felt not just in processing power, but also in memory access, storage requirements, and communication costs. For example, modern data sets are often too large to fit into the main memory of a single workstation and thus data points are processed sequentially without a chance to store the full data. Therefore, there is an urgent need for the development of scalable learning tools and efficient optimization algorithms in today’s high-dimensional data regimes.

A powerful approach to tackle these challenges is centered around preprocessing high-dimensional data sets via a dimensionality reduction technique that preserves the underlying geometry and structure of the data. This approach stems from the observation that high- dimensional data sets often have intrinsic dimension which is significantly smaller than the ambient dimension. Therefore, information-preserving dimensionality reduction methods are valuable tools for reducing the memory and computational requirements of data analytic tasks on large-scale data sets.

Recently, randomized dimension reduction has received a lot of attention in several fields, including signal processing, machine learning, and numerical linear algebra. These methods use random sampling or random projection to construct low-dimensional representations of the data, known as sketches or compressive measurements. These randomized methods are effective in modern data settings since they provide a non-adaptive data- independent mapping of high-dimensional data into a low-dimensional space. However, such methods require strong theoretical guarantees to ensure that the key properties of original data are preserved under a randomized mapping.

This dissertation focuses on the design and analysis of efficient data analytic tasks using randomized dimensionality reduction techniques. Specifically, four efficient signal processing and machine learning algorithms for large high-dimensional data sets are proposed: covariance estimation and principal component analysis, dictionary learning, clustering, and low-rank approximation of positive semidefinite kernel matrices. These techniques are valu- able tools to extract important information and patterns from massive data sets. Moreover, an efficient data sparsification framework is introduced that does not require incoherence and distributional assumptions on the data. A main feature of the proposed compression scheme is that it requires only one pass over the data due to the randomized preconditioning transformation, which makes it applicable to streaming and distributed data settings.

The main contribution of this dissertation is threefold: (1) strong theoretical guarantees are provided to ensure that the proposed randomized methods preserve the key properties and structure of high-dimensional data; (2) tradeoffs between accuracy and memory/computation savings are characterized for a large class of data sets as well as dimensionality reduction methods, including random linear maps and random sampling; (3) extensive numerical experiments are presented to demonstrate the performance and benefits of our proposed methods compared to prior works.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche’s feed, there’s more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Source link

a secretary problem with maximum ability

By | ai, bigdata, machinelearning

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

The Riddler of today has a secretary problem, where one measures sequentially N random variables until one deems the current variable to be the largest of the whole sample. The classical secretary problem has a counter-intuitive solution where one first measures N/e random variables without taking any decision and then and only then picks the first next outcome larger than the largest in the first group. The added information in the current riddle is that the distribution of those iid random variables is set to be uniform on {1,…,M}, which begs for a modification in the algorithm. As for instance when observing M on the current draw.

The approach I devised is clearly suboptimal, as I decided to pick the currently observed value if the (conditional) probability it is the largest is larger than the probability subsequent draws. This translates into the following R code:

M=100 #maximum value
N=10  #total number of draws
# m is sequence of draws so far
if ((m[n]

which produces a winning rate of around 62% when N=10 and M=100, hence much better than the expected performances of the secretary algorithm, with a winning frequency of 1/e.

Filed under: Kids, R Tagged: mathematical puzzle, R, secretary problem, stopping rule, The Riddler

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source link

Advanced Machine Learning with scikit-learn – Training DVD

By | iot, machinelearning

Number of Videos: 4 hours – 46 lessons
Ships on: DVD-ROM
User Level: Advanced
Works On: Windows 7,Vista,XP- Mac OS X

In this Advanced Machine Learning with scikit-learn training course, expert author Andreas Mueller will teach you how to choose and evaluate machine learning models. This course is designed for users that already have experience with Python. You will start by learning about model complexity, overfitting and underfitting. From there, Andreas will teach you about pipelines, advanced metrics and imbalanced classes, and model selection for unsupervised learning. This video tutorial also covers dealing with categorical variables, dictionaries, and incomplete data, and how to handle text data. Finally, you will learn about out of core learning, including the sci-learn interface for out of core learning and kernel approximations for large-scale non-linear classification. Once you have completed this computer based training course, you will have learned everything you need to know to be able to choose and evaluate machine learning models. Working files are included, allowing you to follow along with the author throughout the lessons.Learn Advanced Machine Learning with scikit-learn from a professional trainer from your own desk.
Visual training method, offering users increased retention and accelerated learning
Breaks even the most complex applications down into simplistic steps.
Comes with Extensive Working Files


Four short links: 27 April 2017

By | ai, bigdata, machinelearning

Open Source Mail Delivery, Superhuman AI, Probabilistic Graphical Models, and Golden Ages

  1. PostalA fully featured open source mail delivery platform for incoming & outgoing e-mail, like SendGrid but open source. I enjoyed this comment on Hacker News, where the commenter talks about turning a $1K/mo mail bill into $4/mo by running their own mail infrastructure. (Downside: you would need to get yourself familiar with SMTP, postfix, SPF/DKIM, mx-validation, blacklists, etc. And by “familiar,” I mean “learn it to the core.”)
  2. The Myth of a Superhuman AI (Kevin Kelly) — he makes a good argument that buried in this scenario of a takeover of superhuman artificial intelligence are five assumptions that, when examined closely, are not based on any evidence. These claims might be true in the future, but there is no evidence to date to support them.
  3. Probabilistic Graphical Models — CS228 course notes turned into a concise introductory course […]. This course starts by introducing probabilistic graphical models from the very basics and concludes by explaining from first principles the variational auto-encoder, an important probabilistic model that is also one of the most influential recent results in deep learning.
  4. Watch It While It Lasts: Our Golden Age of TelevisionThe Parisian golden age [of art] emerged out of the collapse of a system that penalized artistic innovation. For most of the 19th century, the AcadĂ©mie des Beaux-Arts, a state-sanctioned institution, dominated the production and consumption of French art. A jury of academicians decided which paintings were exhibited at the Salon, the main forum for collectors to view new work. The academy set strict rules on artistic expression, and preferred idealized scenes from classical mythology to anything resembling contemporary life. For the most part, the art that resulted was staid and predictable, painted by skilled but anonymous technicians. It sure doesn’t feel like we’re in a golden age of technology innovation, and I sure recognize a lot of the VC horde mentality in the AcadĂ©mie description.

Continue reading Four short links: 27 April 2017.

Source link

Forecasting Skills Applied To Trade Areas

By | ai, bigdata, machinelearning

If you ever want to get the blood flowing, sit down in a meeting with your Real Estate Team and your Finance Team and a handful of Executives looking to either open a new store or close an existing store.

Do you remember that scene in Moneyball when the scouts are all talking about the intangibles regarding a potential player and the Brad Pitt character appears to be having intestinal cramps? That’s what a lot of these meetings are like. Of course nobody is spittin’ tobacco juice into plastic cups. But you get the picture. A Real Estate Director thinks that Penn Square Mall has the “potential” to flourish given the changes that are happening next door at J. Jill. Somebody in Finance is completely against the mall … “Oklahoma City is not an aspirational market” … even though the Finance Director has never been to Oklahoma. An EVP responsible for stores doesn’t want to get hit over the head by the Board of Directors for opening yet another store that never meets the sales projections authored by the Real Estate Team.

The big topic in 2017 is store closures. Y’all recall my omnichannel arguments of 2010 – 2014 … arguments suggesting that store closures were the logical outcome of a strategy of creating channel sameness in an effort to compete against Amazon. Turns out the customer doesn’t “demand” a one-brand approach to channels. Turns out the customer demands more from Amazon!

There’s no better place to apply your forecasting chops than in forecasting what happens to a trade area when a store closes. It’s work that is nearly impossible to get right. Each market behaves just a bit different than a comparable market. If you are off by 10% or 15%, you might close a store that is actually profitable.

Next week, we’ll talk about the approach I use. The approach is not fundamentally different than the approach smart catalogers used 5-10 years ago to dramatically scale back unprofitable pages to online buyers. Most important – the approach is FUN! You get to see outcomes that are uniquely different than you’d expect. And you’ll be armed with the ammo to be like Brad Pitt heading into a meeting with scouts on Moneyball!!

Forecasting outcomes are the sum of all analytics and marketing knowledge possessed by your company.

Source link