Enabling data science for the majority

By | ai, bigdata, machinelearning

How human-in-the-loop data analytics is accelerating the discovery of insights.

As practitioners who build data science tools, we seem to have a rather myopic obsession with the challenges faced by the Googles, Amazons, and Facebooks of the world—companies with massive and mature data analytics ecosystems, supported by experienced systems engineers, and used by data scientists who are capable programmers. However, these companies represent a tiny fraction of the “big data” universe. It’s helpful to think of them as the “1% of big data”: the minority whose struggles are not often what the rest of the “big data” world faces. Yet, they occupy the majority of discourse around how to utilize the latest tools and technologies in the industry.

Big data problems for the majority

So what about the other “99%”—who are these organizations or individuals, and what problems do they face? The range of professionals who rely on big data include finance and business analysts, journalists, engineers, and consultants, just to name a few—all of whom likely have an interest in managing and extracting value from data.

Most of the problems that occur when trying to extract value from data ultimately stem from the humans who are “in-the-loop” of the data analysis process. In particular, as the size, number, and complexity of data sets have grown, what has not grown at a proportional rate is the amount of human time available, the number of humans with sophisticated analysis skills, as well as human cognitive load capabilities. This leads to a number of challenging data management problems.

The solutions to these problems are a new breed of intelligent tools that treat humans as first-class citizens, alongside data. These tools empower humans, regardless of programming skills, to extract value and insights from data. These tools, and many others like them, are defining a burgeoning sub-field of data management, one many of us are calling “HILDA”, short for “human-in-the-loop data analytics.”

In this post, I will discuss five key problems that stem from the “humans-in-the-loop.” For each of these problems, my research group at the University of Illinois, in collaboration with research groups at MIT, the University of Maryland, and the University of Chicago, has been developing open source HILDA tools, namely, DataSpread, Zenvisage, Datamaran, Orpheus, and Gestalt. I will describe these open source tools, along with others that I am aware of, that at least partially address these problems (which are far from solved!). I welcome any suggestions on problems that I missed, or on other work that similarly aims to address these problems.

The five key problems: A high-level view

The five key “human-in-the-loop” problems include:

  1. The Excel problem: Over-reliance on spreadsheets
    For many individuals, data analytics happens entirely within Excel or similar spreadsheet software. However, spreadsheet software is sluggish for large or complex spreadsheets, doesn’t scale beyond memory limits, and requires cumbersome mechanisms to express simple operations (e.g., joins).
  2. The exploration problem: Not knowing where to look
    When operating on large data sets with many attributes, data scientists have trouble figuring out what to visualize, what the typical patterns or trends are, and where to find them. Often, exploration, even with tools like Tableau or Excel, is laborious and time-consuming.
  3. The data lake problem: Messy cesspools of data
    Data scientists and organizations routinely accumulate machine-generated semi-structured data sets—it is hard to extract structured information from these data sets, as well as understand the relationships between them. Both of these steps are necessary before they can be put to use in analysis.
  4. The data versioning problem: Ad-hoc management of analysis
    As data scientists analyze their data, they generate hundreds of data sets and source code versions, by repeatedly transforming and cleaning their data and applying machine learning algorithms. The data set versions are managed via ad-hoc mechanisms, with no clear understanding of the derivation relationships, hurting reproducibility and collaboration, as well as fostering redundancies and errors.
  5. The learning problem: Hurdles in leveraging machine learning
    While machine learning is hugely beneficial in identifying patterns and making predictions with data, it is a pain to deploy, especially at scale, without a substantial amount of programming work to connect relevant tools. It also requires manual work to tune and iterate on machine learning algorithms—placing it beyond the reach of many business users who need to interact with data.

The Excel problem

For many people working with data, analytics begins and ends with Microsoft Excel. Unfortunately, Microsoft Excel is extremely sluggish when dealing with large spreadsheets or spreadsheets with many formulae. I’ve worked with spreadsheets with a few hundred thousand cells, for example, where scrolling takes several seconds, and propagation of a single change takes minutes.

At the same time, due to the limited expressivity of formulae, spreadsheet users end up expressing relatively simple operations in complex ways (e.g., joins via VLOOKUPs), or resort to cumbersome mechanisms to analyze data (e.g., copy-pasting relevant data to a different area of the sheet to evaluate a filter expression.) The preponderance of formulae also encourages errors in spreadsheets—these errors have actually led to the retraction of biology, psychology and economics papers!

Related projects

DataSpread is a project aimed at combining the benefits of spreadsheets and databases, by keeping the familiar spreadsheet front end, while using a database back end. It supports flexible, ad-hoc spreadsheet operations, while inheriting the scalability and power of databases. Our research group at the University of Illinois has been tackling a number of challenges in developing DataSpread, including developing flexible representation schemes for a variety of spreadsheet structures, designing mechanisms for recording and retrieving information based on position, and devising lazy computation approaches that prioritize for what the user is seeing over what is not currently being displayed.

DataSpread screenshot
Figure 1. DataSpread can scale to an arbitrary number of rows, supporting interactive exploration via positionally aware indices and lazy computation. In this figure, DataSpread is being used for exploring a genomics (VCF file) data set with about 100M rows. Screenshot courtesy of Aditya Parameswaran.

There have been a number of other projects targeted at improving the usability of databases, including some wonderfully prescient work by Joe Hellerstein and co-authors, and by Jagadish and co-authors. More recently, researchers have identified that SQL is a roadblock for analysts, and have proposed a variety of tools to alleviate that issue, such as GestureDB. Some companies, including Fieldbook, Airtable, and AlphaSheets are addressing similar challenges.

The exploration problem

For many who do not know how to program, but have access to large data sets, even figuring out where to begin analysis can be overwhelming and unclear. Microsoft Excel’s visualization tools, or even visual analytics tools like Tableau, reduce this burden quite a bit by making it straightforward to generate drag-and-drop visualizations. But what if you had dozens of attributes and hundreds of attribute values ? You likely wouldn’t know where to look, what typical patterns and trends look like, and you wouldn’t know where or how to find desired patterns. At present, these forms of discoveries happen via tedious trial-and-error: generate a number of visualizations until you find the ones you want.

Related projects

The Zenvisage project at the University of Illinois (as well as a previous incarnation, SeeDB) enables data scientists to specify a desired pattern (or more generally, a visual exploration query), and rapidly combs through collections of visualizations to find those that match the specification. The challenges here include developing a visual exploration language and interactive interface, determining what constitutes a “match,” along with query execution mechanisms that can explore a large number of candidate visualizations for matches quickly.

Figure 2. ZenVisage supports a sketch-based querying paradigm, with the sketching canvas in the center, and the matches shown below. Representative and outlier visualizations are shown in two tabs on the right hand side. The data set being explored is a real estate data set. Screenshot courtesy of Aditya Parameswaran.

Google recently introduced the Explore functionality within Google Sheets, which, given a spreadsheet, recommends charts based on the data types being displayed. Other recent tools in this space include Voyager, Data Polygamy, Profiler; Sunita Sarawagi’s seminal work on data cube browsing from the 90s is also relevant.

The data lake problem

Now that collection of data has become easier, organizations of all sizes routinely accumulate vast numbers of structured and semi-structured data sets. Semi-structured data sets are typically the output of programs, and there is very little discipline in how these data sets are organized. There’s no clear sense of what each data set contains, how each data set relates to other data sets, and how to extract value or structure to leverage for analysis.

Related projects

There are two underlying challenges here: the extraction of structured data and relating data sets to each other.

Trifacta, based on Data Wrangler, supports the interactive extraction of structured data from semi-structured log data sets. Other tools in this space include FlashExtract, PADS, and Foofah. Our Datamaran project is aimed at fully-unsupervised extraction from semi-structured machine-generated data sets.

Datamaran screenshot
Figure 3. Datamaran’s overall workflow on a simple example; while the idea is intuitively simple, Datamaran benefits from several optimizations that enable it to avoid expensive computation of matches to patterns. Screenshot courtesy of Aditya Parameswaran.

There have been multiple efforts to integrate and query large collections of data sets: Google’s WebTables project put some of these ideas to work in practice; more recently, Microsoft’s InfoGather project, Google’s Goods project, and the Data Civilizer project are addressing similar challenges. Data cleaning has also witnessed a bunch of recent interest, such as Microsoft’s data cleaning project, the HoloClean project, Luna’s Knowledge Fusion project, and the company Tamr. Classical work on data integration, and in particular, entity matching, is also relevant here.

The data versioning problem

When performing data science, data scientists routinely generate hundreds of data set versions at various stages of analysis. These versions are generated via various transformations, including cleaning, normalization, editing, and augmentation. At the same time, they are not stored in databases; instead, they are stored via ad-hoc mechanisms, like networked file-systems.

As a result, there is very little understanding of the derivation relationships across different versions, making it hard to locate or retrieve relevant versions for reproducibility and auditing. In addition, storing different data set versions as-is leads to massive redundancies.

Related projects

The Orpheus project at Illinois and U. Chicago supports data set versioning via a lightweight layer on top of a traditional relational database that is agnostic to the presence of versions. Via rewriting rules and partitioning to trade-off storage and retrieval, we make data set versioning efficient and practical.

Orpheus screenshot
Figure 4. Orpheus’s version browser interface operating on a protein links data set, displaying the version graph (right), the SQL input, querying over a collection of versions (center), and the results (center bottom). Other git-like commands are easily expressed by clicking the corresponding buttons on the right. Screenshot courtesy of Aditya Parameswaran.

Orpheus builds on the Datahub project; another offshoot is ProvDB, which aims to capture model-building also under the same ecosystem. Ground is an emerging related effort to track data science metadata. In a similar vein, ModelDB helps track machine learning models developed in the course of data analysis. There is a lot of classical work on provenance in the context of databases and workflows, as well as work on time-travel databases, both of which are relevant.

The learning problem

Machine learning is a necessity in data science, especially for predictive capabilities, or for identifying hidden relationships between variables. Yet, deploying machine learning is rather challenging: there is a lot of code that needs to be written, machine learning pipelines often take hours to run, and rerunning the code after making small changes, such as adding or deleting features, or changing parameters (e.g., those that that control the complexity of the model space), ends up requiring the user to spend several hours supervising the pipelines.

Related projects

The nascent Gestalt project is aimed at reducing the amount of time required in iterative machine learning by intelligently avoiding repeated computation across iterations—stay tuned for a release soon. Other related projects in this space include KeystoneML, DeepDive, Stan, Edward, TensorFlow, and SystemML, among others. There has been some exciting and important efforts to make machine learning more high level, reducing the effort involved in parameter tuning, feature specification, model maintenance, and training data specification.

Gestalt screenshot
Figure 5. Gestalt is a DSL built on Scala, enabling the use of standard learning libraries, along with intelligent execution of repeated machine learning workflows. Screenshot courtesy of Aditya Parameswaran.

A common recipe and a path forward

Overall, there is a pressing need for powerful HILDA tools that can help analysts—regardless of skill level—extract insights from large data sets. These tools need to be developed from the ground up to reduce labor, time, and tedium; simultaneously, the tools must minimize complexity and the need for sophisticated analysis skills.

The design of HILDA tools must also acknowledge that human attention is actually the scarcest resource—and, therefore, must optimize the participation of humans in the data science process. Developing these tools requires techniques from multiple fields: not just database systems, but also algorithms and ideas from data mining, as well as design and interaction principles from human-computer interaction (HCI).

This creates challenges in evaluation: not only do we need to optimize and evaluate the performance and scalability (from databases), but also accuracy and usefulness of insights (from data mining), and usability and interactivity (from HCI).

As more and more organizations and disciplines accumulate vast troves of data, HILDA is going to be not just hugely important, but also increasingly necessary. By enabling the majority of data users to tap into the hidden insights within these troves of data, HILDA has the potential to not just revolutionize, but also define the field of data analytics in the decades to come.

Continue reading Enabling data science for the majority.

Source link

Grove – 125KHz RFID Reader

By | iot, machinelearning

The Grove-125KHz RFID Reader is a module used to read uem4100 RFID card information with two output formats: Uart and Wiegand. It has a sensitivity with maximum 7cm sensing distance.

Selectable output format: Uart or Wiegand.
4Pins Electronic Brick Interface
High Sensitivity
7cm sensing distance
125kHz RFID CardRead Only
Arduino Library


Interview of Jerome Berthier, Head of BI and Big Data at ELCA

By | ai, bigdata, machinelearning

Data Mining Research (DMR): Can you tell us who you are and how you came to the field of Data Science?

Jerome Berthier (JB): My name is Jerome Berthier, I am an engineer in Computer Science and I have an MBA in management. After 10 years working in different roles for an IT provider (developer, sales representative, managing director), I joined ELCA in 2012 to head the BI division. At that time Big Data was starting to become a mainstream concept in IT and I had the great opportunity to be in charge of developing ELCA’s expertise in this field.

We had to start from scratch, as Big Data was completely new to us, and so, to research the subject we engaged an expert in NLP and created our own Big Data Lab devoted to testing and evaluating algorithms and Big Data solutions available on the market, basically anything that could help us to better understand the principles of Big Data and how they can be applied. This research was an amazing experience which achieved excellent results.

Since then, I have kept up to date on the evolution of Big Data and have been active in raising awareness of its benefits, through presentations in various contexts: for Elca customers, at IT events and in media interviews.

DMR: On your LinkedIn profile, you describe yourself as “A Voice” of Data Science. What do you mean by that?

JB: To be honest, I don’t believe that any individual data scientist can be skilled in all fields: IT, marketing, communication, maths, sales, statistics, business (banking, insurance, travel…) and so on : an  All-in-One data scientist, so to speak. Of course, if you know one, I would be delighted to meet them and have them in my team.

The applications of Data Science are often highly specialized and there are several types of data scientist. The key is to take advantage of these different profiles and involve them together to create a strong team of data scientists. The SAS institute defines 7 basic profiles: http://www.sas.com/content/dam/SAS/en_gb/image/other1/events/WMAGDS/DataScientist-survey-report-web%20FINAL.pdf

  • The Geeks 41%: The Geeks are the largest group in our sample and have the largest female membership of all the groups at 37 per cent. They have a naturally technical bias, strong logic and analytical skills. Essentially “black and white” thinkers, they like to speak plainly and stick to the point – don’t expect them to be moved by emotionally charged arguments. With their attention to detail and fondness for the rules, the Geeks are well suited to roles such as defining systems requirements, designing processes and programming.
  • The Gurus 11%: The next largest group, the Gurus, has a measure of reactive introversion, like the Geeks, which pre-disposes them to scientific and technical subjects. Yet they also display a diametrically opposite characteristic: the strong presence of proactive extroversion, including solid and often highly persuasive communications and social skills. The Gurus can play a very important role by using their enthusiasm, tact and diplomacy to promote the benefits of the data sciences to those holding the purse strings, or who have the authority to give projects the green light.
  • The Drivers 11%: The Drivers are proactive introverts: highly pragmatic individuals who use their determination and focus to realise their goals. Self-confident and results-oriented, they are ideal project managers and team leaders, who excel at prioritising, monitoring and driving projects to a successful conclusion.
  • The Crunchers 11%:  This category is probably one of the least self-promoting groups. Strongly reactive – rather than proactive – personalities, the Crunchers like routine and constancy. They display high technical competence and consistency, making them superb in a range of technically-oriented support roles including data preparation and entry, statistical analysis, monitoring of incoming data and quality control.
  • The Deliverers: 7% Like the Drivers, these individuals are proactive and well suited to project and man management. This is also the group with the largest proportion of men at 80 per cent. However, the Deliverers also have a strong pre-disposition towards acquiring and/or applying technical skills. So, while they are capable of bringing focus and momentum to ensure project success, they are also likely to understand the finer technical details and devise solutions in much greater technical depth.
  • The Voices 6%: The Voices are strong communicators with less apparent detailed technical knowledge than the Gurus. The presence of this group suggests a strong demand for natural promoters who have the ability to generate enthusiasm for the potential of big data and the data sciences at a conceptual level – rather than the practical or technical level. The Voices are strongly valued for their positive outlook, and may be engaged in presenting the results of big data projects as well as supporting their implementation.
  • Other Personalities 13%: A smaller number of respondents displayed a range of other traits.
    • The Ground Breakers: offer new approaches, new methods and new possibilities, drawn from a mix of inspiration and dogged logical thinking. Roles include: system design and algorithm development.
    • The Seekers: combine superb technical knowledge and understanding with inquisitiveness and a drive to find solutions. Roles include: research.
    • The Teachers: skilled at imparting knowledge and inspiring others to want to learn. Roles include: training and mentoring.
    • The Lynchpins: important team players who may not have a depth of technical knowledge but provide essential support services. Roles include: co-ordination and administration.

So I am a “Voice”. Why did I decide to use this term?

Often our customers prefer to start with POC or POV to evaluate the potential of Big Data. But even if the results show great promise, it is still not easy to get the necessary budget from the Board (C-level) to go a step further, because past IT solutions did not necessarily live up to expectations. So it is my role to accompany the teams in their POC in order be familiar with the company context and to assist them in presenting the results to C-levels in such a way as to encourage continued support…

DMR: What do you see as the main skills that a Data Scientist needs in 2017?

JB: All of them, of course, but I see 3 of them as being especially important:

  • Big Data has changed the relationship between infrastructure and analytics. In the past, there were 2 silos: on one hand, analytics; and on the other hand, infrastructure. Now it’s impossible to do useful analysis if you don’t understand the underlying architecture of the infrastructure; and it’s impossible to size the infrastructure accurately if you don’t know what type of analysis will be used. So knowledge of Big Data infrastructure will be as important as ever.
  • Content analytics is a high priority for me because more and more projects use flow aggregation, Chabot, document analysis, email analysis …
  • Finally, complex event processing is becoming increasingly relevant.

DMR: Based on your personal experience, what is the biggest challenge you face when involved in Data Science projects?

JB: For me, there are 2 key challenges:

  • First of all, Data is our “raw material”. Without it we are out of work. I often read and hear that the world is now submerged in data… That may be true but how much of this data is really relevant? I have seen a number of projects in which important data is lacking (which seems incredible when we realize how much data we possess…) or/and to which important data is not accessible. Quality/governance of data is clearly a big problem everywhere.
  • Another problem is habit. How can we explain to people who have spent years and years working in the same way that everything must change drastically because all of the old habits have now become obsolete? I once did a customer segmentation project which showed that the existing segmentation was irrelevant because not based on data analysis.

DMR: If you could give just one piece of advice to a future data scientist, what would it be?

JB: Open a book on Business Intelligence. I often receive new data scientists who know nothing about BI, ETL, SQL language or even what a database is…

I consider Big Data to be an evolution of our classic BI, but the basic principles stay the same: collection of data, transformation, merging, analysis and finally decision-making based on the results.

Today unstructured data can be used, merging can be on the fly and decisions can be predictive or prescriptive, but the principles remain the same.

Furthermore, outside of Academia, each time you will want to work with big data, you will have to deal with old BI and old sources of data. So it’s necessary to understand how the old system works, too.


Source link

Thursday: Unemployment Claims, Philly Fed Mfg

By | ai, bigdata, machinelearning

Black Knight is expected to release their “First Look” mortgage delinquency data for September tomorrow morning. The Black Knight data includes mortgages 30 days or more delinquent (as opposed to the Fannie and Freddie data that is for serious delinquent loans). Since this includes short term delinquencies, I expect a sharp increase in mortgage delinquencies due to the hurricanes. It will take longer for the impact of the hurricanes to show up in the Fannie and Freddie data.

• At 8:30 AM ET, The initial weekly unemployment claims report will be released. The consensus is for 240 thousand initial claims, down from 243 thousand the previous week.

• Also at 8:30 AM, the Philly Fed manufacturing survey for October. The consensus is for a reading of 20.2, down from 23.8.

Source link

Why Your HR Survey is a Lie and How to Get the Truth

By | ai, bigdata, machinelearning

OdinText Discovers Job Satisfaction Drivers in Anonymous Employee Data Employee satisfaction surveys go by different names – “stakeholder satisfaction,” “360-degree surveys,” “employee engagement.” I blog a lot about the shortcomings of survey data and problems like respondent and data quality due to bad sample, long surveys, and poorly structured questions (which assumes the researcher already […]

The post Why Your HR Survey is a Lie and How to Get the Truth appeared first on OdinText.

Source link

No tradeoff between regularization and discovery

By | ai, bigdata, machinelearning

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

We had a couple recent discussions regarding questionable claims based on p-values extracted from forking paths, and in both cases (a study “trying large numbers of combinations of otherwise-unused drugs against a large number of untreatable illnesses,” and a salami-slicing exercise looking for public opinion changes in subgroups of the population), I recommended fitting a multilevel model to estimate the effects in question. The idea is that such a model will estimate a distribution of treatment effects that is concentrated near zero, and the resulting inferences for the individual effects will be partially pooled toward zero, with the anticipated result in these cases that none of the claims will be so strong any more.

Here’s a simple example:

Suppose the prior distribution, as estimated by the hierarchical model, is that the population of effects has mean 0 and standard deviation of 0.1. And now suppose that the data-based estimate for one of the treatment effects is 0.5 with a standard error of 0.2 (thus, statistically significant at conventional levels). Also assume normal distributions all around. Then the posterior distribution for this particular treatment effect is normal with mean (0/0.1^2 + 0.5/0.2^2)/(1/0.1^2 + 1/0.2^2) = 0.10, with standard deviation 1/sqrt(1/0.1^2 + 1/0.2^2) = 0.09. Based on this inference, there’s an 87% posterior probability that the treatment effect is positive.

We could expand this hypothetical example by considering possible alternative prior distributions for the unknown treatment effect. Uniform(-inf,inf) is just too weak. Perhaps normal(0,0.1) is also weakly informative, and maybe the actual population distribution of the true effects is something like normal(0,0.05). In that case, using the normal(0,0.1) prior as above will under-pool, that is, the inference will be anti-conservative and be too susceptible to noise.

With a normal(0,0.05) prior and normal(0.5,0.2) data, you’ll get a posterior that’s normal with mean (0/0.05^2 + 0.5/0.2^2)/(1/0.05^2 + 1/0.2^2) = 0.03, with standard deviation 1/sqrt(1/0.05^2 + 1/0.2^2) = 0.05. Thus, the treatment effect is likely to be small, and there’s a 72% chance that it is positive.

Also, all this assumes zero bias in measurement and estimation, which is just about never correct but can be an ok approximation when standard errors are large. Once the standard error becomes small, then we should think about including an error term to allow for bias, to avoid ending up with too-strong claims.

Regularization vs. discovery?

The above procedure is an example of regularization or smoothing, and from the Bayesian perspective it’s the right thing to do, combining prior information and data to get probabilistic inference.

A concern is sometimes raised, however, that regularization gets in the way of discovery. By partially pooling estimates toward zero, are we reducing our ability to discover new and surprising effects?

My answer is no, there’s not a tradeoff between regularization and discovery.

How is that? Consider the example above, with the 0 ± 0.05 prior with 0.5 ± 0.2 data. Our prior pulls the estimate to 0.03 ± 0.05, thus moving the estimate from clearly statistically significant (2.5 standard errors away from 0) to not even close to statistical significance (less than 1 standard error from zero).

So we’ve lost the opportunity for discovery, right?


There’s nothing stopping you from gathering more data to pursue this possible effect you’ve discovered. Or, if you can’t gather such data, you just have to accept this uncertainty.

If you want to be more open to discovery, you can pursue more leads and gather more and higher quality data. That’s how discovery happens.

B-b-b-but, you might say, what about discovery by luck? By regularizing, are we losing the ability to get lucky? Even if our hypotheses are mere lottery tickets, why throw away tickets that might contain a winner?

Here, my answer is: If you want to label something that might likely be wrong as a “discovery,” that’s fine by me! No need for a discovery to represent certainty or even to represent near-certainty. In the above example, we have a 73% posterior probability of seeing a positive effect in an exact replication study. Call that a discovery if you’d like. Integrate this discovery into your theoretical and practical understanding of the world and use it to decide where to go next.

P.S. The above could be performed using longer-tailed distributions if that’s more appropriate for the problem under consideration. The numbers will change but the general principles are the same.

The post No tradeoff between regularization and discovery appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

The post No tradeoff between regularization and discovery appeared first on All About Statistics.

Source link