Category

bigdata

Catalog of visualization tools

By | ai, bigdata, machinelearning | No Comments

There are a lot of visualization-related tools out there. Here’s a simple categorized collection of what’s available, with a focus on the free and open source stuff.

This site features a curated selection of data visualization tools meant to bridge the gap between programmers/statisticians and the general public by only highlighting free/freemium, responsive and relatively simple-to-learn technologies for displaying both basic and complex, multivariate datasets. It leans heavily toward open-source software and plugins, rather than enterprise, expensive B.I. solutions.

I found some broken links, and the descriptions need a little editing, but it’s a good place to start.

Also, if you’re just starting out with visualization, you might find all the resources a bit overwhelming. If that’s the case, don’t fret. You don’t have to learn how to use all of them. Let your desired outcomes guide you. Here’s what I use.

Tags:


Source link

Weekly Initial Unemployment Claims decrease to 234,000

By | ai, bigdata, machinelearning | No Comments


The DOL reported:

In the week ending January 14, the advance figure for seasonally adjusted initial claims was 234,000, a decrease of 15,000 from the previous week’s revised level. The previous week’s level was revised up by 2,000 from 247,000 to 249,000. The 4-week moving average was 246,750, a decrease of 10,250 from the previous week’s revised average. This is the lowest level for this average since November 3, 1973 when it was 244,000. The previous week’s average was revised up by 500 from 256,500 to 257,000.

There were no special factors impacting this week’s initial claims.
emphasis added

The previous week was revised up.

The following graph shows the 4-week moving average of weekly claims since 1971.

Click on graph for larger image.

The dashed line on the graph is the current 4-week average. The four-week average of weekly unemployment claims decreased to 246,750.

This was below the consensus forecast. This is the lowest level for the four week average since 1973 (with a much larger population).

The low level of claims suggests relatively few layoffs.


Source link

Counting is hard, especially when you don’t have theories

By | ai, bigdata, machinelearning | No Comments

(This article was originally published at Big Data, Plainly Spoken (aka Numbers Rule Your World), and syndicated at StatsBlogs.)

In the previous post, I diagnosed one data issue with the IMDB dataset found on Kaggle. On average, the third-party face-recognition software undercounted the number of people on movie posters by 50%.

It turns out that counting the number of people on movie posters is a subjective activity. Reasonable people can disagree about the number of heads on some of those posters.

For example, here is a larger view of the Better Luck Tomorrow poster I showed yesterday:

Betterlucktomorrowposter

By my count, there are six people on this poster. But notice the row of photos below the red title: someone could argue that there are more than six people on this poster. (Regardless, the algorithm is still completely wrong in this case, as it counted only one head.)

So one of the “rules” that I followed when counting heads is only count those people to whom the designer of the poster is drawing attention. Using this rule, I ignore the row of photos below the red title. Also by this rule, if a poster contains a main character, and its shadow, I only count the person once. If the poster contains a number of people in the background, such as generic soldiers in the battlefield, I do not count them.

Another rule I used is to count the back or side of a person even if I could not see his or her face provided that this person is a main character of the movie. For example, the following Rocky Balboa poster has one person on it.

Rockybalboapost

(cf. The algorithm counted zero heads.)

***

According to the distribution of number of heads predicted by the algorithm, I learned that some posters may have dozens of people on them. So I pulled out these outliers and looked at them.

This poster of The Master (2012) is said to contain 31 people.

Themasterposter

On a closer look, this is a tesselation of a triangle of faces. Should that count as three people or lots of people? As the color fades off on the sides of the poster, should we count those barely visible faces?

Counting is harder than it seems.

***

The discussion above leads to an important issue in building models. The analyst must have some working theory about how X is related to Y. If it is believed that the number of faces on the movie poster affects movie-goers' enthusiam, then that guides us to count certain people but not others.

***

If one were to keep pushing on the rationale of using this face count data, one inevitably arrives at a dead end. Here are the top results from a Google Image Search on “The Master 2012 poster”:

The_master_posters_google_search_more

Well, every movie is supported by a variety of posters. The bigger the movie, the bigger the marketing budget, the more numerous are the posters. There are two key observations from the above:

The blue tesselation is one of the key designs used for this movie. Within this design framework, some posters contain only three heads, some maybe a dozen heads, and some (like the one shown on IMDB) many dozens of heads.

Further, there are at least three other design concepts, completely different from the IMDB poster, and showing different number of people!

Going back to the theory that movie-goers respond to the poster design (in particular, the number of people in the poster), the analyst now realizes that he or she has a huge hole in the dataset. Which of these posters did the movie-goer see? Did IMDB know which poster was seen the most number of times?

Thus, not only are the counts subjective and imprecise, it is not even clear we are analyzing the right posters.

***

Once I led the students down this path, almost everyone decided to drop this variable from the dataset.

 

 

 

Please comment on the article here: Big Data, Plainly Spoken (aka Numbers Rule Your World)

The post Counting is hard, especially when you don’t have theories appeared first on All About Statistics.




Source link

Elements of a successful #openscience #rstats workshop

By | ai, bigdata, machinelearning | No Comments

(This article was first published on R – christopher lortie, and kindly contributed to R-bloggers)

What makes an open science workshop effective or successful*?

Over the last 15 years, I have had the good fortune to participate in workshops as a student and sometimes as an instructor. Consistently, there were beneficial discovery experiences, and at times, some of the processes highlighted have been transformative. Last year, I had the good fortune to participate in Software Carpentry at UCSB and Software Carpentry at YorkU, and in the past, attend (in part) workshops such as Open Science for Synthesis. Several of us are now deciding what to attend as students in 2017. I have been wondering about the potential efficacy of the workshop model and why it seems that they are so relatively effective. I propose that the answer is expectations.  Here is a set of brief lists of observations from workshops that lead me to this conclusion.

*Note: I define a workshop as effective or successful when it provides me with something practical that I did not have before the workshop.  Practical outcomes can include tools, ideas, workflows, insights, or novel viewpoints from discussion. Anything that helps me do better open science. Efficacy for me is relative to learning by myself (i.e. through reading, watching webinars, or stuggling with code or data), asking for help from others, taking an online course (that I always give up on), or attending a scientific conference.

Delivery elements of an open science training workshop

  1. Lectures
  2. Tutorials
  3. Demonstrations
  4. Q & A sessions
  5. Hands-on exercises
  6. Webinars or group-viewing recorded vignettes.

Summary expectations from this list: a workshop will offer me content in more than one way unlike a more traditional course offering. I can ask questions right there on the spot about content and get an answer.

Content elements of an open science training workshop

  1. Data and code
  2. Slide decks
  3. Advanced discussion
  4. Experts that can address basic and advanced queries
  5. A curated list of additional resources
  6. Opinions from the experts on the ‘best’ way to do something
  7. A list of problems or questions that need to addressed or solved both routinely and in specific contexts when doing science
  8. A toolkit in some form associated with the specific focus of the workshop.

Summary of expectations from this list: the best, most useful content is curated. It is contemporary, and it would be a challenge for me to find out this on my own.

Pedagogical elements of an open science training workshop

  1. Organized to reflect authentic challenges
  2. Uses problem-based learning
  3. Content is very contemporary
  4. Very light on lecture and heavy on practical application
  5. Reasonably small groups
  6. Will include team science and networks to learn and solve problems
  7. Short duration, high intensity
  8. Will use an open science tool for discussion and collective note taking
  9. Will be organized by major concepts such as data & meta-data, workflows, code, data repositories OR will be organized around a central problem or theme, and we will work together through the steps to solve a problem
  10. There will be a specific, quantifiable outcome for the participants (i.e. we will learn how to do or use a specific set of tools for future work).

Summary of expectations from this list: the training and learning experience will emulate a scientific working group that has convened to solve a problem. In this case, how can we all get better at doing a certain set of scientific activities versus can a group aggregate and summarize a global alpine dataset for instance. These collaborative solving-models need not be exclusive.

Higher-order expectations that summarize all these open science workshop elements

  1. Experts, curated content, and contemporary tools.
  2. Everyone is focussed exclusively on the workshop, i.e. we all try to put our lives on hold to teach and learn together rapidly for a short time.
  3. Experiences are authentic and focus on problem solving.
  4. I will have to work trying things, but the slope of the learning curve/climb will be mediated by the workshop process.
  5. There will be some, but not too much, lecturing to give me the big picture highlights of why I need to know/use a specific concept or tool.

To leave a comment for the author, please follow the link and comment on their blog: R – christopher lortie.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…




Source link

Randy Hunt on design at Etsy

By | ai, bigdata, machinelearning | No Comments

The O’Reilly Design Podcast: Collaborating with engineering, hiring for humility, and the code debate.

In this week’s Design Podcast, I sit down with Randy Hunt, VP of design at Etsy. We talk about the culture at Etsy, why it’s important to understand the materials you are designing with, and why humility is your most important skill.

Continue reading Randy Hunt on design at Etsy.




Source link

Philly Fed: Manufacturing Activity Continued to Improve in January

By | ai, bigdata, machinelearning | No Comments


Earlier from the Philly Fed: Manufacturing Activity Continued to Improve in January

Economic conditions continued to improve in January, according to the firms responding to this month’s Manufacturing Business Outlook Survey. The indexes for general activity, new orders, and employment were all positive this month and increased from their readings last month. Manufacturers have generally grown more optimistic in their forecasts over the past two months. The future indexes for growth over the next six months, including employment, continued to improve this month.

The index for current manufacturing activity in the region increased from a revised reading of 19.7 in December to 23.6 this month. … The general activity index has remained positive for six consecutive months, and the activity index reading was the highest since November 2014.

Firms reported an increase in manufacturing employment this month. … The current employment index improved 9 points, registering its second consecutive positive reading.

Here is a graph comparing the regional Fed surveys and the ISM manufacturing index:

Fed Manufacturing Surveys and ISM PMI Click on graph for larger image.

The New York and Philly Fed surveys are averaged together (yellow, through January), and five Fed surveys are averaged (blue, through December) including New York, Philly, Richmond, Dallas and Kansas City. The Institute for Supply Management (ISM) PMI (red) is through December (right axis).

It seems likely the ISM manufacturing index will show faster expansion again in January.


Source link

Knoxville, TN: R for Text Analysis Workshop

By | ai, bigdata, machinelearning | No Comments

(This article was originally published at r4stats.com, and syndicated at StatsBlogs.)

The Knoxville R Users Group is presenting a workshop on text analysis using R by Bob Muenchen. The workshop is free and open to the public. You can join the group at https://www.meetup.com/Knoxville-R-Users-Group. A description of the workshop follows.

Seeking Cloud

R for Text Analysis

When analyzing text using R, it’s hard to know where to begin. There are 37 packages available and there is quite a lot of overlap in what they can do. This workshop will demonstrate how to do three popular approaches: dictionary-based content analysis, latent semantic analysis, and latent Dirichlet allocation. We will spend much of the time on the data preparation steps that are important to all text analysis methods including data acquisition, word stemming/lemmatization, removal of punctuation and other special characters, phrase discovery, tokenization, and so on. While the examples will focus on the automated extraction of topics in the text files, we will also briefly cover the analysis of sentiment (e.g. how positive is customer feedback?) and style (who wrote this? are they telling the truth?)

The results of each text analysis approach will be the topics found, and a numerical measure of each topic in each document. We will then merge that with numeric data and do analyses combining both types of data.

The R packages used include quanteda, lsa, topicmodels, tidytext and wordcloud; with brief coverage of tm and SnowballC. While the workshop will not be hands-on due to time constraints, the programs and data files will be available afterwards.

Where: University of Tennessee Humanities and Social Sciences Building, room 201. If the group gets too large, the location may move and a notice will be sent to everyone who RSVPs on Meetup.com or who registers at the UT workshop site below. You can also verify the location the day before via email with Bob at [email protected]

When: 9:05-12:05 Friday 1/27/17

Prerequisite: R language basics

Members of UT Community register at: http://workshop.utk.edu under Researcher Focused

Members of other groups please RSVP on your respective sites so I can bring enough handouts.

Please comment on the article here: r4stats.com

The post Knoxville, TN: R for Text Analysis Workshop appeared first on All About Statistics.




Source link