Category

bigdata

Product Launch: Amped up Kernels Resources + Code Tips & Hidden Cells

By | ai, bigdata, machinelearning

Kaggle’s kernels focused engineering team has been working hard to make our coding environment one that you want to use for all of your side projects. We’re excited to announce a host of new changes that we believe make Kernels the default place you’ll want to train your competition models, explore open data, and build your data science portfolio. Here’s exactly what’s changed:

Additional Computational Resources (doubled and tripled)

  • Execution time: Now your kernels can run for up to 60 minutes instead of our past 20 minute limit.
  • CPUs: Use up to four CPUs for multithreaded workloads..
  • RAM: Work with twice as much data with 16 GB of RAM available for every kernel.
  • Disk space: Create killer output with 1 GB of disk space.

New resources

Code Tips

Code tips catch common mistakes as you work through coding a kernel. They will pop up when you run code with an identifiable error and significantly cut down your troubleshooting time.

Code Tips GIF

Here are some examples of the most common code tips you’ll run into:

Although you specified the “R” language, you might be writing Python code. Was this intentional? If not, start a Python script instead.

Couldn’t show a character. Did you happen to load binary data as text?

Did you mean “from bs4 import BeautifulSoup”?

Did you mean “ggplot2”?

Do you mean pandas.DataFrame?

Hidden Cells

You publish public kernels so you can share your data science work to build a portfolio, get feedback, and help others learn. We’ve added the ability to hide cells, making it possible to present your analysis cleanly so people can focus on what’s useful. If viewers want to dig in, it’s still possible to unhide cells to see the dirty details.

Improved Reliability

We know how frustrating it is to lose work, get stuck on Kaggle-side errors, or simply have a workbench that’s down when you’re trying to get up and running. It’s a priority of the Kernels engineering team to improve reliability so you can count on your code. Here are a few recent improvements:

  • Fewer notebook disconnections
  • Notebook editor now works with “Block Third Party Cookies” browser setting enabled
  • More reliable notebook auto-save

What else can we do to make Kernels your default data science and ML workbench? Aurelio will be monitoring this forum post closely…let us know what you’d like to see next!


Source link

Update: For Fun, Stock Market as Barometer of Policy Success

By | ai, bigdata, machinelearning


Note: This is a repeat of a June post with updated statistics and graph.

There are a number of observers who think the stock market is the key barometer of policy success. My view is there are many measures of success – and that the economy needs to work well for a majority of the people – not just stock investors.

However, for example, Treasury Secretary Steven Mnuchin was on CNBC on Feb 22, 2017, and was asked if the stock market rally was a vote of confidence in the new administration, he replied: “Absolutely, this is a mark-to-market business, and you see what the market thinks.”

And Larry Kudlow wrote in 2007: A Stock Market Vote of Confidence for Bush: “I have long believed that stock markets are the best barometer of the health, wealth and security of a nation. And today’s stock market message is an unmistakable vote of confidence for the president.”

Note: Kudlow’s comments were made a few months before the market started selling off in the Great Recession. For more on Kudlow, see: Larry Kudlow is usually wrong

For fun, here is a graph comparing S&P500 returns (ex-dividends) under Presidents Trump and Obama:

Stock Market Performance Click on graph for larger image.

Blue is for Mr. Obama, Orange is for Mr. Trump.

At this point, the S&P500 is up 10.1% under Mr. Trump compared to up 32.2% under Mr. Obama for the same number of market days.


Source link

My advice on dplyr::mutate()

By | ai, bigdata, machinelearning

(This article was originally published at Statistics – Win-Vector Blog, and syndicated at StatsBlogs.)

There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.

Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.

Vlcsnap 00887

“Character is what you are in the dark.”

John Whorfin quoting Dwight L. Moody.

I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.

What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the R package dplyr.

  • Disclaimer: Win-Vector LLC has no official standing with RStudio, or dplyr development.
  • However:

    “One need not have been Caesar in order to understand Caesar.”

    Alternately: Georg Simmmel or Max Webber.

    Win-Vector LLC, as a consultancy, has experience helping large companies deploy enterprise big data solutions involving R, dplyr, sparklyr, and Apache Spark. Win-Vector LLC, as a training organization, has experience in how new users perceive, reason about, and internalize how to use R and dplyr. Our group knows how to help deploy production grade systems, and how to help new users master these systems.

From experience we have distilled a lot of best practices. And below we will share one.

From: “R for Data Science; Whickham, Grolemund; O’Reilly, 2017” we have:

Note that you can refer to columns that you’ve just created:

mutate(flights_sml,
   gain = arr_delay - dep_delay,
   hours = air_time / 60,
   gain_per_hour = gain / hours
 )

Let’s try that with database backed data:

suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
# [1] ‘0.7.3’

db <- DBI::dbConnect(RSQLite::SQLite(), 
                     ":memory:")
flights <- copy_to(db, 
             nycflights13::flights,
             'flights')

mutate(flights,
       gain = arr_delay - dep_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours
)
# # Source:   lazy query [?? x 22]
# # Database: sqlite 3.19.3 [:memory:]
# year month   day dep_time sched_dep_time        ...
#                        ...
#   1  2013     1     1      517            515   ...
# ...

That worked. One of the selling points of dplyr is a lot of dplyr is source-generic or source-agnostic: meaning it can be run against different data providers (in-memory, databases, Spark).

However, if a new user tries to extend such an example (say adding gain_per_minutes) they run into this:

mutate(flights,
       gain = arr_delay - dep_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours,
       gain_per_minute = 60 * gain_per_hour
)
# Error in rsqlite_send_query(conn@ptr, statement) : 
#   no such column: gain_per_hour

(Some detail on the failing query are here.)

It is hard for experts to understand how frustrating the above is to a new R user or to a part time R user. It feels like any variation on the original code causes it to fail. None of the rules they have been taught anticipate this, or tell them how to get out of this situation.

This quickly leads to strong feelings of learned helplessness and anxiety.

Our rule for dplyr::mutate() has been for some time:

Each column name used in a single mutate must appear only on the left-hand-side of a single assignment, or otherwise on the right-hand-side of any number of assignments (but never both sides, even if it is different assignments).

Under this rule neither of the above mutates() are allowed. The second should be written as (switching to pipe-notation):

flights %>%
  mutate(gain = arr_delay - dep_delay,
         hours = air_time / 60) %>%
  mutate(gain_per_hour = gain / hours) %>%
  mutate(gain_per_minute = 60 * gain_per_hour)

And the above works.

If we teach this rule we can train users to be properly cautious, and hopefully avoid them becoming frustrated, scared, anxious, or angry.

dplyr documentation (such as “help(mutate)“) does not strongly commit to what order mutate expressions are executed in, or visibility and durability of intermediate results (i.e., a full description of intended semantics). Our rule intentionally limits the user to a set of circumstances where none of those questions matter.

Now the error we saw above is a mere bug that one expects will be fixed some day (in fact it is dplyr issue 3095, we looked a bit at the generate queries here). It can be a bit unfair to criticize a package for having a bug.

However, confusion around re-use of column names has been driving dplyr issues for quite some time:

It makes sense to work in a reliable and teachable sub-dialect of dplyr that will serve users well (or barring that, you can use an adapter, such as seplyr). In production you must code to what systems are historically reliably capable of, not just the specification.

Please comment on the article here: Statistics – Win-Vector Blog

The post My advice on dplyr::mutate() appeared first on All About Statistics.




Source link

Because it's Friday: Blue skies or SkyNet?

By | ai, bigdata, machinelearning

I enjoyed attending the O'Reilly AI conference in San Francisco this week. There were many thought-provoking talks, but in the days since then my thoughts kept returning to one thing: incentives

One way of training an AI involves a reward function: a simple equation that increases in value as the AI gets closer to its goal. The AI then explores possibilities in such a way that its reward is maximized. I saw one amusing video of a robot arm that was being trained to pick up a block and then extend its arm as far as possible and set the block down at the limits of its reach. The reward function for such an activity is simple: the reward value is the distance of the block from its original position. Unfortunately, the robot learned not to carefully pick up and set down the block, but instead reached back and whacked the block as hard as it possibly could, sending the block careening across the room.

Artifical Intelligences learn do what we incentivize them to to. But what if those incentives end up not aligning with our interests, despite our best intentions? That's the theme of Tim O'Reilly's keynote below, and it's well worth watching.

That's all from is here at the blog for this week. We'll be back on Monday with more: have a great weekend, and see you then!


Source link

Big Data Analytics with H20 in R Exercises -Part 1

By | ai, bigdata, machinelearning

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


We have dabbled with RevoScaleR before , In this exercise we will work with H2O , another high performance R library which can handle big data very effectively .It will be a series of exercises with increasing degree of difficulty . So Please do this in sequence .
H2O requires you to have Java installed in your system .So please install Java before trying with H20 .As always check the documentation before trying these exercise set .
Answers to the exercises are available here.
If you want to install the latest release from H20 , install it via this instructions .

Exercise 1
Download the latest stable release from h20 and initialize the cluster

Exercise 2
Check the cluster information via clusterinfo

Exercise 3
You can see how h2o works via the demo function , Check H2O’s glm via demo method .

Exercise 4

down load the loan.csv from H2O’s github repo and import it using H2O .
Exercise 5
Check the type of imported loan data and notice that its not a dataframe , check the summary of the loan data .
Hint -use h2o.summary()

Exercise 6
One might want to transfer a dataframe from R environment to H2O , use as.h2o to conver the mtcars dataframe as a H2OFrame

Learn more about importing big data in the online course Data Mining with R: Go from Beginner to Advanced. In this course you will learn how to

  • work with different data import techniques,
  • know how to import data and transform it for a specific moddeling or analysis goal,
  • and much more.

Exercise 7

Check the dimension of the loan H2Oframe via h2o.dim

Exercise 8
Find the colnames from the H2OFrame of loan data.

Exercise 9

Check the histogram of the loan amount of loan H2Oframe .

Exercise 10
Find the mean of loan amount by each home ownership group from the loan H2OFrame

var vglnk = { key: ‘949efb41171ac6ec1bf7f206d57e90b8’ };

(function(d, t) {
var s = d.createElement(t); s.type = ‘text/javascript’; s.async = true;
s.src = “http://cdn.viglink.com/api/vglnk.js”;
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, ‘script’));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…




Source link

Product Launch: Amped up Kernels Resources + Code Tips & Hidden Cells

By | ai, bigdata, machinelearning

Kaggle’s kernels focused engineering team has been working hard to make our coding environment one that you want to use for all of your side projects. We’re excited to announce a host of new changes that we believe make Kernels the default place you’ll want to train your competition models, explore open data, and build your data science portfolio. Here’s exactly what’s changed:

Additional Computational Resources (doubled and tripled)

  • Execution time: Now your kernels can run for up to 60 minutes instead of our past 20 minute limit.
  • CPUs: Use up to four CPUs for multithreaded workloads..
  • RAM: Work with twice as much data with 16 GB of RAM available for every kernel.
  • Disk space: Create killer output with 1 GB of disk space.

New resources

Code Tips

Code tips catch common mistakes as you work through coding a kernel. They will pop up when you run code with an identifiable error and significantly cut down your troubleshooting time.

Code Tips GIF

Here are some examples of the most common code tips you’ll run into:

Although you specified the “R” language, you might be writing Python code. Was this intentional? If not, start a Python script instead.

Couldn’t show a character. Did you happen to load binary data as text?

Did you mean “from bs4 import BeautifulSoup”?

Did you mean “ggplot2”?

Do you mean pandas.DataFrame?

Hidden Cells

You publish public kernels so you can share your data science work to build a portfolio, get feedback, and help others learn. We’ve added the ability to hide cells, making it possible to present your analysis cleanly so people can focus on what’s useful. If viewers want to dig in, it’s still possible to unhide cells to see the dirty details.

Improved Reliability

We know how frustrating it is to lose work, get stuck on Kaggle-side errors, or simply have a workbench that’s down when you’re trying to get up and running. It’s a priority of the Kernels engineering team to improve reliability so you can count on your code. Here are a few recent improvements:

  • Fewer notebook disconnections
  • Notebook editor now works with “Block Third Party Cookies” browser setting enabled
  • More reliable notebook auto-save

What else can we do to make Kernels your default data science and ML workbench? Aurelio will be monitoring this forum post closely…let us know what you’d like to see next!


Source link