Who's downloading the forecast package?

By | machinelearning

The github page for the forecast package currently shows the following information
Note the downloads figure: 264K/month. I know the package is popular, but that seems crazy. Also, the downloads figure on github only counts the downloads from the RStudio mirror, and ignores downloads from the other 125 mirrors around the world.Here are the top ten downloaded packages from the last month:
library(cranlogs) cran_top_downloads(when=’last-month’) rank package count from to 1 zoo 308290 2015-11-09 2015-12-08 2 forecast 263797 2015-11-09 2015-12-08 3 Rcpp 260636 2015-11-09 2015-12-08 4 lmtest 258810 2015-11-09 2015-12-08 5 fpp 244989 2015-11-09 2015-12-08 6 expsmooth 244179 2015-11-09 2015-12-08 7 fma 243556 2015-11-09 2015-12-08 8 tseries 243172 2015-11-09 2015-12-08 9 stringi 199384 2015-11-09 2015-12-08 10 ggplot2 192072 2015-11-09 2015-12-08 OK, that is very weird.

Source link

Highlights from Strata Data Conference in London 2017

By | ai, bigdata, machinelearning

Watch highlights covering data-driven business, data engineering, machine learning, and more. From Strata Data Conference in London 2017.

Experts from across the data world came together in London for Strata Data Conference. Below you’ll find links to highlights from the event.

Using AI to create new jobs

Tim O’Reilly delves into past technological transitions, speculates on the possibilities of AI, and looks at what’s keeping us from making the right choices to govern our creations.

The science of visual interactions

Miriam Redi investigates how machine learning can detect subjective properties of images and videos, such as beauty, creativity, and sentiment.

Machine learning is a moonshot for us all

Darren Strange asks: What part will we each play in what is sure to be one of the most exciting times in computer science?

What Kaggle has learned from almost a million data scientists

Anthony Goldbloom shares lessons learned from top performers in the Kaggle community and explores the types of machine-learning techniques typically used.

Another one bytes the dust

Using the music industry as an example, Paul Brook shows how modern information points bring new data that changes the way an organization will make decisions.

The data subject first?

Aurélie Pols draws a broad philosophical picture of the data ecosystem and then hones in on the right to data portability.

Real-time intelligence gives Uber the edge

M. C. Srivas covers Uber’s big data architecture and explores the real-time problems Uber needs to solve to make ride sharing smooth.

Lessons from piloting the London Office of Data Analytics

Eddie Copeland explores how the London Office of Data Analytics overcame the barriers to joining, analyzing, and acting upon public sector data at city scale.

Accelerate analytics and AI innovations with Intel

Ziya Ma outlines the challenges for applying machine learning and deep learning at scale and shares solutions that Intel has enabled for customers and partners.

Is finance ready for AI?

Aida Mehonic explores the role artificial intelligent might play in the financial world.

Peeking into the black box: Lessons from the front lines of machine-learning product launches

Grace Huang shares lessons learned from running and interpreting machine-learning experiments.

Continue reading Highlights from Strata Data Conference in London 2017.

Source link

RStudio just keeps getting better

By | ai, bigdata, machinelearning

RStudio has been a life-changer for the way I work, and for how I teach data analysis. I still have a couple of minor frustrations with it, but they are slowly disappearing as RStudio adds features.
I use dual monitors and I like to code on one monitor and have the console and plots on the other monitor. Otherwise I see too little context, and long lines get wrapped making the code harder to read.

Source link

Making data analysis easier

By | machinelearning

Di Cook and I are organizing a workshop on “Making data analysis easier” for 18-19 February 2016.
We are calling it WOMBAT2016, which an acronym for Workshop Organized by the Monash Business Analytics Team. Appropriately, it will be held at the Melbourne Zoo. Our plan is to make these workshops an annual event.
Some details are available on the workshop website. Key features are:
Hadley Wickham is our keynote speaker.

Source link

THANYA 16.5FT Aluminum Multi Purpose Telescopic Extension Ladder Tall New Foldable Duty

By | iot, machinelearning

Here comes this 16-Step Dual Joints Aluminum Stretchable Ladder! W know ladder is a practical item in daily life, widely seen in various working fields. It can help you easily reach the height. A folding ladder must be more convenient to use. Made with high-quality aluminum material, this ladder is lightweight yet durable enough for long-term use. It is in folding style, supporting the maximum height up to 500cm. Due to foldable design, it can be easily stored with little space occupied.

1. Material: Aluminum
2. Color: Black & Silver
3. Folded Dimensions: (36.00 x 17.76)” / (91.44 x 45.11)cm (L x W)
4. Weight: 639.45oz / 18128g
5. Number of Ladder Step: 16
6. Space Between Two Rungs: 11.61″ / 29.5cm
7. Max. Load: 150kg
8. Unfolded Height: 196.85″ / 500cm
9. Unfolded Dimensions: (198.00 x 17.76)” / (502.92 x 45.11)cm (L x W)

Package Includes:
1 x Extension LadderHigh quality and in a good condition.Meticulous treatment, delicate design and wearable performance
Ultra-light aluminum ladder, durable in use
16-step ladder in folding style, can be lengthened to 500cm
It won’t take up too much space when it is folded
A practical tool for many working areas.It won’t take up too much space when it is folded


5 ways to measure running time of R code

By | ai, bigdata, machinelearning

A reviewer asked me to report detailed running times for all (so many :scream:) performed computations in one of my papers, and so I spent a Saturday morning figuring out my favorite way to benchmark R code. This is a quick summary of the options I found to be available.

A quick online search revealed at least three R packages for benchmarking R code (rbenchmark, microbenchmark, and tictoc). Additionally, base R provides at least two methods to measure the running time of R code (Sys.time and system.time). In the following I briefly go through the syntax of using each of the five option, and present my conclusions at the end.

1. Using Sys.time

The run time of a chunk of code can be measured by taking the difference between the time at the start and at the end of the code chunk. Simple yet flexible :sunglasses:.

sleep_for_a_minute <- function() { Sys.sleep(60) }

start_time <- Sys.time()
end_time <- Sys.time()

end_time - start_time
# Time difference of 1.000327 mins

2. Library tictoc

The functions tic and toc are used in the same manner for benchmarking as the just demonstrated Sys.time. However tictoc adds a lot more convenience to the whole.

The most recent development1 version of tictoc can be installed from github:


One can time a single code chunk:


print("falling asleep...")
print("...waking up")
# [1] "falling asleep..."
# [1] "...waking up"
# sleeping: 60.026 sec elapsed

Or nest multiple timers:

tic("data generation")
X <- matrix(rnorm(50000*1000), 50000, 1000)
b <- sample(1:1000, 1000)
y <- runif(1) + X %*% b + rnorm(50000)
tic("model fitting")
model <- lm(y ~ X)
# data generation: 3.792 sec elapsed
# model fitting: 39.278 sec elapsed
# total: 43.071 sec elapsed

3. Using system.time

One can time the evaluation of an R expression using system.time. For example, we can use it to measure the execution time of the function sleep_for_a_minute (defined above) as follows.

system.time({ sleep_for_a_minute() })
#   user  system elapsed
#  0.004   0.000  60.051

But what exactly are the reported times user, system, and elapsed? :confused:

Well, clearly elapsed is the wall clock time taken to execute the function sleep_for_a_minute, plus some benchmarking code wrapping it (that’s why it took slightly more than a minute to run I guess).

As for user and system times, William Dunlap has posted a great explanation to the r-help mailing list:

“User CPU time” gives the CPU time spent by the current process (i.e., the current R session) and “system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share. Different operating systems will have different things done by the operating system.


4. Library rbenchmark

The documentation to the function benchmark from the rbenchmark R package describes it as “a simple wrapper around system.time”. However it adds a lot of convenience compared to bare system.time calls. For example it requires just one benchmark call to time multiple replications of multiple expressions. Additionally the returned results are conveniently organized in a data frame.

I installed the development1 version of the rbenchmark package from github:


For example purposes, let’s compare the time required to compute linear regression coefficients using three alternative computational procedures:

  1. lm,
  2. the Moore-Penrose pseudoinverse,
  3. the Moore-Penrose pseudoinverse but without explicit matrix inverses.

benchmark("lm" = {
            X <- matrix(rnorm(1000), 100, 10)
            y <- X %*% sample(1:10, 10) + rnorm(100)
            b <- lm(y ~ X + 0)$coef
          "pseudoinverse" = {
            X <- matrix(rnorm(1000), 100, 10)
            y <- X %*% sample(1:10, 10) + rnorm(100)
            b <- solve(t(X) %*% X) %*% t(X) %*% y
          "linear system" = {
            X <- matrix(rnorm(1000), 100, 10)
            y <- X %*% sample(1:10, 10) + rnorm(100)
            b <- solve(t(X) %*% X, t(X) %*% y)
          replications = 1000,
          columns = c("test", "replications", "elapsed",
                      "relative", "user.self", "sys.self"))

#            test replications elapsed relative user.self sys.self
# 3 linear system         1000   0.167    1.000     0.208    0.240
# 1            lm         1000   0.930    5.569     0.952    0.212
# 2 pseudoinverse         1000   0.240    1.437     0.332    0.612

Here, the meaning of elapsed, user.self, and sys.self is the same as described above in the section about system.time, and relative is simply the time ratio with the fastest test. Interestingly lm is by far the slowest here.

5. Library microbenchmark

The most recent development version of microbenchmark can be installed from github:


Much like benchmark from the package rbenchmark, the function microbenchmark can be used to compare running times of multiple R code chunks. But it offers a great deal of convenience and additional functionality.

I find that one particularly nice feature of microbenchmark is the ability to automatically check the results of the benchmarked expressions with a user-specified function. This is demonstrated below, where we again compare three methods computing the coefficient vector of a linear model.


n <- 10000
p <- 100
X <- matrix(rnorm(n*p), n, p)
y <- X %*% rnorm(p) + rnorm(100)

check_for_equal_coefs <- function(values) {
  tol <- 1e-12
  max_error <- max(c(abs(values[[1]] - values[[2]]),
                     abs(values[[2]] - values[[3]]),
                     abs(values[[1]] - values[[3]])))
  max_error < tol

mbm <- microbenchmark("lm" = { b <- lm(y ~ X + 0)$coef },
               "pseudoinverse" = {
                 b <- solve(t(X) %*% X) %*% t(X) %*% y
               "linear system" = {
                 b <- solve(t(X) %*% X, t(X) %*% y)
               check = check_for_equal_coefs)

# Unit: milliseconds
#           expr      min        lq      mean    median        uq      max neval cld
#             lm 96.12717 124.43298 150.72674 135.12729 188.32154 236.4910   100   c
#  pseudoinverse 26.61816  28.81151  53.32246  30.69587  80.61303 145.0489   100  b
#  linear system 16.70331  18.58778  35.14599  19.48467  22.69537 138.6660   100 a

We used the function argument check to check for equality (up to a maximal error of 1e-12) of the results returned by the three methods. If the results weren’t equal, microbenchmark would return an error message.

Another great feature is the integration with ggplot2 for plotting microbenchmark results.


Microbenchmark results plot


The given demonstration of the different benchmarking functions is surely not exhaustive. Nevertheless I made some conclusions for my personal benchmarking needs:

  • The Sys.time approach as well as the tictoc package can be used for timing (potentially nested) steps of a complicated algorithm (that’s often my use case). However, tictoc is more convenient, and (most importantly) foolproof.
  • We saw that microbenchmark returns other types of measurements than benchmark, and I think that in most situations the microbenchmark measurements are of a higher practical significance :stuck_out_tongue:.
  • To my knowledge microbenchmark is the only benchmarking package that has visualizations built in :+1:.

For these reasons I will go with microbenchmark and tictoc. :bowtie:

  1. Though the repository does not seem to be very active. So the github version is probably no different from the stable release on CRAN.  2

Source link

Starting a career in data science

By | machinelearning

I received this email from one of my undergraduate students:
I’m writing to you asking for advice on how to start a career in Data Science. Other professions seem a bit more straight forward, in that accountants for example simply look for Internships and ways into companies from there. From my understanding, the nature of careers in data science seem to be on a project-to-project basis. I’m not sure how to get my foot stuck in the door.

Source link