Using NYC Citi Bike Data to Help Bike Enthusiasts Find their Mate

By | ai, bigdata, machinelearning

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

There is no shortage of analyses on the NYC bike share system. Most of them aim at predicting the demand for bikes and balancing bike stock, i.e forecasting when to remove bikes from fully occupied stations, and refill stations before the supply runs dry.

This is why I decided to take a different approach and use the Citi Bike data to help its users instead.

The Challenge

citibike_citiTinder2The online dating scene is complicated and unreliable: there is a discrepancy between what online daters say and what they do. Although this challenge is not relevant to me anymore – I am married – I wished that, as a bike enthusiast, I had a platform where I could have spotted like-minded people who did ride a bike (and not just pretend they did).

The goal of this project was to turn the Citi Bike data into an app where a rider could identify the best spots and times to meet other Citi Bike users and cyclists in general.

The Data

mapAs of March 31, 2016, the total number of annual subscribers was 163,865, and Citi Bike riders took an average of 38,491 rides per day in 2016 (source: wikipedia)

This is more than 14 million rides in 2016!

I used the Citi Bike data for the month of May 2016 (approximately 1 million observations). Citi Bike provides the following variables:

  • Trip duration (in seconds).
  • Timestamps for when the trip started and ended.
  • Station locations for where the trip started and ended (both the names and coordinates).
  • Rider’s gender and birth year – this is the only demographic data we have.
  • Rider’s plan (annual subscriber, 7-day pass user or 1-day pass user).
  • Bike ID.


Riders per Age Group

Before moving ahead with building the app, I was interested in exploring the data and identifying patterns in relation to gender, age and day of the week. Answering the following questions helped identify which variables influence how riders use the Citi Bike system and form better features for the app:

  • Who are the primary users of Citi Bike?
  • What is the median age per Citi Bike station?
  • How do the days of the week impact biking behaviours?

As I expected, based on my daily rides from Queens to Manhattan, 75% of the Citi Bike trips are taken by males. The primary users are 25 to 24 years old.


Riders per Age Group

Distribution of Riders per Hour of the Day (weekdays)

However, while we might expect these young professionals to be the primary users during the weekdays around 8-9am and 5-6pm (when they commute to and from work), and the older audience to take over the Citi Bike system midday, this hypothesis proved to be wrong. The tourists don’t have anything to do with it; the short term customers only represent 10% of the dataset.


Distribution of Riders per Hour of the Day (weekdays only)

Median Age per Departure Station

Looking at the median age of the riders for each station departure, we see the youngest riders in East Village, while older riders start their commute from Lower Manhattan (as shown in the map below). The age trends disappear when mapping the station arrival, above all in the financial district (in Lower Manhattan), which is populated by the young wolves of Wall Street (map not shown).

The map also confirms that the Citi Bike riders are mostly between 30 and 45 years old.


Median Age per Departure Station

Rides by Hour of the Day

Finally, when analyzing how the days of the week impacted biking behaviours, I was surprised to see that Citi Bike users didn’t ride for a longer period of time during the weekend: the median trip duration is 19 minutes for each day of the week.


Trip Duration per Gender and Age Group

However, as illustrated below, there is a difference in peak hours; during the weekend, riders hop on a bike later during the day, with most of the rides happening midday while the peak hours during the weekdays are around 8-9am and 5-7pm when riders commute to and from work.


Number of Riders per Hour of the Day (weekdays vs. weekends)

The App

Where does this analysis leave us?

  • The day of the week and the hour of the day are meaningful variables which we need to take into account in the app.
  • Most of the users are between 30 and 45 years. This means that the age groups 25-34 and 35-44 won’t be granular enough when app users need to filter their search. We will let them filter by age instead.

The Citi Tinder app in a few words and screenshots.

There are 3 steps to the app:

  • The “when“: find the times and days where your ideal mate is more likely to ride.


  • The “where“: once you know the best times and days, filter out the location by day of the week, time of the day, gender and age. You can also select if you want to spot where they arrive or depart.


  • The “how‘: the final step is to grab a Citi Bike and get to those hot spots. The app calls the Google Maps API to show the directions with a little extra: you can compare the time estimated by Google to connect two stations versus the average time it took Citi Bike users. I believe the latter is more accurate because it factors in the time of the day and day of the week (which the app let you filter).


Although screenshots are nice, the interactive app is better so head to the first step of the app to get started!

Would Have, Should Have, Could Have

This is the first of the four projects from the NYC Data Science Academy Data Science Bootcamp program. With a two-week timeline and only 24 hours in a day, some things gotta give… Below is a quick list of the analysis I could have, would have and should have done if given more time and data:yeahbike

  • Limited scope : I only took the data from May 2016. However, I expect the Citi Bike riders to behave differently depending on the season, temperature, etc. Besides, the bigger the sample size the more reliable the insights are.
  • Missing data : There was no data on the docks available per station that could be scraped from the Citi Bike website. The map would have been more complete if the availability of docks had been displayed.
  • Limited number of variables : I would have liked to have more demographics data (aside from gender and age); a dating app with only the age and gender as filters is restrictive…
  • Incomplete filters : With more time, I’d have added a filter ‘speed’ in the 2nd step of the app (the ‘where’ part) to enable the hard core cyclists to filter the fastest ones…
  • Sub-optimal visualization : I am aware that the map in the introduction page (with the dots displaying the median age per station) is hard to read and with more time, I’d have used polygons instead to group by neighbourhoods.
  • Finally, I would have liked to track unique users. Although users don’t have a unique identifier in the Citi Bike dataset, I could have identified unique users by looking at their gender, age, zip and usual start/end stations.

The post Using NYC Citi Bike Data to Help Bike Enthusiasts Find their Mate appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source link

Designing a reactive HTTP server with RxJava

By | ai, bigdata, machinelearning

Achieve high scalability and performance while reducing system complexity.

The C10k problem was an area of research and optimization that tried to achieve 10,000 concurrent connections on a single commodity server.
Even these days, solving this engineering task with the traditional Java toolkit is a challenge. There are many reactive approaches that easily achieve C10k, and RxJava makes them very approachable. In this chapter, we explore several implementation techniques that will improve scalability by several orders of magnitude. All of them will circle around the concept of reactive programming. If you are lucky enough to work on a greenfield project, you might consider implementing your application in a reactive manner top to bottom. Such an application should never synchronously wait for any computation or action. The architecture must be entirely event-driven and asynchronous in order to avoid blocking. We will go through several examples of a simple HTTP server and observe how it behaves with respect to design choices we made. Admittedly, performance and scalability does have a complexity price tag. But with RxJava the additional complexity will be reduced significantly.

The classic thread per connection model struggles to solve the C10k problem. With 10,000 threads we do the following:

Continue reading Designing a reactive HTTP server with RxJava.

Source link

Re-thinking Enterprise business processes using Augmented Intelligence

By | ai, bigdata, machinelearning

In the 1990s, there was a popular book called Re-engineering the Corporation. Looking back now, Re-engineering certainly has had a mixed success – but it did have an impact over the last two decades. ERP deployments led by SAP and others were a direct result of the Business Process re-engineering phenomenon.

So, now, with the rise of AI: Could we think of a new form of Re-engineering the Corporation – using Artificial Intelligence? The current group of Robotic process automation companies focus on the UI layer. We could extend this far deeper into the Enterprise. Leaving aside the discussion of  the impact of AI on jobs, this could lead to augmented intelligence at the process level for employees (and hence an opportunity for people to transition their careers in the age of AI).

Here are some initial thoughts. I am exploring these ideas in more detail. This work is also a part of an AI lab we are launching in London and Berlin in partnership with UPM and Nvidia both for Enterprises and Cities

Re-thinking Enterprise business processes using Augmented Intelligence

How would you rethink Enterprise business processes using Augmented Intelligence?

To put the basics into perspective: we consider a very ‘grassroots’ meaning of AI. AI is based on Deep Learning. Deep Learning involves automatic feature detection using the data.  You could model a range of Data types (or combination thereof) using AI:

a)      Images and sound – Convolutional neural networks

b)      Transactional – ex Loan approval

c)       Sequences: including handwriting recognition via LSTMs and recurrent neural networks

d)      Text processing – ex natural language detection

e)      Behaviour understanding – via Reinforcement learning

To extend this idea to Process engineering for Enterprises and Cities, we need to

a)      Understand existing business processes

b)      Break the process down into its components

c)       Model the process using Data and Algorithms (both Deep Learning and Machine Learning)

d)      Improve the efficiency of the process by complementing the human activity with AI(Augmented intelligence)

But this just the first step: You would have to consider the wider impact of AI itself

So, here is my list / ‘stack’:

  • New processes due to disruption at the industry level (ex Uber)
  • Change of behaviour due to new processes( ex: employees collaborating with Robots as peers)
  • Improvements in current Business Processes for Enterprises: Customer services, Supply chain, Finance, Human resources, Project management, Corporate reporting, Sales and Logistics, Management
  • The GPU enabled enterprise  ex Nvidia Grid but more broadly GPUs Will Democratize Delivery of Modern Apps, More Efficient Hybridization of Workflows, Unify Compute and Graphics
  • The availability of bodies of labelled data
  • New forms of Communications: Text analytics, Natural language processing, Speech recognition, chatbots

I am exploring these ideas in more as part of my work on the Enterprise AI lab we are launching in London and Berlin in partnership with UPM and Nvidia both for Enterprises and Cities. Welcome your comments at ajit.jaokar at or @ajitjaokar

Source link

Le retour des abeilles

By | ai, bigdata, machinelearning

Suite à mon précédant billet, “Maudites Abeilles“, j’ai eu des commentaires sur le blog, mais aussi sur Twitter pour me dire que malgré tout, il y avait bien une chance sur deux pour avoir un chemin bleu permettant de connecter les deux régions bleues, celle au nord et celle au sud.

En effet, dans le problème de base, les régions sont colorées, et le problème n’est pas l’existence d’un chemin, mais l’existence d’un chemin de la bonne couleur. Il faut juste rajouter une petite ligne pour spécifier la couleur du chemin

> simu2=function(d){
+ C=sample(c(rep(1,d^2/2),rep(2,d^2/2)),size=d^2)
+ for(i in 1:(d^2)){
+ x=rep(i,6)
+ y=c(i-d,i-d+1,i-1,i+1,i+d-1,i+d)
+ if(i%%d==1) y=c(i-d,i-d+1,NA,i+1,NA,i+d)
+ if(i%%d==0) y=c(i-d,NA,i-1,NA,i+d-1,i+d)
+ D=data.frame(x,y)
+ D=D[(D$y>=1)&(D$y<=d^2),] + B=rbind(B,D[which((C[D$y]==C[D$x])&(C[D$x]==1)),]) + } + B=B[(B[,2]>=1)&(B[,2]<=d^2),] + G=as.vector(t(B)) + G_iter=connect(make_graph(G), d^2) + connectpt=function(x) as.numeric(adjacent_vertices(G_iter,x)[[1]]) + sum(unlist(Vectorize(connectpt)(1:d))%in%(d^2+1-(1:d)))>0}
> appr=function(d,nsimu) mean(Vectorize(simu2)(rep(d,nsimu)))
> appr(4,10000)
[1] 0.4993

et en effet, il y a une chance sur deux de trouver un chemin bleu permettant de connecter les deux régions bleues. Mais comme je le disais hier, j’ai du mal avec l’argument de symétrie évoqué dans la “correction”.

Source link

Philly Fed: State Coincident Indexes increased in 45 states in March

By | ai, bigdata, machinelearning

From the Philly Fed:

The Federal Reserve Bank of Philadelphia has released the coincident indexes for the 50 states for March 2017. Over the past three months, the indexes increased in 45 states, decreased in three, and remained stable in two, for a three-month diffusion index of 84. In the past month, the indexes increased in 45 states and decreased in five, for a one-month diffusion index of 80.

Note: These are coincident indexes constructed from state employment data. An explanation from the Philly Fed:

The coincident indexes combine four state-level indicators to summarize current economic conditions in a single statistic. The four state-level variables in each coincident index are nonfarm payroll employment, average hours worked in manufacturing, the unemployment rate, and wage and salary disbursements deflated by the consumer price index (U.S. city average). The trend for each state’s index is set to the trend of its gross domestic product (GDP), so long-term growth in the state’s index matches long-term growth in its GDP.

Philly Fed Number of States with Increasing ActivityClick on graph for larger image.

This is a graph is of the number of states with one month increasing activity according to the Philly Fed. This graph includes states with minor increases (the Philly Fed lists as unchanged).

In March 45states had increasing activity (including minor increases).

The downturn in 2015 and 2016, in the number of states increasing, was mostly related to the decline in oil prices.

Philly Fed State Conincident Map Here is a map of the three month change in the Philly Fed state coincident indicators. This map was all red during the worst of the recession, and almost all green now.

Source: Philly Fed. Note: For complaints about red / green issues, please contact the Philly Fed.

Source link

I hate R, volume 38942

By | ai, bigdata, machinelearning

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

R doesn’t allow block comments. You have to comment out each line, or you can encapsulate the block in if(0){} which is the world’s biggest hack. Grrrrr.

The post I hate R, volume 38942 appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

The post I hate R, volume 38942 appeared first on All About Statistics.

Source link

Euler Problem 18 & 67: Maximum Path Sums

By | ai, bigdata, machinelearning

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)
A pedigree is an example of a binary tree: Euler Problem 18

An example of a pedigree. Source: Wikimedia.

Euler Problem 18 and 67 are exactly the same besides that the data set in the second version is larger than in the first one. In this post, I kill two Eulers with one code.

These problems deal with binary trees, which is a data structure where each node has two children. A practical example of a binary tree is a pedigree chart, where each person or animal has two parents, four grandparents and so on.

Euler Problem 18 Definition

By starting at the top of the triangle below and moving to adjacent numbers on the row below, the maximum total from top to bottom is 23.

7 4
2 4 6
8 5 9 3

That is, 3 + 7 + 4 + 9 = 23. Find the maximum total from top to bottom of the triangle below:

95 64
17 47 82
18 35 87 10
20 04 82 47 65
19 01 23 75 03 34
88 02 77 73 07 63 67
99 65 04 28 06 16 70 92
41 41 26 56 83 40 80 70 33
41 48 72 33 47 32 37 16 94 29
53 71 44 65 25 43 91 52 97 51 14
70 11 33 28 77 73 17 78 39 68 17 57
91 71 52 38 17 14 91 43 58 50 27 29 48
63 66 04 68 89 53 67 30 73 16 69 87 40 31
04 62 98 27 23 09 70 98 73 93 38 53 60 04 23

As there are only 16,384 routes, it is possible to solve this problem by trying every route. However, Problem 67, is the same challenge with a triangle containing one-hundred rows; it cannot be solved by brute force, and requires a clever method! ;o)


This problem seeks a maximum path sum in a binary tree. The brute force method, as indicated in the problem definition, is a very inefficient way to solve this problem. The video visualises the quest for the maximum path, which takes eleven minutes of hypnotic animation.

A more efficient method is to define the maximum path layer by layer, starting at the bottom. The maximum sum of 2+8 or 2+5 is 10, the maximum sum of 4+5 or 4+9 is 13 and the last maximum sum is 15. These numbers are now placed in the next row. This process cycles until only one number is left. This algorithm solves the sample triangle in four steps:

Step 1:

7 4
2 4 6
8 5 9 3

Step 2:

7 4
10 13 15

Step 3:

20 19

Step 4:


In the code below, the data is triangle matrix. The variables rij (row) and kol (column) drive the search for the maximum path. The triangle for Euler Problem 18 is manually created and the triangle for Euler Problem 67 is read from the website.

path.sum <- function(triangle) {
    for (rij in nrow(triangle):2) {
        for (kol in 1:(ncol(triangle)-1)) {
            triangle[rij - 1,kol] <- max(triangle[rij,kol:(kol + 1)]) + triangle[rij - 1, kol]
        triangle[rij,] <- NA
    return(max(triangle, na.rm = TRUE))

# Euler Problem 18
triangle <- matrix(ncol = 15, nrow = 15)
triangle[1,1] <- 75
triangle[2,1:2] <- c(95, 64)
triangle[3,1:3] <- c(17, 47, 82)
triangle[4,1:4] <- c(18, 35, 87, 10)
triangle[5,1:5] <- c(20, 04, 82, 47, 65)
triangle[6,1:6] <- c(19, 01, 23, 75, 03, 34)
triangle[7,1:7] <- c(88, 02, 77, 73, 07, 63, 67)
triangle[8,1:8] <- c(99, 65, 04, 28, 06, 16, 70, 92)
triangle[9,1:9] <- c(41, 41, 26, 56, 83, 40, 80, 70, 33)
triangle[10,1:10] <- c(41, 48, 72, 33, 47, 32, 37, 16, 94, 29)
triangle[11,1:11] <- c(53, 71, 44, 65, 25, 43, 91, 52, 97, 51, 14)
triangle[12,1:12] <- c(70, 11, 33, 28, 77, 73, 17, 78, 39, 68, 17, 57)
triangle[13,1:13] <- c(91, 71, 52, 38, 17, 14, 91, 43, 58, 50, 27, 29, 48)
triangle[14,1:14] <- c(63, 66, 04, 68, 89, 53, 67, 30, 73, 16, 69, 87, 40, 31)
triangle[15,1:15] <- c(04, 62, 98, 27, 23, 09, 70, 98, 73, 93, 38, 53, 60, 04, 23)

answer <- path.sum(triangle)

Euler Problem 67

The solution for problem number 67 is exactly the same. The data is read directly from the Project Euler website.

# Euler Problem 67
triangle.file <- read.delim("", stringsAsFactors = F, header = F)
triangle.67 <- matrix(nrow = 100, ncol = 100)
for (i in 1:100) {
    triangle.67[i,1:i] <- as.numeric(unlist(strsplit(triangle.file[i,], " ")))
answer <- path.sum(triangle.67)

The post Euler Problem 18 & 67: Maximum Path Sums appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source link