By | ai, bigdata, machinelearning

(This article was originally published at Gianluca Baio’s blog, and syndicated at StatsBlogs.)

In the grand tradition of all recent election times, I’ve decided to have a go and try and build a model that could predict the results of the upcoming snap general election in the UK. I’m sure there will be many more people having a go at this, from various perspectives and using different modelling approaches. Also, I will try very hard to not spend all of my time on this and so I have set out to develop a fairly simple (although, hopefully reasonable) model.

First off: the data. I think that since the announcement of the election, the pollsters have intensified the number of surveys; I have found already 5 national polls (two by Yougov, two by ICM and one by Opinium $-$ there may be more and I’m not claiming a systematic review/meta-analysis of the polls.

Arguably, this election will be mostly about Brexit: there surely will be other factors, but because this comes almost exactly a year after the referendum, it is a fair bet to suggest that how people felt and still feel about its outcome will also massively influence the election. Luckily, all the polls I have found do report data in terms of voting intention, broken up by Remain/Leave. So, I’m considering $P=8$ main political parties: Conservatives, Labour, UKIP, Liberal Democrats, SNP, Green, Plaid Cymru and “Others“. Also, for simplicity, I’m considering only England, Scotland and Wales $-$ this shouldn’t be a big problem, though, as in Northern Ireland elections are generally a “local affair”, with the mainstream parties not playing a significant role.

I also have available data on the results of both the 2015 election (by constituency and again, I’m only considering the $C=632$ constituencies in England, Scotland and Wales $-$ this leaves out the 18 Northern Irish constituencies) and the 2016 EU referendum. I had to do some work to align these two datasets, as the referendum did not consider the usual geographical resolution. I have mapped the voting areas used 2016 to the constituencies and have recorded the proportion of votes won by the $P$ parties in 2015, as well as the proportion of Remain vote in 2016.

For each observed poll $i=1,ldots,N_{polls}$, I modelled the observed data among “$L$eavers” as $$y^{L}_{i1},ldots,y^{L}_{iP} sim mbox{Multinomial}left(left(pi^{L}_{1},ldots,pi^{L}_{P}right),n^L_iright).$$ Similarly, the data observed for “ $R$emainers” are modelled as $$y^R_{i1},ldots,y^R_{iP} sim mbox{Multinomial}left(left(pi^R_{1},ldots,pi^R_Pright),n^R_iright).$$
In other words, I’m assuming that within the two groups of voters, there is a vector of underlying probabilities associated with each party ($pi^L_p$ and $pi^R_p$) that are pooled across the polls. $n^L_i$ and $n^R_i$ are the sample sizes of each poll for $L$ and $R$.

I used a fairly standard formulation and modelled $$pi^L_p=frac{phi^L_p}{sum_{p=1}^P phi^L_p} qquad mbox{and} qquad pi^R_p=frac{phi^R_p}{sum_{p=1}^P phi^R_p} $$ and then $$log phi^j_p = alpha_p + beta_p j$$ with $j=0,1$ to indicate $L$ and $R$, respectively. Again, using fairly standard modelling, I fix $alpha_1=beta_1=0$ to ensure identifiability and then model $alpha_2,ldots,alpha_P sim mbox{Normal}(0,sigma_alpha)$ and $beta_2,ldots,beta_P sim mbox{Normal}(0,sigma_beta)$.

This essentially fixes the “Tory effect” to 0 (if only I could really do that!…) and then models the effect of the other parties with respect to the baseline. Negative values for $alpha_p$ indicate that party $pneq 1$ is less likely to grab votes among leavers than the Tories; similarly positive values for $beta_p$ mean that party $p neq 1$ is more popular than the Tories among remainers. In particular, I have used some informative priors by defining the standard deviations $sigma_alpha=sigma_beta=log(1.5)$, to mean that it is unlikely to observe massive deviations (remember that $alpha_p$ and $beta_p$ are defined on the log scale).

I then use the estimated party- and EU result-specific probabilities to compute a “relative risk” with respect to the observed overall vote at the 2015 election $$rho^j_p = frac{pi^j_p}{pi^{15}_p},$$ which essentially estimates how much better (or worse) the parties are doing in comparison to the last election, among leavers and remainers. The reason I want these relative risks is because I can then distribute the information from the current polls and the EU referendum to each constituency $c=1,ldots,C$ by estimating the predicted share of votes at the next election as the mixture $$pi^{17}_{cp} = (1-gamma_c)pi^{15}_prho^L_p + gamma_c pi^{15}_prho^R_p,$$ where $gamma_c$ is the observed proportion of remain voters in constituency $c$.

Finally, I can simulate the next election by ensuring that in each constituency the $pi^{17}_{cp} $ sum to 1. I do this by drawing the vote shares as $hat{pi}^{17}_{cp} sim mbox{Dirichlet}(pi^{17}_1,ldots,pi^{17}_P)$.

In the end, for each constituency I have a distribution of election results, which I can use to determine the average outcome, as well as various measures of uncertainty. So in a nutshell, this model is all about i) re-proportioning the 2015 and 2017 votes based on the polls; and ii) propagating uncertainty in the various inputs.

I’ll update this model as more polls become available $-$ one extra issue then will be about discounting older polls (something like what Roberto did here and here, but I think I’ll keep things easy for this). For now, I’ve run my model for the 5 polls I mentioned earlier and this is the (rather depressing) result.

From the current data and the modelling assumption, this looks like the Tories are indeed on course for a landslide victory $-$ my results are also kind of in line with other predictions (eg here). The model here may be flattering to the Lib Dems $-$ the polls seem to indicate almost unanimously that they will be doing very well in areas of a strong Remain persuasion, which means that the model predicts they will gain many seats, particularly where the 2015 election was won with a little margin (and often they leapfrog Labour to the first place).

The following table shows the predicted “swings” $-$ who’s stealing votes from whom:

Conservative Green Labour Lib Dem PCY SNP
Conservative 325 0 0 5 0 0
Green 0 1 0 0 0 0
Labour 64 0 160 6 1 1
Liberal Democrat 0 0 0 9 0 0
Plaid Cymru 0 0 0 0 3 0
Scottish National Party 1 0 0 5 0 50
UKIP 1 0 0 0 0 0

Again, at the moment, bad day at the office for Labour who fails to win a single new seat, while losing over 60 to the Tories, 6 to the Lib Dems, 1 to Plaid Cymru in Wales and 1 to the SNP (which would mean Labour completely erased from Scotland). UKIP is also predicted to lose their only seat $-$ but again, this seems a likely outcome.

Please comment on the article here: Gianluca Baio’s blog

The post Snap appeared first on All About Statistics.

Source link

Using NYC Citi Bike Data to Help Bike Enthusiasts Find their Mate

By | ai, bigdata, machinelearning

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

There is no shortage of analyses on the NYC bike share system. Most of them aim at predicting the demand for bikes and balancing bike stock, i.e forecasting when to remove bikes from fully occupied stations, and refill stations before the supply runs dry.

This is why I decided to take a different approach and use the Citi Bike data to help its users instead.

The Challenge

citibike_citiTinder2The online dating scene is complicated and unreliable: there is a discrepancy between what online daters say and what they do. Although this challenge is not relevant to me anymore – I am married – I wished that, as a bike enthusiast, I had a platform where I could have spotted like-minded people who did ride a bike (and not just pretend they did).

The goal of this project was to turn the Citi Bike data into an app where a rider could identify the best spots and times to meet other Citi Bike users and cyclists in general.

The Data

mapAs of March 31, 2016, the total number of annual subscribers was 163,865, and Citi Bike riders took an average of 38,491 rides per day in 2016 (source: wikipedia)

This is more than 14 million rides in 2016!

I used the Citi Bike data for the month of May 2016 (approximately 1 million observations). Citi Bike provides the following variables:

  • Trip duration (in seconds).
  • Timestamps for when the trip started and ended.
  • Station locations for where the trip started and ended (both the names and coordinates).
  • Rider’s gender and birth year – this is the only demographic data we have.
  • Rider’s plan (annual subscriber, 7-day pass user or 1-day pass user).
  • Bike ID.


Riders per Age Group

Before moving ahead with building the app, I was interested in exploring the data and identifying patterns in relation to gender, age and day of the week. Answering the following questions helped identify which variables influence how riders use the Citi Bike system and form better features for the app:

  • Who are the primary users of Citi Bike?
  • What is the median age per Citi Bike station?
  • How do the days of the week impact biking behaviours?

As I expected, based on my daily rides from Queens to Manhattan, 75% of the Citi Bike trips are taken by males. The primary users are 25 to 24 years old.


Riders per Age Group

Distribution of Riders per Hour of the Day (weekdays)

However, while we might expect these young professionals to be the primary users during the weekdays around 8-9am and 5-6pm (when they commute to and from work), and the older audience to take over the Citi Bike system midday, this hypothesis proved to be wrong. The tourists don’t have anything to do with it; the short term customers only represent 10% of the dataset.


Distribution of Riders per Hour of the Day (weekdays only)

Median Age per Departure Station

Looking at the median age of the riders for each station departure, we see the youngest riders in East Village, while older riders start their commute from Lower Manhattan (as shown in the map below). The age trends disappear when mapping the station arrival, above all in the financial district (in Lower Manhattan), which is populated by the young wolves of Wall Street (map not shown).

The map also confirms that the Citi Bike riders are mostly between 30 and 45 years old.


Median Age per Departure Station

Rides by Hour of the Day

Finally, when analyzing how the days of the week impacted biking behaviours, I was surprised to see that Citi Bike users didn’t ride for a longer period of time during the weekend: the median trip duration is 19 minutes for each day of the week.


Trip Duration per Gender and Age Group

However, as illustrated below, there is a difference in peak hours; during the weekend, riders hop on a bike later during the day, with most of the rides happening midday while the peak hours during the weekdays are around 8-9am and 5-7pm when riders commute to and from work.


Number of Riders per Hour of the Day (weekdays vs. weekends)

The App

Where does this analysis leave us?

  • The day of the week and the hour of the day are meaningful variables which we need to take into account in the app.
  • Most of the users are between 30 and 45 years. This means that the age groups 25-34 and 35-44 won’t be granular enough when app users need to filter their search. We will let them filter by age instead.

The Citi Tinder app in a few words and screenshots.

There are 3 steps to the app:

  • The “when“: find the times and days where your ideal mate is more likely to ride.


  • The “where“: once you know the best times and days, filter out the location by day of the week, time of the day, gender and age. You can also select if you want to spot where they arrive or depart.


  • The “how‘: the final step is to grab a Citi Bike and get to those hot spots. The app calls the Google Maps API to show the directions with a little extra: you can compare the time estimated by Google to connect two stations versus the average time it took Citi Bike users. I believe the latter is more accurate because it factors in the time of the day and day of the week (which the app let you filter).


Although screenshots are nice, the interactive app is better so head to the first step of the app to get started!

Would Have, Should Have, Could Have

This is the first of the four projects from the NYC Data Science Academy Data Science Bootcamp program. With a two-week timeline and only 24 hours in a day, some things gotta give… Below is a quick list of the analysis I could have, would have and should have done if given more time and data:yeahbike

  • Limited scope : I only took the data from May 2016. However, I expect the Citi Bike riders to behave differently depending on the season, temperature, etc. Besides, the bigger the sample size the more reliable the insights are.
  • Missing data : There was no data on the docks available per station that could be scraped from the Citi Bike website. The map would have been more complete if the availability of docks had been displayed.
  • Limited number of variables : I would have liked to have more demographics data (aside from gender and age); a dating app with only the age and gender as filters is restrictive…
  • Incomplete filters : With more time, I’d have added a filter ‘speed’ in the 2nd step of the app (the ‘where’ part) to enable the hard core cyclists to filter the fastest ones…
  • Sub-optimal visualization : I am aware that the map in the introduction page (with the dots displaying the median age per station) is hard to read and with more time, I’d have used polygons instead to group by neighbourhoods.
  • Finally, I would have liked to track unique users. Although users don’t have a unique identifier in the Citi Bike dataset, I could have identified unique users by looking at their gender, age, zip and usual start/end stations.

The post Using NYC Citi Bike Data to Help Bike Enthusiasts Find their Mate appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source link

Designing a reactive HTTP server with RxJava

By | ai, bigdata, machinelearning

Achieve high scalability and performance while reducing system complexity.

The C10k problem was an area of research and optimization that tried to achieve 10,000 concurrent connections on a single commodity server.
Even these days, solving this engineering task with the traditional Java toolkit is a challenge. There are many reactive approaches that easily achieve C10k, and RxJava makes them very approachable. In this chapter, we explore several implementation techniques that will improve scalability by several orders of magnitude. All of them will circle around the concept of reactive programming. If you are lucky enough to work on a greenfield project, you might consider implementing your application in a reactive manner top to bottom. Such an application should never synchronously wait for any computation or action. The architecture must be entirely event-driven and asynchronous in order to avoid blocking. We will go through several examples of a simple HTTP server and observe how it behaves with respect to design choices we made. Admittedly, performance and scalability does have a complexity price tag. But with RxJava the additional complexity will be reduced significantly.

The classic thread per connection model struggles to solve the C10k problem. With 10,000 threads we do the following:

Continue reading Designing a reactive HTTP server with RxJava.

Source link

Re-thinking Enterprise business processes using Augmented Intelligence

By | ai, bigdata, machinelearning

In the 1990s, there was a popular book called Re-engineering the Corporation. Looking back now, Re-engineering certainly has had a mixed success – but it did have an impact over the last two decades. ERP deployments led by SAP and others were a direct result of the Business Process re-engineering phenomenon.

So, now, with the rise of AI: Could we think of a new form of Re-engineering the Corporation – using Artificial Intelligence? The current group of Robotic process automation companies focus on the UI layer. We could extend this far deeper into the Enterprise. Leaving aside the discussion of  the impact of AI on jobs, this could lead to augmented intelligence at the process level for employees (and hence an opportunity for people to transition their careers in the age of AI).

Here are some initial thoughts. I am exploring these ideas in more detail. This work is also a part of an AI lab we are launching in London and Berlin in partnership with UPM and Nvidia both for Enterprises and Cities

Re-thinking Enterprise business processes using Augmented Intelligence

How would you rethink Enterprise business processes using Augmented Intelligence?

To put the basics into perspective: we consider a very ‘grassroots’ meaning of AI. AI is based on Deep Learning. Deep Learning involves automatic feature detection using the data.  You could model a range of Data types (or combination thereof) using AI:

a)      Images and sound – Convolutional neural networks

b)      Transactional – ex Loan approval

c)       Sequences: including handwriting recognition via LSTMs and recurrent neural networks

d)      Text processing – ex natural language detection

e)      Behaviour understanding – via Reinforcement learning

To extend this idea to Process engineering for Enterprises and Cities, we need to

a)      Understand existing business processes

b)      Break the process down into its components

c)       Model the process using Data and Algorithms (both Deep Learning and Machine Learning)

d)      Improve the efficiency of the process by complementing the human activity with AI(Augmented intelligence)

But this just the first step: You would have to consider the wider impact of AI itself

So, here is my list / ‘stack’:

  • New processes due to disruption at the industry level (ex Uber)
  • Change of behaviour due to new processes( ex: employees collaborating with Robots as peers)
  • Improvements in current Business Processes for Enterprises: Customer services, Supply chain, Finance, Human resources, Project management, Corporate reporting, Sales and Logistics, Management
  • The GPU enabled enterprise  ex Nvidia Grid but more broadly GPUs Will Democratize Delivery of Modern Apps, More Efficient Hybridization of Workflows, Unify Compute and Graphics
  • The availability of bodies of labelled data
  • New forms of Communications: Text analytics, Natural language processing, Speech recognition, chatbots

I am exploring these ideas in more as part of my work on the Enterprise AI lab we are launching in London and Berlin in partnership with UPM and Nvidia both for Enterprises and Cities. Welcome your comments at ajit.jaokar at or @ajitjaokar

Source link

Le retour des abeilles

By | ai, bigdata, machinelearning

Suite à mon précédant billet, “Maudites Abeilles“, j’ai eu des commentaires sur le blog, mais aussi sur Twitter pour me dire que malgré tout, il y avait bien une chance sur deux pour avoir un chemin bleu permettant de connecter les deux régions bleues, celle au nord et celle au sud.

En effet, dans le problème de base, les régions sont colorées, et le problème n’est pas l’existence d’un chemin, mais l’existence d’un chemin de la bonne couleur. Il faut juste rajouter une petite ligne pour spécifier la couleur du chemin

> simu2=function(d){
+ C=sample(c(rep(1,d^2/2),rep(2,d^2/2)),size=d^2)
+ for(i in 1:(d^2)){
+ x=rep(i,6)
+ y=c(i-d,i-d+1,i-1,i+1,i+d-1,i+d)
+ if(i%%d==1) y=c(i-d,i-d+1,NA,i+1,NA,i+d)
+ if(i%%d==0) y=c(i-d,NA,i-1,NA,i+d-1,i+d)
+ D=data.frame(x,y)
+ D=D[(D$y>=1)&(D$y<=d^2),] + B=rbind(B,D[which((C[D$y]==C[D$x])&(C[D$x]==1)),]) + } + B=B[(B[,2]>=1)&(B[,2]<=d^2),] + G=as.vector(t(B)) + G_iter=connect(make_graph(G), d^2) + connectpt=function(x) as.numeric(adjacent_vertices(G_iter,x)[[1]]) + sum(unlist(Vectorize(connectpt)(1:d))%in%(d^2+1-(1:d)))>0}
> appr=function(d,nsimu) mean(Vectorize(simu2)(rep(d,nsimu)))
> appr(4,10000)
[1] 0.4993

et en effet, il y a une chance sur deux de trouver un chemin bleu permettant de connecter les deux régions bleues. Mais comme je le disais hier, j’ai du mal avec l’argument de symétrie évoqué dans la “correction”.

Source link

Philly Fed: State Coincident Indexes increased in 45 states in March

By | ai, bigdata, machinelearning

From the Philly Fed:

The Federal Reserve Bank of Philadelphia has released the coincident indexes for the 50 states for March 2017. Over the past three months, the indexes increased in 45 states, decreased in three, and remained stable in two, for a three-month diffusion index of 84. In the past month, the indexes increased in 45 states and decreased in five, for a one-month diffusion index of 80.

Note: These are coincident indexes constructed from state employment data. An explanation from the Philly Fed:

The coincident indexes combine four state-level indicators to summarize current economic conditions in a single statistic. The four state-level variables in each coincident index are nonfarm payroll employment, average hours worked in manufacturing, the unemployment rate, and wage and salary disbursements deflated by the consumer price index (U.S. city average). The trend for each state’s index is set to the trend of its gross domestic product (GDP), so long-term growth in the state’s index matches long-term growth in its GDP.

Philly Fed Number of States with Increasing ActivityClick on graph for larger image.

This is a graph is of the number of states with one month increasing activity according to the Philly Fed. This graph includes states with minor increases (the Philly Fed lists as unchanged).

In March 45states had increasing activity (including minor increases).

The downturn in 2015 and 2016, in the number of states increasing, was mostly related to the decline in oil prices.

Philly Fed State Conincident Map Here is a map of the three month change in the Philly Fed state coincident indicators. This map was all red during the worst of the recession, and almost all green now.

Source: Philly Fed. Note: For complaints about red / green issues, please contact the Philly Fed.

Source link

I hate R, volume 38942

By | ai, bigdata, machinelearning

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

R doesn’t allow block comments. You have to comment out each line, or you can encapsulate the block in if(0){} which is the world’s biggest hack. Grrrrr.

The post I hate R, volume 38942 appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

The post I hate R, volume 38942 appeared first on All About Statistics.

Source link