a secretary problem with maximum ability

By | ai, bigdata, machinelearning

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

The Riddler of today has a secretary problem, where one measures sequentially N random variables until one deems the current variable to be the largest of the whole sample. The classical secretary problem has a counter-intuitive solution where one first measures N/e random variables without taking any decision and then and only then picks the first next outcome larger than the largest in the first group. The added information in the current riddle is that the distribution of those iid random variables is set to be uniform on {1,…,M}, which begs for a modification in the algorithm. As for instance when observing M on the current draw.

The approach I devised is clearly suboptimal, as I decided to pick the currently observed value if the (conditional) probability it is the largest is larger than the probability subsequent draws. This translates into the following R code:

M=100 #maximum value
N=10  #total number of draws
# m is sequence of draws so far
if ((m[n]

which produces a winning rate of around 62% when N=10 and M=100, hence much better than the expected performances of the secretary algorithm, with a winning frequency of 1/e.

Filed under: Kids, R Tagged: mathematical puzzle, R, secretary problem, stopping rule, The Riddler

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source link

Four short links: 27 April 2017

By | ai, bigdata, machinelearning

Open Source Mail Delivery, Superhuman AI, Probabilistic Graphical Models, and Golden Ages

  1. PostalA fully featured open source mail delivery platform for incoming & outgoing e-mail, like SendGrid but open source. I enjoyed this comment on Hacker News, where the commenter talks about turning a $1K/mo mail bill into $4/mo by running their own mail infrastructure. (Downside: you would need to get yourself familiar with SMTP, postfix, SPF/DKIM, mx-validation, blacklists, etc. And by “familiar,” I mean “learn it to the core.”)
  2. The Myth of a Superhuman AI (Kevin Kelly) — he makes a good argument that buried in this scenario of a takeover of superhuman artificial intelligence are five assumptions that, when examined closely, are not based on any evidence. These claims might be true in the future, but there is no evidence to date to support them.
  3. Probabilistic Graphical Models — CS228 course notes turned into a concise introductory course […]. This course starts by introducing probabilistic graphical models from the very basics and concludes by explaining from first principles the variational auto-encoder, an important probabilistic model that is also one of the most influential recent results in deep learning.
  4. Watch It While It Lasts: Our Golden Age of TelevisionThe Parisian golden age [of art] emerged out of the collapse of a system that penalized artistic innovation. For most of the 19th century, the Académie des Beaux-Arts, a state-sanctioned institution, dominated the production and consumption of French art. A jury of academicians decided which paintings were exhibited at the Salon, the main forum for collectors to view new work. The academy set strict rules on artistic expression, and preferred idealized scenes from classical mythology to anything resembling contemporary life. For the most part, the art that resulted was staid and predictable, painted by skilled but anonymous technicians. It sure doesn’t feel like we’re in a golden age of technology innovation, and I sure recognize a lot of the VC horde mentality in the Académie description.

Continue reading Four short links: 27 April 2017.

Source link

Forecasting Skills Applied To Trade Areas

By | ai, bigdata, machinelearning

If you ever want to get the blood flowing, sit down in a meeting with your Real Estate Team and your Finance Team and a handful of Executives looking to either open a new store or close an existing store.

Do you remember that scene in Moneyball when the scouts are all talking about the intangibles regarding a potential player and the Brad Pitt character appears to be having intestinal cramps? That’s what a lot of these meetings are like. Of course nobody is spittin’ tobacco juice into plastic cups. But you get the picture. A Real Estate Director thinks that Penn Square Mall has the “potential” to flourish given the changes that are happening next door at J. Jill. Somebody in Finance is completely against the mall … “Oklahoma City is not an aspirational market” … even though the Finance Director has never been to Oklahoma. An EVP responsible for stores doesn’t want to get hit over the head by the Board of Directors for opening yet another store that never meets the sales projections authored by the Real Estate Team.

The big topic in 2017 is store closures. Y’all recall my omnichannel arguments of 2010 – 2014 … arguments suggesting that store closures were the logical outcome of a strategy of creating channel sameness in an effort to compete against Amazon. Turns out the customer doesn’t “demand” a one-brand approach to channels. Turns out the customer demands more from Amazon!

There’s no better place to apply your forecasting chops than in forecasting what happens to a trade area when a store closes. It’s work that is nearly impossible to get right. Each market behaves just a bit different than a comparable market. If you are off by 10% or 15%, you might close a store that is actually profitable.

Next week, we’ll talk about the approach I use. The approach is not fundamentally different than the approach smart catalogers used 5-10 years ago to dramatically scale back unprofitable pages to online buyers. Most important – the approach is FUN! You get to see outcomes that are uniquely different than you’d expect. And you’ll be armed with the ammo to be like Brad Pitt heading into a meeting with scouts on Moneyball!!

Forecasting outcomes are the sum of all analytics and marketing knowledge possessed by your company.

Source link

Visualiser les résultats du premier tour

By | ai, bigdata, machinelearning

Juste quelques lignes de code, pour visualiser les résultats du premier tour en France. L’idée est de faire une carte assez minimaliste, avec des cercles centrés sur les centroïdes des départements. On commence par récupérer les données pour le fond de carte, un fichier 7z sur le site de l’ign.

download.file("$GEOFLA_2-2_DEPARTEMENT_SHP_LAMB93_FXX_2016-06-28/file/GEOFLA_2-2_DEPARTEMENT_SHP_LAMB93_FXX_2016-06-28.7z",destfile = "dpt.7z")

On a dans ce fichier des informations sur les centroïdes

points([email protected]$X_CENTROID,[email protected]$Y_CENTROID,pch=19,col="red")

Comme ça ne marche pas très bien, on va le refaire à la main, par exemple pour l’Ille-et-Vilaine,

pos=which([email protected][,"CODE_DEPT"]==35)
Poly_35=departements[pos,] plot(departements)
[email protected][pos,c("X_CENTROID","Y_CENTROID")] points([email protected][pos,c("X_CENTROID","Y_CENTROID")],pch=19,col="red")

Comme ça marche mieux, on va utiliser ces centroïdes.,byid=TRUE))

Maintenant, il nous faut les résultats des élections, par département. On peut aller scraper le site du ministère d’ intérieur. On a une page par département, donc c’est facile de parcourir. Par contre, dans l’adresse url, il faut le code région. Comme je suis un peu fainéant, au lieu de faire une table de correspondance, on teste tout, jusqu’à ce que ça marche. L’idée est d’aller cherche le nombre de voix obtenues par un des candidats.

candidat="M. Emmanuel MACRON"
reg=vect_reg[i] nodpt=paste("0",no,sep="")
# if(!{if(as.numeric(no)<10) nodpt=paste("00",no,sep="")}
test=try(htmlParse(url),silent =TRUE)
if(!inherits(test, "try-error")){testurl=TRUE
tabs tab=tabs[[2]] nb=tab

Empty section. Edit page to add content here.
==candidat,"Voix"] a<-unlist(strsplit(as.character(nb)," "))

On peut alors tester

> voix(35)
[1] 84648

Comme ça semble marcher, on le fait pour tous les départements

[email protected]$CODE_DEPT

On peut alors visualiser sur une carte.


Et on peut aussi tenter pour une autre candidate,

candidat="Mme Marine LE PEN"

et on obtient la carte suivante


Source link

Friday: GDP, Chicago PMI

By | ai, bigdata, machinelearning

From the Altanta Fed: GDPNow

The final GDPNow model forecast for real GDP growth (seasonally adjusted annual rate) in the first quarter of 2017 is 0.2 percent on April 27, down from 0.5 percent on April 18. The forecast of first-quarter real consumer spending growth fell from 0.3 percent to 0.1 percent after yesterday’s annual retail trade revision by the U.S. Census Bureau. The forecast of the contribution of inventory investment to first-quarter growth declined from -0.76 percentage points to -1.11 percentage points after this morning’s advance reports on durable manufacturing and wholesale and retail inventories from the Census Bureau. The forecast of real equipment investment growth increased from 5.5 percent to 6.6 percent after the durable manufacturing report and the incorporation of previously published data on light truck sales to businesses from the U.S. Bureau of Economic Analysis.
emphasis added

From the NY Fed Nowcasting Report

The FRBNY Staff Nowcast stands at 2.7% for 2017:Q1 and 2.1% for 2017:Q2.

Mixed news from this week’s data releases left the nowcast for Q1 and Q2 essentially unchanged.

• At 8:30 AM ET, Gross Domestic Product, 1st quarter 2017 (Advance estimate). The consensus is that real GDP increased 1.1% annualized in Q1.

• At 9:45 AM, Chicago Purchasing Managers Index for April. The consensus is for a reading of 56.5, down from 57.7 in March.

• At 10:00 AM, <University of Michigan’s Consumer sentiment index (final for April). The consensus is for a reading of 98.0, unchanged from the preliminary reading 98.0.

Source link

There is no New Thing under the Sun – Yes and No

By | ai, bigdata, machinelearning

Twitter reminded me that there’s #NTTS2017 going on, Eurostat’s biennial scientific conference on New Techniques and Technologies for Statistics (NTTS).

The opening session also focused on official statistics and its actual and future role in a world of data deluge and alt-facts. What will be Official Statistics in 30 years?
In Diego Kuonen’s presentation and discussion on ‘Big Data, Data Science, Machine Intelligence and Learning’ I could hear an answer to this question reminding me of a text in the Bible: “… that [thing] which is done is that which shall be done: and there is no new thing under the sun”.
And this not to be understood in a static but in a dynamic interpretation:
The work statistical institutions are doing today will be the same that they will do tomorrow … BUT a work adapted to the changing context.
The algorithms (understood in a broader sense as ‘a set of rules that precisely defines a sequence of operations ->‘) used in collecting, analyzing and disseminating data will be changing, manual work will / must be replaced by automation, robots. But the core role of being a trusted source of data-based and (in all operations) transparently produced information serving professional decision making will remain.
The challenge will be that these institutions
– are known,
– are noted for their veracity,
– are consulted
and with all this can play their role.
In this fighting to be heard humans will always play a decisive part.
That’s a clear message (as I understood it) of a data scientist looking ahead.
PS. A step towards automation consists of preparing and using linked data. See the NTTS 2017 satellite session “Hands-on workshop on Linked Open Statistical Data (LOD)”

Filed under: 09 Stat.Office / Organization Tagged: algorithm, automation, official statistics

Source link

Learning to Paint The Mona Lisa With Neural Networks

By | ai, bigdata, machinelearning

Can we recover an image by learning a deep regression map from pixels (x,y) to colors (r,g,b)?

Yes, we can.

The idea is to use a deep learning (DL) solution to do a deep regression to learn a mapping between pixel locations and RGB colors, with the goal of generating an image one pixel at a time. This means that if the dimensions of the target image are X-by-Y, then it is necessary to run the network X*Y. If the image is 100-by-100, then 10,000 training iterations are executed.

Keras with TensorFlow as a backend creates a working model for this project. Pythonistas rejoice: Keras functional API is a great abstraction for Theanos and Tensorflow (Keras could support new DL frameworks later) to help define complex models, such as multi-output models, directed acyclic graphs, or models with shared layers.

The Sequential model is probably a better choice to implement such a network, but it helps to start with something surprisingly simple.

Using the Model class:

  • A layer instance is callable (on a tensor), and it returns a tensor.
  • Input tensor(s) and output tensor(s) can then be used to define a Model.
  • Such a model can be trained just like Keras Sequential models.

However, first, we need to be able to run Tensorflow on your computer.

Install Docker and run Tensorflow Notebook image on your machine

The best way to run the TensorFlow is to use a Docker container. There’s full documentation on installing Docker at, but in a few words, the steps are:

  • Go to in your browser.
  • Step one of the instructions sends you to download Docker.
  • Run that downloaded file to install Docker.
  • At the end of the install process a whale in the top status bar indicates that Docker is running, and accessible from a terminal.
  • Click the whale to get Preferences and other options.
  • Open a command-line terminal, and run some Docker commands to verify that Docker is working as expected. Some useful commands to try are docker version to check that you have the latest release installed.
  • Once Docker is installed, you can download the image which allows you to run Tensorflow on your computer.
  • In a terminal run: docker pull 3blades/tensorflow-notebook
  • MacOS & Linux: Run the deep learning image on your system: docker run -it -p 8888:8888 -p 6006:6006 -v /$(pwd):/notebooks 3blades/tensorflow-notebook
  • Windows: Run the deep learning image on your system: docker run -it -p 8888:8888 -p 6006:6006 -v C:/your/folder:/notebooks 3blades/tensorflow-notebook
  • Once you have completed these steps, you can check the installation by starting your web browser and introducing this URL: http://localhost:8888.

We are now ready to paint the Mona Lisa using Deep Regression from pixels to RGB.

Let’s get started!

The infamous Mona Lisa painting is are target image:

import matplotlib.image as mpimg
import matplotlib.pylab as plt
import numpy as np
%matplotlib inline

im = mpimg.imread("data/monalisa.jpg")


Our training dataset will be composed of pixels locations and input and pixel values as output:

X_train = []
Y_train = []
for i in range(im.shape[0]):
    for j in range(im.shape[1]):
X_train = np.array(X_train)
Y_train = np.array(Y_train)
print 'Samples:', X_train.shape[0]
print '(x,y):', X_train[0],'n', '(r,g,b):',Y_train[0]
Samples: 30447
(x,y): [ 0.  0.] 
(r,g,b): [ 85 105 116]

Let’s now build our sequential model

import keras
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.optimizers import Adam, RMSprop, Nadam

# Model architecture
model = Sequential()

model.add(Dense(500, input_dim=2, init='uniform'))

model.add(Dense(500, init='uniform'))

model.add(Dense(500, init='uniform'))

model.add(Dense(500, init='uniform'))

model.add(Dense(500, init='uniform'))

model.add(Dense(3, init='uniform'))


# Compile model
# Why use NAdam Optimizer?
# Much like Adam is essentially RMSprop with momentum, Nadam is Adam RMSprop with Nesterov momentum.

Our output:

Layer (type)                     Output Shape          Param #     Connected to                     
dense_1 (Dense)                  (None, 500)           1500        dense_input_1[0][0]              
activation_1 (Activation)        (None, 500)           0           dense_1[0][0]                    
dense_2 (Dense)                  (None, 500)           250500      activation_1[0][0]               
activation_2 (Activation)        (None, 500)           0           dense_2[0][0]                    
dense_3 (Dense)                  (None, 500)           250500      activation_2[0][0]               
activation_3 (Activation)        (None, 500)           0           dense_3[0][0]                    
dense_4 (Dense)                  (None, 500)           250500      activation_3[0][0]               
activation_4 (Activation)        (None, 500)           0           dense_4[0][0]                    
dense_5 (Dense)                  (None, 500)           250500      activation_4[0][0]               
activation_5 (Activation)        (None, 500)           0           dense_5[0][0]                    
dense_6 (Dense)                  (None, 3)             1503        activation_5[0][0]               
activation_6 (Activation)        (None, 3)             0           dense_6[0][0]                    
Total params: 1005003

Let’s now train our model with 1000 epochs and a 500 batch size.

# use this cell to find the best model architecture
history =, Y_train, nb_epoch=1000, shuffle=True, verbose=1, batch_size=500)
Y = model.predict(X_train, batch_size=10000)
k = 0
im_out = im[:]
for i in range(im.shape[0]):
    for j in range(im.shape[1]):
        im_out[i,j]= Y[k]
        k += 1
print "Mona Lisa by DL"

Give it time to run, on a 4GB laptop, it will take you up to 3 hours to get a result.

Epoch 997/1000
30447/30447 [==============================] - 12s - loss: 231.1333 - acc: 0.9138    
Epoch 998/1000
30447/30447 [==============================] - 12s - loss: 213.8869 - acc: 0.9170    
Epoch 999/1000
30447/30447 [==============================] - 12s - loss: 215.9076 - acc: 0.9130    
Epoch 1000/1000
30447/30447 [==============================] - 12s - loss: 217.6785 - acc: 0.9154 

And here is our result by painting the Mona Lisa Keras using Tensorflow as backend.


Let’s now plot our model accuracy

# summarize history for accuracy
plt.plot(history.history['acc'], 'b')
plt.title('Model Accuracy')

And what about our model loss?

# summarize history for loss
plt.plot(history.history['loss'], 'r')
plt.title('Model Loss')


Several tests were made to achieve the presented result. Optimization methods play a significant role in the image quality. In this case, I decided to use Nadam optimizer.  Nevertheless, the number of neurons and layers used on the model with a uniform init played a major role in the speed of the script as well as the quality of the output.

Training times significantly improve when using nvidia-docker to take advantage of your machine’s GPU processor. We will post a new article on how to install and use nvidia-docker, which in our experience improves training times by almost 20x!

Get the full code on github

The post Learning to Paint The Mona Lisa With Neural Networks appeared first on 3Blades.

Source link