Category

ai

splitting a field by annealing

By | ai, bigdata, machinelearning

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A recent riddle [from The Riddle] that I pondered about during a [long!] drive to Luxembourg last weekend was about splitting a square field into three lots of identical surface for a minimal length of separating wire… While this led me to conclude that the best solution was a T like separation, I ran a simulated annealing R code on my train trip to AutransValence, seemingly in agreement with this conclusion.I discretised the square into n² units and explored configurations by switching two units with different colours, according to a simulated annealing pattern (although unable to impose connectivity on the three regions!):

partz=matrix(1,n,n)
partz[,1:(n/3)]=2;partz[((n/2)+1):n,((n/3)+1):n]=3
#counting adjacent units of same colour 
nood=hood=matrix(4,n,n)
for (v in 1:n2) hood[v]=bourz(v,partz)
minz=el=sum(4-hood)
for (t in 1:T){
  colz=sample(1:3,2) #picks colours
  a=sample((1:n2)[(partz==colz[1])&(hood<4)],1)
  b=sample((1:n2)[(partz==colz[2])&(hood0)&(voiz0){
    difz=sum(nood)-sum(hood)
    if (log(runif(1))

(where bourz computes the number of neighbours), which produces completely random patterns at high temperatures (low t) and which returns to the T configuration (more or less):if not always, as shown below:Once the (a?) solution was posted on The Riddler, it appeared that one triangular (Y) version proved better than the T one [if not started from corners], with a gain of 3% and that a curved separation was even better with an extra gain less than 1% [solution that I find quite surprising as straight lines should improve upon curved ones…]

Filed under: Kids, pictures, R, Statistics Tagged: Autrans, Luxembourg, mathematical puzzle, Palais Ducal, R, random walk, simulated annealing, The Riddler, Vercors

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };

(function(d, t) {
var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
s.src = "http://cdn.viglink.com/api/vglnk.js";
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...




Source link

Industrial Production Increased 0.3% in September

By | ai, bigdata, machinelearning


From the Fed: Industrial production and Capacity Utilization

Industrial production rose 0.3 percent in September. The rates of change for July and August were notably revised; the current estimate for July, a decrease of 0.1 percent, was 0.5 percentage point lower than previously reported, while the estimate for August, a decrease of 0.7 percent, was 0.2 percentage point higher than before. The estimates for manufacturing, mining, and utilities were each revised lower in July. The continued effects of Hurricane Harvey and, to a lesser degree, the effects of Hurricane Irma combined to hold down the growth in total production in September by 1/4 percentage point.[1] For the third quarter as a whole, industrial production fell 1.5 percent at an annual rate; excluding the effects of the hurricanes, the index would have risen at least 1/2 percent. Manufacturing output edged up 0.1 percent in September but fell 2.2 percent at an annual rate in the third quarter. The indexes for mining and utilities in September rose 0.4 percent and 1.5 percent, respectively. At 104.6 percent of its 2012 average, total industrial production in September was 1.6 percent above its year-earlier level. Capacity utilization for the industrial sector increased 0.2 percentage point in September to 76.0 percent, a rate that is 3.9 percentage points below its long-run (1972–2016) average.
emphasis added

Capacity Utilization Click on graph for larger image.

This graph shows Capacity Utilization. This series is up 9.4 percentage points from the record low set in June 2009 (the series starts in 1967).

Capacity utilization at 76.0% is 3.9% below the average from 1972 to 2015 and below the pre-recession level of 80.8% in December 2007.

Note: y-axis doesn’t start at zero to better show the change.

Industrial Production The second graph shows industrial production since 1967.

Industrial production increased in September to 104.6. This is 20.1% above the recession low, and close to the pre-recession peak.

The hurricanes are still impacting this data.


Source link

Lop-sided precincts, a visual exploration

By | ai, bigdata, machinelearning

(This article was originally published at Junk Charts, and syndicated at StatsBlogs.)

In the last post, I discussed one of the charts in the very nice Washington Post feature, delving into polarizing American voters. See the post here. (Thanks again Daniel L.)

Today's post is inspired by the following chart (I am  showing only the top of it – click here to see the entire chart):

Wpost_friendsparties2_top

The chart plots each state as a separate row, so like most such charts, it is tall. The data analysis behind the chart is fascinating and unusual, although I find the chart harder to grasp than expected. The analyst starts with precinct-level data, and determines which precincts were “lop-sided,” defined as having a winning margin of over 50 percent for the winner (either Trump or Clinton). The analyst then sums the voters in those lop-sided precincts, and expresses this as a percent of all voters in the state.

For example, in Alabama, the long red bar indicates that about 48% of the state's voters live in lop-sided precincts that went for Trump. It's important to realize that not all such people voted for Trump – they happened to live in precincts that went heavily for Trump. Interestingly, about 12% of the states voters reside in precincts that went heavily for Clinton. Thus, overall, 60% of Alabama's voters live in lop-sided precincts.

This is more sophisticated than the usual analysis that shows up in journalism.

The bar chart may confuse readers for several reasons:

  • The horizontal axis is labeled “50-point plus margin for Trump/Clinton” and has values from 0% to 40-60% range. This description seemingly infers the values being plotted as winning margins. However, the sub-header tells readers that the data values are percentages of total voters in the state.
  • The shades of colors are not explained. I believe the dark shade indicates the winning party in each state, so Trump won Alabama and Clinton, California. The addition of this information allows the analysis to become multi-dimensional. It also reveals that the designer wants to address how lop-sided precincts affect the outcome of the election. However, adding shade in this manner effectively turns a two-color composition into a four-color composition, adding to the processing load.
  • The chart adopts what Howard Wainer calls the “Alabama first”  ordering. This always messes up the designer's message because the alphabetical order typically does not yield a meaningful correlation.

The bars are facing out from the middle, which is the 0% line. This arrangement is most often used in a population pyramid, and used when the designer feels it important to let readers compare the magnitudes of two segments of a population. I do not feel that the Democrat versus Republican comparison within each state is crucial to this chart, given that most states were not competitive.

What is more interesting to me is the total proportion of voters who live in these lop-sided precincts. The designer agrees on this point, and employs bar stacking to make this point. This yields some amazing insights here: several Democratic strongholds such as Massachusetts surprisingly have few lop-sided precincts.

***
Here then is a remake of the chart according to my priorities. Click here for the full chart.

Redo_wpost_friendsparties2_top

The emphasis is on the total proportion of voters in lop-sided precincts. The states are ordered by that metric from most lop-sided to least. This draws out an unexpected insight: most red states have a relatively high proportion of votesr in lop-sided precincts (~ 30 to 40%) while most blue states – except for the quartet of Maryland, New York, California and Illinois – do not exhibit such demographic concentration.

The gray/grey area offers a counterpoint, that most voters do not live in lop-sided districts.

P.S. I should add that this is one of those chart designs that frustrate standard – I mean, point-and-click – charting software because I am placing the longest bar segments on the left, regardless of color.

Please comment on the article here: Junk Charts

The post Lop-sided precincts, a visual exploration appeared first on All About Statistics.




Source link

Writing Julia functions in R with examples

By | ai, bigdata, machinelearning

(This article was first published on R – insightR, and kindly contributed to R-bloggers)
By Gabriel Vasconcelos

The Julia programming language is growing fast and its efficiency and speed is now well-known. Even-though I think R is the best language for Data Science, sometimes we just need more. Modelling is an important part of Data Science and sometimes you may need to implement your own algorithms or adapt existing models to your problems.

If performance is not essential and the complexity of your problem is small, R alone is enough. However, if you need to run the same model several times on large datasets and available implementations are not suit to your problem, you will need to go beyond R. Fortunately, you can go beyond R in R, which is great because you can do your analysis in R and call complex models from elsewhere. The book “Extending R” from John Chambers presents interfaces in R for C++, Julia and Python. The last two are in the XRJulia and in the XRPython packages, which are very straightforward.

XRJulia (Small example)

Now I will show how to write a small function in Julia and call it from R. You can also call existing Julia functions and run Julia packages (see the book and the package documentation for more information).

First let’s see if there is an existing Julia application in the system:

library(XRJulia)
findJulia(test = TRUE)
## [1] TRUE

Great! My Julia is up and running. A good way to define your Julia functions is though the JuliaEval (you can also use JuliaSource to run scripts). This function evaluates a Julia code and returns an object. However, this does not allow us to call the Julia function. Next we need to tell R that this object is a function with JuliaFunction. Note that everything I am doing here is in Julia 0.5.2 and there may be some differences if you use older or newer versions.

# = Define a function in Julia for a linear regression = #
regjl <- juliaEval("
  function reg(x,y)
    n=size(x,1)
    xreg=hcat(ones(size(x)[1],1),x)
    k=size(xreg,2)
    p1=((xreg'xreg)^(-1))
    b=p1*xreg'y
    r=y-xreg*b
    sig=(r'r)/(n-k)
    vmat=sig[1]*p1
    sigb=sqrt(diag(vmat))
    t=b./sigb

    return (b,t)
  end
")

# = Tell R regjl is a function = #
regjl_function=JuliaFunction(regjl)

Now we can just call the Julia function as if it was an R function. The data will automatically be converted for you by the package. However, there are some possible pitfalls. For example, Julia’s coercion is different from R. It does not understand that a round float may be interpreted as an integer. If, for example, one of the arguments in the Julia function is used to index variables, it needs to be integer and you may have to convert in the Julia function or use L in R (0L, 1L).

Let us generate some goofy data just to run a regression. We have 30 explanatory variables. Once we call the Julia regression function we will get an object that prints something like this: Julia proxy object Server Class: Array{Array{Float64,1},1}; size: 2.

# = Generate Data = #
set.seed(1)
x=matrix(rnorm(9000),300,30)
y=rnorm(300)

# = Run Julia Regression = #
test=regjl_function(x,y)
# = Convert Julia object to R = #
JLreg=juliaGet(test)[[1]]

# = Run regression using lm and compare = #
Rreg=coef(lm(y~x))
cbind(Rreg,JLreg)
##                     Rreg        JLreg
## (Intercept)  0.039077527  0.039077527
## x1          -0.007701335 -0.007701335
## x2          -0.056821632 -0.056821632
## x3           0.105108373  0.105108373
## x4           0.067983306  0.067983306
## x5           0.008417105  0.008417105
## x6           0.078409799  0.078409799
## x7           0.125869518  0.125869518
## x8          -0.083682827 -0.083682827
## x9           0.071740308  0.071740308
## x10         -0.005219314 -0.005219314
## x11         -0.031737503 -0.031737503
## x12         -0.059554921 -0.059554921
## x13         -0.006208602 -0.006208602
## x14         -0.020050620 -0.020050620
## x15          0.008620698  0.008620698
## x16         -0.030422210 -0.030422210
## x17         -0.070605635 -0.070605635
## x18          0.085396818  0.085396818
## x19         -0.039982253 -0.039982253
## x20          0.066555544  0.066555544
## x21         -0.059725849 -0.059725849
## x22         -0.037880127 -0.037880127
## x23         -0.009742695 -0.009742695
## x24          0.074308087  0.074308087
## x25          0.067308613  0.067308613
## x26          0.049848935  0.049848935
## x27         -0.020548783 -0.020548783
## x28         -0.014468850 -0.014468850
## x29         -0.038258085 -0.038258085
## x30          0.102003013  0.102003013

A big Example: Performance

If you just needed an example to adapt to your own problems you do not need to read the rest of this post. However, it is interesting to see how a complex Julia call performs in R. The main question is: is it worthy to use Julia in R even with all the data conversions we need to do from R to Julia and Julia to R? I will answer with an example!!!

The model I will estimate is the Complete Subset Regression (CSR). Basically, it is a combination of many (really many!!) linear regressions. If you want more information on the CSR click here. In this design, we will need to estimate 4845 regressions. The Julia function for the CSR is in the chunk below and the R function can be downloaded from my github here or using install_github(“gabrielrvsc/HDeconometrics”) (devtools must be loaded).

# = Load Combinatorics package in Julia. You need to install it. = #
# = In julia, type: Pkg.add("Combinatorics")
comb=juliaEval("using Combinatorics")

# = Evaluate the CSR function = #
csrjl 1
final_coef[i,fixed_controls]=model1[1:(size(fixed_controls)[1])]
model2=model1[size(fixed_controls)[1]+1:end]
else
final_coef[i,fixed_controls]=model1[1]
model2=model1[2:end]
end

final_coef[i,floor(Int,comb[i])]=model2
end
elseif fixed_controls[1]==0
for i in 1:m
(model,tstat)=reg(x[:,floor(Int,comb[i])],y)
final_const[i]=model[1]
final_coef[i,floor(Int,comb[i])]=model[2:end]
end
end
aux=hcat(final_const,final_coef)
result=[mean(aux[:,i]) for i in 1:size(aux)[2]]

end
")
# = Define it as a function to R = #
csrjl_function=JuliaFunction(csrjl)

Finally, lets compare the R implementation and the Julia implementation. The R CSR is not as good as it could be, I still have some improvements to do. However, the Julia CSR is also far from perfect given my skills in Julia.

# = Run Julia CSR = #
t1=Sys.time()
test=csrjl_function(x,y,4L,20L,0)
t2=Sys.time()
t2-t1
## Time difference of 0.5599275 secs
# = Run R CSR = #
t1=Sys.time()
testR=HDeconometrics::csr(x,y)
t2=Sys.time()
t2-t1
## Time difference of 5.391584 secs

As you can see, the Julia implementation is nearly 10 times faster than the R implementation. Most fast machine learning functions implemented in R such as glmnet or RandomForest are not written in R. Instead, these functions calls other languages like C++ and Fortran. Now you can do these things yourself with Julia and Python, which are much easier to learn. The last chunk below just shows that the two functions return the same result.

cbind("R"=colMeans(coef(testR)),"Julia"=juliaGet(test))
##                      R        Julia
## intersect  0.033839276  0.033839276
##            0.000000000  0.000000000
##           -0.010053924 -0.010053924
##            0.020620605  0.020620605
##            0.011568725  0.011568725
##            0.011807227  0.011807227
##            0.010009517  0.010009517
##            0.028981984  0.028981984
##           -0.008858019 -0.008858019
##            0.013655475  0.013655475
##            0.000000000  0.000000000
##           -0.008555735 -0.008555735
##           -0.014146991 -0.014146991
##            0.000000000  0.000000000
##            0.000000000  0.000000000
##            0.000000000  0.000000000
##            0.000000000  0.000000000
##           -0.010730430 -0.010730430
##            0.015761635  0.015761635
##           -0.009543594 -0.009543594
##            0.015337589  0.015337589
##           -0.008913380 -0.008913380
##           -0.009944097 -0.009944097
##            0.000000000  0.000000000
##            0.008933383  0.008933383
##            0.010226204  0.010226204
##            0.000000000  0.000000000
##            0.000000000  0.000000000
##            0.000000000  0.000000000
##           -0.007886220 -0.007886220
##            0.021236322  0.021236322

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };

(function(d, t) {
var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
s.src = "http://cdn.viglink.com/api/vglnk.js";
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...




Source link

Hillstrom's Optimizer: Oh, Maybe The Optimizers Were Wrong

By | ai, bigdata, machinelearning

Our base case:


Then here’s what happens when you optimize every single individual year as each year comes down the turnpike:


The business is eleven million dollars more profitable by doing this than by following the base case. So, WOW! Life is good! The optimizers were right … they optimized each year and made you a fortune.

One problem.

In our optimized solution, where each individual year is optimized, we aren’t bringing on enough new customers. The “optimal” solution in an individual year was to cut catalog customer acquisition by 65%.

What happens if, instead of cutting customer acquisition spend via catalogs by 65%, we double catalog customer acquisition spend? I know, that’s crazy … that’s not the optimal solution!!!


Oh. My. Goodness.

I mean, we optimize each individual year, and we get a better answer.

Then we sub-optimize the answer and we get an answer that is SO MUCH BETTER!!

So what have we learned?
  • Optimization is good.
  • Sub-Optimization is better, especially when it comes to Customer Acquisition.
The secret to business, then, appears to require us to optimize to a point. To a point. And then, we need to sub-optimize the areas of our business that generate new customers … as long as we sub-optimize to the point where long-term value is maximized.


P.S.: You’ve heard of You Goat Mail, right? No? Well go check ’em out. And when you either buy or receive a goat you become a member of the “Goat Herd” (click here). What do you see when you visit the page? You see a ton of low-cost / no-cost customer acquisition awareness. It’s not hard to create a club that has no real meaning and no benefits, but is cute. Try something!

P.P.S.: NEMOA Sponsorship Opportunities (click here).

  • $2,800 = Host a Roundtable Session.
  • $5,500 = Host a Roundtable, Introduce a Session.
  • $11,000 = Speaking Opportunity (Not Guaranteed), Host Roundtable, Introduce Session. One Table for Sales Materials.
  • $21,000 (5 only) = Private Board and Speaker Event, Speaking Opportunity (Not Guaranteed), One Table for Sales Materials, Video Clip played in front of General Audience. Conference Organizers will find a unique way to get Sponsor Noticed.
  • $26,000 (1 only) = Guaranteed Speaking Opportunity, Guaranteed Roundtable Speaking Opportunity, Video Clip played in front of General Audience.
  • Go to the last page … 1 sponsor @ $26,000 + 4 sponsors @ $21,000 + 6 sponsors @ $11,000 + 13 sponsors @ $5,500 + 21 sponsors @ $2,800 = $306,300.
  • This is a very small … humble conference … and they’re ranking in more than a quarter-million dollars from sponsorship. What is easier … generating $306,300 in sponsorship fees or finding 400+ attendees? You already know the answer … and now you know why conferences are overrun by vendor-centric everything … and now you know why the attendees are folks who are dazzled by vendor-centric everything … and now you know why conference topics don’t align with the actual problems/challenges you face.





Source link

Wednesday: Housing Starts, Beige Book

By | ai, bigdata, machinelearning


Back in 2014, I wrote this …

For amusement: Years ago, whenever there was a market sell-off, my friend Tak Hallus (Stephen Robinett) would shout at his TV tuned to CNBC “Bring out the bears!”.

This was because CNBC would usually bring on the bears whenever there was a sell-off, and bulls whenever the market rallied.

Today was no exception with Marc Faber on CNBC:

“This year, for sure—maybe from a higher diving board—the S&P will drop 20 percent,” Faber said, adding: “I think, rather, 30 percent”

And Faber from August 8, 2013:

Faber expect to see stocks end the year “maybe 20 percent [lower], maybe more!”

And from October 24, 2012:

“I believe globally we are faced with slowing economies and disappointing corporate profits, and I will not be surprised to see the Dow Jones, the S&P, the major indices, down from the recent highs by say, 20 percent,” Faber said…

Since the market is up 30% since his 2012 prediction, shouldn’t he be expecting a 50% decline now?

Now the market is up about 80% since his 2012 prediction. I mention Faber – not because of his forecasting record – but because of his racist comments today (he will no longer be on CNBC).

I guess CNBC has an opening for a permabear!

Wednesday:
• At 7:00 AM ET, The Mortgage Bankers Association (MBA) will release the results for the mortgage purchase applications index.

• At 8:30 AM, Housing Starts for September. The consensus is for 1.170 million SAAR, down from the August rate of 1.180 million.

• During the day: The AIA’s Architecture Billings Index for September (a leading indicator for commercial real estate).

• At 2:00 PM, the Federal Reserve Beige Book, an informal review by the Federal Reserve Banks of current economic conditions in their Districts.


Source link

splitting a field by annealing

By | ai, bigdata, machinelearning

(This article was originally published at R – Xi’an’s Og, and syndicated at StatsBlogs.)

A recent riddle [from The Riddle] that I pondered about during a [long!] drive to Luxembourg last weekend was about splitting a square field into three lots of identical surface for a minimal length of separating wire… While this led me to conclude that the best solution was a T like separation, I ran a simulated annealing R code on my train trip to AutransValence, seemingly in agreement with this conclusion.I discretised the square into n² units and explored configurations by switching two units with different colours, according to a simulated annealing pattern (although unable to impose connectivity on the three regions!):

partz=matrix(1,n,n)
partz[,1:(n/3)]=2;partz[((n/2)+1):n,((n/3)+1):n]=3
#counting adjacent units of same colour 
nood=hood=matrix(4,n,n)
for (v in 1:n2) hood[v]=bourz(v,partz)
minz=el=sum(4-hood)
for (t in 1:T){
  colz=sample(1:3,2) #picks colours
  a=sample((1:n2)[(partz==colz[1])&(hood<4)],1)
  b=sample((1:n2)[(partz==colz[2])&(hood<4)],1) 
  partt=partz;partt[b]=colz[1];partt[a]=colz[2] 
#collection of squares impacted by switch 
  nood=hood 
  voiz=unique(c(a,a-1,a+1,a+n,a-n,b-1,b,b+1,b+n,b-n)) 
  voiz=voiz[(voiz>0)&(voiz0){
    difz=sum(nood)-sum(hood)
    if (log(runif(1))

(where bourz computes the number of neighbours), which produces completely random patterns at high temperatures (low t) and which returns to the T configuration (more or less):if not always, as shown below:Once the (a?) solution was posted on The Riddler, it appeared that one triangular (Y) version proved better than the T one [if not started from corners], with a gain of 3% and that a curved separation was even better with an extra gain less than 1% [solution that I find quite surprising as straight lines should improve upon curved ones…]

Filed under: Kids, pictures, R, Statistics Tagged: Autrans, Luxembourg, mathematical puzzle, Palais Ducal, R, random walk, simulated annealing, The Riddler, Vercors

Please comment on the article here: R – Xi'an's Og

The post splitting a field by annealing appeared first on All About Statistics.




Source link