Rational Inexuberance

By | ai, bigdata, machinelearning


Recently Yoav Goldberg had a famous blog rant. I appreciate his concern, because the situation is game-theoretically dangerous: any individual researcher receives a benefit for aggressively positioning their work (as early as possible), but the field as a whole risks another AI winter as rhetoric and reality become increasingly divergent. Yoav’s solution is to incorporate public shaming in order to align local incentives with aggregate outcomes (c.f., reward shaping).

I feel there is a better way, as exemplified by a recent paper by Jia and Liang. In this paper the authors corrupt the SQUAD dataset with distractor sentences which have no effect on human performance, but which radically degrade the performance of the systems on the leaderboard. This reminds me of work by Paperno et. al. on a paragraph completion task which humans perform with high skill and for which all state of the art NLP approaches fail miserably. Both of these works clearly indicate that our current automatic systems only bear a superficial (albeit economically valuable) resemblance to humans.

This approach to honest self-assessment of our capabilities is not only more scholarly, but also more productive, as it provides concrete tasks to consider. At minimum, this will result in improved technological artifacts. Furthermore iterating this kind of goal-setting-and-goal-solving procedure many many times might eventually lead to something worthy of the moniker Artificial Intelligence.

(You might argue that the Yoav Goldberg strategy is more entertaining, but the high from the Yoav Goldberg way is a “quick hit”, whereas having a hard task to think about has a lot of “replay value”.)


Source link

thinking with data with “Modern Data Science with R”

By | ai, bigdata, machinelearning

One of the biggest challenges educators face is how to teach statistical thinking integrated with data and computing skills to allow our students to fluidly think with data.  Contemporary data science requires a tight integration of knowledge from statistics, computer science, mathematics, and a domain of application. For example, how can one model high earnings as a function of other features that might be available for a customer? How do the results of a decision tree compare to a logistic regression model? How does one assess whether the underlying assumptions of a chosen model are appropriate?  How are the results interpreted and communicated? 


While there are a lot of other useful textbooks and references out there (e.g., R for Data Science, Practical Data Science with R, Intro to Data Science with Python) we saw a need for a book that incorporates statistical and computational thinking to solve real-world problems with data.  The result was Modern Data Science with R, a comprehensive data science textbook for undergraduates that features meaty, real-world case studies integrated with modern data science methods.  (Figure 8.2 above was taken from a case study in the supervised learning chapter.)

Part I (introduction to data science) motivates the book and provides an introduction to data visualization, data wrangling, and ethics.  Part II (statistics and modeling) begins with fundamental concepts in statistics, supervised learning, unsupervised learning, and simulation.  Part III (topics in data science) reviews dynamic visualization, SQL, spatial data, text as data, network statistics, and moving towards big data.  A series of appendices cover the mdsr package, an introduction to R, algorithmic thinking, reproducible analysis, multiple regression, and database creation.

We believe that several features of the book are distinctive:

  1. minimal prerequisites: while some background in statistics and computing is ideal, appendices provide an introduction to R, how to write a function, and key statistical topics such as multiple regression
  2. ethical considerations are raised early, to motivate later examples
  3. recent developments in the R ecosystem (e.g., RStudio and the tidyverse) are featured

Rather than focus exclusively on case studies or programming syntax, this book illustrates how statistical programming in R/RStudio can be leveraged to extract meaningful information from a variety of data in the service of addressing compelling statistical questions.  

This book is intended to help readers with some background in statistics and modest prior experience with coding develop and practice the appropriate skills to tackle complex data science projects. We’ve taught a variety of courses using it, ranging from an introduction to data science, a sophomore level data science course, and as part of the components for a senior capstone class.  
We’ve made three chapters freely available for download: data wrangling I, data ethics, and an introduction to multiple regression. An instructors solution manual is available, and we’re working to create a series of lab activities (e.g., text as data).  (The code to generate the above figure can be found in the supervised learning materials at http://mdsr-book.github.io/instructor.html.)
Modern Data Science with R cover
Modern Data Science with R


An unrelated note about aggregators:
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

var vglnk = { key: ‘949efb41171ac6ec1bf7f206d57e90b8’ };

(function(d, t) {
var s = d.createElement(t); s.type = ‘text/javascript’; s.async = true;
s.src = “http://cdn.viglink.com/api/vglnk.js”;
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, ‘script’));




Source link

R’s tidytext turns messy text into valuable insight

By | ai, bigdata, machinelearning

Authors Julia Silge and David Robinson discuss the power of tidy data principles, sentiment lexicons, and what they’re up to at Stack Overflow.

“Many of us who work in analytical fields are not trained in even simple interpretation of natural language,” write Julia Silge, Ph.D., and David Robinson, Ph.D., in their newly released book Text Mining with R: A tidy approach. The applications of text mining are numerous and varied, though; sentiment analysis can assess the emotional content of text, frequency measurements can identify a document’s most important terms, analysis can explore relationships and connections between words, and topic modeling can classify and cluster similar documents.

I recently caught up with Silge and Robinson to discuss how they’re using text mining on job postings at Stack Overflow, some of the challenges and best practices they’ve experienced when mining text, and how their tidytext package for R aims to make text analysis both easy and informative.

Let’s start with the basics. Why would an analyst mine text? What insights can be derived from mining instances of words, sentiment of words?

Text and other unstructured data is increasingly important for data analysts and data scientists in diverse fields from health care to tech to nonprofits. This data can help us make good decisions, but to capitalize on it, we must have the tools and the skills to get from unstructured text to insights. We can learn a lot by exploring word frequencies or comparing word usage, and we can dig deeper by implementing sentiment analysis to analyze the emotion or opinion content of words, or by fitting a topic model to discover underlying structure in a set of documents.

Why did you create the tidytext text mining package in R? How does it make an R user’s life easier?

We created the tidytext package because we believe in the power of tidy data principles, and we wanted to apply this consistent, opinionated approach for handling data to text mining tasks. Tidy tools like dplyr and ggplot2 are widely used, and integrating natural language processing into these tools allows R users to work with greater fluency.

One feature in tidytext 0.1.3 is the addition of the Loughran and McDonald sentiment lexicon of words specific to financial reporting, where words like “collaborate” and “collaborators” seem to be tagged as positive and words like “collapsed” and “collapsing” seem to be tagged as negative. For someone who is new to text mining, what is the general purpose of a sentiment lexicon? What are some ways this lexicon would be used in by an analyst?

Sentiment lexicons are lists of words that have been assigned scores according to how positive or negative they are, or what emotions (such as “anticipation” or “fear”) they might be associated with. We can analyze the emotion content of text by adding up the scores of the words within it, which is a common approach to sentiment analysis. The tidytext package contains several general purpose English lexicons appropriate for general text, and we are excited to extend these with a context-specific lexicon for finance. A word like “share” has a positive meaning in most contexts, but is neutral in financial contexts, where it usually refers to shares of stock. Applying the Loughran-McDonald lexicon allows us to explore the sentiment content of documents dealing with finance with more confidence.

In your book, you perform text analysis on data sets ranging from classic Jane Austen novels to NASA metadata to Twitter archives. What are some of the ways you’re analyzing text data in your daily work at Stack Overflow?

We are swimming in text data at Stack Overflow! One example we deal with is text in job postings; we use text mining and modeling to match job listings with people who may be interested in them. Another example is text in messages between companies who are hiring and developers they want to hire; we use text mining to see what makes a developer more likely to respond to a company. But, we’re certainly not unique in this; many organizations are dealing with increasing amounts of text data that are important to their decision-making.

Text data is messy, and things like abbreviations, “filler” words, or repeated words can present many challenges. What are some common challenges practitioners might confront when wrangling or visualizing text data, as opposed to more traditional data types (e.g., numerical)?

Data scientists and analysts like us are usually trained on numerical data in a rectangular shape like a table (i.e., data frame), so it takes some practice to fluently wrangle raw text data. We find ourselves reaching for regular expressions and the stringr package a lot, to deal with challenges such as stripping out HTML tags or email headers, or extracting subsets of text we are interested in. We often put such tasks into practice using the purrr package; it’s a very useful tool for dealing with iteration.

What are some best practices you can offer to data scientists and analysts looking to overcome text mining problems?

We come from a particular, opinionated perspective on this question; our advice is that adopting tidy data principles is an effective strategy to approach text mining problems. The tidy text format keeps one token (typically a word) in each row, and keeps each variable (such as a document or chapter) in a column. When your data is tidy, you can use a common set of tools for exploring and visualizing them. This frees you from struggling to get your data into the right format for each task and instead lets you focus on the questions you want to ask.

Your book demonstrates how to do text mining in R. Which R tools do you commonly use to support text mining? And why is R your tool of choice?

Our main toolbox for text mining in R focuses on our package tidytext, along with the packages dplyr, tidyr, and ggplot2. These are all tools from the tidyverse collection of packages in R, and the availability and cohesion of these tools are the reasons why we use R for text mining. Using consistent tools designed for handling tidy data gives us a dependable framework for understanding how to represent text data in R, visualize the characteristics of text, model topics, and move smoothly to more complex machine learning applications.

What is the difference between text mining and natural language processing?

In our experience, definitions for these terms are somewhat vague and sometimes interchangeable. When people talk about text mining, they often mean getting insight from text through statistical analysis, perhaps looking at word frequencies or clustering. When people talk about natural language processing, they’re often describing the interaction between language and computers, and sometimes the goal of extracting meaning to enable human-computer conversations. We describe our work as “text mining” because our goal is extracting and visualizing insights, but there is a great deal of overlap.

Continue reading R’s tidytext turns messy text into valuable insight.




Source link

Died in the Wool

By | machinelearning

Garrett M. writes:

I’m an analyst at an investment management firm. I read your blog daily to improve my understanding of statistics, as it’s central to the work I do.

I had two (hopefully straightforward) questions related to time series analysis that I was hoping I could get your thoughts on:

First, much of the work I do involves “backtesting” investment strategies, where I simulate the performance of an investment portfolio using historical data on returns. The primary summary statistics I generate from this sort of analysis are mean return (both arithmetic and geometric) and standard deviation (called “volatility” in my industry). Basically the idea is to select strategies that are likely to generate high returns given the amount of volatility they experience.

However, historical market data are very noisy, with stock portfolios generating an average monthly return of around 0.8% with a monthly standard deviation of around 4%. Even samples containing 300 months of data then have standard errors of about 0.2% (4%/sqrt(300)).

My first question is, suppose I have two times series. One has a mean return of 0.8% and the second has a mean return of 1.1%, both with a standard error of 0.4%. Assuming the future will look like the past, is it reasonable to expect the second series to have a higher future mean than the first out of sample, given that it has a mean 0.3% greater in the sample? The answer might be obvious to you, but I commonly see researchers make this sort of determination, when it appears to me that the data are too noisy to draw any sort of conclusion between series with means within at least two standard errors of each other (ignoring for now any additional issues with multiple comparisons).

My second question involves forecasting standard deviation. There are many models and products used by traders to determine the future volatility of a portfolio. The way I have tested these products has been to record the percentage of the time future returns (so out of sample) fall within one, two, or three standard deviations, as forecasted by the model. If future returns fall within those buckets around 68%/95%/99% of the time, I conclude that the model adequately predicts future volatility. Does this method make sense?

My reply:

In answer to your first question, I think you need a model of the population of these time series. You can get different answers from different models. If your model is that each series is a linear trend plus noise, then you’d expect (but not be certain) that the second series will have a higher future return than the first. But there are other models where you’d expect the second series to have a lower future return. I’d want to set up a model allowing all sorts of curves and trends, then fit the model to past data to estimate a population distribution of those curves. But I expect that the way you’ll make real progress is to have predictors—I guess they’d be at the level of the individual stock, maybe varying over time—so that your answer will depend on the values of these predictors, not just the time series themselves.

In answer to your second question, yes, sure, you can check the calibration of your model using interval coverage. This should work fine if you have lots of calibration data. If your sample size becomes smaller, you might want to do something using quantiles, as described in this paper with Cook and Rubin, as this will make use of the calibration data more efficiently.

The post Died in the Wool appeared first on Statistical Modeling, Causal Inference, and Social Science.

Source link

Why Join admitad CPA Affiliate Marketing Network?

By | iot

With the internet growing at a very fast pace, we have come across a time when there are new and better business opportunities being created every other day. One such strategy you must have heard about is Affiliate marketing. For those of you who are not yet aware of this, Affiliate Marketing is a type of marketing which is performance based and in this kind of marketing, a business rewards one or more of its affiliates who bring in customers in the form of traffic through their own affiliate links onto the site of the business owner. For each visitor

The post Why Join admitad CPA Affiliate Marketing Network? appeared first on IoT Worm.

Source link

If you are using Facebook Ads split testing (A/B testing), stop fooling yourself

By | ai, bigdata, machinelearning

I don't speak often at marketing conferences and that's because my message is not easy to take. For example, one of my talks is titled “The Accountability Paradox in Big Data Marketing.” Google and other digital marketers claim that the ad-tech world is more measurable, and thus more accountable than the old world of TV advertising – they claim that advertisers save money by going digital. The reality is not so. There have been some attention to this problem recently – but far from enough.

Let me illustrate the problems by describing my recent experience running ads on Facebook for Principal Analytics Prep, the analytics bootcamp I recently launched. For a small-time advertiser like us, Facebook presents a channel to reach large numbers of people to build awareness of our new brand.

So far, the results from the ads have been satisfactory but not great. We are quite contented with the effectiveness but wanted to run experiments to get higher volume of “conversions”. This last week, we ran an A/B test to see if different images result in more conversions. We designed a four-way split, so in reality, an A/B/C/D test. One of the test cells (call it D) is the “champion,” i.e. the image that has performed well prior to the test; the other images are new. We launched the test on a Friday.

Two days later, I checked the interim results. Only one of the test cells (A) had any responses. Surprisingly, that test cell A has received about 90% of all “impressions.” Said differently, test cell A received 10 times as many impressions as each of the other three cells. The other test cells were getting such measly allocation that I have lost all confidence in this test.

It turns out that an automated algorithm (what is now labeled A.I.) was behind this craziness. Apparently, this is a well-known problem among people who tried to do so-called split testing on the Facebook Ads platform. See this paragraph from the AdEspresso blog:

This often results in an uneven distribution of the budget where some experiments will receive a lot of impressions and consume most of the budget leaving others under-tested. This is due to Facebook being over aggressive determining which ad is better and driving to it most of the Adset’s budget.

Then one day later, I was shook again when checking the interim report. Suddenly, test cell C got almost all the impressions – due to one conversion that showed up overnight for the C image. Clearly, anyone using this split-testing feature is just fooling themselves.

***

This is a great example of interesting math that looks good on paper but spectacularly fails in practice. The algorithm that is driving this crazy behavior is most likely something called multi-armed bandits. This method has traditionally been used to study casino behavior but some academics have recently written many papers that argue they are suitable to use in A/B testing. The testing platform in Google Analytics used to do a similar thing – it might still do but I wouldn't know because I avoid that one like the plague as well.

The problem setup is not difficult to understand: in traditional testing as developed by statisticians, you need a certain sample size to be confident that any difference observed between the A and B cells is “statistically significant.” The analyst would wait for the entire sample to be collected before making a judgment on the results. No one wants to wait especially when the interim results are showing a direction in one's favor. This is true in business as in medicine. The pharmaceutical company that is running a clinical trial on a new drug it spent gazillions to develop would love to declare the new drug successful based on interim positive results. Why wait for the entire sample when the first part of the sample gives you the answer you want?

So people come up with justifications for why one should stop a test early. They like to call this a game of “exploration versus exploitation.” They claim that the statistical way of running testing is too focused on exploration; they claim that there is “lost opportunity” because statistical testing does not “exploit” interim results. 

They further claim that the multi-armed bandit algorithms solve this problem by optimally balancing exploration and exploitation (don't shoot me, I am only the messenger). In this setting, they allow the allocation of treatment in the A/B test to change continuously in response to interim results. Those cells with higher interim response rates will be allocated more future testing units while those cells with lower interim response rates will be allocated fewer testing units. The allocation of units to treatment continuously shifts throughout the test.

***

When this paradigm is put in practice, it keeps running into all sorts of problems. One reality is that 80 to 90 percent of all test ideas make no difference, meaning the test version B on average performs just as well as test version A. There is nothing to “exploit.” Any attempted exploitation represents swimming in the noise.

In practice, many tests using this automated algorithm produce absurd results. As AdEspresso pointed out, the algorithm is overly aggressive in shifting impressions to the current “winner.” For my own test, which has very low impressions, it is simply absurd for it to start changing allocation proportions after one or two days. These shifts are driving by single-digit conversions off a small base of impressions. And it then swims around in the noise. Because of such aimless and wasteful “exploitation,” it would have taken me much, much longer to collect enough samples on the other images to definitively make a call!

***

AdEspresso and others recommend a workaround. Instead of putting the four test images into one campaign, they recommend setting up four campaigns each with one image, and splitting the advertising equally between these campaigns.

Since there is only one image in each campaign, you have effectively turned off the algorithm. When you split the budget equally, each campaign will get similar numbers of impressions. 

However, this workaround is also flawed. If you can spot what the issue is, say so in the comments!

 

 

 

 

 

 


Source link

MBA: Mortgage Applications Increase in Latest Weekly Survey

By | ai, bigdata, machinelearning


From the MBA: Mortgage Applications Increase in Latest MBA Weekly Survey

Mortgage applications increased 0.4 percent from one week earlier, according to data from the Mortgage Bankers Association’s (MBA) Weekly Mortgage Applications Survey for the week ending July 21, 2017.

… The Refinance Index increased 3 percent from the previous week. The seasonally adjusted Purchase Index decreased 2 percent from one week earlier to the lowest level since May 2017. The unadjusted Purchase Index decreased 2 percent compared with the previous week and was 8 percent higher than the same week one year ago. …

The average contract interest rate for 30-year fixed-rate mortgages with conforming loan balances ($424,100 or less) decreased to 4.17 percent from 4.22 percent, with points increasing to 0.40 from 0.31 (including the origination fee) for 80 percent loan-to-value ratio (LTV) loans.
emphasis added

Mortgage Refinance Index Click on graph for larger image.

The first graph shows the refinance index since 1990.

Refinance activity will not pick up significantly unless mortgage rates fall well below 4%.

Mortgage Purchase IndexThe second graph shows the MBA mortgage purchase index.

According to the MBA, purchase activity is up 8% year-over-year.


Source link

Random segments and broken sticks

By | ai, bigdata, machinelearning

(This article was originally published at The DO Loop, and syndicated at StatsBlogs.)

A classical problem in elementary probability asks for the expected lengths of line segments that result from randomly selecting k points along a segment of unit length. It is both fun and instructive to simulate such problems. This article uses simulation in the SAS/IML language to estimate solutions to the following problems:

  • Randomly choose a point at random in the interval (0, 1). The point divides the interval into two segments of length x and 1-x. What is the expected length of the larger (smaller) segment?
  • Broken stick problem: What is the probability that three randomly chosen points will break a segment into a triangle?

  • Randomly choose k points at random in the interval (0, 1). The points divide the interval into k+1 segments. What is the expected length of the largest (smallest) segment?
  • When k=2, the points divide the interval into three segments. What is the probability that the three segments can form a triangle? This is called the broken-stick problem and is illustrated in the figure to the right.

You can find a discussion and solution to these problems on many websites, but I like the Cut-The-Knot.org website, which includes proofs and interactive Java applets.

Simulate a solution in SAS

You can simulate these problems in SAS by writing a DATA step or a SAS/IML program. I discuss the DATA step at the end of this article. The body of this article presents a SAS/IML simulation and constructed helper modules that solve the general problem. The simulation will do the following:

  1. Generate k points uniformly at random in the interval (0, 1). For convenience, sort the points in increasing order.
  2. Compute the lengths of the k+1 segments.
  3. Find the length of the largest and smallest segments.

In many languages (including the SAS DATA step), you would write a loop that performs these operations for each random sample. You would then estimate the expected length by computing the mean value of the largest segment for each sample. However, in the SAS/IML language, you can use matrices instead of using a loop. Each sample of random points can be held in the column of a matrix. The lengths of the segments can also be held in a matrix. The largest segment for each trial is stored in a row vector.

The following SAS/IML modules help solve the general simulation problem for k random points. Because the points are ordered, the lengths of the segments are the differences between adjacent rows. You can use the DIF function for this computation, but the following program uses the DifOp module to construct a small difference operator, and it uses matrix multiplication to compute the differences.

proc iml;
/* Independently sort column in a matrix.
   See http://blogs.sas.com/content/iml/2011/03/14/sorting-rows-of-a-matrix.html */
start SortCols(A);
   do i = 1 to ncol(A);
      v = A[ ,i];  call sort(v);  A[ ,i] = v; /* get i_th col and sort it */
   end;
finish;
 
/* Generate a random (k x NSim) matrix of points, then sort each column. */
start GenPts(k, NSim);
   x = j(k, NSim);               /* allocate k x NSim matrix */
   call randgen(x, "Uniform");   /* fill with random uniform in (0,1) */
   if k > 1 then run SortCols(x);  
   return x;
finish;
 
/* Return matrix for difference operator.
   See  http://blogs.sas.com/content/iml/2017/07/24/difference-operators-matrices.html */
start DifOp(dim);
   D = j(dim-1, dim, 0);         /* allocate zero matrix */
   n = nrow(D); m = ncol(D);
   D[do(1,n*m, m+1)] = -1;       /* assign -1 to diagonal elements */
   D[do(2,n*m, m+1)] = 1;        /* assign +1 to super-diagonal elements */
   return D;
finish;
 
/* Find lengths of segments formed by k points in the columns of x.
   Assume each column of x is sorted and all points are in (0,1). */
start SegLengths(x);   
   /* append 0 and 1 to top and bottom (respectively) of each column */
   P = j(1, ncol(x), 0) // x // j(1, ncol(x), 1);
   D = DifOp(nrow(P));           /* construct difference operator */
   return D*P;                   /* use difference operator to find lengths */
finish;
 
P = {0.1  0.2  0.3,
     0.3  0.8  0.5,
     0.7  0.9  0.8 };
L = SegLengths(P);
print L[label="Length (k=3)"];

Lengths of segments formed by random points in the unit interval

The table shows the lengths of three different sets of points for k=3. The first column of P corresponds to points at locations {0.1, 0.3, 0.7}. These three points divide the interval [0, 1] into four segments of lengths 0.1, 0.2, 0.4, and 0.3. Similar computations hold for the other columns.

The expected length of the longer of two segments

For k=1, the problem generates a random point in (0, 1) and asks for the expected length of the longer segment. Obviously the expected length is greater than 1/2, and you can read the Cut-The-Knot website to find a proof that shows that the expected length is 3/4 = 0.75.

The following SAS/IML statements generate one million random points and compute the larger of the segment lengths. The average value of the larger segments is computed and is very close to the expected value:

call randseed(54321);
k = 1;  NSim = 1E6;
x = GenPts(k, NSim);             /* simulations of 1 point dropped onto (0,1) */
L = SegLengths(x);               /* lengths of  segments */
Largest = L[<>, ];               /* max length among the segments */
mean = mean(Largest`);           /* average of the max lengths */
print mean;

Estimate for expected length of longest segment

You might not be familiar with the SAS/IML
max subscript operator (<>) and the min subscript operator (><). These operators compute the minimum or maximum values for each row or column in a matrix.

The expected length of the longest of three segments

For k=2, the problem generates two random points in (0, 1) and asks for the expected length of the longest segment. You can also ask for the average shortest length. The Cut-The-Knot website shows that the expected length for the longest segment is 11/18 = 0.611, whereas the expected length of the shortest segment is 2/18 = 0.111.

The following SAS/IML statements simulate choosing two random points on one million unit intervals. The program computes the one million lengths for the resulting longest and shortest segments. Again, the average values of the segments are very close to the expected values:

k = 2;  NSim = 1E6;
x = GenPts(k, NSim);             /* simulations of 2 points dropped onto (0,1) */
L = SegLengths(x);               /* lengths of segments */
maxL = L[<>, ];                  /* max length among the segments */
meanMax = mean(maxL`);           /* average of the max lengths */
minL = L[><, ];                  /* min length among the segments */
meanMin = mean(minL`);           /* average of the max lengths */
print meanMin meanMax;

Estimates for expected lengths of the shorted and longest segments formed by two random points in the unit interval

The broken stick problem

You can use the previous simulation to estimate the broken stick probability. Recall that three line segments can form a triangle provided that they satisfy the triangle inequality: the sum of the two smaller lengths must be greater than the third length. If you randomly choose two points in (0,1), the probability that the resulting three segments can form a triangle is 1/4, which is smaller than what most people would guess.

The vectors maxL and minL each contain one million lengths, so it is trivial to compute the vector of that contains the third lengths.

/* what proportion of randomly broken sticks form triangles? */
medL = 1 - maxL - minL;          /* compute middle length */
isTriangle = (maxL <= minL + medL); /* do lengths satisfy triangle inequality? */
prop = mean(isTriangle`);        /* proportion of segments that form a triangle */
print prop;

Estimate for the probability that three random segments form a triangle

As expected, about 0.25 of the simulations resulted in segments that satisfy the triangle inequality.

In conclusion, this article shows how to use the SAS/IML language to solve several classical problems in probability. By using matrices, you can run the simulation by using vectorized computations such as matrix multiplication finding the minimum or maximum values of columns. (However, I had to use a loop to sort the points. Bummer!)

If you want to try this simulation yourself in the DATA step, I suggest that you transpose the SAS/IML setup. Use arrays to hold the random points and use the CALL SORTN subroutine to sort the points. Use the LARGEST function and the SMALLEST function to compute the largest and smallest elements in an array. Feel free to post your solution (and any other thoughts) in the comments.

The post Random segments and broken sticks appeared first on The DO Loop.

Please comment on the article here: The DO Loop

The post Random segments and broken sticks appeared first on All About Statistics.




Source link

SQL Server 2017 release candidate now available

By | ai, bigdata, machinelearning

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

SQL Server 2017, the next major release of the SQL Server database, has been available as a community preview for around 8 months, but now the first full-featured release candidate is available for public preview. For those looking to do data science with data in SQL Server, there are a number of new features compared to SQL Server 2017 which might be of interest: 

SQL Server 2017 Release Candidate 1 is available for download now. For more details on these and other new features in this release, check out the link below.

SQL Server Blog: SQL Server 2017 CTP 2.1 now available

 

var vglnk = { key: ‘949efb41171ac6ec1bf7f206d57e90b8’ };

(function(d, t) {
var s = d.createElement(t); s.type = ‘text/javascript’; s.async = true;
s.src = “http://cdn.viglink.com/api/vglnk.js”;
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, ‘script’));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…




Source link