All Posts By



Alternatives to jail for scientific fraud

By | machinelearning | No Comments

Mark Tuttle pointed me to this article by Amy Ellis Nutt, who writes:

Since 2000, the number of U.S. academic fraud cases in science has risen dramatically. Five years ago, the journal Nature tallied the number of retractions in the previous decade and revealed they had shot up 10-fold. About half of the retractions were based on researcher misconduct, not just errors, it noted.

The U.S. Office of Research Integrity, which investigates alleged misconduct involving National Institutes of Health funding, has been far busier of late. Between 2009 and 2011, the office identified three three cases with cause for action. Between 2012 and 2015, that number jumped to 36.

While criminal cases against scientists are rare, they are increasing. Jail time is even rarer, but not unheard of. Last July, Dong-Pyou Han, a former biomedical scientist at Iowa State University, pleaded guilty to two felony charges of making false statements to obtain NIH research grants and was sentenced to more than four years in prison.

Han admitted to falsifying the results of several vaccine experiments, in some cases spiking blood samples from rabbits with human HIV antibodies so that the animals appeared to develop an immunity to the virus.

“The court cannot get beyond the breach of the sacred trust in this kind of research,” District Judge James Gritzner said at the trial’s conclusion. “The seriousness of this offense is just stunning.”

In 2014, the Office of Research Integrity had imposed its own punishment. Although it could have issued a lifetime funding ban, it only barred Han from receiving federal dollars for three years.

Sen. Charles Grassley (R-Iowa) was outraged. “This seems like a very light penalty for a doctor who purposely tampered with a research trial and directly caused millions of taxpayer dollars to be wasted on fraudulent studies,” he wrote the agency. The result was a federal probe and Han’s eventual sentence.

I responded that I think community service would make more sense. Flogging seems like a possibility too. Jail seems so destructive.

I do agree with Sen. Grassley that a 3-year ban on federal dollars is not enough of a sanction in that case. Spiking blood samples is pretty much the worst thing you can do, when it comes to interfering with the scientific process. If spiking blood samples only gives you a 3-year ban, what does it take to get a 10-year ban? Do you have to be caught actually torturing the poor bunnies? And what would it take to get a lifetime ban? Spiking blood samples plus torture plus intentionally miscalculating p-values?

The point is, there should be some punishments more severe than the 3-year ban but more appropriate than prison, involving some sort of restitution. Maybe if you’re caught spiking blood samples you’d have to clean pipettes at the lab every Saturday and Sunday for the next 10 years? Or you’d have to check the p-value computations in every paper published in Psychological Science between the years of 2010 and 2015?

The post Alternatives to jail for scientific fraud appeared first on Statistical Modeling, Causal Inference, and Social Science.

Source link

Learning the structure of learning

By | ai, bigdata, machinelearning | No Comments

If anything, there has been a flurry of effort in learning the structure of new learning architectures. Here is an ICLR2017 paper on the subject of meta learning and posters of the recent NIPS symposium on the topic.

Neural Architecture Search with Reinforcement Learning, Barret Zoph, Quoc Le (Open Review is here)

Abstract: Neural networks are powerful and flexible models that work well for many difficult learning tasks in image, speech and natural language understanding. Despite their success, neural networks are still hard to design. In this paper, we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. On the CIFAR-10 dataset, our method, starting from scratch, can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. Our CIFAR-10 model achieves a test error rate of 3.65, which is 0.09 percent better and 1.05x faster than the previous state-of-the-art model that used a similar architectural scheme. On the Penn Treebank dataset, our model can compose a novel recurrent cell that outperforms the widely-used LSTM cell, and other state-of-the-art baselines. Our cell achieves a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art model. The cell can also be transferred to the character language modeling task on PTB and achieves a state-of-the-art perplexity of 1.214.

At NIPS, we had the Symposium on Recurrent Neural Networks and Other Machine that Learns Algorithms

  • Jürgen Schmidhuber, Introduction to Recurrent Neural Networks and Other Machines that Learn Algorithms
  • Paul Werbos, Deep Learning in Recurrent Networks: From Basics To New Data on the Brain
  • Li Deng, Three Cool Topics on RNN
  • Risto Miikkulainen, Scaling Up Deep Learning through Neuroevolution
  • Jason Weston, New Tasks and Architectures for Language Understanding and Dialogue with Memory
  • Oriol Vinyals, Recurrent Nets Frontiers
  • Mike Mozer, Neural Hawkes Process Memories
  • Ilya Sutskever, Using a slow RL algorithm to learn a fast RL algorithm using recurrent neural networks (Arxiv)
  • Marcus Hutter, Asymptotically fastest solver of all well-defined problems
  • Nando de Freitas , Learning to Learn, to Program, to Explore and to Seek Knowledge (Video)
  • Alex Graves, Differentiable Neural Computer
  • Nal Kalchbrenner, Generative Modeling as Sequence Learning
  • Panel Discussion Topic: The future of machines that learn algorithms, Panelists: Ilya Sutskever, Jürgen Schmidhuber, Li Deng, Paul Werbos, Risto Miikkulainen, Sepp Hochreiter, Moderator: Alex Graves

Posters of the recent NIPS2016 workshop

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche’s feed, there’s more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Source link

Catalog of visualization tools

By | ai, bigdata, machinelearning | No Comments

There are a lot of visualization-related tools out there. Here’s a simple categorized collection of what’s available, with a focus on the free and open source stuff.

This site features a curated selection of data visualization tools meant to bridge the gap between programmers/statisticians and the general public by only highlighting free/freemium, responsive and relatively simple-to-learn technologies for displaying both basic and complex, multivariate datasets. It leans heavily toward open-source software and plugins, rather than enterprise, expensive B.I. solutions.

I found some broken links, and the descriptions need a little editing, but it’s a good place to start.

Also, if you’re just starting out with visualization, you might find all the resources a bit overwhelming. If that’s the case, don’t fret. You don’t have to learn how to use all of them. Let your desired outcomes guide you. Here’s what I use.


Source link

Weekly Initial Unemployment Claims decrease to 234,000

By | ai, bigdata, machinelearning | No Comments

The DOL reported:

In the week ending January 14, the advance figure for seasonally adjusted initial claims was 234,000, a decrease of 15,000 from the previous week’s revised level. The previous week’s level was revised up by 2,000 from 247,000 to 249,000. The 4-week moving average was 246,750, a decrease of 10,250 from the previous week’s revised average. This is the lowest level for this average since November 3, 1973 when it was 244,000. The previous week’s average was revised up by 500 from 256,500 to 257,000.

There were no special factors impacting this week’s initial claims.
emphasis added

The previous week was revised up.

The following graph shows the 4-week moving average of weekly claims since 1971.

Click on graph for larger image.

The dashed line on the graph is the current 4-week average. The four-week average of weekly unemployment claims decreased to 246,750.

This was below the consensus forecast. This is the lowest level for the four week average since 1973 (with a much larger population).

The low level of claims suggests relatively few layoffs.

Source link

Counting is hard, especially when you don’t have theories

By | ai, bigdata, machinelearning | No Comments

(This article was originally published at Big Data, Plainly Spoken (aka Numbers Rule Your World), and syndicated at StatsBlogs.)

In the previous post, I diagnosed one data issue with the IMDB dataset found on Kaggle. On average, the third-party face-recognition software undercounted the number of people on movie posters by 50%.

It turns out that counting the number of people on movie posters is a subjective activity. Reasonable people can disagree about the number of heads on some of those posters.

For example, here is a larger view of the Better Luck Tomorrow poster I showed yesterday:


By my count, there are six people on this poster. But notice the row of photos below the red title: someone could argue that there are more than six people on this poster. (Regardless, the algorithm is still completely wrong in this case, as it counted only one head.)

So one of the “rules” that I followed when counting heads is only count those people to whom the designer of the poster is drawing attention. Using this rule, I ignore the row of photos below the red title. Also by this rule, if a poster contains a main character, and its shadow, I only count the person once. If the poster contains a number of people in the background, such as generic soldiers in the battlefield, I do not count them.

Another rule I used is to count the back or side of a person even if I could not see his or her face provided that this person is a main character of the movie. For example, the following Rocky Balboa poster has one person on it.


(cf. The algorithm counted zero heads.)


According to the distribution of number of heads predicted by the algorithm, I learned that some posters may have dozens of people on them. So I pulled out these outliers and looked at them.

This poster of The Master (2012) is said to contain 31 people.


On a closer look, this is a tesselation of a triangle of faces. Should that count as three people or lots of people? As the color fades off on the sides of the poster, should we count those barely visible faces?

Counting is harder than it seems.


The discussion above leads to an important issue in building models. The analyst must have some working theory about how X is related to Y. If it is believed that the number of faces on the movie poster affects movie-goers' enthusiam, then that guides us to count certain people but not others.


If one were to keep pushing on the rationale of using this face count data, one inevitably arrives at a dead end. Here are the top results from a Google Image Search on “The Master 2012 poster”:


Well, every movie is supported by a variety of posters. The bigger the movie, the bigger the marketing budget, the more numerous are the posters. There are two key observations from the above:

The blue tesselation is one of the key designs used for this movie. Within this design framework, some posters contain only three heads, some maybe a dozen heads, and some (like the one shown on IMDB) many dozens of heads.

Further, there are at least three other design concepts, completely different from the IMDB poster, and showing different number of people!

Going back to the theory that movie-goers respond to the poster design (in particular, the number of people in the poster), the analyst now realizes that he or she has a huge hole in the dataset. Which of these posters did the movie-goer see? Did IMDB know which poster was seen the most number of times?

Thus, not only are the counts subjective and imprecise, it is not even clear we are analyzing the right posters.


Once I led the students down this path, almost everyone decided to drop this variable from the dataset.




Please comment on the article here: Big Data, Plainly Spoken (aka Numbers Rule Your World)

The post Counting is hard, especially when you don’t have theories appeared first on All About Statistics.

Source link

Systems Architecture of Smart Healthcare Cloud Applications and Services IoT System: General Architectural Theory at Work

By | iot, machinelearning | No Comments

A system is complex that it comprises multiple views such as strategy/version n, strategy/version n+1, concept, analysis, design, implementation, structure, behavior, and input/output data views. Accordingly, a system is defined as a set of interacting components forming an integrated whole of that system’s multiple views.
Since structure and behavior views are the two most prominent ones among multiple views, integrating the structure and behavior views is a method for integrating multiple views of a system. In other words, structure-behavior coalescence (SBC) results in the coalescence of multiple views. Therefore, it is concluded that the SBC architecture is so proper to model the multiple views of a system.
In this book, we use the SBC architecture description language (SBC-ADL) to describe and represent the systems architecture of Smart Healthcare Cloud Applications and Services IoT System (SHCASIS). An architecture description language is a special kind of system model used in defining the architecture of a system. SBC-ADL uses six fundamental diagrams to formally grasp the essence of a system and its details at the same time. These diagrams are: a) architecture hierarchy diagram, b) framework diagram, c) component operation diagram, d) component connection diagram, e) structure-behavior coalescence diagram, and f) interaction flow diagram.
Systems architecture is on the rise. By this book’s introduction and elaboration of the systems architecture of SHCASIS, all readers may understand clearly how the SBC-ADL helps architects effectively perform architecting, in order to productively construct the fruitful systems architecture.