This paper presents a remarkably simple, yet powerful, algorithm for robust Principal Component Analysis (PCA). In the proposed approach, an outlier is set apart from an inlier by comparing their coherence with the rest of the data points. As inliers lie on a low dimensional subspace, they are likely to have strong mutual coherence provided there are enough inliers. By contrast, outliers do not typically admit low dimensional structures, wherefore an outlier is unlikely to bear strong resemblance with a large number of data points. The mutual coherences are computed by forming the Gram matrix of normalized data points. Subsequently, the subspace is recovered from the span of a small subset of the data points that exhibit strong coherence with the rest of the data. As coherence pursuit only involves one simple matrix multiplication, it is significantly faster than the state of-the-art robust PCA algorithms. We provide a mathematical analysis of the proposed algorithm under a random model for the distribution of the inliers and outliers. It is shown that the proposed method can recover the correct subspace even if the data is predominantly outliers. To the best of our knowledge, this is the first provable robust PCA algorithm that is simultaneously non-iterative, can tolerate a large number of outliers and is robust to linearly dependent outliers
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche’s feed, there’s more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.
The Grupo Bimbo Inventory Demand competition ran on Kaggle from June through August 2016. Over 2000 players on nearly as many teams competed to accurately forecast sales of Grupo Bimbo’s delicious bakery goods. Kaggler Alex Ryzhkov came in second place with his teammates Clustifier and Andrey Kiryasov. In this interview, Alex describes how he and his team spent 95% of their time feature engineering their way to the top of the leaderboard. Read how the team used pseudo-labeling, typically used in deep learning, to improve their final forecast.
What was your background prior to entering this challenge?
I graduated from Mathematical Methods of Forecasting department at Moscow State University in 2015. My scientific advisor was Alexander D’yakonov, who once was the Top-1 Kaggler worldwide, and I have learnt a lot of tips and tricks from him.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
Of course I have I participated in the first rotation of PZAD course held by Alexander D’yakonov, where we developed our practical skills in machine learning competitions. Moreover, after each competition I spent several days reading winning solutions and figuring out what I could have done better.
How did you get started competing on Kaggle?
Almost at the beginning of my education in the Mathematical Methods of Forecasting department in university I joined Kaggle and totally loved it.
What made you decide to enter this competition?
I enjoyed this competition in two ways. My passion is to work with time-series data and I have several qualification works on this type of data. The second reason is that I wanted to check how far I can go using Amazon AWS servers’ power.
Let’s get technical
What preprocessing and supervised learning methods did you use?
For this competition we used several XGBoost, FTRL, and FFM models, and the initial dataset was hugely increased by:
- different aggregations (mean, median, max, min etc.) of target and sales variables by week, product, client and town IDs;
- New_Client_ID feature (for example, all OXXO shops have the same ID in it instead of different ones in the dataset from Bimbo);
- features from products’ names like weight, brand, number of pieces, weight of each piece;
- Truncated SVD on TF-IDF matrix of client and product names
What was your most important insight into the data?
Since the public-private test dataset split was done in a time manner (one week in public and next week to private), we can’t use features with lag equal to 1 in training our models. We did experiments for checking this point and models, which use lag_1 features, get worse score on private for 0.03-0.05 in logloss terms than models without these features.
Were you surprised by any of your findings?
It was surprising that initial client IDs worked as well as their clustered version. In the beginning of the competition I had an opinion that the initial ones have too much diversity but for the final model we saved both of them in the dataset.
Which tools did you use?
For this competition we used XGBoost packages in Python and R, as well as a Python implementation of FTRL algorithm and the FFM library for regression problems. To run heavy models on the whole dataset, spot Amazon r3.8xlarge servers were the best variant – fast and with huge RAM.
How did you spend your time on this competition?
From my point of view, it was a feature engineering competition. After my first script with XGBoost, I spent all of my time on preprocessing client and products tables, working with towns and states, creating new aggregations on sales and target variables. So it was 95% of time for feature engineering and only 5% for machine learning.
What was the run time for both training and prediction of your winning solution?
If we run it on r3.8xlarge, it will take around 146 hours (6 days) including feature engineering, training and predicting steps.
Words of wisdom
What have you taken away from this competition?
It was really surprising that pseudo labeling techniques can work outside deep learning competitions. Also you should spend a lot of time thinking about your validation and prediction techniques – it can prevent you from losing your position in the end.
Do you have any advice for those just getting started in data science?
From my side, competitions with kernels enabled are the best teachers for beginners. You can find all variety of scripts there – from simple (like all zeros or random forest on the whole initial dataset) to advanced ones (blend of several models with preprocessing and feature engineering). It’s also useful to read topics on forum – you can get a number of ideas from other competitors’ posts. The last advice but in my opinion the best one – don’t give up!
How did your team form?
I was in top-20, when I got stuck and understood the necessity of new views and ideas to be in top-10 on private – at that time I merged with Clustifier and we started to work together. Later we joined with Andrey to be competitive with another top team – The Slippery Appraisals.
How did your team work together?
We had a chat in Skype (later in Google Hangouts), where we could discuss our ideas. Otherwise, all data was shared on Google Drive and we uploaded our first level submissions there. Moreover, I also shared my RStudio server on AWS with Clustifier, so we could easily work on the same files simultaneously.
How did competing on a team help you succeed?
Firstly, merging your ideas about one to two weeks before the end of competition increases your score. Secondly, you can exchange your ideas with teammates and each of them would implement those ideas in his own manner – this boosts your models even more. Finally, it’s a nice time to share experience and tips & tricks, which help you to go up and improve stability of your solution before private LB.
Just for fun
If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?
It would be nice to create a small challenge with leaderboard shake up prediction. This topic is always popular on forums near the end of each competition.
What is your dream job?
Data scientist outside Russia
Alexander Ryzhkov has graduated from Mathematical Methods of Forecasting department at Moscow State University in 2015, where his scientific advisor was Alexander D’yakonov. Now he works as a software developer in Deutsche Bank Technology Center (Moscow).
Many data centers pump water past server racks in order to mitigate exhaust heat given off by the equipment. This is a time-honored method for cooling data centers, but it’s not without risk. A leaky or burst pipe can cause irreparable damage to the servers and systems. Not only is it incredibly costly to replace damaged equipment, it’s also expensive to deal with water-damage-related outages and network downtime.
Benefits Of Using Symphony Link As A Water Leak Detection System For Data Centers
Symphony Link is a wireless low power, wide-area network (LPWAN) solution ideal for those who want to connect devices to the cloud easily, inexpensively, reliably, and securely. Symphony Link sensors are simply placed in potential “problem” areas that may experience leakage, and the information collected from those sensors is transmitted wirelessly online.
There are several reasons why Symphony Link may work well as a water leak detection system in data centers. Here are three:
- You can achieve a ten-year battery life. The longer your network nodes are able to last, the lower your power costs and maintenance costs will be. Both of these elements should be considered when selecting your water leak detection system.
- Symphony Link operates well in areas with a lot of interference. Electromagnetic interference (EMI) occurs regularly in data centers—but this interference doesn’t bother the Symphony connection.
- The network doesn’t doesn’t rely on WiFi. This has numerous benefits. First, there are fewer security concerns associated with its integration. Also, the network can be closed-loop—so it doesn’t have to rely on WiFi if an alarm needs to sound.
Alternative Solutions For Water Leak Detection
Mesh network topologies are common in this space, but you may be required to use more gateways or sensors than you actually need to ensure you have enough link budget. The last thing you want to discover is that your mesh network couldn’t make a link, and therefore couldn’t alert you right away to a pipe leak. On the other hand, Symphony’s star-based topology can support thousands of endpoints per gateway.
WiFi-based sensor networks can have difficult IT integration requirements, which makes them a pain to set up. The power consumption is also significantly higher in WiFi than it is when using Symphony Link, which may be limiting in some use cases.
Wired sensor networks are almost always a good option, particularly if the data center is new and cost isn’t a major concern. Symphony Link is better suited for retrofits or customers looking to save costs.
Need a top-of-the-line water leak detection system for your data center? Let’s talk.
We have Symphony-enabled off-the-shelf detectors that you can purchase today. Additionally, Link Labs is interested in developing solutions with partners—so if you’re building out a solution that fits this use case, get in touch with us.
The post Using Symphony Link As A Water Leak Detection System For Data Centers appeared first on Link Labs.
Why You Get Different Results With
Different Runs Of An Algorithm With The Same Data.
Applied machine learning is a tapestry of breakthroughs and mindset shifts.
Understanding the role of randomness in machine learning algorithms is one of those breakthroughs.
Once you get it, you will see things differently. In a whole new light. Things like choosing between one algorithm and another, hyperparameter tuning and reporting results.
You will also start to see the abuses everywhere. The criminally unsupported performance claims.
In this post, I want to gently open your eyes to the role of random numbers in machine learning. I want to give you the tools to embrace this uncertainty. To give you a breakthrough.
Let’s dive in.
(special thanks to Xu Zhang and Nil Fero who promoted this post)
Why Are Results Different With The Same Data?
A lot of people ask this question or variants of this question.
You are not alone!
I get an email along these lines once per week.
Here are some similar questions posted to Q&A sites:
- Why do I get different results each time I run my algorithm?
- Cross-Validation gives different result on the same data
- Randomness in Artificial Intelligence & Machine Learning
- Why are the weights different in each running after convergence?
- Does the same neural network with the same learning data and same test data in two computers give different results?
Machine Learning Algorithms Use Random Numbers
Machine learning algorithms make use of randomness.
1. Randomness in Data Collection
Trained with different data, machine learning algorithms will construct different models. It depends on the algorithm. How different a model is with different data is called the model variance (as in the bias-variance trade off).
So, the data itself is a source of randomness. Randomness in the collection of the data.
2. Randomness in Observation Order
The order that the observations are exposed to the model affects internal decisions.
Some algorithms are especially susceptible to this, like neural networks.
It is good practice to randomly shuffle the training data before each training iteration. Even if your algorithm is not susceptible. It’s a best practice.
3. Randomness in the Algorithm
Algorithms harness randomness.
An algorithm may be initialized to a random state. Such as the initial weights in an artificial neural network.
Votes that end in a draw (and other internal decisions) during training in a deterministic method may rely on randomness to resolve.
4. Randomness in Sampling
We may have too much data to reasonably work with.
In which case, we may work with a random subsample to train the model.
5. Randomness in Resampling
We sample when we evaluate an algorithm.
We use techniques like splitting the data into a random training and test set or use k-fold cross validation that makes k random splits of the data.
The result is an estimate of the performance of the model (and process used to create it) on unseen data.
There’s no doubt, randomness plays a big part in applied machine learning.
The randomness that we can control, should be controlled.
Get your FREE Algorithms Mind Map
I’ve created a handy mind map of 60+ algorithms organized by type.
Download it, print it and use it.
Also get exclusive access to the machine learning algorithms email mini-course.
Random Seeds and Reproducible Results
Run an algorithm on a dataset and get a model.
Can you get the same model again given the same data?
You should be able to. It should be a requirement that is high on the list for your modeling project.
We achieve reproducibility in applied machine learning by using the exact same code, data and sequence of random numbers.
Random numbers are generated in software using a pretend random number generator. It’s a simple math function that generates a sequence of numbers that are random enough for most applications.
This math function is deterministic. If it uses the same starting point called a seed number, it will give the same sequence of random numbers.
We can get reproducible results by fixing the random number generator’s seed before each model we construct.
In fact, this is a best practice.
We should be doing this if not already.
In fact, we should be giving the same sequence of random numbers to each algorithm we compare and each technique we try.
It should be a default part of each experiment we run.
Machine Learning Algorithms are Stochastic
If a machine learning algorithm gives a different model with a different sequence of random numbers, then which model do we pick?
Ouch. There’s the rub.
I get asked this question from time to time and I love it.
It’s a sign that someone really gets to the meat of all this applied machine learning stuff – or is about to.
- Different runs of an algorithm with…
- Different random numbers give…
- Different models with…
- Different performance characteristics…
But the differences are within a range.
A fancy name for this difference or random behavior within a range is stochastic.
Machine learning algorithms are stochastic in practice.
- Expect them to be stochastic.
- Expect there to be a range of models to choose from and not a single model.
- Expect the performance to be a range and not a single value.
These are very real expectations that you MUST address in practice.
What tactics can you think of to address these expectations?
Tactics To Address The Uncertainty of Stochastic Algorithms
Thankfully, academics have been struggling with this challenge for a long time.
There are 2 simple strategies that you can use:
- Reduce the Uncertainty.
- Report the Uncertainty.
Tactics to Reduce the Uncertainty
If we get different models essentially every time we run an algorithm, what can we do?
How about we try running the algorithm many times and gather a population of performance measures.
We already do this if we use k-fold cross validation. We build k different models.
We can increase k and build even more models, as long as the data within each fold remains representative of the problem.
We can also repeat our evaluation process n times to get even more numbers in our population of performance measures.
This tactic is called random repeats or random restarts.
It is more prevalent with stochastic optimization and neural networks, but is just as relevant generally. Try it.
Tactics to Report the Uncertainty
Never report the performance of your machine learning algorithm with a single number.
If you do, you’ve most likely made an error.
You have gathered a population of performance measures. Use statistics on this population.
This tactic is called report summary statistics.
The distribution of results is most likely a Gaussian, so a great start would be to report the mean and standard deviation of performance. Include the highest and lowest performance observed.
In fact, this is a best practice.
You can then compare populations of result measures when you’re performing model selection. Such as:
- Choosing between algorithms.
- Choosing between configurations for one algorithm.
You can see that this has important implications on the processes you follow. Such as: to select which algorithm to use on your problem and for tuning and choosing algorithm hyperparameters.
Lean on statistical significance tests. Statistical tests can determine if the difference between one population of result measures is significantly different from a second population of results.
Report the significance as well.
This too is a best practice, that sadly does not have enough adoption.
Wait, What About Final Model Selection
The final model is the one prepared on the entire training dataset, once we have chosen an algorithm and configuration.
It’s the model we intend to use to make predictions or deploy into operations.
We also get a different final model with different sequences of random numbers.
I’ve had some students ask:
Should I create many final models and select the one with the best accuracy on a hold out validation dataset.
“No” I replied.
This would be a fragile process, highly dependent on the quality of the held out validation dataset. You are selecting random numbers that optimize for a small sample of data.
Sounds like a recipe for overfitting.
In general, I would rely on the confidence gained from the above tactics on reducing and reporting uncertainty. Often I just take the first model, it’s just as good as any other.
Sometimes your application domain makes you care more.
In this situation, I would tell you to build an ensemble of models, each trained with a different random number seed.
Use a simple voting ensemble. Each model makes a prediction and the mean of all predictions is reported as the final prediction.
Make the ensemble as big as you need to. I think 10, 30 or 100 are nice round numbers.
Maybe keep adding new models until the predictions become stable. For example, continue until the variance of the predictions tightens up on some holdout set.
In this post, you discovered why random numbers are integral to applied machine learning. You can’t really escape them.
You learned about tactics that you can use to ensure that your results are reproducible.
You learned about techniques that you can use to embrace the stochastic nature of machine learning algorithms when selecting models and reporting results.
For more information on the importance of reproducible results in machine learning and techniques that you can use, see the post:
Do you have any questions about random numbers in machine learning or about this post?
Ask your question in the comments and I will do my best to answer.
Frustrated With Machine Learning Math?
See How Algorithms Work in Minutes
…with just arithmetic and simple examples
Discover how in my new Ebook: Master Machine Learning Algorithms
It covers explanations and examples of 10 top algorithms, including:
Linear Regression, k-Nearest Neighbors, Support Vector Machines and much more…
Finally, Pull Back the Curtain on
Machine Learning Algorithms
Skip the Academics. Just Results.
Source by Dulce Dotcom
We’ve been talking a lot about how innovative companies are realizing the need to enhance their solutions with more customer-facing data products. For example, GoToMeeting launched a new feature called “Insights” where they send you engagement summary information from your meetings. Here is one from a recent Juice Lunch & Learn:
DJ Patil (U.S. Chief Data Scientist) defines data products as “a product that facilitates an end goal through the use of data.” I’ve described data products as turning analytics inside-out to deliver value to your customers. But the question we most often get is: How are data products different from the customer reporting we already provide?
Here are ten important differences between customer reporting and data products:
1. Instead of summarizing data, solve a problem. Most reporting simply regurgitates data in a semi-aggregated format. A data product starts with customer’s pain points and asks how data can bring insight and better decisions.
2. Instead of starting from the data, start from the customer. Report writers will often look at the data available to them and ask “How can we deliver all this information?” That’s what gets you to self-service analytics tools sitting on top of data. Not good. Data products need to start by asking how you can make your customers smarter and more effective in their job.
3. Instead of stopping at showing the data, guide users to specific actions. Customer reporting may be satisfied with making data accessible. Data products need to do more — they need to move people to take action. Start from the end point: what kinds of things do you want your users to do? How will you give them the right information?
4. Instead of focusing on metric values, deliver context for decisions. Key metrics are only as good as the context you put around them. Data products wrap context around metrics with goals, benchmarks, comparison, and trends. Then your users will know how they should react to the numbers they are seeing.
5. Instead of passive objectivity, bake-in best practices, predictive models, and/or recommendations. “Let the data speak for itself” — that’s like a chef saying: let the diners enjoy the raw ingredients. Bring your expertise to the data product. Your customer knows their pain — but you know the data and what can be done with it.
6. Instead of trying to show more data, reduced to only the data needed. When it comes to presenting information, more data is seldom better. Customer reporting only expands — into dozens of dashboards or reports. Data products should strive for less.
7. Instead of putting the burden on users to figure it out, strive to reduce burden. Customer reporting tosses responsibility to the customer, effectively saying “you figure out what’s important to you.” Data products recognize that few people inherently enjoy messing with data; most people just want to be better at what they do. The data can facilitate that goal.
8. Instead of being designed for analysts, data products are designed for decision makers. Many customer reporting solutions assume the end-user wants to dig in and analyze the data. Data products are for a different audience: front-line decision-makers. These people are busy with their regular job and have little interest in learning something new.
9. Instead of “show me the data”, strive to make the data invisible. The best data products of the future will make the data invisible. Consider how Google Search tries to predict your need and point you to the best answer at the top of your search results. Google wants to hide the data (search results) and jump straight to the answer.
10. Instead of a cost-center for your business, become a profit center and differentiator. Customer reporting is considered a necessary evil for many companies. For example, we’ve had dozens of conversations with advertising agencies who feel compelled to provide reporting, but clearly don’t relish the task. In contrast, companies that view their customer data as an asset recognize that they can create new revenue streams.
That’s where Juicebox comes in — it is the quickest, best path to turn your data into differentiated, revenue-generating data products.
The world of business has changed with rapidly evolving market dynamics and high customer expectations. Modern customers demand fast, responsive and tailored-made services. To remain relevant in this changing landscape and to secure future growth, enterprises are adopting cloud service delivery models.
Cloud offers agile, flexible and scalable pay-per-use service delivery models. For today’s IT enterprises, cloud is a great operating model, offering automated tools that are fine-tuned for scalable performance, quick deployment and building business capabilities. By moving their IT enterprise to the cloud, organizations can explore next generation business models that ensure a streamlined efficient IT infrastructure as well as innovative channels for customer engagement, self-service social media tools for collaboration and business analytic tools for strategic planning. The most important thing to note here is that enterprises no longer look at cloud as an isolated technology but as a key platform for driving business growth. While there are many cloud solutions – ranging from private to public to hybrid cloud – the intent behind the adoption is always the same: better business performance. However, IT enterprises need to make a critical assessment of the associated benefits and risks before jumping onto the cloud bandwagon.
There is no doubt that moving the IT enterprise to the cloud can result in a number of tactical advantages for organizations. However, the move is not without some risks – the biggest of which are security and privacy. Public cloud in particular is an area that not many enterprises are ready to step into given its multi-tenant environment. As the 2013 Target data breach showed, not only are such incidents damaging in terms of information lost, but they also cost the company dearly in terms of reputation and customer perception. Regulatory compliance and governance is another concern with cloud models. Non-compliance with relevant industry regulations and local laws can result in legal problems for organizations. Added to these are concerns relating to accountability, control and even costs.
Cloud models are revolutionizing the way IT resources are managed, operated, and consumed. Cloud models help an enterprise to scale up performance of its IT resources. It can be a true blessing for a small enterprise setting out on a growth path because it means that resources can be provisioned immediately when needed, suspended when no longer required, and most importantly, billed only when used. This on-demand service model enables enterprises to make judicious utilization of their infrastructural assets by lowering costs and increasing capacity to handle business peaks. Cloud also promotes greater optimization and utilization of IT assets allowing enterprises to do more with less. This reduces capital expenses and saves significantly on costs. Thus, cloud lowers IT costs through a faster, flexible and cheaper delivery of IT services within an enterprise. It also provides overall better performances for IT operations by simplifying the underlying infrastructure resources to lesser standardized products, technologies and platforms. This reduces operational complexities on one hand, and gives organizations the much needed ability to build business capabilities on the other.
However, the move to the cloud is not without its pitfalls. The real question before organizations is that of balance. Do the risks outweigh the benefits? Are the benefits worth taking the risks? And most importantly, are the risks manageable? It is a good idea to spend some time and effort in structuring a cloud strategy which will help in achieving the desired costs, performance and business goals without compromising on the security risks.
The post Managing your IT Enterprise in the cloud – Risk Factor or Efficiency Booster appeared first on Internet Of Things | IoT India.