Category

TensorFlow

Data Science Weekly – Issue 204

By | machinelearning, TensorFlow

Data Science Weekly – Issue 204

#outlook a{
padding:0;
}
.ReadMsgBody{
width:100%;
}
.ExternalClass{
width:100%;
}
body{
margin:0;
padding:0;
}
img{
border:0;
height:auto;
line-height:100%;
outline:none;
text-decoration:none;
}
table,td{
border-collapse:collapse !important;
mso-table-lspace:0pt;
mso-table-rspace:0pt;
}
#bodyTable,#bodyCell{
height:100% !important;
margin:0;
padding:0;
width:100% !important;
}
#bodyCell{
padding:20px;
}
#templateContainer{
width:600px;
}
body,#bodyTable{
background-color:#ecf0f1;
}
h1{
color:#34495e !important;
display:block;
font-family:Georgia;
font-size:26px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:center;
}
h2{
color:#34495e !important;
display:block;
font-family:Tahoma;
font-size:20px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:center;
}
h3{
color:#000000 !important;
display:block;
font-family:Helvetica;
font-size:18px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:center;
}
h4{
color:#000000 !important;
display:block;
font-family:Helvetica;
font-size:16px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:left;
}
#templatePreheader{
border-top:0;
border-bottom:0;
}
.preheaderContent{
color:#34495e;
font-family:Tahoma;
font-size:9px;
line-height:125%;
padding-top:10px;
padding-bottom:10px;
text-align:left;
}
.preheaderContent a:link,.preheaderContent a:visited,.preheaderContent a .yshortcuts {
color:#34495e;
font-weight:bold;
text-decoration:none;
}
#templateHeader{
border-top:10px solid #000000;
border-bottom:5px solid #000000;
}
.headerContent{
color:#000000;
font-family:Helvetica;
font-size:20px;
font-weight:bold;
line-height:100%;
padding-top:20px;
padding-bottom:20px;
text-align:center;
}
.headerContent a:link,.headerContent a:visited,.headerContent a .yshortcuts {
color:#000000;
font-weight:normal;
text-decoration:underline;
}
#headerImage{
height:auto;
max-width:600px !important;
}
#templateBody{
border-top:0;
border-bottom:0;
}
.bodyContent{
color:#000000;
font-family:Helvetica;
font-size:16px;
line-height:150%;
padding-top:40px;
padding-bottom:40px;
text-align:left;
}
.bodyContent a:link,.bodyContent a:visited,.bodyContent a .yshortcuts {
color:#FF0000;
font-weight:normal;
text-decoration:none;
}
.bodyContent img{
display:inline;
height:auto;
max-width:600px !important;
}
#templateFooter{
border-top:2px solid #000000;
border-bottom:20px solid #000000;
}
.footerContent{
color:#000000;
font-family:Helvetica;
font-size:10px;
line-height:150%;
padding-top:20px;
padding-bottom:20px;
text-align:center;
}
.footerContent a:link,.footerContent a:visited,.footerContent a .yshortcuts,.footerContent a span {
color:#000000;
font-weight:bold;
text-decoration:none;
}
.footerContent img{
display:inline;
height:auto;
max-width:600 !important;
}
@media only screen and (max-width: 500px){
body,table,td,p,a,li,blockquote{
-webkit-text-size-adjust:none !important;
}

} @media only screen and (max-width: 500px){
body{
width:auto !important;
}

} @media only screen and (max-width: 500px){
td[id=bodyCell]{
padding:10px;
}

} @media only screen and (max-width: 500px){
table[id=templateContainer]{
max-width:600px !important;
width:75% !important;
}

} @media only screen and (max-width: 500px){
h1{
font-size:40px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
h2{
font-size:20px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
h3{
font-size:18px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
h4{
font-size:16px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
table[id=templatePreheader]{
display:none !important;
}

} @media only screen and (max-width: 500px){
td[class=headerContent]{
font-size:20px !important;
line-height:150% !important;
}

} @media only screen and (max-width: 500px){
td[class=bodyContent]{
font-size:18px !important;
line-height:125% !important;
}

} @media only screen and (max-width: 500px){
td[class=footerContent]{
font-size:14px !important;
line-height:150% !important;
}

} @media only screen and (max-width: 500px){
td[class=footerContent] a{
display:block !important;
}

}


Curated news, articles and jobs related to Data Science. 
Keep up with all the latest developments

Issue #204

Oct 19 2017

Editor Picks

 

  • Google's Learning Software Learns to Write Software
    Artificial-intelligence researchers at Google are trying to automate the tasks of highly paid workers more likely to wear a hoodie than a coat and tie—themselves. In a project called AutoML, Google’s researchers have taught machine-learning software to build machine-learning software…
  • Your Brain Limits You to Just Five BFFs
    The number of people we can have meaningful contact with is limited by the size of our brains. Now this group seems to be subdivided into layers, say anthropologists…
  • Generalization in Deep Learning
    This paper explains why deep learning can generalize well, despite large capacity and possible algorithmic instability, nonrobustness, and sharp minima, effectively addressing an open problem in the literature…

 


 

A Message from this week's Sponsor:

 

 
Attend the Future Labs AI Summit in NYC on October 30 – 31

Two days of technical trainings and talks from leading executives at Google, NASA, Vector Institute, NYU, AI4ALL, and more. Day 1 courses will be taught by experts from Intel, Amazon Web Services (AWS), Insight Data Science, Paperspace, and NYU, and include courses such as intro to machine learning and deep learning, machine learning for statisticians, and building and deploying deep learning models on AWS. Day 2 presentations will cover investing in AI, democratizing AI, the current state of quantum computing, social and ethical impacts of the technology, and more. Sign up before prices rise on October 25!
 
 


 

Data Science Articles & Videos

 

  • Advice For New and Junior Data Scientists
    Two years ago, I shared my experience on doing data science in the industry. The writing was originally meant to be a private reflection for myself to celebrate my two year twitterversary at Twitter, but I instead published it on Medium because I believe it could be very useful for many aspiring data scientists. Fast forward to 2017, I have been working at Airbnb for a little bit less than two years and have recently become a senior data scientist — an industry title used to signal that one has acquired a certain level of technical expertise. As I reflect on my journey so far and imagine what’s next to come, I once again wrote down a few lessons that I wish I had known in the early days of my career….
  • Coloring B&W portraits with neural networks.
    Earlier this year, Amir Avni used neural networks to troll the subreddit /r/Colorization – a community where people colorize historical black and white images manually using Photoshop. They were astonished with Amir’s deep learning bot – what could take up to a month of manual labour could now be done in just a few seconds. I was fascinated by Amir’s neural network, so I reproduced it and documented the process. First off, let’s look at some of the results/failures from my experiments…
  • Neural Networks for Advertisers
    Recently I came across a problem to solve using some sort of machine learning capabilities, which was the need to count the total time during which a specific company was advertised on the various places at a football match…
  • Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image
    We propose a unified formulation for the problem of 3D human pose estimation from a single raw RGB image that reasons jointly about 2D joint estimation and 3D pose reconstruction to improve both tasks. We take an integrated approach that fuses probabilistic knowledge of 3D human pose with a multi-stage CNN architecture and uses the knowledge of plausible 3D landmark locations to refine the search for better 2D locations. The entire process is trained end-to-end, is extremely efficient and obtains state-of-the-art results on Human3.6M outperforming previous approaches both on 2D and 3D errors…

 


 

Jobs

 

  • Data Scientist – MealPal – New York

    Are you passionate about helping an organization make smart decisions in order to deliver the best product and user experience? Do you want to join a fast-paced, growing company? As a Data Scientist at MealPal, you will focus on using data to drive business strategy and take our company to the next level. You will have the opportunity to think critically and problem solve in order to drive valuable and executable insights…

 


 

Training & Resources

 

  • Streaming Dataframes
    This post describes a prototype project to handle continuous data sources of tabular data using Pandas and Streamz…

 


 

Books

 

 


 
P.S., Want to reach our audience / fellow readers? Consider sponsoring. We've just opened up booking for November & December – grab a spot now; first come first served! Email us for more details – All the best, Hannah & Sebastian

Follow on Twitter
Copyright © 2013-2017 DataScienceWeekly.org, All rights reserved.
unsubscribe from this list    update subscription preferences 

Source link

Pixel Visual Core: image processing and machine learning on Pixel 2

By | machinelearning, TensorFlow

The camera on the new Pixel 2 is packed full of great hardware, software and machine learning (ML), so all you need to do is point and shoot to take amazing photos and videos. One of the technologies that helps you take great photos is HDR+, which makes it possible to get excellent photos of scenes with a large range of brightness levels, from dimly lit landscapes to a very sunny sky.

HDR+ produces beautiful images, and we’ve evolved the algorithm that powers it over the past year to use the Pixel 2’s application processor efficiently, and enable you to take multiple pictures in sequence by intelligently processing HDR+ in the background. In parallel, we’ve also been working on creating hardware capabilities that enable significantly greater computing power—beyond existing hardware—to bring HDR+ to third-party photography applications.

To expand the reach of HDR+, handle the most challenging imaging and ML applications, and deliver lower-latency and even more power-efficient HDR+ processing, we’ve created Pixel Visual Core.

Pixel Visual Core is Google’s first custom-designed co-processor for consumer products. It’s built into every Pixel 2, and in the coming months, we’ll turn it on through a software update to enable more applications to use Pixel 2’s camera for taking HDR+ quality pictures.

Magnified image of Pixel Visual Core

Let’s delve into the details for you technical folks out there: The centerpiece of Pixel Visual Core is the Google-designed Image Processing Unit (IPU)—a fully programmable, domain-specific processor designed from scratch to deliver maximum performance at low power. With eight Google-designed custom cores, each with 512 arithmetic logic units (ALUs), the IPU delivers raw performance of more than 3 trillion operations per second on a mobile power budget. Using Pixel Visual Core, HDR+ can run 5x faster and at less than one-tenth the energy than running on the application processor (AP). A key ingredient to the IPU’s efficiency is the tight coupling of hardware and software—our software controls many more details of the hardware than in a typical processor. Handing more control to the software makes the hardware simpler and more efficient, but it also makes the IPU challenging to program using traditional programming languages. To avoid this, the IPU leverages domain-specific languages that ease the burden on both developers and the compiler: Halide for image processing and TensorFlow for machine learning. A custom Google-made compiler optimizes the code for the underlying hardware.

In the coming weeks, we’ll enable Pixel Visual Core as a developer option in the developer preview of Android Oreo 8.1 (MR1). Later, we’ll enable it for all third-party apps using the Android Camera API, giving them access to the Pixel 2’s HDR+ technology. We can’t wait to see the beautiful HDR+ photography that you already get through your Pixel 2 camera become available in your favorite photography apps.

HDR+ will be the first application to run on Pixel Visual Core. Notably, because Pixel Visual Core is programmable, we’re already preparing the next set of applications. The great thing is that as we port more machine learning and imaging applications to use Pixel Visual Core, Pixel 2 will continuously improve. So keep an eye out!



Source link

Introducing custom pipelines and extensions for spaCy v2.0

By | machinelearning, TensorFlow

As the release candidate for spaCy v2.0 gets closer, we’ve been excited to implement some of the last outstanding features. One of the best improvements is a new system for adding pipeline components and registering extensions to the Doc, Span and Token objects. In this post, we’ll introduce you to the new functionality, and finish with an example extension package, spacymoji.spaCy v2.0 alphaThe new version of spaCy is available on pip via spacy-nightly. To try out the examples in this post, you need the latest version, 2.0.0a17+. See this page for details on the new features. For an overview of the new models, see the models directory.

Previous versions of spaCy have been fairly difficult to extend. This has been especially true of the core Doc, Token and Span objects. They’re not instantiated directly, so creating a useful subclass would involve a lot of ugly abstraction (think FactoryFactoryConfigurationFactory classes). Inheritance is also unsatisfying, because it gives no way to compose different customisations. We want to let people develop extensions to spaCy, and we want to make sure those extension can be used together. If every extension required spaCy to return a different Doc subclass, there would be no way to do that. To solve this problem, we’re introducing a new dynamic field that allows new attributes, properties and methods to be added at run-time:

import spacy
from spacy.tokens import Doc

Doc.set_attribute('is_greeting', default=False)

nlp = spacy.load('en')
doc = nlp(u'hello world')
doc._.is_greeting = True

We think the ._ attribute strikes a nice balance between readability and explicitness. Extensions need to be nice to use, but it should also be obvious what is and isn’t built-in – otherwise there’s no way to track down the documentation or implementation of the code you’re reading. The ._ attribute also makes sure that updates to spaCy won’t break extension code through namespace conflicts.

The other thing that’s been missing for extension development was a convenient way of modifying the processing pipeline. Early versions of spaCy hard-coded the pipeline, because only English was supported. spaCy v1.0 allowed the pipeline to be changed at run-time, but this has been mostly hidden away from the user: you’d call nlp on a text and stuff happens – but what? If you needed to add a process that should run between tagging and parsing, you’d have to dig into spaCy’s internals. In spaCy v2.0 there’s finally an API for that, and it’s as simple as:

nlp = spacy.load('en')
component = MyComponent()
nlp.add_pipe(component, after='tagger')
doc = nlp(u"This is a sentence")

Fundamentally, a pipeline is a list of functions called on a Doc in order. The pipeline can be set by a model, and modified by the user. A pipeline component can be a complex class that holds state, or a very simple Python function that adds something to a Doc and returns it. Under the hood, spaCy performs the following steps when you call nlp on a string of text:

doc = nlp.make_doc(u'This is a sentence')   # create a Doc from raw text
for name, proc in nlp.pipeline:             # iterate over components in order
    doc = proc(doc)                         # call each component on the Doc

The nlp object is an instance of Language, which contains the data and annotation scheme of the language you’re using and a pre-defined pipeline of components, like the tagger, parser and entity recognizer. If you’re loading a model, the Language instance also has access to the model’s binary data. All of this is specific to each model, and defined in the model’s meta.json – for example, a Spanish NER model requires different weights, language data and pipeline components than an English parsing and tagging model. This is also why the pipeline state is always held by the Language class. spacy.load() puts this all together and returns an instance of Language with a pipeline set and access to the binary data.

A spaCy pipeline in v2.0 is simply a list of (name, function) tuples, describing the component name and the function to call on the Doc object:

>>> nlp.pipeline
[('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>),
 ('ner', <spacy.pipeline.EntityRecognizer>)]

To make it more convenient to modify the pipeline, there are several built-in methods to get, add, replace, rename or remove individual components. spaCy’s default pipeline components, like the tagger, parser and entity recognizer now all follow the same, consistent API and are subclasses of Pipe. If you’re developing your own component, using the Pipe API will make it fully trainable and serializable. At a minimum, a component needs to be a callable that takes a Doc and returns it:

def my_component(doc):
    print&lpar;"The doc is {} characters long and has {} tokens."
          .format(len(doc.text), len(doc))
    return doc

The component can then be added at any position of the pipeline using the nlp.add_pipe() method. The arguments before, after, first, and last let you specify component names to insert the new component before or after, or tell spaCy to insert it first (i.e. directly after tokenization) or last in the pipeline.

nlp = spacy.load('en')
nlp.add_pipe(my_component, name='print_length', last=True)
doc = nlp(u"This is a sentence.")

When you implement your own pipeline components that modify the Doc, you often want to extend the API, so that the information you’re adding is conveniently accessible. spaCy v2.0 introduces a new mechanism that lets you register your own attributes, properties and methods that become available in the ._ namespace, for example, doc._.my_attr. There are mostly three types of extensions that can be registered via the set_extension() method:Why ._?Writing to a ._ attribute instead of to the Doc directly keeps a clearer separation and makes it easier to ensure backwards compatibility. For example, if you’ve implemented your own .coref property and spaCy claims it one day, it’ll break your code. Similarly, just by looking at the code, you’ll immediately know what’s built-in and what’s custom – for example, doc.sentiment is spaCy, while doc._.sent_score isn’t.

  1. Attribute extensions. Set a default value for an attribute, which can be overwritten.
  2. Property extensions. Define a getter and an optional setter function.
  3. Method extensions. Assign a function that becomes available as an object method.
Doc.set_extension('hello_attr', default=True)
Doc.set_extension('hello_property', getter=get_value, setter=set_value)
Doc.set_extension('hello_method', method=lambda doc, name: 'Hi {}!'.format(name))

doc._.hello_attr            # True
doc._.hello_property        # return value of get_value
doc._.hello_method('Ines')  # 'Hi Ines!'

Being able to easily write custom data to the Doc, Token and Span means that applications using spaCy can take full advantage of the built-in data structures and the benefits of Doc objects as the single source of truth containing all information:

  • No information is lost during tokenization and parsing, so you can always relate annotations to the original string.
  • The Token and Span are views of the Doc, so they’re always up-to-date and consistent.
  • Efficient C-level access is available to the underlying TokenC* array via doc.c.
  • APIs can standardise on passing around Doc objects, reading and writing from them whenever necessary. Fewer signatures makes functions more reusable and composable.

For example, lets say your data contains geographical information like country names, and you’re using spaCy to extract those names and add more details, like the country’s capital or GPS coordinates. Or maybe your application needs to find names of public figures using spaCy’s named entity recognizer, and check if a page about them exists on Wikipedia.

Before, you’d usually run spaCy over your text to get the information you’re interested in, save it to a database and add more data to it later. This worked well, but it also meant that you lost all references to the original document. Alternatively, you could serialize your document and store the additional data with references to their respective token indices. Again, this worked well, but it was a pretty unsatisfying solution overall. In spaCy v2.0, you can simply write all this data to custom attributes on a document, token or span, using a name of your choice. For example, token._.country_capital, span._.wikipedia_url or doc._.included_persons.

The following example shows a simple pipeline component that fetches all countries using the REST Countries API, finds the country names in the document, merges the matched spans, assigns the entity label GPE (geopolitical entity) and adds the country’s capital, latitude/longitude coordinates and a boolean is_country to the token attributes. You can also find a more detailed version on GitHub.

Countries extensionimport requests
from spacy.tokens import Token, Span
from spacy.matcher import PhraseMatcher

class Countries(object):
    name = 'countries'  # component name shown in pipeline

    def __init__(self, nlp, label='GPE'):
        # request all country data from the API
        r = requests.get('https://restcountries.eu/rest/v2/all')
        self.countries = {c['name']: c for c in r.json()}  # create dict for easy lookup
        # initialise the matcher and add patterns for all country names
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add('COUNTRIES', None, [nlp(c) for c in self.countries.keys()])
        self.label = nlp.vocab.strings[label] # get label ID from vocab
        # register extensions on the Token
        Token.set_extension('is_country', default=False)
        Token.set_extension('country_capital')
        Token.set_extension('country_latlng')

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # create Span for matched country and assign label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            for token in entity:  # set values of token attributes
                token._.set('is_country', True)
                token._.set('country_capital', self.countries[entity.text]['capital'])
                token._.set('country_latlng', self.countries[entity.text]['latlng'])
        doc.ents = list(doc.ents) + spans  # overwrite doc.ents and add entities – don't replace!
        for span in spans:
            span.merge()  # merge all spans at the end to avoid mismatched indices
        return doc  # don't forget to return the Doc!

The example also uses spaCy’s PhraseMatcher, which is another cool feature introduced in v2.0. Instead of token patterns, the phrase matcher can take a list of Doc objects, letting you match large terminology lists fast and efficiently. When you add the component to the pipeline and process a text, all countries are automatically labelled as GPE entities, and the custom attributes are available on the token:

nlp = spacy.load('en')
component = Countries(nlp)
nlp.add_pipe(component, before='tagger')
doc = nlp(u"Some text about Colombia and the Czech Republic")

print([(ent.text, ent.label_) for ent in doc.ents])
# [('Colombia', 'GPE'), ('Czech Republic', 'GPE')]

print([(token.text, token._.country_capital) for token in doc if token._.is_country])
# [('Colombia', 'Bogotá'), ('Czech Republic', 'Prague')]

Using getters and setters, you can also implement attributes on the Doc and Span that reference custom Token attributes – for example, whether a document contains countries. Since the getter is only called when you access the attribute, you can refer to the Token‘s is_country attribute here, which is already set in the processing step. For a complete implementation, see the full example.Other ideasIn this case, we are able to fetch all data with one request to the REST API. However, you can also implement API requests via getter functions on individual objects, or add a method attribute to pass in additional parameters. Or how about a Token method that takes another country name or GPS coordinates, and computes the distance to the token’s country? This is all possible now!

has_country = lambda tokens: any([token._.is_country for token in tokens])
Doc.set_extension('has_country', getter=has_country)
Span.set_extension('has_country', getter=has_country)

Having a straightforward API for custom extensions and a clearly defined input/output (Doc/Doc) also helps making larger code bases more maintainable, and allows developers to share their extensions with others and test them reliably. This is relevant for teams working with spaCy, but also for developers looking to publish their own packages, extensions and plugins.

We’re hoping that this new architecture will help encourage a community ecosystem of spaCy components to cover any potential use case – no matter how specific. Components can range from simple extensions adding fairly trivial attributes for convenience, to complex models making use of external libraries such as PyTorch, scikit-learn and TensorFlow. There are many components users may want, and we’d love to be able to offer more built-in pipeline components shipped with spaCy – for example, better sentence boundary detection, semantic role labelling and sentiment analysis. But there’s also a clear need for making spaCy extensible for specific use cases, making it interoperate better with other libraries, and putting all of it together to update and train statistical models.

Adding better emoji support to spaCy has long been on my list of “cool things to build sometime”. Emoji are fun, hold a lot of relevant semantic information and, supposedly, are now more common in Twitter text than hyphens. Over the past two years, they have also become vastly more complex. Aside from the regular emoji characters and their unicode representations, you can now also use skin tone modifiers that are placed after a regular emoji, and result in one visible character. For example, 👍 + 🏿 = 👍🏿. In addition, some characters can form “ZWJ sequences“, e.g. two or more emoji joined by a Zero Width Joiner (U+200D) that are merged into one symbol. For example, 👨 + ZWJ + 🎤 = 👨‍🎤 (official title is “man singer”, I call it “Bowie”).

As of v2.0, spaCy’s tokenizer splits all emoji and other symbols into individual tokens, making them easier to separate from the rest of your text. However, emoji unicode ranges are fairly arbitrary and updated often. The p{Other_Symbol} or p{So} category, which spaCy’s tokenizer uses, is a good approximation, but it also includes other icons and dingbats. So if you want to handle only emoji, there’s no way around matching against an exact list. Luckily, the emoji package has us covered here.

spacymoji is a spaCy extension and pipeline component that detects individual emoji and sequences in your text, merges them into one token and assigns custom attributes to the Doc, Span and Token. For example, you can check if a document or span includes an emoji, check whether a token is an emoji and retrieve its human-readable description.

import spacy
from spacymoji import Emoji

nlp = spacy.load('en')
emoji = Emoji(nlp)
nlp.add_pipe(emoji, first=True)

doc  = nlp(u"This is a test 😻 👍🏿")
assert doc._.has_emoji
assert len(doc._.emoji) == 2
assert doc[2:5]._.has_emoji
assert doc[4]._.is_emoji
assert doc[5]._.emoji_desc == u'thumbs up dark skin tone'
assert doc._.emoji[1] == (u'👍🏿', 5, u'thumbs up dark skin tone')

Pipeline positionBy adding the component as the first in the pipeline, the spans are merged right after tokenization, and before the document is parsed. If your text contains a lot of emoji, this might even give you a nice boost in parser accuracy, as the parser only gets to see one token per emoji.

The spacymoji component uses the PhraseMatcher to find occurences of the exact emoji sequences in the emoji lookup table and generates the respective emoji spans. It also merges them into one token if the emoji consists of more than one character – for example, an emoji with a skin tone modifier or a combined ZWJ sequence. The emoji shortcut, e.g. :thumbs_up:, is converted to a human-readable description, available as token._.emoji_desc. You can also pass in your own lookup table, mapping emoji to custom descriptions.

If you feel inspired and want to build you own extension, see this guide for some tips, tricks and best practices. With the growth of deep learning tools and techniques, there are now lots of models for predicting various types of NLP annotations. Models for tasks like coreference resolution, information extraction and summarization can now easily be used to power spaCy extensions – all you have to do is add the extension attributes, and hook the model into the pipeline. We’re looking forward to seeing what you build!

Source link

Data Science Weekly – Issue 203

By | machinelearning, TensorFlow

Data Science Weekly – Issue 203

#outlook a{
padding:0;
}
.ReadMsgBody{
width:100%;
}
.ExternalClass{
width:100%;
}
body{
margin:0;
padding:0;
}
img{
border:0;
height:auto;
line-height:100%;
outline:none;
text-decoration:none;
}
table,td{
border-collapse:collapse !important;
mso-table-lspace:0pt;
mso-table-rspace:0pt;
}
#bodyTable,#bodyCell{
height:100% !important;
margin:0;
padding:0;
width:100% !important;
}
#bodyCell{
padding:20px;
}
#templateContainer{
width:600px;
}
body,#bodyTable{
background-color:#ecf0f1;
}
h1{
color:#34495e !important;
display:block;
font-family:Georgia;
font-size:26px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:center;
}
h2{
color:#34495e !important;
display:block;
font-family:Tahoma;
font-size:20px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:center;
}
h3{
color:#000000 !important;
display:block;
font-family:Helvetica;
font-size:18px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:center;
}
h4{
color:#000000 !important;
display:block;
font-family:Helvetica;
font-size:16px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:left;
}
#templatePreheader{
border-top:0;
border-bottom:0;
}
.preheaderContent{
color:#34495e;
font-family:Tahoma;
font-size:9px;
line-height:125%;
padding-top:10px;
padding-bottom:10px;
text-align:left;
}
.preheaderContent a:link,.preheaderContent a:visited,.preheaderContent a .yshortcuts {
color:#34495e;
font-weight:bold;
text-decoration:none;
}
#templateHeader{
border-top:10px solid #000000;
border-bottom:5px solid #000000;
}
.headerContent{
color:#000000;
font-family:Helvetica;
font-size:20px;
font-weight:bold;
line-height:100%;
padding-top:20px;
padding-bottom:20px;
text-align:center;
}
.headerContent a:link,.headerContent a:visited,.headerContent a .yshortcuts {
color:#000000;
font-weight:normal;
text-decoration:underline;
}
#headerImage{
height:auto;
max-width:600px !important;
}
#templateBody{
border-top:0;
border-bottom:0;
}
.bodyContent{
color:#000000;
font-family:Helvetica;
font-size:16px;
line-height:150%;
padding-top:40px;
padding-bottom:40px;
text-align:left;
}
.bodyContent a:link,.bodyContent a:visited,.bodyContent a .yshortcuts {
color:#FF0000;
font-weight:normal;
text-decoration:none;
}
.bodyContent img{
display:inline;
height:auto;
max-width:600px !important;
}
#templateFooter{
border-top:2px solid #000000;
border-bottom:20px solid #000000;
}
.footerContent{
color:#000000;
font-family:Helvetica;
font-size:10px;
line-height:150%;
padding-top:20px;
padding-bottom:20px;
text-align:center;
}
.footerContent a:link,.footerContent a:visited,.footerContent a .yshortcuts,.footerContent a span {
color:#000000;
font-weight:bold;
text-decoration:none;
}
.footerContent img{
display:inline;
height:auto;
max-width:600 !important;
}
@media only screen and (max-width: 500px){
body,table,td,p,a,li,blockquote{
-webkit-text-size-adjust:none !important;
}

} @media only screen and (max-width: 500px){
body{
width:auto !important;
}

} @media only screen and (max-width: 500px){
td[id=bodyCell]{
padding:10px;
}

} @media only screen and (max-width: 500px){
table[id=templateContainer]{
max-width:600px !important;
width:75% !important;
}

} @media only screen and (max-width: 500px){
h1{
font-size:40px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
h2{
font-size:20px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
h3{
font-size:18px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
h4{
font-size:16px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
table[id=templatePreheader]{
display:none !important;
}

} @media only screen and (max-width: 500px){
td[class=headerContent]{
font-size:20px !important;
line-height:150% !important;
}

} @media only screen and (max-width: 500px){
td[class=bodyContent]{
font-size:18px !important;
line-height:125% !important;
}

} @media only screen and (max-width: 500px){
td[class=footerContent]{
font-size:14px !important;
line-height:150% !important;
}

} @media only screen and (max-width: 500px){
td[class=footerContent] a{
display:block !important;
}

}


Curated news, articles and jobs related to Data Science. 
Keep up with all the latest developments

Issue #203

Oct 12 2017

Editor Picks

 

  • Phone-Powered AI Spots Sick Plants with Remarkable Accuracy
    Researchers have developed a smartphone-based program that can automatically detect diseases in the cassava plant—the most widely grown root crop on Earth—with darn near 100 percent accuracy. It’s a glimpse at a future in which farmers in the developing world trade the expertise of a handful of specialists for increasingly omnipresent and powerful technology…

 


 

A Message from this week's Sponsor:

 

 
Attend the Future Labs AI Summit in NYC on October 30 – 31

Two days of technical trainings and talks from leading executives at Google, NASA, Vector Institute, NYU, AI4ALL, and more. Day 1 courses will be taught by experts from Intel, Amazon Web Services (AWS), Insight Data Science, Paperspace, and NYU, and include courses such as intro to machine learning and deep learning, machine learning for statisticians, and building and deploying deep learning models on AWS. Day 2 presentations will cover investing in AI, democratizing AI, the current state of quantum computing, social and ethical impacts of the technology, and more. Sign up before prices rise on October 25!
 
 


 

Data Science Articles & Videos

 

  • Behind the Magic: How we built the ARKit Sudoku Solver
    A few weeks ago my company, Hatchlings, released Magic Sudoku for iOS11. It’s an app that solves sudoku puzzles using a combination of Computer Vision, Machine Learning, and Augmented Reality. Many people have asked me about the app so I thought it would be fun to share some behind the scenes of how and why we built it…
  • Visualizing gender and race inequality in newsrooms
    Our latest project in the collaboration with Google News Lab is an exploration of gender and race in U.S news publications. It was designed by Polygraph based on data from the American Society of News Editors (ASNE,) which has also published an article about it…
  • #RecSys2017 summaries and reviews
    Every year after RecSys, our community takes the time to reflect and write summaries and reviews of the conference. Whether you could not make it to the conference, or you missed a session, it is always a good idea to keep an eye on what others are thinking. Here goes the list of all the #RecSys2017 summaries published so far…
  • GANs are Broken in More than One Way: The Numerics of GANs
    Last year, when I was on a mission to "fix GANs" I had a tendency to focus only on what the loss function is, and completely disregard the issue of how do we actually find a minimum. Here is the paper that has finally challenged that attitude…
  • Interactions in fraud experiments: A case study in multivariable testing
    A while ago we observed something curious when we ran a set of simultaneous A/B tests around multiple antifraud features. These tests were to improve our passengers’ ride payment experience and our ability to collect fares to pay our drivers. The features centered around the temporary authorization hold we use to determine if a passenger has enough money for a Lyft ride…

 


 

Jobs

 

  • Data Scientist – MealPal – New York

    Are you passionate about helping an organization make smart decisions in order to deliver the best product and user experience? Do you want to join a fast-paced, growing company? As a Data Scientist at MealPal, you will focus on using data to drive business strategy and take our company to the next level. You will have the opportunity to think critically and problem solve in order to drive valuable and executable insights…

 


 

Training & Resources

 

  • 3Blue1Brown
    A channel about animating math. Check out the "Recommended" playlist for some thought-provoking one-off topics, and take a look at the "Essence of linear algebra" for some more student-focussed material…

 


 

Books

 

 


 
P.S., Want to reach our audience / fellow readers? Consider sponsoring. We've just opened up booking for November & December – grab a spot now; first come first served! Email us for more details – All the best, Hannah & Sebastian

Follow on Twitter
Copyright © 2013-2017 DataScienceWeekly.org, All rights reserved.
unsubscribe from this list    update subscription preferences 

Source link

Data Science Weekly – Issue 202

By | machinelearning, TensorFlow

Data Science Weekly – Issue 202

#outlook a{
padding:0;
}
.ReadMsgBody{
width:100%;
}
.ExternalClass{
width:100%;
}
body{
margin:0;
padding:0;
}
img{
border:0;
height:auto;
line-height:100%;
outline:none;
text-decoration:none;
}
table,td{
border-collapse:collapse !important;
mso-table-lspace:0pt;
mso-table-rspace:0pt;
}
#bodyTable,#bodyCell{
height:100% !important;
margin:0;
padding:0;
width:100% !important;
}
#bodyCell{
padding:20px;
}
#templateContainer{
width:600px;
}
body,#bodyTable{
background-color:#ecf0f1;
}
h1{
color:#34495e !important;
display:block;
font-family:Georgia;
font-size:26px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:center;
}
h2{
color:#34495e !important;
display:block;
font-family:Tahoma;
font-size:20px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:center;
}
h3{
color:#000000 !important;
display:block;
font-family:Helvetica;
font-size:18px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:center;
}
h4{
color:#000000 !important;
display:block;
font-family:Helvetica;
font-size:16px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:left;
}
#templatePreheader{
border-top:0;
border-bottom:0;
}
.preheaderContent{
color:#34495e;
font-family:Tahoma;
font-size:9px;
line-height:125%;
padding-top:10px;
padding-bottom:10px;
text-align:left;
}
.preheaderContent a:link,.preheaderContent a:visited,.preheaderContent a .yshortcuts {
color:#34495e;
font-weight:bold;
text-decoration:none;
}
#templateHeader{
border-top:10px solid #000000;
border-bottom:5px solid #000000;
}
.headerContent{
color:#000000;
font-family:Helvetica;
font-size:20px;
font-weight:bold;
line-height:100%;
padding-top:20px;
padding-bottom:20px;
text-align:center;
}
.headerContent a:link,.headerContent a:visited,.headerContent a .yshortcuts {
color:#000000;
font-weight:normal;
text-decoration:underline;
}
#headerImage{
height:auto;
max-width:600px !important;
}
#templateBody{
border-top:0;
border-bottom:0;
}
.bodyContent{
color:#000000;
font-family:Helvetica;
font-size:16px;
line-height:150%;
padding-top:40px;
padding-bottom:40px;
text-align:left;
}
.bodyContent a:link,.bodyContent a:visited,.bodyContent a .yshortcuts {
color:#FF0000;
font-weight:normal;
text-decoration:none;
}
.bodyContent img{
display:inline;
height:auto;
max-width:600px !important;
}
#templateFooter{
border-top:2px solid #000000;
border-bottom:20px solid #000000;
}
.footerContent{
color:#000000;
font-family:Helvetica;
font-size:10px;
line-height:150%;
padding-top:20px;
padding-bottom:20px;
text-align:center;
}
.footerContent a:link,.footerContent a:visited,.footerContent a .yshortcuts,.footerContent a span {
color:#000000;
font-weight:bold;
text-decoration:none;
}
.footerContent img{
display:inline;
height:auto;
max-width:600 !important;
}
@media only screen and (max-width: 500px){
body,table,td,p,a,li,blockquote{
-webkit-text-size-adjust:none !important;
}

} @media only screen and (max-width: 500px){
body{
width:auto !important;
}

} @media only screen and (max-width: 500px){
td[id=bodyCell]{
padding:10px;
}

} @media only screen and (max-width: 500px){
table[id=templateContainer]{
max-width:600px !important;
width:75% !important;
}

} @media only screen and (max-width: 500px){
h1{
font-size:40px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
h2{
font-size:20px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
h3{
font-size:18px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
h4{
font-size:16px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
table[id=templatePreheader]{
display:none !important;
}

} @media only screen and (max-width: 500px){
td[class=headerContent]{
font-size:20px !important;
line-height:150% !important;
}

} @media only screen and (max-width: 500px){
td[class=bodyContent]{
font-size:18px !important;
line-height:125% !important;
}

} @media only screen and (max-width: 500px){
td[class=footerContent]{
font-size:14px !important;
line-height:150% !important;
}

} @media only screen and (max-width: 500px){
td[class=footerContent] a{
display:block !important;
}

}


Curated news, articles and jobs related to Data Science. 
Keep up with all the latest developments

Issue #202

Oct 5 2017

Editor Picks

 

  • What Happens When Algorithms Design a Concert Hall
    The auditorium—the largest of three concert halls in the Elbphilharmonie—is a product of parametric design, a process by which designers use algorithms to develop an object’s form. Algorithms have helped design bridges, motorcycle parts, typefaces—even chairs. In the case of the Elbphilharmonie, Herzog and De Meuron used algorithms to generate a unique shape for each of the 10,000 gypsum fiber acoustic panels that line the auditorium's walls like the interlocking pieces of a giant, undulating puzzle…

 


 

A Message from this week's Sponsor:

 

 
Big Data LDN 2017

The UK’s largest free to attend Big Data Conference and Exhibition takes place at Olympia London. Big Data LDN will host leading global data and analytics experts, ready to arm you with the tools you need to deliver the most effective data-driven strategy in 5 free to attend conference theatres. The keynote programme is opened by Confluent CTO Neha Narkhede and Cloudera CTO Amr Awadallah. Complimenting the free conference is an exhibition of approx. 80 global technology providers in the big data and analytics space who will be hosting live product demos and will be able to talk to you about your product requirements.

Join the #BigDataRevolution by securing your free ticket to Big Data LDN
 
 


 

Data Science Articles & Videos

 

  • How AI Will Keep You Healthy
    An audacious Chinese entrepreneur wants to test your body for everything. But are computers really smart enough to make sense of all that data?…
  • Germany’s election and the trouble with correlation
    Readers of the Financial Times will be familiar with the aphorism that “correlation is not causation”, referring to the fact that a statistical association between two measures cannot be taken as evidence of a causal link. But what about when correlation is not even correlation?…
  • Why I don’t like Jupyter Notebooks
    We’ve had a number of tickets recently asking about running Jupyter Notebooks on Legion/Grace. Until the architecture of the Jupyter Notebook changes this will never be a good/safe idea. This sparked a discussion which descended into an argument between James and myself on the internal Slack about whether it is appropriate to encourage new researchers to use Jupyter notebooks. Because I like having the last word, I’m going to present James’ arguments first…
  • WaveNet launches in the Google Assistant
    Just over a year ago we presented WaveNet, a new deep neural network for generating raw audio waveforms that is capable of producing better and more realistic-sounding speech than existing techniques. At that time, the model was a research prototype and was too computationally intensive to work in consumer products. But over the last 12 months we have worked hard to significantly improve both the speed and quality of our model and today we are proud to announce that an updated version of WaveNet is being used to generate the Google Assistant voices for US English and Japanese across all platforms…
  • College Closure Risk
    College closure can cause significant distress to currently enrolled students as they have to transfer to a new school where their previous credits may not be honored. Given the huge public and private investments into post-secondary education, I developed a model that tries to predict whether a college would close by 2017 from Department of Education data from 2013. During this time, there were several high profile closures of colleges, particularly for-profit private colleges with shady practices…
  • Flood Water Detection … with A Semi-Supervised U-Net
    This project was started as part of the Metis/DigitalGlobe Data Challenge but, when DigitalGlobe released a substanial dataset of the areas just hit by Hurricane Harvey (through it's Open Data Program) the project quickly evolved into something more immediately relevent, but utilizing a dataset in a less clean state than the SpaceNet dataset I had been planning on working with…

 


 

Jobs

 

  • Associate Data Scientist / Pricing Analyst – Expedia Inc – Bellevue WA

    The online travel market never stands still. At Expedia Affiliate Network (EAN), we're smack in the middle of it! We are an entrepreneurial start-up operating in the B2B market inside the world's biggest travel company. We create the tools and technology that help millions of travelers find the perfect hotels for their next trips. As the world's largest and fastest-growing affiliate network, we work with over 10,000 partners in 33 countries to turn their web traffic into hotel bookings and satisfied and loyal customers.

    We have the chance to work in the EAN's main global headquarters with the brightest minds in the travel business in an energetic and international work environment focused on innovation, creative problem-solving and collaboration between functional teams.

    We are the revenue optimisation team and we are responsible for identifying and driving significant revenue growth for the organisation. We are searching for an analyst to help us build on the strong growth results we have delivered to date.

 


 

Training & Resources

 

  • Scipy Lecture Notes
    Tutorials on the scientific Python ecosystem: a quick introduction to central tools and techniques. The different chapters each correspond to a 1 to 2 hours course with increasing level of expertise, from beginner to expert…

 


 

Books

 

  • Reproducible Research with R and R Studio

    "a very practical book that teaches good practice in organizing reproducible data analysis and comes with a series of examples…"

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

 


 
P.S., Want to reach our audience / fellow readers? Consider sponsoring. We've just opened up booking for November & December – grab a spot now; first come first served! Email us for more details – All the best, Hannah & Sebastian

Follow on Twitter
Copyright © 2013-2017 DataScienceWeekly.org, All rights reserved.
unsubscribe from this list    update subscription preferences 

Source link

First contact with TensorBoard

By | machinelearning, TensorFlow

First contact with TensorBoard TensorBoard is a suite of visualization tools that allows to visualize your TensorFlow/Keras graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it (*). TensorBoard operates by reading TensorFlow events files, which contain summary data that you can generate when […]

The post First contact with TensorBoard appeared first on Jordi Torres – Professor and Researcher at UPC & BSC: Supercomputing for Artificial Intelligence and Deep Learning.

Source link

Introducing faster GPUs for Google Compute Engine

By | machinelearning, TensorFlow

Today, we’re happy to make some massively parallel announcements for Cloud GPUs. First, Google Cloud Platform (GCP) gets another performance boost with the public launch of NVIDIA P100 GPUs in beta. Second, NVIDIA K80 GPUs are now generally available on Google Compute Engine. Third, we’re happy to announce the introduction of sustained use discounts on both the K80 and P100 GPUs.

Cloud GPUs can accelerate your workloads including machine learning training and inference, geophysical data processing, simulation, seismic analysis, molecular modeling, genomics and many more high performance compute use cases.

The NVIDIA Tesla P100 is the state of the art of GPU technology. Based on the Pascal GPU architecture, you can increase throughput with fewer instances while saving money. P100 GPUs can accelerate your workloads by up to 10x compared to K801.

Compared to traditional solutions, Cloud GPUs provide an unparalleled combination of flexibility, performance and cost-savings:

  • Flexibility: Google’s custom VM shapes and incremental Cloud GPUs provide the ultimate amount of flexibility. Customize the CPU, memory, disk and GPU configuration to best match your needs.  
  • Fast performance: Cloud GPUs are offered in passthrough mode to provide bare-metal performance. Attach up to 4 P100 or 8 K80 per VM (we offer up to 4 K80 boards, that come with 2 GPUs per board). For those looking for higher disk performance, optionally attach up to 3TB of Local SSD to any GPU VM. 
  • Low cost: With Cloud GPUs you get the same per-minute billing and Sustained Use Discounts that you do for the rest of GCP’s resources. Pay only for what you need! 
  • Cloud integration: Cloud GPUs are available at all levels of the stack. For infrastructure, Compute Engine and Google Container Enginer allow you to run your GPU workloads with VMs or containers. For machine learning, Cloud Machine Learning can be optionally configured to utilize GPUs in order to reduce the time it takes to train your models at scale with TensorFlow. 

With today’s announcement, you can now deploy both the NVIDIA Tesla P100 and K80 GPUs in four regions worldwide. All of our GPUs can now take advantage of sustained use discounts, which automatically lower the price (up to 30%), of your virtual machines when you use them to run sustained workloads. No lock-in or upfront minimum fee commitments are needed to take advantage of these discounts.

Cloud GPUs Regions Availability – Number of Zones


Speed up machine learning workloads 

Since launching GPUs, we’ve seen customers benefit from the extra computation they provide to accelerate workloads ranging from genomics and computational finance to training and inference on machine learning models. One of our customers, Shazam, was an early adopter of GPUs on GCP to power their music recognition service.

“For certain tasks, [NVIDIA] GPUs are a cost-effective and high-performance alternative to traditional CPUs. They work great with Shazam’s core music recognition workload, in which we match snippets of user-recorded audio fingerprints against our catalog of over 40 million songs. We do that by taking the audio signatures of each and every song, compiling them into a custom database format and loading them into GPU memory. Whenever a user Shazams a song, our algorithm uses GPUs to search that database until it finds a match. This happens successfully over 20 million times per day.”   

 Ben Belchak, Head of Site Reliability Engineering, Shazam

With today’s Cloud GPU announcements, GCP takes another step toward being the optimal place for any hardware-accelerated workload. With the addition of NVIDIA P100 GPUs, our primary focus is to help you bring new use cases to life. To learn more about how your organization can benefit from Cloud GPUs and Compute Engine, visit the GPU site and get started today!



The 10x performance boost compares 1 P100 GPU versus 1 K80 GPU (½ of a K80 board) for machine learning inference workloads that benefits from the P100 FP16 precision. Performance will vary by workload. Download this datasheet for more information.



Source link

How Machine Learning with TensorFlow Enabled Mobile Proof-Of-Purchase at Coca-Cola

By | machinelearning, TensorFlow

In this guest editorial, Patrick Brandt, IT Director and Solutions Strategist at The Coca-Cola Company, tells us how they’re using AI and TensorFlow to achieve frictionless proof-of-purchase.

Coca-Cola’s core loyalty program launched in 2006 as MyCokeRewards.com. The “MCR.com” platform included the creation of unique product codes for every Coca-Cola, Sprite, Fanta, and Powerade product sold in 20oz bottles and cardboard “fridge-packs” purchasable at grocery stores and other retail outlets. Users could enter these product codes at MyCokeRewards.com to participate in promotional campaigns.

Fast-forward to 2016: Coke’s loyalty programs are still hugely popular with millions of product codes having been entered for promotions and sweepstakes. However, mobile browsing went from non-existent in 2006 to over 50% share by the end of 2016. The launch of Coke.com as a mobile-first web experience (replacing MCR.com) was a response to these changes in browsing behavior. Thumb-entering 14-character codes into a mobile device could be a difficult enough user experience to impact the success of our programs. We want to provide our mobile audience the best possible experience, and recent advances in artificial intelligence opened new opportunities.

The quest for frictionless proof-of-purchase

For years Coke attempted to use off-the-shelf optical character recognition (OCR) libraries and services to read product codes with little success. Our printing process typically uses low-resolution dot-matrix fonts with the cap or fridge-pack media running under the printhead at very high speeds. All of this translates into a low-fidelity string of characters that defeats off-the-shelf OCR offerings (and can sometimes be hard to read with the human eye as well). OCR is critical to simplifying the code-entry process for mobile users: they should be able to take a picture of a code and automatically have the purchase registered for a promotional entry. We needed a purpose-built OCR system to recognize our product codes.

Bottlecap and fridge-pack examples

Our research led us to a promising solution: Convolutional Neural Networks. CNNs are one of a family of “deep learning” neural networks that are at the heart of modern artificial intelligence products. Google has used CNNs to extract street address numbers from StreetView images. CNNs also perform remarkably well at recognizing handwritten digits. These number-recognition use-cases were a perfect proxy for the type of problem we were trying to solve: extracting strings from images that contain small character sets with lots of variance in the appearance of the characters.

CNNs with TensorFlow

In the past, developing deep neural networks like CNNs has been a challenge because of the complexity of available training and inference libraries. TensorFlow, a machine learning framework that was open sourced by Google in November 2015, is designed to simplify the development of deep neural networks.

TensorFlow provides high-level interfaces to different kinds of neuron layers and popular loss functions, which makes it easier to implement different CNN model architectures. The ability to rapidly iterate over different model architectures dramatically reduced the time required to build Coke’s custom OCR solution because different models could be developed, trained, and tested in a matter of days. TensorFlow models are also portable: the framework supports model execution natively on mobile devices (“AI on the edge”) or in servers hosted remotely in the cloud. This enables a “create once, run anywhere” approach for model execution across many different platforms, including web-based and mobile.

Machine learning: practice makes perfect

Any neural network is only as good as the data used to train it. We knew that we needed a large set of labeled product-code images to train a CNN that would achieve our performance goals. Our training set would be built in three phases:

  • 1. Pre-launch simulated images
  • 2. Pre-launch real-world images
  • 3. Images labeled by our users in production

The pre-launch training phase began by programmatically generating millions of simulated product-code images. These simulated images included variations in tilt, lighting, shadows, and blurriness. The prediction accuracy (i.e. how often all 14 characters were correctly predicted within the top-10 predictions) was at 50% against real-world images when the model was trained using only simulated images. This provided a baseline for transfer-learning: a model initially trained with simulated images was the foundation for a more accurate model that would be trained against real-world images.

The challenge now turned to enriching the simulated images with enough real-world images to hit our performance goals. We created a purpose-built training app for iOS and Android devices that “trainers” could use to take pictures of codes and label them; these labeled images were then transferred to cloud storage for training. We did a production run of several thousand product codes on bottle caps and fridge-packs and distributed these to multiple suppliers who used the app to create the initial real-world training set.

Even with an augmented and enriched training set, there is no substitute for images created by end-users in a variety of environmental conditions. We knew that scans would sometimes result in an inaccurate code prediction, so we needed to provide a user-experience that would allow users to quickly correct these predictions. Two components are essential to delivering this experience: a product-code validation service that has been in use since the launch of our original loyalty platform in 2006 (to verify that a predicted code is an actual code) and a prediction algorithm that performs a regression to determine a per-character confidence at each one of the 14 character positions. If a predicted code is invalid, the top prediction as well as the confidence levels for each character are returned to the user interface. Low-confidence characters are visually highlighted to guide the user to update characters that need attention.

Error correction user interface lets users correct invalid predictions and generate useful training data

This user interface innovation enables an active learning process: a feedback loop allows the model to gradually improve by returning corrected predictions to the training pipeline. In this way, our users organically improve the accuracy of the character recognition model over time.

Product-code recognition pipeline

Optimizing for maximum performance

To meet user expectations around performance, we established a few ambitious requirements for the product-code OCR pipeline:

  • It had to be fast: we needed a one-second average processing time once the image of the product-code was sent into the OCR pipeline
  • It had to be accurate: our goal was to achieve 95% string recognition accuracy at launch with the guarantee that the model could be improved over time via active learning
  • It had to be small: the OCR pipeline needs to be small enough to be distributed directly to mobile apps and accommodate over-the-air updates as the model improves over time
  • It had to handle diverse product code media: dozens of different combinations of font types, bottlecaps, and cardboard fridge-pack media

We initially explored an architecture that used a single CNN for all product-code media. This approach created a model that was too large to be distributed to mobile apps and the execution time was longer than desired. Our applied-AI partners at Quantiphi, Inc.began iterating on different model architectures, eventually landing on one that used multiple CNNs.

This new architecture reduced the model size dramatically without sacrificing accuracy, but it was still on the high end of what we needed in order to support over-the-air updates to mobile apps. We next used TensorFlow’s prebuilt quantization module to reduce the model size by reducing the fidelity of the weights between connected neurons. Quantization reduced the model size by a factor of 4, but a dramatic reduction in model size occurred when Quantiphi had a breakthrough using a new approach called SqueezeNet.

The SqueezeNet model was published by a team of researchers from UC Berkeley and Stanford in November of 2016. It uses a small but highly complex design to achieve accuracy levels on par with much larger models against popular benchmarks such as Imagenet. After re-architecting our character recognition models to use a SqueezeNet CNN, Quantiphi was able to reduce the model size of certain media types by a factor of 100. Since the SqueezeNet model was inherently smaller, a richer feature detection architecture could be constructed, achieving much higher accuracy at much smaller sizes compared to our first batch of models trained without SqueezeNet. We now have a highly accurate model that can be easily updated on remote devices; the recognition success rate of our final model before active learning was close to 96%, which translates into a 99.7% character recognition accuracy (just 3 misses for every 1000 character predictions).

Valid product-code recognition examples with different types of occlusion, translation, and camera focus issues

Crossing boundaries with AI

Advances in artificial intelligence and the maturity of TensorFlow enabled us to finally achieve a long-sought proof-of-purchase capability. Since launching in late February 2017, our product code recognition platform has fueled more than a dozen promotions and resulted in over 180,000 scanned codes; it is now a core component for all of Coca-Cola North America’s web-based promotions.

Moving to an AI-enabled product-code recognition platform has been valuable for two key reasons:

  • Frictionless proof-of-purchase was enabled in a timely fashion, corresponding to our overall move to a mobile-first marketing platform.
  • Coke saved millions of dollars by avoiding the requirement to update printers in our production lines to support higher-fidelity fonts that would work with existing off-the-shelf OCR software.

Our product-code recognition platform is the first execution of new AI-enabled capabilities at scale within Coca-Cola. We’re now exploring AI applications across multiple lines of business, from new product development to ecommerce retail optimization.

.blogimg img { max-width: 100%; display: block; margin: auto; border: 0; padding: 10px 0 10px 0; } .blogimgx img { max-width: 65%; display: block; margin: auto; border: 0; padding: 10px 0 10px 0; } .blogcptn { font-style: italic; font-size: 85%; text-align: center !important; margin: 0; border: 0; padding: 0 0 10px 0; }

Source link

Data Science Weekly – Issue 200

By | machinelearning, TensorFlow

Data Science Weekly – Issue 200

#outlook a{
padding:0;
}
.ReadMsgBody{
width:100%;
}
.ExternalClass{
width:100%;
}
body{
margin:0;
padding:0;
}
img{
border:0;
height:auto;
line-height:100%;
outline:none;
text-decoration:none;
}
table,td{
border-collapse:collapse !important;
mso-table-lspace:0pt;
mso-table-rspace:0pt;
}
#bodyTable,#bodyCell{
height:100% !important;
margin:0;
padding:0;
width:100% !important;
}
#bodyCell{
padding:20px;
}
#templateContainer{
width:600px;
}
body,#bodyTable{
background-color:#ecf0f1;
}
h1{
color:#34495e !important;
display:block;
font-family:Georgia;
font-size:26px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:center;
}
h2{
color:#34495e !important;
display:block;
font-family:Tahoma;
font-size:20px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:center;
}
h3{
color:#000000 !important;
display:block;
font-family:Helvetica;
font-size:18px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:center;
}
h4{
color:#000000 !important;
display:block;
font-family:Helvetica;
font-size:16px;
font-style:normal;
font-weight:bold;
line-height:100%;
letter-spacing:normal;
margin-top:0;
margin-right:0;
margin-bottom:10px;
margin-left:0;
text-align:left;
}
#templatePreheader{
border-top:0;
border-bottom:0;
}
.preheaderContent{
color:#34495e;
font-family:Tahoma;
font-size:9px;
line-height:125%;
padding-top:10px;
padding-bottom:10px;
text-align:left;
}
.preheaderContent a:link,.preheaderContent a:visited,.preheaderContent a .yshortcuts {
color:#34495e;
font-weight:bold;
text-decoration:none;
}
#templateHeader{
border-top:10px solid #000000;
border-bottom:5px solid #000000;
}
.headerContent{
color:#000000;
font-family:Helvetica;
font-size:20px;
font-weight:bold;
line-height:100%;
padding-top:20px;
padding-bottom:20px;
text-align:center;
}
.headerContent a:link,.headerContent a:visited,.headerContent a .yshortcuts {
color:#000000;
font-weight:normal;
text-decoration:underline;
}
#headerImage{
height:auto;
max-width:600px !important;
}
#templateBody{
border-top:0;
border-bottom:0;
}
.bodyContent{
color:#000000;
font-family:Helvetica;
font-size:16px;
line-height:150%;
padding-top:40px;
padding-bottom:40px;
text-align:left;
}
.bodyContent a:link,.bodyContent a:visited,.bodyContent a .yshortcuts {
color:#FF0000;
font-weight:normal;
text-decoration:none;
}
.bodyContent img{
display:inline;
height:auto;
max-width:600px !important;
}
#templateFooter{
border-top:2px solid #000000;
border-bottom:20px solid #000000;
}
.footerContent{
color:#000000;
font-family:Helvetica;
font-size:10px;
line-height:150%;
padding-top:20px;
padding-bottom:20px;
text-align:center;
}
.footerContent a:link,.footerContent a:visited,.footerContent a .yshortcuts,.footerContent a span {
color:#000000;
font-weight:bold;
text-decoration:none;
}
.footerContent img{
display:inline;
height:auto;
max-width:600 !important;
}
@media only screen and (max-width: 500px){
body,table,td,p,a,li,blockquote{
-webkit-text-size-adjust:none !important;
}

} @media only screen and (max-width: 500px){
body{
width:auto !important;
}

} @media only screen and (max-width: 500px){
td[id=bodyCell]{
padding:10px;
}

} @media only screen and (max-width: 500px){
table[id=templateContainer]{
max-width:600px !important;
width:75% !important;
}

} @media only screen and (max-width: 500px){
h1{
font-size:40px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
h2{
font-size:20px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
h3{
font-size:18px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
h4{
font-size:16px !important;
line-height:100% !important;
}

} @media only screen and (max-width: 500px){
table[id=templatePreheader]{
display:none !important;
}

} @media only screen and (max-width: 500px){
td[class=headerContent]{
font-size:20px !important;
line-height:150% !important;
}

} @media only screen and (max-width: 500px){
td[class=bodyContent]{
font-size:18px !important;
line-height:125% !important;
}

} @media only screen and (max-width: 500px){
td[class=footerContent]{
font-size:14px !important;
line-height:150% !important;
}

} @media only screen and (max-width: 500px){
td[class=footerContent] a{
display:block !important;
}

}


Curated news, articles and jobs related to Data Science. 
Keep up with all the latest developments

Issue #200

Sept 21 2017

**Special Notice**: Folks, this is our 200TH EDITION!!. We just wanted to take this opportunity to thank you for joining, sharing and being part of this community! If you'd like to make a small donation to help keep us up and running, please use the Donate button below – we would really appreciate any/all support 🙂
 

 


 

Editor Picks

 

  • Introducing: Unity Machine Learning Agents
    As the world’s most popular creation engine, Unity is at the crossroads between machine learning and gaming. It is critical to our mission to enable machine learning researchers with the most powerful training scenarios, and for us to give back to the gaming community by enabling them to utilize the latest machine learning technologies. As the first step in this endeavor, we are excited to introduce Unity Machine Learning Agents…

 


 

A Message from this week's Sponsor:

 

 
Are you data curious? An aspiring data scientist?

On 9/27, join Metis for a free, online event featuring 25+ incredible speakers from the data science field. Speakers will demystify data science and discuss the training, tools, and career path to one of the world's hottest jobs. Every registrant gets access to bonus material from some of the industry's most influential thought-leaders. Secure your spot today! #DemistifyDS
 
 


 

Data Science Articles & Videos

 

  • Predicting NFL Plays with the xgboost Decision Tree Algorithm
    In all levels of football, on-field trends are typically discerned exclusively through voluminous film study of opponent history, and decisions are made using anecdotal evidence and gut instinct. These methods in isolation are highly inefficient and prone to human error. Enter – the play predictor. This tool aims to enhance in-game NFL decision making with a tool capable of predicting the type of play the opposing team will run at high accuracy in real-time…
  • Kullback-Leibler Divergence Explained
    In this post we’re going to take a look at way of comparing two probability distributions called Kullback-Leibler Divergence (often shortened to just KL divergence). Very often in Probability and Statistics we’ll replace observed data or a complex distributions with a simpler, approximating distribution. KL Divergence helps us to measure just how much information we lose when we choose an approximation….
  • Learning to Optimize with Reinforcement Learning
    Yet, there is a paradox in the current paradigm: the algorithms that power machine learning are still designed manually. This raises a natural question: can we learn these algorithms instead? This could open up exciting possibilities: we could find new algorithms that perform better than manually designed algorithms, which could in turn improve learning capability…
  • Supporting Hypothesis
    In September, Stripe is supporting the development of Hypothesis, an open-source testing library for Python created by David MacIver. Hypothesis is the only project we’ve found that provides effective tooling for testing code for machine learning, a domain in which testing and correctness are notoriously difficult…
  • Visualizing Distributions
    Many charting taxonomies include distributions, but they only present a few options. Let’s remedy that with a post on the many. We’ll use a single (completely fake) data set so we can easily compare how each chart type displays the same data…

 


 

Jobs

 

  • Data Analyst – Glossier – New York, NY

    Glossier is looking for a Senior Data Analyst to take our data practice to the next level. You will work closely with our Head of Data to provide data-driven insights to teams across the organization in order to inform strategic decision-making. You will take a leading role in shaping our Data practices, and you will use your insights to scope projects, propose approaches, and help to drive them to completion. If you enjoy finding the signal in the noise, bringing order and structure to inefficiencies, and know the mean time and standard deviation of your commute then please apply…

 


 

Training & Resources

 

  • Creating a Bar Chart
    The very basics of how to create a bar chart or stacked bar chart with labels and an axis in Semiotic…

  • Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
    This is a PyTorch version of fairseq, a sequence-to-sequence learning toolkit from Facebook AI Research. The original authors of this reimplementation are (in no particular order) Sergey Edunov, Myle Ott, and Sam Gross. The toolkit implements the fully convolutional model described in Convolutional Sequence to Sequence Learning and features multi-GPU training on a single machine as well as fast beam search generation on both CPU and GPU. We provide pre-trained models for English to French and English to German translation…

 


 

Books

 

 


 
Reminder, if you enjoyed the first 200 newsletters and want many more … Please make a donation to help keep us going 🙂 – All the best, Hannah & Sebastian

 

Follow on Twitter
Copyright © 2013-2017 DataScienceWeekly.org, All rights reserved.
unsubscribe from this list    update subscription preferences 

Source link