Datavu: Predictive Analytics

Showing posts with label Predictive Analytics. Show all posts

Friday, July 5, 2019

Handwritten Devanagari Character Identification

After playing with Cricket vs Baseball images I wanted to try the fastai approach on a more concrete problem where the benchmarks were available. So I chose a dataset with Handwritten Devanagari Character Identification (character set for my mother tongue Marathi) with SoTA accuracy of 98.47%. The aim was to check if I can beat this number.

I used fastai environment. It is a collection of libraries and lessons created to keep the standard practices/technologies available at one place. The fastai is built on top of the PyTorch, open source machine learning created by Facebook developers. PyTorch is an alternative to TensorFlow and being used widely.

The detailed post with code and output could be found on GitHub. In this post, I am sharing a quick summary.

Steps followed:

Import fastai and other required libraries
Set config (Path to your data, the size that the images will be resized to)
Use a variants of a pre-trained NeuralNet architecture called ResNet
Train our data and evaluate the model on the validation set
Explore the results

Dataset Samples

The following screenshot displays top incorrect predictions. Some of these are hard to identify for a native Devanagari speaker and writer like me.

Deep Learning for Image Classification: Baseball-player vs Cricket-player

If you are from a country which does not play Cricket or Baseball, it could be hard to differentiate. So I tried to build a simple model just to do that.

This post is based on Jeremy Howard's original work of classifying "Cats vs Dogs". I decided to use the same architecture to build the neural network on a different and perhaps more exciting problem statement.

I created a model to label images based on whether they contain a Batter (label=Baseball) or a Batsman (label=Cricket). The aim was to build a basic model with relatively simple images. No exclusive pictures of pitchers, catchers, bowlers, fielders etc.

I used fastai environment. It is a collection of libraries and lessons created to keep the standard practices/technologies available at one place. The fastai is built on top of the PyTorch, open source machine learning created by Facebook developers. PyTorch is an alternative to TensorFlow and being used widely.

The detailed post with code and output could be found on GitHub. In this post, I am sharing a quick summary.

Signal and Noise: Optimizing the predictive model

Signal gives us useful information required and we would like to maximize its input.
Noise gives us useless information and we would like to remove all the possible noise as input.

Unfortunately in real life scenarios every input to model contains both signal and noise. So we have optimize the balance between two.

For any predictive model we can determine if the inputs really affect the output by association tests. Lower the p-value, stronger the association. The next step is we would like to arrange predictors in decreasing order of association and add them one by one to build the predictive model. Every time we add a predictor, a new model on training data should be built. Then it should be evaluated against the testing data.

Lets use 'Area Under the Curve' or AUC as evaluator of the model.

Predictive Modelling project : Workflow and learnings

I got the opportunity to work on a project aimed at building predictive model for a disorder X. So sharing few observations and procedures here. Due to some constraints wont be getting into details.

Aim: Build a generic predictive model for disorder X in all ethnic groups. The predictive model works better if developed for a specific ethnic group. However we were trying to build a generic model which means we might not get too good results.

Data set: The data gathered was from three different studies and involving three ethnic groups. Every source was a case-control study. Case is a record with positive outcome (here for disorder X) and control is observation with negative outcome (for disorder X).

Time division: It was a year long project and we spent around 80% of the time in data pre-processing (gathering, profiling, cleaning and formatting). Also most of the heavy processing was done in this part. The remaining 20% time was spend on building and evaluating different predictive models.

Hollywood and the Big Data

There is huge opportunity in movie market prediction. In my last semester I got a chance to be part of a research project going in my university (Marketing department) where they are trying to build a predictive model in terms of revenue for movies. This lead me to explore more about how Hollywood is trying to use big data and predictive analytics to maximize the business.

Few more related stories below,

1] For almost every movie released by big Hollywood firms there is predicted financial target for first week. However there are so many factors to consider as input that it is still not even close to perfect. Recently "City of Bones" missed its target by almost 25 percent. [ref: Hollywood's box office tracking system under fire]

Retailers and the Big Data

Somewhere in 2011 I read an article about how much data Walmart collects hourly and how it is used to optimize their business. This incident ignited my interest in big data. I decided to explore more about how retail businesses and big data usage. Here are few interesting finds,

Walmart:

Wal-Mart, a retail giant, handles more than 1m customer transactions every hour, feeding databases estimated at more than 2.5 petabytes—the equivalent of 167 times the books in America's Library of Congress. [1] The magnitude of data is so big that, they recently shifted to 250-node Hadoop cluster. Walmart labs is building their own big data analysis tool and plan to open source them. [2]

Datavu

Labels

Friday, July 5, 2019

Handwritten Devanagari Character Identification

Dataset Samples

Saturday, May 12, 2018

Deep Learning for Image Classification: Baseball-player vs Cricket-player

Thursday, February 6, 2014

Signal and Noise: Optimizing the predictive model

Predictive Modelling project : Workflow and learnings

Sunday, September 29, 2013

Hollywood and the Big Data

Wednesday, November 21, 2012

Retailers and the Big Data