Saturday, May 12, 2018

Deep Learning for Image Classification: Baseball-player vs Cricket-player

If you are from a country which does not play Cricket or Baseball, it could be hard to differentiate. So I tried to build a simple model just to do that.

This post is based on Jeremy Howard's original work of classifying "Cats vs Dogs". I decided to use the same architecture to build the neural network on a different and perhaps more exciting problem statement.

I created a model to label images based on whether they contain a Batter (label=Baseball) or a Batsman (label=Cricket). The aim was to build a basic model with relatively simple images. No exclusive pictures of pitchers, catchers, bowlers, fielders etc.

I used fastai environment. It is a collection of libraries and lessons created to keep the standard practices/technologies available at one place. The fastai is built on top of the PyTorch, open source machine learning created by Facebook developers. PyTorch is an alternative to TensorFlow and being used widely.

The detailed post with code and output could be found on GitHub. In this post, I am sharing a quick summary.

Steps followed:
  1. Import fastai and other required libraries
  2. Set config (Path to your data, the size that the images will be resized to)
  3. Use a pre-existing NN architecture (pre-trained model) called resnet34
  4. Train our data and evaluate the model on the validation set
  5. Explore the results
Some interesting findings:

In just 4 epochs the model was able to achieve 81% accuracy on the validation dataset.
epoch      trn_loss   val_loss   accuracy       
    0      1.289021   0.991551   0.25      
    1      1.156484   0.812032   0.4375         
    2      1.012042   0.668625   0.5625         
    3      0.843908   0.548285   0.8125    
Some correctly predicted images (closer 1 is Cricket, closer to 0 is Baseball)

Some incorrectly predicted images (closer 1 is Cricket, closer to 0 is Baseball)

This was interesting! I had chosen a few images in the validation dataset slightly deviating away from the common pattern in the training dataset. For example, in the first image above (Baseball-image) we do not see the ground unlike the ones in training data. In the second image above (Cricket-image) the third stump is not in place

The values closer to 0.5 shows less confidence in the prediction.

I believe with more training data and increasing number of epochs the performance could be increased. In the next post, I hope to do more experiment with fine-tuning the model with a slightly more difficult problem.

Sunday, January 28, 2018

Data Visualization: Classics Edition

On the first page of his first data visualization book, Mr Edward Tufte states his fundamental beliefs about data visualization. One of his principles corresponds to encouraging the eye to compare the different pieces of data. I wasn't very sure what he meant by that until I came across some of the classic works of visualizations. One can easily recognize and appreciate the thoughtfulness and arduous efforts of these data artists.

John Snow's Cholera Map

By John Snow - Published by C.F. Cheffins, Lith, Southhampton Buildings, London, England, 1854 in Snow, John. On the Mode of Communication of Cholera, 2nd Ed, John Churchill, New Burlington Street, London, England, 1855. (This image was originally from en.wikipedia; description page is/was here. Image copied from, Public Domain, Link

Saturday, March 4, 2017

Indian Space Program Overview: Success Rate, Launcher Rankings, Budget and Profits, Ambitious Future Plans

On 15th February 2017, Indian Space Research Organization (ISRO) launched 104 satellites using a single launcher. It was an excellent example of indigenous engineering. The curiosity to explore more about the organization after its recent achievements lead to this post.

ISRO joined the space party a little late as compared to the countries with highly successful space program. Nonetheless, its accomplishments are pretty astonishing. In the first decade of inception ISRO focussed on building its own launch vehicle which can deliver satellites in orbit.

The ISRO built its first indigenous rocket capable of putting satellites in orbit called as SLV-3 (Satellite Launch Vehicle-3). The project was headed by A. P. J. Abdul Kalam who later became the 11th President of India.

Following table shows the evolution of launch vehicle,

LauncherStatusPayload (in kg)Payload delivery capability
SLV-3Decommissioned40Low Earth Orbits 
ASLVDecommissioned150Low Earth Orbits 
1450Sun-Synchronous Polar Orbits
1750Geosynchronous and Geostationary orbits
2500Geosynchronous Transfer Orbits
5000Lower Earth Orbit
Under development
4000Geosynchronous Transfer Orbits
8000Lower Earth Orbit

Wednesday, March 1, 2017

Time for Tracking the Election Manifesto Promises?

A young journalist from The Hindu performed an interesting analysis of official election manifestos. After analyzing individual items, he assigned them to one of the following categories - fulfilled, under progress, yet to start, stalled etc. According to the results, the Central Government of India had fulfilled 18% of the promises in 2.3 years. Though it seems low, I was actually surprised and happy to be wrong about my skepticism towards the government. The same journalist did another analysis of Delhi government and concluded 23% of the manifesto promises were fulfilled by the end of 1st year. Another pleasant surprise! I think it's time we start seriously evaluating the election manifestos by political parties and ask questions about unfulfilled promises. Let it not be a mere box of carrots.

There are two simple problems with this particular analysis,
  1. Cross-sectional analysis: This analysis was performed at a specific point of time (almost a year ago) and it is outdated now. So having these numbers continuously updated can help us compare apples to apples in real-time.
  2. Only two governments: Hypothetically if a BJP state government is performing poorly it should not get bragging rights for the good work done by BJP's Central Government. So it would be extremely valuable to have these statistics for central government & all state governments.

Wednesday, February 15, 2017

The Curse of Fake News and Possible Remedy

Well before the debate of "Fake News" started trending, researchers at Stanford University started working on a project aimed at studying how well we evaluate the information, especially from online sources. Starting in early 2015 researchers studied the behaviour of students from schools and universities like Stanford for 18 months. In the summary of the report researchers summed up their disappointment by stating "in every case and at every level, we were taken aback by students’ lack of preparation." The participant did a pretty poor job in assessing the credibility of information and sources. Though it is unfortunate it might not be entirely shocking as it confirms the pattern we are observing around us.

In the era of social media journalism, the reliability of information appears to have taken the back seat. Facebook announced it's intention to crack down on fake news. Recently Twitter has also joined the call. Maybe few other companies will follow the suit. Though commendable initiative, it doesn't seem enough for the enormous scale of the problem. There are 85+ virtual communities worldwide with atleast a million registered users each (like Facebook and Twitter). Additionally, there are few dozen instant messaging services like Whatsapp. Making all these platforms accountable seems like practically an impossible task. And even if a lot of them implement some measure of regulation, can we trust these platforms with their self-moderation policies?

Sunday, January 29, 2017

Influence of celebrities on Public Health

In late 2014, one of the leading Indian actress Anushka Sharma came out regarding her anxiety issues in an interview. She did not play a victim but suggested it is as normal as having a constant stomach pain and encouraged talking about it.

"I have anxiety. And I’m treating my anxiety. I’m on medication for my anxiety. Why am I saying this? Because it’s a completely normal thing. It’s a biological problem. In my family there have been cases of depression. More and more people should talk openly about it. There is nothing shameful about it or something to hide. If you had a constant stomach pain, wouldn’t you go to the doctor? It’s that simple. I want to make this my mission, to take any shame out of this, to educate people about this."

Roughly around the same time another Indian actress Deepika Padukone came out openly about her depression in an interview with a national newspaper. [1] She talked about her plans to create more awareness about depression and also used social media for the same.

"Anxiety,Depression and Panic Attacks are not signs of weakness.They are signs of trying to remain strong for way too long." - @deepikapadukone, 31 Dec, 2014 [2]

Sunday, January 22, 2017

Wordplay in Information Manipulation

There is a very interesting scene in the movie Dark Knight. The Joker (bad guy) is holding Rachel (lead actress) hostage at the edge of a rooftop and then the Batman arrives. The short conversation goes something like,

Batman: Let her go!
Joker: Ohh, very poor choice of words

Indeed, maybe the Batman was under a lot of stress. If this poor^ information representation example serves as one end of the spectrum, then researchers might be on the other end.

The way (good) researchers chose their words, seem remarkably careful^^. They would love to say something like, "X is associated with increased risk of Y with the p-value of blah-blah" (possibly with extra stress on the word associated).

^Poor = unintentional or careless
^^Careful = intentional and thoughtful

Thursday, December 8, 2016

Selection Bias

Barack Obama's article in Wired. [1]
Stephen Hawking's article in The Guardian. [2]
Peter Thiel's speech at RNC. [3]

In last two months, three renowned people have shared their thoughts about the time we live in.

All of them are highly successful and revered figures in their field. They all are data driven, you will find them quoting facts and figures all the time. Yet there is a stark difference between the central message here.

Case 1: Barack Obama's article in Wired

Barack Obama wrote an article titled "Now is the greatest time to be alive". His argument is, we have achieved great breakthroughs. Though it's not utopia, considering the history the current time is the best time to live in.

"Just since 1983, when I finished college, things like crime rates, teen pregnancy rates, and poverty rates are all down. Life expectancy is up. The share of Americans with a college education is up too. Tens of mil­lions of Americans recently gained the security of health insurance. Blacks and Latinos have risen up the ranks to lead our businesses and communities. Women are a larger part of our workforce and are earning more money. Once-quiet factories are alive again, with assembly lines churning out the components of a clean-energy age.

Sunday, January 25, 2015

How to Create and Publish R package on CRAN : Step-by-Step Guide

  • R Studio (This tutorial is based on R studio 0.98.501)
  • Beginner level R programming skills
  • devtools package (to build and compile the code)
  • roxygen2 package (to create the documentation)
Lets break it down into 7 simple steps as following:
  1. Create R project
  2. Create function(s)
  3. Create  description file
  4. Create help file(s)
  5. Build, load and check the package
  6. Export package
  7. Submit on CRAN
Step 1

1.1  Open R Studio. Create a new project using "file > new project > new directory > empty project". Give directory name.

Tuesday, October 28, 2014

Important Concepts in Statistics

This is a random collection of few important statistical concepts. These notes provide simple explanation (not a formal definition) of concepts and the reason why we need them.

Sample space: This is a set of all possible outcome values.

So if we consider a coin flip then sample space would be {head,tail}. If one unbiased die is thrown then sample space would be {1, 2, 3, 4, 5, 6}.

Event: It is a subset of sample space. For a given event "Getting even numbers after throwing unbiased die" the subset is {2, 4, 6}. So every time we run experiment either the event will occur or it wont.

Why we need it: Both sample space and event helps us to determine the probability of event. Probability is nothing but ratio of number of elements in event space to number of elements of sample space.

Probability distributions: It is a probability of every possible outcome in sample space.

So for a unbiased dice, probability of every outcome is equal 1/6 which look like,

When a probability distribution looks like this (equal probability of all outcomes) it is called Uniform probability distribution.

Important thing to consider here is sum of all probabilities is exactly equals to one for probability distribution.

Why we need it: Most of the statistical modelling methods make certain assumption about underlying probability distribution. So based on what kind of distribution data follows, we can choose appropriate methods. Sometimes we will transform (log transform, inverse transform) the data if the distribution observed is not what we would have expected or required by certain statistical methods.

We can categorize probability distribution in to two classes, discrete probability distribution and continuous probability distribution.
  • Discrete: Sample space is collection of discrete values. e.g. Coin flip, die throw etc 
    • Continuous: Sample space is collection of infinite continuous values. e.g. Height of all people in US, distance traveled to reach workplace

    Normal distribution: It is one of the most important concepts in statistics. Distributions in real world are very similar to the normal distribution which look like a bell shaped curve approaching zero on both ends.

    In reality we almost never observe exact normal distribution in nature, however in many cases it provides good approximation model.

    Normal Distribution PDF.svg

    Normal Distribution PDF" by Inductiveload - self-made, Mathematica, Inkscape. Licensed under Public domain via Wikimedia Commons.

    When the mean of normal distribution is zero and standard deviation is 1 then it is called Standard normal distribution. The red curve is standard normal distribution.

    Why we need it: Attaching a screenshot from Quora discussion which sums it up pretty well.

    Law of Large numbers: The law of large numbers implies larger the sample size, closer is our sample mean to the true (population) mean.

    Why we need it: Have you ever wondered, if probability of any outcome (head or tail) for a fair coin is exactly half but for 10 trials you might actually get different results (e.g. 6 heads and 4 tails). Well, Law of Large numbers provides answer to it. It says as we will increase number of trials, mean of all trials will come closer to expected value.

    Another simple example is, for an unbiased die probability of every outcome {1,2,3,4,5,6} is exactly same (1/6) so the mean should be 3.5.
    "Largenumbers" by NYKevin - Own work. Licensed under CC0 via Wikimedia Commons.

    As we can see in the image above, only after large number of trials the mean approaches to 3.5.

    Central Limit theorem: Regardless of the underlying distribution, if we draw large enough samples and plot each sample mean then it approximates to normal distribution.

    Why we need it: If we know given data is normally distributed then it provides more understanding about data as compared to unknown distribution. And the Central Limit Theorem enables us to actually use the real world data (near-normal or non-normal) with statistical methods making assumption about normality of the data.

    An article on summarizes the practical use of CLT as follows,

    "The assumption that data is from a normal distribution simplifies matters, but seems a little unrealistic. Just a little work with some real-world data shows that outliers, skewness, multiple peaks and asymmetry show up quite routinely. We can get around the problem of data from a population that is not normal. The use of an appropriate sample size and the central limit theorem help us to get around the problem of data from populations that are not normal.

    Thus, even though we might not know the shape of the distribution where our data comes from, the central limit theorem says that we can treat the sampling distribution as if it were normal."

    Correlation: A number representing strength of association between two variables. A high value of correlation coefficient implies both variables are strongly associated.

    One way to measure it is Person's correlation coefficient. It most widely used method which can measure only linear relationship between variables. The coefficient value varies from -1 to 1.

    The correlation coefficient value of zero means, there is no relationship between two variables. A negative value means as one variable increases the other decreases.

    The most important thing to remember here is, correlation does not necessarily mean there is a causation. It represents how two variables are associated with each other.

    source : xkcd

    Peter Flom, a statistical consultant explains the difference in simple words as following:
    "Correlation means two things go together. Causation means one thing causes another."

    Once we find correlation, controlled experiments can be conducted to check if any causation exists. There are few statistical methods which help us to check non-linear relationship between two variables like, maximal correlation.

    Why we need it: A correlation coefficient tells us how strongly two variables are associated and direction of the association.

    P-value: The basic notion of this concept is, a number representing results by chance. Smaller the number, more reliable are the results. Generally 0.05 is considered the threshold, P-value less than that is reliable.

    Having said that, Fisher argued strongly that interpretation of the P value was ultimately up to the researcher. The threshold can vary depending on requirements.

    Why we need it: So a P-value of 5% or 0.05 tells us, 1 out of every 20 results will be produced by chance.