Tuesday, October 28, 2014

Important Concepts in Statistics

This is a random collection of few important statistical concepts. These notes provide simple explanation (not a formal definition) of concepts and the reason why we need them.

Sample space: This is a set of all possible outcome values.

So if we consider a coin flip then sample space would be {head,tail}. If one unbiased die is thrown then sample space would be {1, 2, 3, 4, 5, 6}.

Event: It is a subset of sample space. For a given event "Getting even numbers after throwing unbiased die" the subset is {2, 4, 6}. So every time we run experiment either the event will occur or it wont.

Why we need it: Both sample space and event helps us to determine the probability of event. Probability is nothing but ratio of number of elements in event space to number of elements of sample space.

Probability distributions: It is a probability of every possible outcome in sample space.

So for a unbiased dice, probability of every outcome is equal 1/6 which look like,

When a probability distribution looks like this (equal probability of all outcomes) it is called Uniform probability distribution.

Important thing to consider here is sum of all probabilities is exactly equals to one for probability distribution.

Why we need it: Most of the statistical modelling methods make certain assumption about underlying probability distribution. So based on what kind of distribution data follows, we can choose appropriate methods. Sometimes we will transform (log transform, inverse transform) the data if the distribution observed is not what we would have expected or required by certain statistical methods.

We can categorize probability distribution in to two classes, discrete probability distribution and continuous probability distribution.
  • Discrete: Sample space is collection of discrete values. e.g. Coin flip, die throw etc 
    • Continuous: Sample space is collection of infinite continuous values. e.g. Height of all people in US, distance traveled to reach workplace

    Normal distribution: It is one of the most important concepts in statistics. Distributions in real world are very similar to the normal distribution which look like a bell shaped curve approaching zero on both ends.

    In reality we almost never observe exact normal distribution in nature, however in many cases it provides good approximation model.

    Normal Distribution PDF.svg

    Normal Distribution PDF" by Inductiveload - self-made, Mathematica, Inkscape. Licensed under Public domain via Wikimedia Commons.

    When the mean of normal distribution is zero and standard deviation is 1 then it is called Standard normal distribution. The red curve is standard normal distribution.

    Why we need it: Attaching a screenshot from Quora discussion which sums it up pretty well.

    Law of Large numbers: The law of large numbers implies larger the sample size, closer is our sample mean to the true (population) mean.

    Why we need it: Have you ever wondered, if probability of any outcome (head or tail) for a fair coin is exactly half but for 10 trials you might actually get different results (e.g. 6 heads and 4 tails). Well, Law of Large numbers provides answer to it. It says as we will increase number of trials, mean of all trials will come closer to expected value.

    Another simple example is, for an unbiased die probability of every outcome {1,2,3,4,5,6} is exactly same (1/6) so the mean should be 3.5.
    "Largenumbers" by NYKevin - Own work. Licensed under CC0 via Wikimedia Commons.

    As we can see in the image above, only after large number of trials the mean approaches to 3.5.

    Central Limit theorem: Regardless of the underlying distribution, if we draw large enough samples and plot each sample mean then it approximates to normal distribution.

    Why we need it: If we know given data is normally distributed then it provides more understanding about data as compared to unknown distribution. And the Central Limit Theorem enables us to actually use the real world data (near-normal or non-normal) with statistical methods making assumption about normality of the data.

    An article on about.com summarizes the practical use of CLT as follows,

    "The assumption that data is from a normal distribution simplifies matters, but seems a little unrealistic. Just a little work with some real-world data shows that outliers, skewness, multiple peaks and asymmetry show up quite routinely. We can get around the problem of data from a population that is not normal. The use of an appropriate sample size and the central limit theorem help us to get around the problem of data from populations that are not normal.

    Thus, even though we might not know the shape of the distribution where our data comes from, the central limit theorem says that we can treat the sampling distribution as if it were normal."

    Correlation: A number representing strength of association between two variables. A high value of correlation coefficient implies both variables are strongly associated.

    One way to measure it is Person's correlation coefficient. It most widely used method which can measure only linear relationship between variables. The coefficient value varies from -1 to 1.

    The correlation coefficient value of zero means, there is no relationship between two variables. A negative value means as one variable increases the other decreases.

    The most important thing to remember here is, correlation does not necessarily mean there is a causation. It represents how two variables are associated with each other.

    source : xkcd

    Peter Flom, a statistical consultant explains the difference in simple words as following:
    "Correlation means two things go together. Causation means one thing causes another."

    Once we find correlation, controlled experiments can be conducted to check if any causation exists. There are few statistical methods which help us to check non-linear relationship between two variables like, maximal correlation.

    Why we need it: A correlation coefficient tells us how strongly two variables are associated and direction of the association.

    P-value: The basic notion of this concept is, a number representing results by chance. Smaller the number, more reliable are the results. Generally 0.05 is considered the threshold, P-value less than that is reliable.

    Having said that, Fisher argued strongly that interpretation of the P value was ultimately up to the researcher. The threshold can vary depending on requirements.

    Why we need it: So a P-value of 5% or 0.05 tells us, 1 out of every 20 results will be produced by chance.

    No comments:

    Post a Comment