Monday, September 15, 2014

How to check normality of the data

All parametric tests make certain assumptions about the data. Most of the parametric tests like F-test, Z-test assume the data is normally distributed. So it is always useful to test the assumption of normality before we proceed. Sharing my notes about normality tests in this post.

At high level I would generalize tests into two categories
  • Visual test
  • Statistical test

Visual tests: These might not be the best way to check for normality and can be ambiguous and/or misleading sometimes. Lets get a high-level overview of how to use them.

Histogram: We can plot a histogram of observed data and check for
  • If it looks like a bell shaped curve
  • Not skewed in any direction.

This test is not convenient when we have a data which is symmetric and gives bell-shaped histogram, but NOT actually normal.

Box plot: For box plot we would check for,
  • There are very few outliers and far away from box plot
  • The box plot look symmetrical.

In the plot above, the first box plot clearly demonstrated skewness and second box plot looks pretty much symmetrical.

Again, similar to histogram for close cases,  this test is not a good way to conclude.

Probability plots: We would plot a given data distribution against a theoretical normal data. We can do this with cumulative distribution of both observed and theoretical data (PP plot) or non-cumulative distribution of both data sets (QQ plot). We should check for
  • If the plot looks like a straight line
  • The slope of the line-like structure is positive.

The QQ plot shown above doesn't actually looks like a straight line, so sampled data might not be a good fit to normal distribution.

Statistical test: These are less ambiguous and convenient way to determine normality.

Shapiro-Wilk test : Null hypothesis for this test is, data is normally distributed. So when we get a p-value which leads to rejecting the null-hypothesis, it is concluded that data is NOT normal. This is the most powerful normality test.

Anderson-Darling test : This is one of the most powerful test after Shapiro-Wilk to find deviations from normality. The general use case of this test is to check if the observed data adhere to certain distribution.

Kolmogorov-Smirnov test : It is a non-parametric test which is used to compare observed data with certain distribution. This is NOT so powerful as compared to previous tests.

Chi-square test : It is another goodness of fit test which tests the same null hypothesis as that of Kolmogorov-Smirnov test mentioned above.

If you are willing to explore more, there is an interesting paper published comparing these tests.
Paper: Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests

No comments:

Post a Comment