Thursday, February 6, 2014

Chi-squared test: concept and example

Chi-squared is one of the important tests which help us to understand role of random chance variation between categorical variables. That is if the variation is random or there is some relation.

It is denoted by and pronounced as Kai-squared or Kai-square test (which I learned after few awkward situations). Lets try to understand how its calculated based on one example,


Lets assume we have a die which is 1-5-6 loaded (favors 1, 5, or 6 due to altered weight). Now we are supposed to determine if the die is loaded with the confidence of 95%.

Lets roll the die everyday 100 times for 6 days. If the die is fair then theoretically it should give 100 as count for each number after 6 days.

  • Null hypothesis: H0 - the die is fair. 
  • Alternative hypothesis: H1 - the die is NOT fair.
Now we have to determine if the observed variation is due to random chance or beyond the random chance should allow. In other words we have to see how far our data vary before we have to reject the null hypothesis and conclude that die is NOT fair. In order to implement it, all we have to do is, 

Based on given conditions derive critical (expected) Chi-squared value (part 1) 
Based on observations find actual Ch-squared value. (part 2) 
Compare both and based on results reject one hypothesis from above. (part 3) 

Part 1:

In order to find critical chi-squared value we need p-value and degrees of freedom. 

The given condition of confidence interval is 95%, so the error probability (p-value) should be <= 0.05. The p-value depends on the tolerance we have, if our tolerance is 10% then p-value is 0.10 or 0.1

Degrees of freedom (df) = Number of categories - 1 = ( 6 - 1 ) = 5
Critical value = CHIINV(error, df) = CHIINV(0.05, 5) = 11.07

It means our threshold for chi-squared critical value is going to be 11.07. So if our die Chi-squared value (actual) is greater than 11.07 then we are supposed to reject null hypothesis H0 and claim the die is NOT fair. 

Part 2:

As you can see above with given conditions, we must NOT have chi-squred value of observed die greater than 11.07 in order to accept NULL Hypothesis. 

The actual Chi-squared value is calculated by, summation [ ( O - E ) ^ 2 / E ] : where O is observed and E is expected. 

Refer the following screenshot for calculations,

Part 3: 

Now you can see observed chi-squared value id greater than 11.07 so we much reject null hypothesis and conclude the die is NOT fair. 

Effect of p-value

As we have seen the p-value explains our tolerance level error with results. 
Lets change given conditions. Now we need to be 99% confident that the die is NOT fair, which means there should be very significant difference in observed values.

Lets re-calculate the critical value of chi-square based on new given conditions. 

So based on new conditions if the chi-squred value is greater than 15.09 then we can reject null hypothesis. 

However according to the observed data (which is same) the chi-squared value is 12.26 and NOT greater than 15.09. So we accept the null hypothesis saying the die is fair. 

Conclusion, with higher confidence level (like 99% in last case) we need significantly different behavior (higher observed Chi-squared value) than random possibility.

No comments:

Post a Comment