Thursday, February 6, 2014

Signal and Noise: Optimizing the predictive model

Signal gives us useful information required and we would like to maximize its input.
Noise gives us useless information and we would like to remove all the possible noise as input.

Unfortunately in real life scenarios every input to model contains both signal and noise. So we have optimize the balance between two.

For any predictive model we can determine if the inputs really affect the output by association tests. Lower the p-value, stronger the association. The next step is we would like to arrange predictors in decreasing order of association and add them one by one to build the predictive model. Every time we add a predictor, a new model on training data should be built. Then it should be evaluated against the testing data.

Lets use 'Area Under the Curve' or AUC as evaluator of the model. 

If you look at the screenshot below, we have built three predictive models on training data called 'train' and evaluated against testing data called 'test'. Area under the curve (AUC) goes on increasing as we add a new predictor.

However as we go on, we might realize that every newly added predictor leads to decrease in AUC value. It means that those parameters are adding more noise than the signal and probably we should not consider them.

So the summary chart for all models might look like this,

As you can see in the screenshot above, till model 7 AUC kept on increasing with every additional risk score but after that it goes on decreasing.

Lets plot AUC values for all models,

We can see a comparatively big information gain at model 2 (after adding RS1: risk score 1), which can be interpreted as RS1 being most useful risk score predictor. This also confirms model 7 is the best model we have in this particular case with maximized signal.

No comments:

Post a Comment