Monday, July 29, 2013

Two heads are better than one in data science

We have heard the saying "two heads are better than one". This has also been shown in the field of data science. The term used to describe this is Overkill analytics.

A simple description of the approach here is given in this article. The author uses this approach repeatedly in several places including a competition on Kaggle.

Simply put, the approach involves using simple models to predict and then using the ensemble of these models for the final prediction.  So sophisticated models are skipped in favor of brute force methods on simple models.  Further, the simple models chosen should be such that they complement each other. This way each model brings its own strengths to bear on the final prediction. This while contributing to cancelling the noise.  Hence the title --  two models (or two heads) are better than one.

An example of using the model in R is also given. I plan to give a similar example using Python.