Wednesday, July 31, 2013

Randy Pausch and working in groups

Randy Pausch is one of the people I admire. His last lecture is just awesome. I frequently remind myself of his statement "Brick walls are there for a reason". 

Anyhow, recently came across another article of his. This is about tips for working successfully in a group. Great advice indeed.

Monday, July 29, 2013

Eclipse on Linux

It's been a long time since I used Eclipse. Planning to get back to some Java programming. Hence installed Eclipse on Scientific Linux (SL) 6.3 today. It should have been easy except for the fact that I did not pay attention to the fact that I had 64 bit Java on my VM while I got the 32 bit Eclipse release. This resulted in an error "JVM terminated Exit code=13".  Wish the error message could have been less cryptic.

Anyhow, googling it I saw the reason for the error.  Quick search turned out this article. That ensured eclipse could start from the command line. But I wanted to add it to the Gnome launcher. More searching turned up this article.

So now I can start Eclipse on my SL 6.3 VM by selecting "Applications"-> "Programming".

Searching through big data

Found yourself having to search through lots of data. Came across an article that describes use of simple unix commands for this. For details check this article.

Two heads are better than one in data science

We have heard the saying "two heads are better than one". This has also been shown in the field of data science. The term used to describe this is Overkill analytics.

A simple description of the approach here is given in this article. The author uses this approach repeatedly in several places including a competition on Kaggle.

Simply put, the approach involves using simple models to predict and then using the ensemble of these models for the final prediction.  So sophisticated models are skipped in favor of brute force methods on simple models.  Further, the simple models chosen should be such that they complement each other. This way each model brings its own strengths to bear on the final prediction. This while contributing to cancelling the noise.  Hence the title --  two models (or two heads) are better than one.

An example of using the model in R is also given. I plan to give a similar example using Python.

Sunday, July 28, 2013

Big Data

The current trend towards increased connectivity has an interesting byproduct. Lots of data. Data which can enable you to predict better, data which can enable you to analyze better. 

There is a big movement to take advantage of this deluge of data. The is the Big Data movement.  I have been working lately on several Big Data technologies. These include Hadoop, Mapreduce, NoSQL etc. 

Why the new technologies? This is because existing technologies are not sufficient to handle the amount of data that we are faced with. These new technologies bring a multitude of open problems with them. So an interesting area going forward.

Will start posting more details about these concepts as well as how to get your hands dirty with this stuff.