Pitfalls of Big Data II: Bad Data
Hi,
Today I’ll about the input to the big data systems, i.e., the data.
We, kind of, implicitly know that bad data leads to bad models and I guess bad data is also the most talked about aspect of the darker side of big data. Still, I think how easy it is to get bad data is under appreciated.
Suppose you want to know the average number of friends someone in the campus has. You go to all your friends, ask them how many friends each of them have and average the number. Sounds Good? Unfortunately it is very likely that this number will be larger than the actual average. This is because you are more likely to be friends with people who have larger number of friends, hence people with larger number of friends are over counted. Think of it this way, suppose there is some person who has no friends, then there is no chance that this person will be friends with you, hence you will never count that person, though she should be counted in the population average. Similarly, it is unlikely that people with only one friend will be your friends, but if there is someone who is friends with almost everyone then is it is very likely that you’ll also be her friend. So you’ll count people with more friends more, which won’t happen in a truly random sample.
This also illustrates why it is a bad idea to draw general conclusions on the basis of a survey you ask on social media. Your friends/followers on social media are likely to be like you than the general population, so whatever conclusions you draw from your survey might be true for your friends but not necessarily for the general population. <Rant> This is the main reason why the demonetization survey was such an epic bullshit. It was popularized on social media, the survey app was on smart phone and the questions were in English. Someone who follow the app popularizers (likely to be ruling party supporter), knows English and owns a smart phone is the not an average Indian. </Rant>
The bias which creeps in because of the data selection process is called selection bias. The problem with selection bias is that it doesn’t get fixed with more data. In fact, more data makes things worse as it gives false confidence in the results. This happened in opinion polls of the 1945 US Presidential Elections. Literary Digest had around 2.4 million survey samples and they predicted that Alfred Landon will win the election, whereas George Gallop with a much smaller sample of 50 thousand data points predicted that Roosevelt the will win and he was right with a good margin. What happened was that Literary Digest sent out the surveys through magazine subscriptions and poorer households weren’t very likely to have magazine subscriptions hence their opinions weren’t counted enough. Where as Gallop had corrected for such factor and hence he was able give more accurate result with less data.
To fix selection bias one needs to argue about the process through which data is collected and this information is often not available in the data itself. The problem of bad data often arises in sciences and scientists (both social and natural) have to remain very careful in designing the experimental setup so that unindented factors don’t affect the data they get. But the big data analytics we see now a days seems to be completely oblivious to how the data came to be.
Previous Post: Pitfalls of Big Data I: What do you care about?
Next Post: Pitfalls of Big Data III: Fear of Unknown