Big Data - Treehouse

Published on October 2, 2015 by Jensen Goh

This article is part of the science and technology column at Treehouse. This column aims to distill issues related to science and technology, presenting them in an easy-to-read, digestible format.

“Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.” — Wikipedia (Image source)

As university students, we are often tasked with endless readings of reports and experiments where we are challenged to critically examine and question previous research and work done. A common area of possible error in the experiment’s set up is that the sample size was simply not big enough, and that the results could have arisen with an inherent bias. Enter big data, the next step in science, study of trends, human behaviour and whatever that can be gleaned off from social media. Or is it?

“Database, database just living in the database” (Image source)

Big data has possibly been the most exciting buzzword in the tech-world for some time now. Buzzword being the keyword. The difficulty in approaching conversations about big data would be the vast number of interpretations about the subject itself. Sure, it is big, but even on that matter most would find disagreements in determining how big a set of data needs to be before it can be considered big data. Then again, who cares? All that we know is that it is big enough to be able to eliminate doubt about sample size or if correlations and trends observed are significant or not. A study about emotion contagion on people’s Facebook feed showed that a reduction of negative comments on a person’s Facebook feed had a minuscule impact on their own expression of positivity and negativity. However, while this small percentage might have been brushed off as insignificant in the regular studies and research we read about, attributing it to random deviation, the researchers argued that by virtue of the sheer amount of data collected, any form of deviation, no matter how small, is significant.

How does it work?

Much like how the lazy college student identifies the common things asked in class (What potential biases are there in this paper? What else do you think might be able to account for the researcher’s results?) and seeks to gain class participation points without doing the readings by regurgitating the same common inevitabilities, big data achieves its outcomes through trend identification of key terms.

Say you find yourself in the same shoes of that same lazy college student who has fifty over reading to complete for his examination the next week. Rather than simply go through each and every paper and immerse in the details and fine print, you’re going to need a strategy to get the gist of the content out of the readings to prepare you for your examination. A common strategy would be to simply read the abstracts and conclusions for each paper and look out for key findings. An alternative approach would be to search for keywords you think are important and look up statements, literature reviews and findings regarding that particular search term. Either way, the large amount of information presented is filtered in order to cope. Similarly in big data, specific items are selected and searched for in order to filter out noise that might make deriving a rigorous definite statement at the end difficult.

Now with the way academic papers work, there are going to be different commentaries and critiques present within the information that might not necessarily match and might even conflict with each other. So once again, you come up with another cheap strategy that does not require you to read all 50 texts to decide who is right or wrong. You match the statements to the date of publications, number of references in the paper, or the number of times a cat is mentioned (one of these three is clearly less effective) to decide which statements hold the most weight. This is where the magic of big data comes in and we have the computer do this through algorithms such as statistical matching and linear regression, which is more accurately explained here.

So the lazy college student decides in the end to match the number of times cats are mentioned in the text to the abstracts to decide which papers most accurately represent his area of study. He walks into the examination hall and fills in the questions with ease, writing down the bits of abstract that best fit the question and drawing cat icons on the bottom of the page to increase the credibility of his work. Surely, he would be scoring an A now right?

Unfortunately no.

Simply irresistible, this will make my article more credible. (Image source)

Internet power

Similar to the lazy college student, Google was enthralled by the idea of piecing correlations together in an attempt to predict flu trends by matching search terms from their search engine that correlated to flu rates obtained from disease control centers. In what was considered the biggest disappointment in the application of big data, the entire project proved to be bust, with the projected predictions faring badly, underrepresenting major spikes in flu trends and often turning false positives. It turned out that matching the search terms to increases and decreases in the temperature would have been a more valid application of the Google Flu Trends model, as without application of any theory to link the two clauses, the correlation between the search terms and actual flu rates turned out to be a spurious one.

There has to be a link somewhere! (Image source)

Clearly more than simply searching for correlations would be needed if big data analysis would be headed anywhere. Google Translate, once a steaming pile of disjunct gibberish if you attempted to input anything with context or nuance, would have never been able to achieve the translation of sentences as seen in recent times. The premise of how the translation system works is through statistical matching of translated reports, searching for terms that have been translated previously and turning up the most commonly translated result. How it progressed from a poor man’s ‘lost in translation’ meme generator to today’s increasingly accurate (and actually useful) translations? User feedback.

Notice the ‘Was this translation helpful?’ button at the corner of the translation box? That provided a feedback system that allowed the system to learn and refine itself through user feedback by adding another point of reference they knew would be directly related to the correlations being found by their algorithms (People’s appreciation versus the accuracy of findings). Theory, of sorts, would be necessary in the application of big data to prevent it from simply being hubris. There had to be an explanation to achieve a definitive conclusion.

Wearing the big wigs

Now the application of big data is nott going to be exclusive to internet giants such as Google or Facebook. Any large corporation with enough consumer data would be able to, and probably are already able to, employ such studies. Chances are you will probably have never heard of it as it often is buried in the terms of service agreement we all too often click ‘I accept’ on — Big brother is watching.

From personalised ads, the time of day a news article is released online, irresistible steam sales and many more little things that have proliferated the online environment have been results of such applications of big data. Target assigns every customer a Guest ID number, tied to their credit card, name, or email address, storing a history of everything they have bought and any demographic information Target has collected from them or bought from other sources. Discovering a trend that pregnant women would purchase certain types of things during their maternity, they would then send them advertisements for baby products and items often needed for the care of a young child. Remember how all this big data only searches for specific things and ignores the rest? Like the age of the prospective mother in such a situation? Things were completely blown out of control for a certain 16-year-old teenage mother who had yet to tell anyone about her pregnancy before the ads started showing up.

Sure, personalised ads might be able to generate a win-win experience, but it can be rather creepy as well, much like a stalker who can tell you where you went in the past week based off from photos on your Instagram and Facebook feed, your preferred eating habits and the number of times you have used a specific cubicle in the public restroom.

It gets worse when you consider the possibility of manipulation, ever wonder why you see so many game app advertisements during periods where you are likely to get distracted such as before finals week? It might sound like a conspiracy, but its simply the proliferation of the use of big data in the way things are being marketed to us.

—

About the Author
As a competitive gamer, Jensen’s personal field is the study of winning. As a Shoutcaster for Garena League of Legends, Jensen loves to discuss the E-sports industry: how is it perceived? And how does it interact with our society? He is also a firm believer that competitive gaming will be recognized in the future. Trust him, he’s an engineer.