How do 87m records scraped from Facebook become an advertising campaign that could help swing an election? What does gathering that much data actually involve? And what does that data tell us about ourselves? The Cambridge Analytica scandal has raised question after question, but for many, the technological USP of the company, which announced last week that it was closing its operations, remains a mystery. For those 87 million people probably wondering what was actually done with their data, I went back to Christopher Wylie, the ex-Cambridge Analytica employee who blew the whistle on the company’s problematic operations in the Observer. According to Wylie, all you need to know is a little bit about data science, a little bit about bored rich women, and a little bit about human psychology…
Step one, he says, over the phone as he scrambles to catch a train: “When you’re building an algorithm, you first need to create a training set.” That is: no matter what you want to use fancy data science to discover, you first need to gather the old-fashioned way. Before you can use Facebook likes to predict a person’s psychological profile, you need to get a few hundred thousand people to do a 120-question personality quiz.
The “training set” refers, then, to that data in its entirety: the Facebook likes, the personality tests, and everything else you want to learn from. Most important, it needs to contain your “feature set”: “The underlying data that you want to make predictions on,” Wylie says. “In this case, it’s Facebook data, but it could be, for example, text, like natural language, or it could be clickstream data” – the complete record of your browsing activity on the web.“Those are all the features that you want to [use to] predict.”
At the other end, you need your “target variables” – in Wylie’s words, “the things that you’re trying to predict for. So in this case, personality traits or political orientation, or what have you.”