The New Washington Data Grid
Last week, I took time off from work and attended the Future of Privacy Forum and the Stanford Center for Internet & Society’s “Big Data and Privacy: Making Ends Meet” workshop in Washington D.C on September 10th. I have a two decade tradition upon arriving in Washington DC of doing a Forest Gump like jog for as long as my schedule and joints will allow. Late Monday afternoon before the conference I continued my DC jogging tradition. I left from the Washington Hilton and managed to go on a 10 + mile jog past the White House, the Washington Monument, the Jefferson Memorial, across the Potomac and then back to the Hilton. What struck me on my run was the grid like structure built around the Washington Monument (pictured above). This monument provides a good bookend for me to describe the stark contrast between the opening panels of privacy experts and the lunchtime keynote delivered by Rayid Ghani, former Chief Scientist for Obama for America 2012.
Held at the Microsoft Innovation and Policy Center, moderators lead panels of privacy experts to share their perspectives on the privacy implications of Big Data (hashtag #BigPrivacy). Many of the panelists submitted papers in advance of the workshop. As part of my LL.M. in IP and Tech Law studies at Washington University School of Law, I co-authored and submitted a paper with Professor Neil Richards of Washington University called The Three Paradoxes of Big Data. Stanford Law Review Online published our paper with some others on September 3rd as part of a Symposium Issue. All of the papers submitted in advance of the workshop are published here on the Future of Privacy Forum website.
The first panel moderated by Jules Polonetsky, Co-Chair of the Future of Privacy Forum, was called Framing Big Data and Privacy. Professor Richards joked early on saying:
“There is mystical thinking that infuses the thinking around big data. Replace the word big data with magic – if we had magic and wizards we would have to regulate them.”
Professor Richards summarized some of the arguments we made in our paper saying, “As a society if we are going to making important life changing decisions based on algorithmic processes then we need to regulate and have transparency.” Professor Richards also described the need in this rapidly changing environment for the development of data ethics and the protection of ‘intellectual privacy’ rights for fundamental intellectual activity such as reading.
Professor Deirdre Milligan, Assistant Professor of School of Information and Co-Director, Berkeley Center for Law and Technology, framed the debate more widely than privacy saying, “we have to get past man versus machine discussions and look at the ‘socio-techno’ systems and realize that people wrote those algorithms.” “Ethical issues with big data,” Said Professor Milligan, “go far beyond privacy.” She continued saying that “if the policy conversation is not broadened in several ways then decisional autonomy is at stake.”
Eric Jones, Policy Director and Assistant Attorney General to Lisa Madgian, Attorney General of the State of Illinois, provided a legislative perspective on Big Data. He described his previous role working for Senate Rockefeller on the Senate Commerce Committee and his new role as Policy Director with Attorney General Madigan. Jones shared that he was focused on providing general oversight on consumer protection issues for Senator Rockefeller until they started to see that “technology was having huge impact across the gamet.” Commenting on the federal perspective, Jones said “Congress as a whole leads to not getting in the way of innovation. They don’t want to do something that will prohibit innovation. Because of that you see congress focusing on specific harms and fixing those.” Now working for Attorney General Madigan, Jones said that he is focusing on opening some investigations into data brokers not because there are problems necessarily but “to ask the right questions.”
Natsha Singer of The New York Times moderated a lively second panel called the Social Ramifications of Big Data. Professor Evan Selinger, Associate Professor at Rochester Institute of Technology, shared his experiences teaching a privacy law class for science and engineering students. Professor Selinger said that at the beginning of his class students did not have privacy concerns. As he started to teach them hands on applications of big data, however, he said “you could see a change happen.” The students were able to see first hand how “seemingly innocuous information could become harmful.” Professor Selinger continued saying, “We are so used to thinking about big data in a big organizational way that we are not yet fully to think about what is going to happen about individuals.”
Karen Levy, Ph.D. Candidate in the Department of Sociology at Princeton University, shared a fascinating perspective gained from studying long haul truck drivers monitored by GPS. Levy chose to study truck drivers for her PhD as part of a larger inquiry into the impact of big data on relationships. Levy said “We should think about top down institutional collection but we should also think about smaller data practices in our relationships.” Levy is interested in looking at social domains in which data is being applied in relationships such as family. She observed that there is a proliferation of tools to track teenagers that “were not around when I was a teenager.” Levy’s studies are showing that monitoring products used across friendship, employment and family relationships have the “potential to change trust relationships, control relationships and change accountability.”
True to their titles, these opening panels provided big picture privacy and social perspectives on Big Data. It was here that Rayid Ghani, former Chief Scientist for Obama for America 2012, took the podium and quite literally stole the show.
Introduced by Chris Wolf, Co-Chair of the Future of Privacy Forum and partner at Hogan Lovells LLP, Ghani is now at the Computational Institute and the Harris School of Public Policy at the University of Chicago and co-founder of Edgeflip. Ghani opened saying it was an “interesting morning hearing from people I typically don’t talk to.” Ghani shared that he normally heard from other computer scientists and that privacy is not talked about without a trade off. Put another way, privacy is a constraint for data scientists. Ghani said, “I don’t know much about privacy and I don’t care that much about privacy because I don’t care much about it.” Qualifying his statement, Ghani shared that, like many of us, he gives up information because “he is lazy and he wants to connect.” Ghani then commented on the over used term “big data” itself saying “no one in the computational world talks about big data.” He dismissed the term big data as one that vendors had come up with to sell more. He also observed that the previous sessions uttered the term big data more than he had ever witnessed.
Ghani then proceeded to deliver an insightful talk on data science and the role it played in the 2012 Obama campaign. Ghani first observed that nothing fundamentally has changed in the past ten years in data analysis. That said, Ghani shared four ways in which access to more data is changing data science predictions:
- Better Predictions: “Most people use data to make predictions … When you have more data, the implications are that you can make finer grained predictions.”
- Earlier Predictions: “We can make these predictions much earlier than we used to.”
- More Accurate Predictions: “The goal is better than random, not 100% accuracy … Things that they are showing about Axciom about big data is not data about you, it is inferences about you … People not in the big data world think about this as very deterministic. There is no this or that, it is a continuum. More accurate means we can really do something…”
- Reduce Risk Of Taking Certain Action: “Work the Obama campaign did was decisive in winning elections. We won for a lot of different reasons. What better analytics and data meant is that we reduced the risk of losing. What we did was increase the probability of winning [by predicting on] election day an 88% likely to win instead of 64% likely to win.”
Ghani then described the important role of experimentation in driving actions and decisions and that privacy reactions themselves can become part of the experiments. Ghani said:
When on the Obama campaign we struggled every day about what we could do and not do; what would be perceived as privacy violation even though it would not be a privacy violation. For example, recommendations on which friends you should recommend to get out to vote … We then started to send users email.
Ghani walked thru how they added additional features to the emails they sent such as referencing names in subject lines and adding profile pictures from authorized Facebook friend lists. Netting out his discussion on experimentation, Ghani said, “for every additional personalization the response rate doubled.”
Ghani said their were big arguments internally about how many emails to send so they would continue to run experiments. By sending emails “you are asking people what they want, not a survey, but what do they actually respond to. Instead of hypothesizing you can do it. When people start unsubscribing, then you can adjust.”
In the most fascinating part of his talk, Ghani outlined how the Obama campaign gathered voting data and plotted the electorate on a grid. The campaign did not use private data, it used public voting records. Ghani shared later in a response to a question from the audience, that they used voting records because they are the best predictor of how a voter will vote. Using voter records, the campaign could predict three things on every voter in a swing state:
- How likely you are to support Obama.
- How likely you are to be persuaded to support Obama.
- How likely you are to vote.
With these simple predictions, the Obama campaign could then plot every voter in a swing state into a grid. Ghani descrbed the four quadrants of the grid and the corresponding action the campaign would take in each grid:
- People who are not likely to vote and not supporting Obama: “Too expensive to focus on.”
- People who are unlikely to vote but have a high likelihood to vote for Obama: “Focus on getting those people to vote.”
- People who are not supportive of Obama but have a high likelihood of voting: “Focus on small sliver of percentage that are persuadable.” Ghani shared that it is hard to identify these people because undecided voters typically translate into not telling you how they will vote. Therefore, Ghani shared how the campaign ran experiments to find a subset of persuadable people in order to develop models to rank persuadability. The campaign would have volunteers “talk to people about Obama’s policies and then poll again and figure out what kind of people increase their support as result of this persuasion to then apply to everyone else in the country and how likely they are to be persuaded.” The campaign would then create a ranking of the most persuadable to least persuadable and have volunteers go top to bottom on the list.
- People who are likely to vote and vote for Obama: “Use this segment to really expand reach.” Referring to the earlier personalized emails, Ghani described that the focus for this group was to give them as many tools as possible to call the right people in the right bucket.
I don’t want to know what you know about me. I want to know what you predict about me. You can infer a lot more or less about data, what is important is what they predict about.
The contrast between Ghani and the first two privacy panels highlighted the need for many more workshops between policy makers and technologists. Ghani by his own opening admission did not seem to hear the privacy panelists and I wonder if the privacy attendees had the technical aptitude to understand Ghani. Many more workshops like the one hosted by the Future of Privacy Forum and the Stanford Center for Internet and Society last week are in order. At a time when the potential for data science applications are not only revolutionary but in many cases required, when the technology is changing rapidly and its use already being rapidly experimented with and applied across all facets of life, we need grounded discussion about the way forward and the corresponding policies to guide us.
And now I come back to the Washington Monument surrounded by a protective grid (pictured above). The grid is there to allow repairs to the iconic monument after a 5.8 magnitude earthquake damaged the structure on August 23, 2011. I find the picture of this structure a fitting backdrop for the earthquake going on in Washington right now around the sharing and use of data. This most recent data earthquake originated with the tragic events of September 11th, 2001. Laws were passed to protect and defend the United States from the asymmetric threat of terrorism. This same body of law is now likely to be needed for new threats in cyberspace. Perhaps we should leave the grid around the monument to protect and defend it from the next earthquake much like we keep the laws to protect and defend us from these new threats. Perhaps we should keep the grid around the monument more importantly to remind us that these laws are still in effect.