On June 9, the Wall Street Journal reported that for the last few years the National Security Agency has been relying on a software program "with the quirky name Hadoop" to help it make sense of its enormous collections of data. Named after a toy elephant that belonged to the child of one of the original developers of the program, "Hadoop," reported the Journal, is a crucial part of "a computing and software revolution ... a piece of free software that lets users distribute big-data projects across hundreds or thousands of computers."
"Revolution" is probably the most overused word in the chronicle of Internet history, but if anything, the Wall Street Journal undersold the real story. Hadoop's importance to how we live our lives today is hard to overstate. By making it economically feasible to extract meaning from the massive streams of data that increasingly define our online existence, Hadoop effectively enabled the surveillance state.
And not just in the narrowest, Big Brother, government-is-watching-everyone-all-the-time sense of that term. Hadoop is equally critical to private sector corporate surveillance. Facebook, Twitter, Yahoo, Amazon, Netflix -- just about every big player that gathers the trillions of data "events" generated by our everyday online actions employs Hadoop as a part of their arsenal of Big Data-crunching tools. Hadoop is everywhere -- as one programmer told me, "it's taken over the world."
The Journal's description of Hadoop as "a piece of free software" barely scratches the surface of the significance of this particular batch of code. In the past half-decade Hadoop has emerged as one of the triumphs of the non-proprietary, open-source software programming methodology that previously gave us the Apache Web server, the Linux operating system and the Firefox browser. Hadoop belongs to nobody. Anyone can copy it, modify, extend it as they please. Funny, that: A software program developed collaboratively by programmers who believe that their code should be shared in as open and transparent a process as possible has resulted in the creation of tools that everyone from the NSA to Facebook uses to annihilate any semblance of individual privacy. But what's even more ironic, and fascinating, is the sight of intelligence agencies like the NSA and CIA joining in and becoming integral players in the world of open source big data software. The NSA doesn't just use Hadoop. NSA programmers have improved and extended Hadoop and donated their changes and additions back to the larger community. The CIA actively invests in start-ups that are commercializing Hadoop and other open source projects.
They're all in it together. The spooks and the social media titans and the online commerce goliaths are collaborating to improve data-crunching software tools that enable the tracking of our behavior in fantastically intimate ways that simply weren't possible as recently as four or five years ago. It's a new military industrial open source Big Data complex. The gift economy has delivered us the surveillance state.
Hadoop's earliest roots go back to 2002, when Doug Cutting, then the search director at the Internet Archive, and Michael Cafarella, a graduate student at the University of Washington, started working on an open-source search engine called "Nutch." But the project did not get serious traction until Cutting joined Yahoo and began to merge his work into Yahoo's larger strategic goal of improving its search engine technology so as to better compete with Google. Significantly, Yahoo executives decided not to make the project proprietary. In 2006, they blessed the formation of Hadoop, an open-source project managed under the auspices of the Apache Software Foundation. (For a much more detailed look at the history of Hadoop, please read this four-part history of Hadoop at GigaOm.)
Hadoop is basically a nifty hack. The definition, per Wikipedia, is surprisingly simple: "It supports the running of applications on large clusters of commodity hardware." Bottom line, Hadoop provides a means for distributing both the storage and processing of an enormous amount of data over lots and lots of relatively inexpensive computers. Hadoop turned out to be cheap, fast and scalable -- meaning it could expand smoothly in capacity as the flows of data it was crunching burgeoned in size, simply though plugging in extra computers to the network. Hadoop was also fundamentally modular -- different parts of it could be easily replaced by custom designed chunks of software, making it seamlessly adaptable to the individual circumstances of different corporations -- or government agencies.
Hadoop's debut was timely, addressing not only the problems Yahoo faced in managing the enormous amounts of data produced by its users, but also those that the entire Internet industry was simultaneously struggling to cope with. Basically, the Internet had become a victim of its own success. The enormous flows of data generated by users of the likes of Facebook and Twitter far overwhelmed the ability of those companies to make sense of it. There was too much coming in too fast. Hadoop helped companies cope with the tsunami -- it was, in the words of Jeff Hammerbacher, an early employee of Facebook, "our tool for exploiting the unreasonable effectiveness of data."
Before Hadoop, you were at the mercy of your data. After Hadoop, you were in charge. You could figure out all kinds of interesting things. You could recognize patterns in the data and start to make inferences about what might happen if you made tweaks to your product. What did users do when the interface was adjusted like this? What kinds of ads made them more likely to pull out their credit cards? What did that batch of millions of Verizon calls reveal about the formation of a potential terrorist cell? Facebook wouldn't be able to exploit the insights of its so-called social graph without tools like Hadoop.
"Hadoop has become the de facto standard tool for cost-effectively processing Big Data," says Raymie Stata, who served as chief technology officer at Yahoo before eventually starting his own Hadoop-focused start-up, Altiscale. And the significance of being able to cheaply process Big Data, to accurately "measure" what your users are doing, he added, is a "big deal."
"Once you can measure what's happening 'out there' -- [you can] then use those measurements to understand and ultimately influence what's happening out there."
With engineers at multiple companies recognizing that Hadoop offered solutions to the specific challenges they faced on a daily basis, Hadoop quickly secured the critical mass of cross-industry support necessary for an open-source software program to become an essential part of Internet infrastructure. Even engineers at Google chipped in, although Hadoop, at its core, was basically an attempt to reverse-engineer proprietary Google technology. But that's just how the Internet has historically worked. For decades, so-called gift economy collaboration, in which the community as a whole benefits from the freely donated contributions of its members, has been a potent driver of Internet software evolution. As I wrote 16 years ago, when chronicling the birth of the Apache Web server, the success of open source software "testifies to the enduring vigor of the Internet’s cooperative, distributed approach to solving problems." Hadoop, which down to its fundamental structural essence is a distributed approach to solving problems, emblematized this philosophy at its core.
So, in a sense, Hadoop's success was just the same old story. But back in the mid-'90s, around the time that one of the first open source success stories, the Apache Web server, was taking off, I'm not sure that anyone would have predicted that the National Security Agency and CIA would end up becoming stalwart participants in the gift economy. Even though it makes total sense, in principle, that the fruits of government-funded software development should be shared with the general public, there's still something cognitively disjunctive about intelligence agencies that shroud their every activity in great secrecy contributing to projects built on openness and transparency. On the one hand, employees of the NSA are appearing at conferences discussing how they have adapted Hadoop to solve the problems of dealing with unimaginably huge data sets, but on the other hand, we're not supposed to know anything about what they are actually doing with that data.
The intertwining of the intelligence agencies with the larger open source software community could hardly be more incestuous. In 2008, a group of Yahoo employees that eventually included Doug Cutting formed a start-up designed to commercialize Hadoop called Cloudera. The CIA, through its In-Q-Tel (named after James Bond's Q character) venture capital arm, was an early investor in, and customer of, Cloudera. The NSA built a significant piece of software that works "on top" of Hadoop called Accumulo designed to add sophisticated security controls managing how data could be accessed, and then promptly donated that code to the Apache Software Foundation. Later, a group of NSA software engineers formed another spinoff company, Sqrrl, to commercialize Accumulo.
What all this means is that the improvements to tools that the NSA is making, with the aim of more efficiently catching terrorists, are propagating into the private sector where they will be used by Facebook and Neftlix and Yahoo to more accurately target ads or influence our purchasing behavior or provide us with content algorithmically shaped to our very specific desires. And vice versa. Innovations and increased capabilities pioneered by private companies trickle back to the NSA. The collective boot-strapping never stops.
Again, in principle, there is nothing necessarily wrong going on here. There is no one to blame. Some of the fiercer apologists for unfettered free markets might complain that government involvement in open source projects unfairly competes with private sector proprietary businesses, but a much stronger case can be made that any software development work that is funded by taxpayer money should by definition be considered freely sharable with the wider public. The NSA should probably be applauded for helping to improve Hadoop. And if the capabilities unlocked by Hadoop result in the prevention of some horrific terrorist act, then every programmer who contributed a line of code to the project justly deserves some congratulation.
But there's also an intriguing inversion occurring here of what, for better or worse, we might call the purpose of the Internet. The Internet was initially created by the U.S. government to facilitate the sharing of information between geographically separate research centers. The Internet took off in the mid-'90s in large part because the general public recognized it as a phenomenal tool for sharing information with each other. The fact that so much of the Internet's infrastructure was also built from code that was freely shared seemed like a pleasing match of form and function.
Free software and open-source software evolution is frequently driven not so much by hope for financial gain but by individuals looking to solve their immediate engineering problems. Over time, on the Internet at large, one of those problems has turned out to be the gnarly challenge of how to manage all the data created by all those people sharing so promiscuously with each other. Hadoop can justly be seen as the natural response to all that promiscuous sharing. And it certainly helped solve the problems faced by engineers at Facebook and elsewhere.
But what ended up getting enabled by the success of Hadoop is something significantly different than good old peer-to-peer sharing. The ability to make sense out of petabytes of data isn't necessarily useful to you or me. But it's god's gift to the profit-minded corporations and terrorist-seeking intelligence agencies seeking to leverage the data we generate for their own purposes, to measure our behavior and ultimately to influence it. That could mean Netflix figuring out exactly what combination of plot twists and acting talent proves irresistible to streaming video watchers or Facebook figuring out exactly how to stock our newsfeeds with advertisements that generate acceptable click-through or Twitter knowing exactly where we are on the surface of the planet so it can pop up a sponsored tweet pushing a coupon for a happy hour at the bar just down the street -- or the NSA spotting a peculiar pattern of pressure cooker purchases. This is no longer about sharing information with each other; it's about manipulation, control and punishment. It's about keeping stock prices up. We're a long, long way here from the ideal gift economy, where everyone brings their home-cooked delicacy to the potlatch. We've arrived at a destination where the tools offer more power to them than to us.
I posed a version of this analysis to Michael Cafarella, one of the original authors of Hadoop, now a computer scientist at the University of Michigan. He conceded that "there's a certain irony that the open ideas of open source have enabled the construction of systems that can undermine openness so substantially."
But Raymie Stata, who has been closely involved with the growth of Hadoop for the last seven years, warned against "conflating 'open source software' with 'Open Society.'"
"Everyone involved with Hadoop in the early days certainly did believe that Hadoop, as a piece of open source software, would make the world a better place. I can't say, back then, that we saw Hadoop moving from cyberspace to the real world, but we did recognize that it would become foundational to building Internet applications of the future, and we wanted to contribute to advancing that agenda.
"But individuals who find common ground in contributing to open source projects do not, as a whole, share beliefs on what constitutes the ideal 'Open Society,'" said Stata. "Is using Big Data to make inferences about people a Bad Thing at all, no matter who does it? Or is it no big deal? Or does it depend on who's doing it, and for what reason (and with what transparency)? Should we be more worried about Big Business, or Big Government?"
"I guess in some ways this incident is evidence that it's hard to encode ideals in a piece of software," said Cafarella. "The right way to do that is via legislation."
Cafarella's point is hard to dispute. Brian Behlendorf, one of the founders of the Apache Software Foundation, told me that at one juncture, contributors to the various software projects managed by Apache had argued over whether the license that determined the rules for how their code could be shared should include restrictions against organizations using that code for purposes deemed morally or ethically unacceptable by the open source software programmer community. But it was relatively quickly determined that to attempt such restrictions would open up an impossible to resolve subjective can of worms. Society at large has to figure out what limits it wants to put on the surveillance state, on what either Facebook or the NSA is allowed to do.
It's also important to acknowledge that as users of online services, we benefit in many ways from our instant-gratification, access-to-everything, always on lives. But still: When we first started to log on, did we realize what the tradeoffs would be? Did we know that we were entering the Panopticon? That we would be making it substantially easier than ever before for governments and businesses to track our behavior and monitor our every whim?
Behlendorf says we kind of did. He recalls his days, fresh out of college in 1995, working for HotWired, Wired magazine's first foray into online publishing. AT&T was running an ad on HotWired, under the theme "Imagine the Future," that pictured an arm with a "wrist-watch phone" on it.
"Someone printed it out," said Behlendorf, "put it up on the wall, and wrote in black marker over the top of the ad, 'NSA primate tracking device.'"
And guess what? We went ahead and built it.