In his 1992 book "To Engineer Is Human: The Role of Failure in Successful Design," Duke civil engineering professor Henry Petroski tosses out a little-known statistic from the history of bridge design: During the latter half of the 19th century, a period that introduced the locomotive train to most corners of the industrial world, roughly a quarter of all iron truss bridges failed.
The simplified reason: Bridge designers, unused to iron as a structural material and railroad trains as a service load, had yet to grasp the full impact of a minor miscalculation anywhere within their plans. It wasn't until designers started introducing a conservative fudge factor, now known as the margin of error, that bridge designs developed enough redundancy and robustness to account for the occasional errant crossbeam or overloaded rail car.
"Basically civil engineers made bridges safe by recognizing that humans would be involved in every step of the bridge-building process," says David Patterson, a Berkeley computer science professor who has cited Petroski's statistic in numerous papers. "With human involvement comes the risk of human failure."
For Patterson, the iron-truss story is more than just a quick attention grabber; it's a hint that today's software programmers, oft derided for their failure to deliver bug-free code, have yet to grasp the full weight of their own discipline.
Coauthor of the landmark 1987 paper that laid out the low-cost memory strategy now known as RAID (the acronym stands for "redundant array of inexpensive disks"), Patterson has long been a proponent of hardware architectures that treat component failure as a given yet still find a way to get the job done. Since 2002, he's been putting forward the same strategy in the realm of software systems, banding together with Stanford counterpart Armando Fox, head of that university's Software Infrastructures Group, to launch the Recovery Oriented Computing project.
In a June 2003 article for Scientific American, Fox and Patterson cited Petroski's observation and laid out their own project's philosophy and goals. "As digital systems have grown in complexity, their operation has become brittle and unreliable," they wrote. "Rather than trying to eliminate computer crashes -- probably an impossible task -- our team concentrates on designing systems that recover rapidly when mishaps do occur."
While somewhat fatalistic on the surface, treating failure as inevitable just might be the key to pushing software development out of its current malaise. From Berkeley to MIT and points in between, software engineers are buzzing over the prospect of "autonomic computing" -- systems built to recognize and recover from their own flaws without tying down a human administrator in the process. Such systems remain a few years over the current commercial horizon, of course, but the sense of collective mission, something akin to the mammoth World War II science projects that spawned computer science in the first place, is growing.
"We're running into a complexity barrier in computing," says Steve White, senior manager for autonomic computing at IBM Research. "Computer scientists have done a great job of making software faster and cheaper. But we haven't paid as much attention to the people costs."
Maybe that's because, until recently, counting up the "people costs" was an inexact science itself. "Total cost of ownership" studies vary from platform to platform and often fall prey to vendor bias. Still, over the last decade, one common statistic has emerged: When it comes to running enterprise-level software, most companies spend twice as much on human talent than they do on licensing and acquisition.
While companies strive to reduce this two-thirds tax through lower labor costs (read: outsourcing), researchers are looking further down the road. One problem with hiring any human to fix or tune a system is the assumption that the system is fixable at the human level and that once fixed, it stays fixed. A quick review of recent software history, however, proves otherwise. For at least three decades now, programmers have joked of "heisenbugs" -- software errors that surface at seemingly random intervals and whose root causes consistently evade detection. The name is a takeoff on Werner Heisenberg, the German physicist whose famous uncertainty principle posited that no amount of observation or experimentation could pinpoint both the position and momentum of an electron.
"A lot of the bugs we're seeing in modern systems have been plaguing programmers from the beginning of time," says Fox, the head of Stanford's Software Infrastructures Group. "The only difference now is machines just crash faster."
One remedy to this situation is a strategy so simple every user has relied on it at least once or twice: Reboot the machine and start from scratch. Fox and Stanford University doctoral student George Candea have collaborated on a series of papers investigating a tactic originally known as partial rebooting but which Candea now calls "micro-rebooting." Instead of digging through the source code to fix errors, their strategy calls upon system managers to simply reboot the offending components while leaving the rest of the network operationally intact.
"In a lot of cases, rebooting cures the problem much faster than fixing the root cause," Candea says. "We see this all the time with PCs. Rebooting takes 30 seconds to a minute, enough time for a bathroom break. When you come back, the problem is usually gone and you can go back to work."
Rebooting the components of a computer network is, of course, more challenging than rebooting an individual PC. Network administrators have to guard against the lost data and whatever performance loss such outages might incur. Still, thanks to clustering, a strategy that bundles low-cost hardware resources in a way that makes it easy for one machine to pick up another machine's workload in the event of a failure or shutdown, most e-commerce networks already have that built-in safeguard. Fox and Candea have worked together to develop a process they call recursive restartability, in which an automated network manager systematically goes through a network's node tree, rebooting each branch as a form of preventive maintenance.
Lately, however, Candea has been looking at an even more sophisticated approach, one that gives a system its own ability to target and correct failing components. He calls it crash-only computing, and the strategy is to marry micro-rebooting with the increasingly popular diagnostic tactic known as fault injection. Candea has built a Java application server divided into two main components: management and monitoring. The monitoring side periodically sends queries into the software system and watches for any sign of bad data.
If the messages trigger an erroneous response, the monitors' own components compare notes on the error path, generate a statistical estimate of the faulty component, and send a signal to the management component to perform a micro-reboot. According to a paper released last year, Candea's self-monitoring Java server was able to increase system dependability by 78 percent while reducing service outages from 12 per hour to zero.
It's at this point that a technology journalist must fight the urge to evoke biological metaphors, an urge all the more compelling because many programmers, IBM's White included, consider experiments like Candea's a first step toward autonomic computing systems that manage internal resources the same way the human body's own autonomic nervous system regulates heart rate and breathing.
"First of all, I'm a real fan of ROC," says White, referring to recovery-oriented computing. "It's that notion of self that I think is the key idea of autonomic computing and the most revolutionary part."
Candea, for one, is hesitant to invoke biological metaphors but notes that, for discussing overly complex systems, sometimes they are the only parallels available. Like the body's own autonomic system, which operates independently of the conscious brain, his Java server works best when the monitoring component is strictly isolated from the management component. The same goes for all components. Without rigid functional boundaries, the software equivalent of cell membranes, it is almost impossible to tell which component is in need of a restart.
"It's all about having isolation of what we in computer-speak call the fault domain," says Candea.
Across the country, University of Virginia computer scientist David Evans has taken this notion of cellular segregation one step further. Three years ago, he and his colleagues developed a program that shows how a software network might function if limited to the same rules governing cellular interaction. In other words, modules communicate not by direct electronic query but in a fashion modeled on the physics of chemical diffusion. Signals move outward in a slow-moving spherical field, delivering information in variable doses.
While significantly slower than standard electronic communications, this diffusion strategy has one sizable advantage: When healthy components fail, the "signal" remains, leaving a distributed memory of its position and function, a memory the overall network can use to replace the damaged component.
To demonstrate survivability, Evans and his colleagues have taken a cue from biological evolution and programmed the individual modules to build and maintain an arbitrary three-dimensional superstructure -- a sphere, for example. Once it is built, various modules are subjected to damaging data and flushed out of the system when they fail. The question then becomes whether the superstructure can rebuild the same shape with a fraction of its original components.
So far, Evans says, diffused signaling works like a charm: "We can survive damage to nearly all the cells as long as the structure is maintained through these types of interactions."
Building cartoon spheres might seem a little frivolous, but Evans says the experiment has solid business-world roots. A security specialist, he says it was the creativity of Internet hackers that forced him to consider a more creative approach to network defense.
"The attackers have really taken advantage of the interconnectedness of the Internet," he says. "Defenders haven't."
With self-healing software at the blastula stage of software evolution, it seems a bit premature to speak of full-scale autonomic computing. Even so, NASA, DARPA, IBM (which boasts a 3-year-old Autonomic Computing Division) and a growing number of research underwriters have taken an active interest in seeing what's next. Evans' sphere project is already supported by the National Science Foundation. This summer, Evans and university colleagues John Knight, Jack Davidson, Anh Nguyen-Tuong and Chenzi Wang will start a new project backed by DARPA's Self-Regenerative Systems program. "[We'll] study approaches to system security inspired by biological diversity," he says.
Whether that inspiration leads to outright mimicry remains to be seen. For the moment, says IBM's White, terms like "self-healing software" and "autonomic computing" offer a convenient reference point for scientists eager to explore the next level of software complexity. Just as the sound barrier forced aircraft designers to radically revise aircraft and engine designs, so today's complexity barrier is forcing computer scientists to rethink systems design or, at the very least, to seek out new sources of inspiration.
"Today's systems have too many dials to watch; people can spend their whole lives figuring out how to make a database run well," White says. "We want to stand this notion of systems management on its head. The system has to be able to set itself up. It has to optimize itself. It has to repair itself, and if something goes wrong, it has to know how to respond to external threats. If I can think about the system at that level, I'm using humans for what they're good at, and I'm using the machines for what they're good at. That's the idea here."