Prowling the ruins of ancient software

Famous programs from just a generation or two ago are in danger of disappearing from human ken, forever.

Published July 30, 2003 7:30PM (EDT)

For Grady Booch, the nightmare goes something like this: Deep in the future, a team of archaeologists stumble onto a rare cache of 20th century art, a major assortment of works thought lost to the ravages of time.

The only problem, of course, is that they don't know it. All the images are recorded in an obsolete digital format, JPEG, and nobody knows how to unscramble the data. As a result, the hard disk containing said artwork spends its days not in a museum but as a coffee coaster in some college professor's crowded office.

"It might seem silly now, but put yourself 1,000 years in the future," says Booch, chief scientist at IBM's Rational Software subsidiary. "It's not too hard to imagine."

In an industry where one man's clever C code is another man's Linear B, Booch already knows the frustration of playing software archaeologist. As co-developer of the Universal Modeling Language (UML), a mid-1990s effort to create a common "blueprint" notation for object-oriented software programs, he's spent the last 10 years laboring to spare future programmers the same torment.

It's an uphill battle on a hill that is only growing steeper. With new programs replacing old and no major company or institution playing the central role of source-code archivist, the amount of software history currently circling the memory hole is scarily large. And even if there were a central institution, recent changes to the copyright code have made the transfer of source code from old media to new forms of storage a dicey prospect, legally. Add it all up, and you have the ideal makings for what some are already calling the "digital dark age."

"Things are going to be lost not because people don't want to save them or because the original creators don't want to save them, but because they can't save them," says Brewster Kahle, founder of the Internet Archive, an institution that has lobbied for a safe harbor within the Digital Millennium Copyright Act to shield institutions looking to archive source code.

For Booch, the barriers to software preservation aren't so much legal as educational. Most developers have come to accept the evolvable nature of software programs. What is lacking is the ability to examine static source-code snapshots with a scholarly, comparative eye. In the interest of encouraging that skill, Booch this fall will lead a seminar on software archaeology and preservation at the newly reopened Computer History Museum in Mountain View, Calif.

"Our industry has had a major effect in changing the world," says Booch, talking over the phone from his Denver, Colo., office. "It would be great if we could preserve the artifacts and interview the architects while they're still alive."

Booch isn't alone. Now that the hysteria surrounding Y2K has faded, developers are free to worry about legacy code again. One increasingly common worry is what to do with it? For every modern offshoot of DOS/Windows, Unix and Macintosh OS evolving with the marketplace, a dozen ghost programs lurk inside yellowed engineering pads, punch-card stacks and slowly degaussing magnetic memories. Even if programmers could get their hands on these programs and find a way to preserve and update their contents, a new question emerges: How do you qualitatively analyze those contents on a historical basis?

"It's funny," says Dave Thomas, a Dallas software consultant and co-author, with Andrew Hunt, of "The Pragmatic Programmer," a 1999 book on software design methods. "Colleges spend a lot of time teaching people how to write code, but very few teach them how to read code. When you think about it, we programmers spend most of our time reading code, not writing code."

To help fill the gap, Thomas served as cohost of the 2001 Software Archaeology: Understanding Large Systems workshop, hosted by Object Oriented Programming, Systems, Languages and Architecture (OOPSLA). Starting with the unifying question, "How do you come to grips with 1,000,000 lines of code right away?" conference speakers traded various tips, tools and techniques acquired through professional and personal encounters with unfamiliar systems.

"Whenever we're faced with big problems in software, we tend to fall back on metaphors," says Thomas. "In this case archaeology metaphor happens to be a good one. Sometimes you do archaeology with a backhoe. Sometimes you do it with a toothbrush."

Those partial to the backhoe approach can use Ward Cunningham's Signature Survey program. Billed as a "method for browsing unfamiliar code," Signature Survey scans through source code and compresses lines of text into a single punctuation symbol. Operating on the assumption that a file's size is proportional to the number of punctuation marks separating individual elements (packages and files in Java, for example), Signature Survey offers a quick guide to programming thickets and areas of quick repetition.

"It's a satellite system for looking over large bodies of work," Cunningham says. "It lets you use your own human pattern recognition to see variation over the whole program. It also leads you to interesting parts of the program to read."

Thomas says his own preferred technique is to import a program's contents into Microsoft Word and reduce the zoom factor as far as it will go. The resulting 50-page image leaves little for the eye to make out other than jagged patterns of text and blank page. Still, even these patterns can reveal peculiar anomalies in developer mood or style. "Sometimes the structure is easier to see at that level than if you're digging around line-by-line," he says.

Both Thomas and Cunningham liken their techniques to the aerial surveys some archaeologists use to spot the overall structure of burial mound networks, neolithic cairn patterns, etc.

"It shows the most interesting places to dig," says Cunningham.

It also provides a quick way to track the flow of ideas and source code from one program to the next. Cunningham, a man best known on the Web as the creator of the Wiki collaborative online authoring language, has loaned out his forensic talents to companies embroiled in legal disputes over intellectual property and prior art. He's also used it to refactor, or streamline, his own programs, stripping out redundant sections and commands.

When it comes to the toothbrush level, forensic tools and techniques are still in development. Booch says the fall workshop will discuss ways to analyze the fine structure of programs and to detect the emergence of novel techniques. One potential benefit of such knowledge would be a steep reduction in the number of frivolous patent claims filed by software companies.

"IBM believes in patents. I believe in them, too, but there are a lot that look suspicious," Booch says. "What better way to check for prior art than to have the source code ready and available for inspection?"

Herein lies the final goal of the fall Computer History Museum conference: to provide a foundation for a future exhibit on classic software programs and to provide a "vocabulary" for the intellectual dissection and discussion of these programs.

"Maybe I'm horribly geeky," says Booch, "but I find tremendous beauty in looking at well-written software programs. There's an elegance, a brilliance that we're only now developing the critical means to describe. We have literary critics. We have art critics. We don't have any software critics, yet. We need software critics, too."

Booch and his allies will need to overcome a number of obstacles, first. The largest obstacle at the moment is the lack of a central source code repository. In an online article, Elisabeth Kaplan, an archivist at the University of Minnesota's Charles Babbage Institute, lays out the frustrating history of software preservation. In 1986 the Computer Museum, a Boston forerunner of the current Computer History Museum, commissioned a report on how to archive software programs. That report identified many of the challenges but left the solutions to future reports. In 1988, the Library of Congress created a Machine Readable Collections Reading Room, essentially a repository of old machines capable of reading out-of-date programs. The project was phased out a few years later, however.

Since then, the topic of preservation has resurfaced every three years or so, a periodic rate roughly coincidental with the upgrade cycle of most commercial software programs, by the way.

"The issue comes up again and again," says Kaplan. "From an archival perspective, though, it's just not worth it to put resources into preserving software. There's just not enough projected use. The fact is, when you add up the amount of people who can use these programs, there are like five of them."

One institution willing to take up the burden is Kahle's Internet Archive. The Internet Archive already stores screen shots of Web sites and other artifacts of the digital age. Adding source code to the mix would be easy enough, says staff software preservationist Simon Carless. Unfortunately, legal issues and aging copy-protection mechanisms make it difficult to provide a decent record of historic programs.

Carless says the Digital Millennium Copyright Act clouds the current preservation landscape. Although the 1998 law lets archives make copies of copyright-protected works for preservation purposes, it imposes harsh criminal penalties for any circumvention of copy-protection mechanisms. Rather than risk legal blowback, Carless and the Internet Archive are currently petitioning Congress to clarify that archival organizations are exempt from such penalties.

"Even if you're an institution that's allowed to archive stuff, there's still a possible DMCA problem," Carless says. "If there's a physical hardware dongle that restricts copying, are you allowed to emulate that dongle to get the software running or does that qualify as a circumvention? We don't know."

Carless and the Internet Archive have recently requested that Congress expand its list of exemptions to Sec. 1201 of the DMCA, the portion that prohibits the circumvention of copy-protection mechanisms, to include software source-code preservation efforts. While waiting for a response, the Internet Archive has built a page displaying famous programs currently on the brink of software extinction.

In a similar attempt to rally the public, the Computer History Museum's Booch has sent out surveys asking programmers to nominate "classic" programs for a potential source-code exhibit. The list, originally intended to be a Top 50, already includes more than 150 games, applications, tools and programming languages. He hopes to devote the upcoming seminar to discussing how to present such programs to the public in a way that encourages further study and preservation.

"There's a great difference between walking up and showing somebody the Illiac and showing them the original source code for Lotus 1,2, 3," Booch admits.

Booch hopes to ally the preservation movement with two powerful forces: the World Wide Web and the open-source software community. Both have already proven invaluable in the preservation and publication of coding techniques, he says. He also plans to lobby companies with a stake in seeing their early works preserved.

Though Booch is hesitant to predict a donation of the original DOS source code from Microsoft, he has spoken with archivists inside the Redmond-based company wrestling with the same ideas. He also holds out hope that, with a little schmoozing and a little ego massage, the Computer History Museum might be able to encourage a more direct form of participation.

"Imagine somebody 100 years from now watching Bill Gates explaining the structures of his first program," says Booch, throwing out yet another hypothetical scenario. "Just think: Fox could have a reality show on software programming."

Booth punctuates his dream scenario with a quick laugh: "Actually, that's pretty scary when you think about it."


By Sam Williams

Sam Williams is a freelance reporter who covers software and software-development culture. He is also the author of "Free as in Freedom: Richard Stallman's Crusade for Free Software."

MORE FROM Sam Williams


Related Topics ------------------------------------------

Copyright Intellectual Property