Data-mining life on earth

Every blade of grass, every fish and fowl, slug and snail, has a place on the Web.

Published October 28, 2002 6:30PM (EST)

David Maddison, an entomologist at the University of Arizona, specializes in the evolution of beetles -- not all beetles everywhere in all eras, just a subgroup of present-day ground beetles known as carabids.

Maddison's specialty may sound comically narrow, but his purview includes no less than 30,000 species. And not even an expert like Maddison can keep track of all the connections that link those tens of thousands of different bugs, much less the rest of the insects and fungi, plants and animals living on earth. There are between 1.5 and 1.7 million known species today, with millions more as yet undiscovered.

"The average layperson thinks that we know about 99 percent of the species that occur on the earth, and that's the farthest thing from the truth," says Gary Waggoner, a botanist at the Center for Biological Informatics at the United States Geological Survey. "On a global scale, we know maybe 10 percent of what's out there, and the remaining 90 percent have yet to be discovered and described and classified."

Estimates of the number of species on earth range from 5 million to as high as 100 million, with many researchers guessing in the 30 million range, says Ryan Phelan, CEO of the All Species Foundation, a nonprofit that has set a moon-shot goal for science of cataloging all life on earth within 25 years.

But the push to uncover new species faces many obstacles. Scientists, even specialists in the same narrow field, often literally can't figure out what their counterparts are talking about. A single well-known species, like, say, a quaking aspen, can go by a dozen different scientific names at different times and in different cultures -- and that's not even counting the common names. "It was a real problem," says Waggoner, who has worked on a taxonomic index database for the U.S. government since 1993. "People who called things by different names from different parts of the country would get together and discuss things, and that confusion would foster debates."

The result is a colossal taxonomic cross-indexing problem, with no easy way of getting your beetle: "The vast majority of what we know is actually sitting on dusty library shelves and not in electronic form," laments Maddison.

Enter the Internet, a taxonomist's best friend. Maddison and Waggoner and many other researchers are busily data-mining life on earth, using the Web to create a collective memory of all living (and dead) species. They're pitting database architecture against the mysteries of nature.

Maddison is taking the evolutionary approach: his Tree of Life Web Project aims to trace the entire history of life right down to its roots. Other catalogs range from the regional, such as CalFlora, which focuses on toting up all California plants, and the universal, such as the All Species Toolkit. The largest collection of species in any one place on the Web, at last count, the toolkit included references to 873,979 living things.

Other scientists are trying to deploy the Web to help unravel the "synonym" problem -- duplicate names for the same critter. The United States' no-nonsense Integrated Taxonomic Information System currently contains virtually all the country's known vertebrates, among other creatures. The ITIS is contributing to an ambitious worldwide effort to do the same, the Species 2000 Catalogue of Life, which currently has more than 260,000 species, including 420,000 synonyms.

Plans for using the Web to study the earth's biology go far beyond just getting the proper name on every known beastie. The Global Biodiversity Information Facility, a worldwide consortium of scientists, is dedicated to unlocking all the biodiversity data currently housed in the planet's natural history museums, libraries and databases, while the U.S. is hard at work on its own National Biological Information Infrastructure.

"Some people speculate that in the future, a lot of experiments are going to be performed not so much in a lab, as against your database," says Robert Wilensky, UC Berkeley computer science professor, who works on the Digital Library Project, a program that's helped bring more than 8,000 California plants online, as well as 5,000 species of amphibians.

But putting a stake into the earth -- teaming with microscopic organisms -- and declaring that you're going to catalog all known life isn't as simple as getting a bunch of databases to talk to each other.

"This is much less a technological challenge than a social challenge," says Ann Davis, the executive director of CalFlora, who has collected more than 200,000 observations about plants in the site's database. Getting scientists and even laypersons to centralize their field research means, paradoxically, inspiring them to give up their control over their own data. Sharing an observation about an invasive artichoke thistle amounts to giving that information away.

Another problem is even more fundamental. Is scientific information really so definitive that it can be neatly organized into a happy tree of life that everyone can agree on? Evolution is messy, and understanding it even more so -- it's not that easy to reduce it to a neat and tidy series of Web pages that will make everything completely transparent to an audience that ranges from rangers in the forestry service to kindergarten kids researching dinosaurs.

"Biologists now share the view that there is a real tree of life out there," says Maddison. At least they agree on that much. "That tree is 3-point-some billion years old, and we're just on the leaves of this tree. As you look back at the branches, you're looking into the ancestral lineages that connect us one to another." On the Tree of Life Project, the links between pages represent the genetic connections over time, organizing species according to their "phylogeny" -- the history or evolution of a species.

Maddison and his brother Wayne Maddison first wrote software to try to help untangle evolution in the 1980s, when they developed their MacClade program, now a standard for phylogenic analysis. Since translating that software onto the Web in the Tree of Life project in 1994, some 340 scientists from 21 different countries have contributed to the tree, with experts in starfish and sea cucumbers each contributing to the branches they specialize in.

The focus so far has been on getting the main branches and roots right, with some success, but the 2,400 pages produced by the collective effort represent just a fragment of the whole history of life on earth. As for dissension among the experts, for now, the learned opinions of whichever biologist has authored a given page hold sway, but alternative views are often included.

"The eventual hope is the system will allow for whole alternative trees," says Maddison, trees that will represent the ideological debates among evolutionary biologists, and show the tree of life as it really grows -- through dispute and discussion.

By Katharine Mieszkowski

Katharine Mieszkowski is a senior writer for Salon.

MORE FROM Katharine Mieszkowski

Related Topics ------------------------------------------