Anyone who works with or on today’s computer networks knows how fragile and crotchety they can be. They are rarely more than a step or two away from disaster, and so, as Ellen Ullman once put it, “The real-world experience of system managers is a kind of permanent state of emergency.”
The sirens were screaming in Redmond this past week, as it was Microsoft’s turn to experience the sheer awfulness of network collapse. The company had to admit that it had fallen off the Internet for the better part of two days because, er, well, somebody screwed up. Caught up in the down time were not only Microsoft’s own Web site but its MSN.com Internet service, MSNBC, Hotmail and Expedia and even Slate.
What had happened that could bring this legendarily proud “Business @ the Speed of Thought” to its network knees? Though there was much idle chatter about hacker attacks in the wake of the initial outage, it gradually emerged that Microsoft had what those in the business call “a DNS problem.” A big DNS problem.
DNS is the Domain Name System, which translates the Internet’s fundamental numeric (IP) addresses — opaque strings of digits like 184.108.40.206 — into the more familiar verbal monikers by which we address computers on the Net, like www.salon.com. Without the DNS, there’d be no “dot com” or “dot” anything.
The DNS is the reason that we can remember Web addresses, and even have verbal fun with them — like the pranksters who spammed the Internet registry with graffiti-like anti-Microsoft slogans by simply registering them as domain names. Compare it to the telephone system — which requires that we attempt to remember (or recall from tattered phonebooks or balky software programs) long strings of meaningless digits in order to contact anyone — and you realize that the Internet could be a lot harder to use than it is.
But in order for the DNS to work, we depend on a distributed network of computers that ask and tell one another where to find the various names, and update one another when addresses change. These domain-name server computers keep the Internet running — but they can also form what engineers call a “single point of failure,” a weak spot in the chain between you and a Web site you want to visit. Large Internet operations typically safeguard themselves by running multiple domain-name servers — so that if one goes down the others keep responding to the “where are you?” questions pouring in from the Net. This is called redundancy: a bad thing in writing, a prized thing in networking.
Until its problems hit, Microsoft was running four domain-name servers — but apparently all four were located together (physically and in terms of their network addresses), so that when the hapless Microsoft employee who misconfigured a router last Tuesday knocked them off the network, the corporation’s entire Internet presence gradually dimmed and died. This, then, was a double human error: A poor engineering choice designing the system’s architecture in the first place made it highly vulnerable to an operational error down the line.
The latter can happen to anyone, but the former should be something a technological colossus like Microsoft can easily avoid, networking experts say — and they flayed Microsoft mercilessly for the goof on mailing lists and in press reports.
It was this same “single point of failure,” having been widely publicized as a result of the Tuesday-to-Wednesday outage, that in turn became the apparent target of “denial of service” attacks on Thursday — which, embarrassingly for Microsoft, knocked it offline again only a few hours after it had assured the world everything was back to normal.
Microsoft shouldn’t feel too bad — every Web site hits a problem patch like this sooner or later. (Earlier this month Salon had its own DNS problems, thanks to a screw-up on the part of the facility that hosts our Web servers.) Sure, when you’re a megabillion-dollar corporation trying to lead the business world into a new era in which all your software services are delivered over the Net, an extended and repeat service outage like this raises embarrassing questions. But reality checks like this are valuable in reminding users that networks are never perfect and forcing companies to fix mistakes. You can bet Microsoft’s DNS setup won’t ever be this shaky again (if it is, the company will have earned all the abuse it tends to draw in online forums).
What Microsoft ought to feel bad about is the inept and self-defeating way it went about dealing with the public as its ordeal unfolded. Initial radio silence from the company kept customers guessing what was really going on, and then confused comments trickled out for the public to try to decipher. A spokesman for Microsoft in Europe cued derisory flames by clumsily implying that the problem was not Microsoft’s fault but instead that of ICANN, the administrative group that oversees the assignment of domain names.
When the company finally issued a statement, instead of just coming out and admitting, “Oops, we goofed!” it headlined the release “Microsoft Explains Site Access Issues” — as though the whole fiasco were simply an opportunity for Microsoft to educate the public on a matter of mutual interest. It recycled that headline for its second-day release about the denial-of-service attack — inadvertently reinforcing the very notion it was trying to dispel, that the two outages were related. Each of Microsoft’s communiqués took pains to emphasize that the problems had nothing to do with its own products; they read more like a politician’s attempt to spin a scandal than anything else.
Let’s be fair: Anyone’s network can go down. Those engineers who heaped scorn on Microsoft for its dumb gaffe will sooner or later see their own handiwork fail thanks to some simple oversight or unpredictable complex snafu. When Microsoft’s critics assume their own superiority, they broadcast a kind of arrogance that’s just waiting to be knocked down a few pegs.
All that said, Microsoft’s reaction to its problem offered one more piece of data to reinforce the picture of the company that has emerged over the last decade. From its approach to fixing bugs in its software to its treatment of competitors to its stance in federal court during its antitrust trial — and now, in its handling of its own customers during a massive service failure — Microsoft presents itself as a monolithic giant that’s reluctant to admit failure, to share necessary information or to work with its customers in a spirit of openness. Though Bill Gates has tried to position his company as an innovative leader of the “new economy,” Microsoft’s evasive language and unresponsive behavior suggest that it remains remarkably unaffected by the changes in corporate culture that the Internet has begun to inspire in many other places.
Last year the authors of a book called “The Cluetrain Manifesto” argued, rather optimistically, that the new channels of communication the Internet has opened will make old-style corporate doublespeak obsolete: “In just a few more years, the current homogenized ‘voice’ of business — the sound of mission statements and brochures — will seem as contrived and artificial as the language of the 18th century French court.”
By that analysis, “Microsoft Explains Site Access Issues” may go down in history as the Bill Gates version of “Let them eat cake.”