Photo: Steffen / Flickr
Skype's developers have just posted a fascinating explanation for last week's massive service disruption, in which millions of people were unable to log in to the popular Internet phone service for more than a day.
On the company's network-status blog, Skype explains that contrary to many rumors, it wasn't terrorists or hackers or other nefarious types who caused the system to go down. It was us: me and you and everyone else who loves Skype. More or less at the same time, last week, a huge number of us restarted our computers -- and all those simultaneous Skype login requests triggered a previously unknown bug in the software that "prompted a chain reaction that had a critical impact," Skype says.
But wait a minute -- why did everyone restart their machines at the same time? Because Microsoft put out some updates for Windows last week; people rebooted after installing the patches. The Windows updates were routine, and there was nothing in the updates that interfered with Skype -- it was merely that the patch caused folks to log out and then log in to Skype.
Skype operates on a peer-to-peer network. On such networks, data isn't sent through a central server, but is instead routed through many network nodes; peer-to-peer networks are hailed for stability, because they contain few critical locations whose failure could cause the whole network to come down. This outage, though, highlights an intriguing vulnerability in such networks: social behavior. A peer-to-peer network is nothing, after all, if a huge number of peers suddenly quit the system at the same time -- and who can predict when that'll happen?
Well, maybe you say that Skype should have predicted this; after all, Windows updates aren't unusual. The bug uncovered in last week's outage has apparently been lying dormant in the software for more than four years -- shouldn't someone have spotted it by now?
But this is a classic peak-load problem. Tens of millions of people logging in at the same time is a very rare thing, and it's unreasonable to expect any network to handle that well. I'd venture it's also bad management to spend engineering resources planning for such an unusual problem rather than what happens during ordinary use. Perhaps Skype could have run simulations to test the system in the event of a flood of login requests -- and, who knows, maybe they did, and missed this bug still.
And that's in the nature of bugs; through use and testing, we clean them out, but there are many others lying in wait. Skype has had four years of pretty good service. This outage, it says, "was unprecedented in terms of its impact and scope."
The company says it will put out "a number of improvements to its software to ensure that our users will not be similarly affected in the unlikely possibility of this combination of events recurring." That's good. Just don't expect perfection.