Could the Colorado Rockies' site crash have been prevented?
If you follow either baseball or IT infrastructure, I'm sure you're aware of the crash of the Colorado Rockies' web server, shortly after they started selling tickets for the World Series games. The site is back up, although the company isn't admitting to the actual source of the problem other than to say that they were "the victim of an external, malicious attack that shut down the system."
Journalists are known for shoving their microphones (virtual and otherwise) into the middle of a disaster and asking the victims, "How do you feel?" I haven't called anyone at or their partner Paciolan (the service that actually runs the e-commerce site). Aside from the fact that they're in the middle of a technology and PR firestorm, with the company located in Irvine, California, I think they have another kind of fire to put out right now. (How kind of me, huh?)
Instead, I've found myself contemplating the IT and business issues. By its nature, ticket sales for the World Series must be the worst kind of load testing nightmare imaginable: a sputter of traffic followed by several million people who want to witness whether Matt Holiday will actually be able to touch home plate this time. But if the ticket sales site was the victim of a malicious attack (and my cynical side whispers, "Sure sounds better to say that than to admit they screwed up, doesn't it?") — could they have done anything to mitigate the problem?
You probably don't have to cope with millions of angry customers... but if your server went down at a critical time, the result might be equally devastating.
What—if anything—could the IT folks at Paciolan have done to prevent this from happening? Was this a failure of load testing? Did they choose the wrong architecture? What lessons can you learn from their misfortune?
The architecture, first: The company doesn't say much about the technology it's using, but thanks to a smart correspondent on the Software QA Forums, it's easy enough to find out from their job search listings: currently Pick/Universe but moving to J2EE, with some uncertainty about the database to use.
Though I doubt the problem had anything to do with their platform. According to performance testing consultant Roland Stens, the site probably suffered a Distributed Denial of Service (DDOS) attack or automated scripts hitting the site trying to snatch tickets. And, say Stens, they tried to minimize this by blocking IP addresses that had repeated requests for the same info:
"Alves explained that those who saw a "page cannot be displayed" message had "IP addresses that we blocked due to suspicious/malicious activity to our website during the last 24 to 48 hours. As an example, if several inquiries came from a single IP address they were blocked."
Said Stens, "This, effectively, also blocked traffic from sites with multiple users that use only one external IP address"—which most companies do. Stens also pointed out that Paciolan's online ticketing brochure promotes their solution as "Power up your ticket sales, with the industry’s most robust online ticketing solution." This, he remarked, has obviously and painfully been contradicted.
DDOS attacks are not rare. Brian Karas, who's built several high availablity server farms, offers several specific suggestions. "This can be a particular weak point in many modern web servers because the content is often dynamically generated or pulled from a

