Rants
Questions
Soapbox
Best Practices
Apply today for a FREE subscription to CIO Magazine!
Wed, Apr 29, 2009 11:52 EDT

Topic: MobileBlog: Mobile WorkHorse
Current Rating: |
On Tuesday, April 28, users of Research In Motion’s (RIM) BlackBerry service throughout Europe, the Middle East and Asia suffered, to varying degrees, a widespread data outage. Service stayed down for a couple hours for most affected parties, longer for others. The cause of the downtime has since been attributed to a RIM Server Router Protocol (SRP) outage, which occurred at roughly 1:35 PM (GMT) on Tuesday, April 28.
SRP is RIM's proprietary network protocol employed to transfer data between the company's BlackBerry infrastructure and organizations’ BlackBerry Enterprise Servers.
BlackBerry outages aren’t exactly uncommon nowadays, nor was Tuesday’s service disruption particularly serious since the problem was resolved relatively quickly for most of those impacted.
Shortly after news of the outage, I spoke with both Zenprise and BoxTone, which offer competing BlackBerry infrastructure management and support products. I’ve covered both Zenprise and BoxTone frequently on CIO.com. (Read, “Eyes on Zenprise: How the Red Sox Keeps BlackBerrys in the Game,” and “myBoxTone Expert: The On-Device IT Help Desk” for more.)
Ahmed Datoo, VP of marketing for Zenprise, says the company’s U.K. customers were first alerted to the SRP disconnect at 13:32 (1:32 GMT). From Datoo:
“One of our customers in the UK got an alert from Zenprise...that the RIM SRP network went down. That network looked to be back up and running around 15:00. One of our U.S. customers (who) supports users in Europe received an alert from Zenprise (at) roughly the same time, but service for them was restored at 14:13."
“One of the automated diagnostics that our product runs before triggering an alert is to telnet to port 3101 to test RIM connectivity (that’s the port the BES server talks to the RIM network on). It looks like one of the advertised IP addresses of the RIM network went down, and the traffic was rerouted to the secondary IP address. The propagation of the DNS changes may have taken some time, which is why some customers saw service restore faster than others.”
Mitch Berk, director of product management with BoxTone, shared information that mostly coincides with Zenprise’s findings.
BoxTone customers in the U.K. first received alerts and specifics via e-mail updates at 1:35 PM (GMT), along with possible resolution instructions. The following is an example of what one of those BoxTone alerts looked like.
Alert : BES XXX : Critical -> Unavailable
Explanation = The SRP connection to the BES infrastructure has been lost due to network conditions such as packet loss, latency, or other symptoms of poor network conditions. The BES will automatically attempt to reconnect.
Possible Action = If this event is observed repeatedly or for a long duration, there may be network access issues from the BES to the RIM SRP Host (or the RIM NOC itself may be experiencing issues). The following items can be tested to check the connection with RIM:
1: Ping the SRP host via the windows command prompt (Ping Hostname)
2: Use the bbsrptest.exe utility (See RIM KB KB00804).
3: Telnet to the SRP host on port 3101 to verify connectivity to the SRP server.
4: Verify you aren't experiencing network outages or firewall configuration changes.
That specific BoxTone customer saw the SRP connection restored in roughly 2.5 hours, according to Berk.
Berk also