Engineers at its European headquarters in Slough, Berkshire, as well as its corporate base in Waterloo, Ontario, are still investigating what went so badly wrong. According to industry sources, however, a picture is beginning to emerge.
While Slough is the site of RIM’s European headquarters, and is also in charge of operations in the Middle East and Africa, it is not the physical location of the stacks of networking equipment that actually serve the tens of millions of BlackBerry users in these regions.
The firm is famously tight-lipped on such matters, but it is widely known within the mobile industry that the machines are actually maintained at its site in Egham, Surrey.
The problems began on Monday at around 10AM BST. Mobile networks noticed that BlackBerry internet traffic had fallen away completely. Senior RIM executives confirmed to them that there was a problem, and that an urgent investigation had been launched.
Blackberry OS7 Phones |
RIM’s investigation revealed the apparent cause of the outage to be a failed Cisco switch in its core network. Switches are basic components of Internet Protocol networks. They are specialised computers that direct communications within networks; in this case the emails, web browsing and instant messages of millions BlackBerry Internet Service users.
On day three of the crisis, RIM publicly admitted it had suffered a “core switch failure”.
If everything had worked to plan, the failure would not have mattered. A backup system also failed, however, for reasons that remain obscure and will surely be among the top priorities of RIM’s own post-mortem investigation.
Ironically, suspicion has fallen on a network upgrade programme specifically designed to prevent outages.
Involving “fundamental” changes, it was initiated after a North American BlackBerry outage in December 2009. Work in Britain was completed only two months ago, sources said.
After Monday’s morning’s collapse, RIM’s engineers decided to revert the software running the switching infrastructure to the pre-upgrade version. This meant the Internet Protocol backbone of the BlackBerry network in Europe, the Middle East and Africa had to be rebuilt from scratch. Effectively reset, the switches and routers had to learn where they were within the network and how to talk to each other again.
Yet this is normally relatively simple job, perhaps taking a few hours, experienced network engineers have told The Telegraph. The core switch and backup switch failure, and the software rollback, need not have caused a 72-hour-plus disaster.
But an unknown point following the switch failure, the Egham data centre’s Oracle database, a bespoke and heavy-duty communications data storage application, was corrupted. This database is effectively the “brain” of the BlackBerry Internet Service, handling messages and forwarding data to users.
With saving the Oracle database the top priority, RIM was forced to repair software while it was still running – a difficult and fraught process known as a “hotfix”.
“Working with a live database like that is the stuff of nightmares,” explained one network engineer..
This database corruption problem, according to industry sources, is thought to be the reason the outage lasted well into Thursday for many users.
The period of message delays and patchy browsing that marked the end of the outage, and spread to North America and Asia, was caused by the backlog of data that built up. RIM’s global systems had to grind through huge quantities of data on as its European systems were gradually fixed.
The firm had prematurely declared victory on Tuesday morning, when it said services had been “restored”. But the database problems prompted a second collapse before users received any data. Of all the public relations mistakes commentators have said RIM made this week, this act of optimism annoyed its network partners the most.
“They should have just asked us - we weren’t seeing any BlackBerry traffic,” said a source at a British network.
"They were being far too positive. It didn't help."
Britain’s mobile network chiefs are said to be “absolutely furious” with RIM, but they know that the must continue to work with it, as BlackBerry users’ internet service will always depend on the firm.
In a press conference at 3PM on Thursday, Mike Lazaridis, RIM's founder and co-CEO, having already apologised profusely, finally announced systems were returning to normal. This time he was backed up by the mobile networks, who saw their BlackBerry subscribers data connections flicker back to life.
Source : http://www.telegraph.co.uk
Good links
www.gr8insurance.biz (Insurance quotes RSA)
www.woza.mobi (Insurance quotes on your mobile)
bit.ly/come2rsa (Hotel discount special world wide)
bit.ly/dialdir (Submit name and phone number for quick quote )
No comments:
Post a Comment