BACKGROUND -- I had responsibility for managing 2nd and 3rd tier production support for the electronic transfer of funds and securities routed between back-office settlement applications and external agencies such as other banks and global clearance networks. Since these transfers were generally valued at over a billion dollars per day, there were strict quality of service requirements: messages could not be lost or duplicated, and they had to be delivered in order within specific timeframes. Like most commercial B2B trading chains, there were multiple companies, departments, computers, and network elements between our applications and our trading partners. Any single element in these chains could potentially disrupt service. Disaster / Recovery was critical.
MY APPROACH -- My team was able to manage this traffic over a four year period with no failures resulting in adverse business consequences. Our compliance with SLAs (service level agreements) was due to a combination of agile architecture, well-rehearsed recovery processes, management oversight, and fall-back options. My hardest lesson was accepting responsibility for failures beyond my control, namely at other companies. I have a talent for coordinating the trouble-shooting efforts of many geographically distributed teams (with different agendas) while focusing attention on the core problem. My approach generally involves these steps: evaluate risks, remove the failure from the critical path of customer expectations, fix the problem off-line, restore failed services, and document lessons learned.
EXAMPLE -- The following is an example of a production problem that really happened in a fully-redundant MQ-Series network that was never supposed to fail, but did --15 minutes before the Federal Reserve Wire deadline. After re-routing traffic for that day (within the 15 minute window), the problem was examined off-line. The cause was a technician at a remote bank (beyond our team's control) who modified a firewall access list for an unrelated application. This disrupted the TCP three-way hand shake for session establishment, which knocked out MQ Series between banks. MQ-Series still failed even though we used RIP ( Routing Information Protocol) to dynamically route traffic from the primary to a secondary firewall router. I found the problem by coordinating an investigation at both banks using divide and conquer across teams with protocol traces at each test point. I learned the value of fire drills and never to rely on a single technology solution (e.g., MQ Series) to maintain business continuity.