But I do want to say that the film did a nice job of covering the main themes of the event and (some of) its causes. It’s a Hollywood disaster movie, rather than a documentary, so one shouldn’t expect a perfect coverage of all the details. But the essence of the disaster is quite well captured, from the relationship between BP and Transocean to the disturbing scene (and a compelling illustration of confirmation bias in action) when the negative pressure test is redone because it doesn’t initially provide the “right” result.
My friend John Almandoz and I recently published a paper on the relationship between the proportion of banking experts on a bank’s board of directors and the likelihood that that bank will fail. A reader-friendly summary just appeared in the online edition of Harvard Business Review. Click HERE to take a look, and let me know what you think!
What my trip to the airport teaches us about why catastrophe happens.
I noticed the feeling of rain in the air as soon as I stepped out of my final afternoon meeting on this trip to New York. Since moving to Seattle, I have missed the powerful, dramatic summer thunderstorms that the East Coast produces, and I suspected that one was in store.
I had a few hours until I had to be at JFK, but it was too early to have dinner, so I decided to retreat to a cafe in nearby Flatiron for a respite from the heat and humidity.
A few minutes later, the deluge arrived. The rain hammered the street as well-prepared pedestrians hoisted umbrellas. It soaked the unprepared.
I was hoping to have dinner at a restaurant that was a fifteen-minute walk away. Burdened by a rolling suitcase, and lacking an umbrella or rain jacket, I decided to check on Uber.
When I first looked, the surge multiplier was around 2.0, but in minutes the rain and rush-hour commute caused it to skyrocket to nearly four (with long wait times to boot). I decided to simplify and grab dinner around the corner, scurrying from awning to awning to avoid a soak.
After a quick dinner, I checked in on Uber again. Still a surge multiplier of 4.0, which made the ride from Manhattan to JFK around $400, more than I wanted to spend.
But, I’ve lived in New York before, so I decided to use my go to backup option: the Long Island Rail Road and AirTrain. The rain let up a little, and I made my way to Penn Station—a few-block walk and a quick subway ride away. Commuters packed the station, some milling about waiting for more information about delays on their lines. But the train to Jamaica still seemed OK, so I purchased a ticket, rushed down to the track, and jumped on the 6:31 train with only a moment to spare. If everything went well, I’d arrive in plenty of time for my 8:45 flight.
But everything wasn’t going well. Instead of speeding out of Penn Station, the train just sat.
A few minutes later, the conductor announced, “I have just heard from the Station Master: we’re being held momentarily in the station.” Commuters groaned collectively—a momentary delay on a day like today was unlikely.
Their suspicions were well founded. In a few more minutes, the conductor announced that “due to weather-related signal problems near Jamaica, this train line is being suspended until further notice.”
There I sat, a victim of illusory redundancy, when a backup system is vulnerable to same disruption as the system it’s meant to protect. Illusory redundancy is all about unexpected correlations, in this case rain drove Uber’s increasing demand and caused signal failures on the LIRR.
Illusory redundancies lie at the core of many meltdowns. During Hurricane Sandy, for example, the storm surge destroyed a key substation and blacked out a large chunk of Manhattan. At the same time, the surge flooded NYU’s state-of-the-art hospital, located a few blocks uptown, disabling their backup generator and forcing the evacuation of critically ill patients.
Before the launch of the Space Shuttle Challenger, engineers at NASA assumed that having two O-rings on the potentially problematic solid rocket boosters provided sufficient protection: even though cold temperatures might affect one important seal, they argued, the second would provide a backup. Yet when the primary O-ring failed during the stress of the launch, hot gasses from the booster quickly eroded the backup, causing it to fail as well and leading to Challenger’s destruction.
And illusory redundancy doesn’t just occur in physical systems. During the financial crisis, products like auction rate securities, which once seemed safe and liquid because of the breadth of participants that traded them, froze as shocks simultaneously affected many financial institutions. Anyone relying on the redundancy of multiple participants quickly realized their folly.
I thought of these examples as I made my way off of the train and onto the main level of Penn Station. It was so crowded with commuters that police had shutdown the station and were preventing people from entering. Inside, I found a corner with a trickle of 4G internet and checked out the Uber situation – surge pricing had dropped to 1.2. I requested a pickup and raced up to street level to meet the driver.
Traffic was bad, and we didn’t get to the airport until 8:52. But I had one thing going for me – I knew that the same weather system that shut down the LIRR caused delays at JFK. I thought that I still had a chance.
After passing through security, I ran to my gate, and arrived just as the gate agent called my boarding group.
I was lucky – though I was surprised by illusory redundancy, unexpected correlations, in the form of flight delays, worked in my favor.
Knowing that New York can be virtually shut down by heavy rain, I might have headed straight to the airport to wait there at the first sign of rain. It certainly would have made for a less exciting afternoon.
NY Magazine has an interesting (fictional) story about a 2017 cyber attack. Almost all of the elements seem plausable: a mix of hardware-based and people-based attack vectors, failures of coordination, confusion, and the financial consequences.
Congratulations to Rotman School of Management graduate Anthony Harbour for being selected by Poets & Quants as one of their MBAs To Watchthis year. Anthony was a student in my Catastrophic Failure in Organizationscourseas part of a great cohort in 2016 (thanks for the shoutout to the course, Anthony!). A Los Angeles native, Anthony came to Rotman with prior experience at the U.S. Securities and Exchange Commission and left a lasting mark on the Rotman School. You can read about his many great contributions to our community on his Poets & Quants profile. Congratulations, Anthony!
I seriously doubt that the CFTC can attract people with the coding skill necessary to track down errors in trading algorithms, or can devote the time necessary… for a truly effective review.
This is a great point. If the CFTC is so burdened with what’s on their regulatory plate already, how can they possibly add this? And how can the CFTC hope to compete with trading firms for the technical talent required to effectively review such code?
Second, and more substantively, reviewing individual trading algorithms in isolation is of limited value in determining their potentially disruptive effects…
This is because in complex systems, attempts to improve the safety of individual components of the system can actually increase the probability of system failure.
Pirrong is a scholar after our own hearts, and he hits on so many important points here. The theory of complex systems tells us that non-holistic safety mechanisms often make things worse.
For example, after the 2010 Flash Crash, the SEC implemented single-stock circuit breakers. Such measures seem like a good idea, and the circuit breakers often help minimize disruptions. But on August 24, 2015, these single-stock circuit breakers halted trading in 471 different ETFs and stocks. This in turn lead to further dislocation as many key ETF liquidity providers simply stopped trading because they could no longer model the baskets of securities that underlie many ETFs.
Worse, if the intent is to prevent Knight-like fiascos, the CFTC should look elsewhere. Knight’s problem wasn’t even a coding error. Knight’s code worked—it was just deployed incorrectly. If that sounds like splitting hairs, that’s precisely the point. These systems are so complicated that code divorced from configuration files and deployment procedures is essentially meaningless.
I understand where the desire for a Reg AT-type solution comes from. The complexity of the financial markets is increasing, and we’ve seen over and over that regulators are struggling to get a handle on things. But if the CFTC really wants a window into the risk of automated trading, they should take a page from the Federal Aviation Administration’s playbook (as we’ve argued before). The FAA supports the airline industry’s quest for safety by cooperatively interfacing with airline-run Safety Management Systems. These systems specify a structure for reporting, discussing, correcting errors, and for auditing those corrections—largely without the fear of regulatory reprisal.
The CFTC should drop the costly, draconian, ultimately counterproductive Reg AT proposal. Instead, they should consider “Reg SMS,” in which they work with the industry to set up standard error capture, discussion, and QA processes—modeled after the airlines’ Safety Management Systems—so we can all get a handle on this complexity together.
Just as there are best practices for coding, there are best practices for managing complexity. The CFTC needs to look for them.
T. Rowe Price: Invest with Confidence… but vote with skepticism?
When we think about complexity, we naturally think about systems that seem risky, like nuclear power, aviation, space flight, the power grid, or high frequency trading. But a recent, and costly, proxy voting mistake shows that even systems that seem really boring can have big consequences when they fail.
T. Rowe Price, the asset manager, announced this week that it was paying almost two hundred million dollars to clients for mishandling of a proxy vote related to the 2013 leveraged buyout of Dell Inc. At the time of the buyout, T. Rowe Price held the computer maker’s shares in a variety of its mutual funds and client accounts.
Even as T. Rowe Price actively opposed the buyout and advocated for a higher price for Dell shares, their proxy voting system mistakenly voted “for” the merger. Like almost all complexity-driven errors, this was a combination of human error (T. Rowe Price employees failed to check that the voting record matched what they expected), external factors (the shareholder vote was postponed several times, which overwrote the “Against” vote that T. Rowe Price recorded), and seemingly benign design decisions that have unintended consequences: in this case, that the T. Rowe Price’s default vote for a management-supported merger was “For” the proposal.
On May 31, 2016, the court ruled that Dell’s fair value per share was $17.62 and not $13.75. Because of their mistaken vote for the merger, T. Rowe Price’s shareholders were denied the additional $3.87 per share.
In a press release, T. Rowe Price pointed out that the $3.87 difference in share value “[validated] the firm’s original investment thesis.” A validation that’s now resulting in a $200 million loss for the firm. Seems like a Pyrrhic victory.
The challenge for any firm with complex technology like this is that it’s hard to tell where such errors might be lurking. The vast majority of the time, T. Rowe Price’s system recorded the intended vote. The problem is that this mistake came with a large price tag. And more likely than not, the next costly and unexpected error (at T. Rowe Price or another firm) won’t have anything to do with proxy voting. Instead, it will be a mishandled options conversion, dividend election, or something outside of the corporate actions space entirely.
So how can firms manage to protect themselves against the spectrum of possible errors? First, they should think of complexity itself as a risk factor. One way that this could have been explicitly considered is by noting near misses, instances where a vote or other corporate action was almost recorded incorrectly, but was caught. Sensitivity to near misses allows firms to correct deep and systematic errors before they become costly.
Second, recognize that organizational (and technical) boundaries can obscure what’s going on. In this case, interactions between T. Rowe Price’s fund managers and corporate actions group diffused responsibility. And their technology platform, integrated with an external processing agent, didn’t always tell the full picture. Boundaries like this create risk.
Finally, design systems defensively with the assumption that individuals are fallible. T. Rowe Price’s corporate action voting system had sensible defaults recorded for the majority of votes. And though there was a process to change the vote, the Dell leverage buy-out was a clear special case, especially as its multiple postponements required multiple votes. Just as happened at Knight Capital (where a technologist failed to roll out new code on all eight of Knight’s servers), humans struggle to accomplish tasks that require exceptional precision with little differentiation. Designing and using checklists can help, but only when supported by an organizational culture of dissent and healthy checks and balances.