Illusory Redundancy Strikes Again

What my trip to the airport teaches us about why catastrophe happens.

I noticed the feeling of rain in the air as soon as I stepped out of my final afternoon meeting on this trip to New York. Since moving to Seattle, I have missed the powerful, dramatic summer thunderstorms that the East Coast produces, and I suspected that one was in store.

I had a few hours until I had to be at JFK, but it was too early to have dinner, so I decided to retreat to a cafe in nearby Flatiron for a respite from the heat and humidity.

A few minutes later, the deluge arrived. The rain hammered the street as well-prepared pedestrians hoisted umbrellas. It soaked the unprepared.

I was hoping to have dinner at a restaurant that was a fifteen-minute walk away. Burdened by a rolling suitcase, and lacking an umbrella or rain jacket, I decided to check on Uber.

When I first looked, the surge multiplier was around 2.0, but in minutes the rain and rush-hour commute caused it to skyrocket to nearly four (with long wait times to boot). I decided to simplify and grab dinner around the corner, scurrying from awning to awning to avoid a soak.

After a quick dinner, I checked in on Uber again. Still a surge multiplier of 4.0, which made the ride from Manhattan to JFK around $400, more than I wanted to spend.

But, I’ve lived in New York before, so I decided to use my go to backup option: the Long Island Rail Road and AirTrain. The rain let up a little, and I made my way to Penn Station—a few-block walk and a quick subway ride away. Commuters packed the station, some milling about waiting for more information about delays on their lines. But the train to Jamaica still seemed OK, so I purchased a ticket, rushed down to the track, and jumped on the 6:31 train with only a moment to spare. If everything went well, I’d arrive in plenty of time for my 8:45 flight.

But everything wasn’t going well. Instead of speeding out of Penn Station, the train just sat.

A few minutes later, the conductor announced, “I have just heard from the Station Master: we’re being held momentarily in the station.” Commuters groaned collectively—a momentary delay on a day like today was unlikely.

Their suspicions were well founded. In a few more minutes, the conductor announced that “due to weather-related signal problems near Jamaica, this train line is being suspended until further notice.”

There I sat, a victim of illusory redundancy, when a backup system is vulnerable to same disruption as the system it’s meant to protect. Illusory redundancy is all about unexpected correlations, in this case rain drove Uber’s increasing demand and caused signal failures on the LIRR.

Illusory redundancies lie at the core of many meltdowns. During Hurricane Sandy, for example, the storm surge destroyed a key substation and blacked out a large chunk of Manhattan. At the same time, the surge flooded NYU’s state-of-the-art hospital, located a few blocks uptown, disabling their backup generator and forcing the evacuation of critically ill patients.

Before the launch of the Space Shuttle Challenger, engineers at NASA assumed that having two O-rings on the potentially problematic solid rocket boosters provided sufficient protection: even though cold temperatures might affect one important seal, they argued, the second would provide a backup. Yet when the primary O-ring failed during the stress of the launch, hot gasses from the booster quickly eroded the backup, causing it to fail as well and leading to Challenger’s destruction.

And illusory redundancy doesn’t just occur in physical systems. During the financial crisis, products like auction rate securities, which once seemed safe and liquid because of the breadth of participants that traded them, froze as shocks simultaneously affected many financial institutions. Anyone relying on the redundancy of multiple participants quickly realized their folly.

I thought of these examples as I made my way off of the train and onto the main level of Penn Station. It was so crowded with commuters that police had shutdown the station and were preventing people from entering. Inside, I found a corner with a trickle of 4G internet and checked out the Uber situation – surge pricing had dropped to 1.2. I requested a pickup and raced up to street level to meet the driver.

Traffic was bad, and we didn’t get to the airport until 8:52. But I had one thing going for me – I knew that the same weather system that shut down the LIRR caused delays at JFK. I thought that I still had a chance.

The incoming flight. Note the racetrack holding pattern in the top left.

After passing through security, I ran to my gate, and arrived just as the gate agent called my boarding group.

I was lucky – though I was surprised by illusory redundancy, unexpected correlations, in the form of flight delays, worked in my favor.

Knowing that New York can be virtually shut down by heavy rain, I might have headed straight to the airport to wait there at the first sign of rain. It certainly would have made for a less exciting afternoon.

The Big Hack – NY Magazine

NY Magazine has an interesting (fictional) story about a 2017 cyber attack. Almost all of the elements seem plausable: a mix of hardware-based and people-based attack vectors, failures of coordination, confusion, and the financial consequences.

It’s worth a read.


Reg AT – Don’t Go There

Craig Pirrong, the Streetwise Professor, recently wrote about his skeptical take on the CFTC’s desire to examine the source code of trading algorithms. The proposed Regulation AT (Automated Trading) has many issues, and Pirrong calls out two of them:

I seriously doubt that the CFTC can attract people with the coding skill necessary to track down errors in trading algorithms, or can devote the time necessary… for a truly effective review.

This is a great point. If the CFTC is so burdened with what’s on their regulatory plate already, how can they possibly add this? And how can the CFTC hope to compete with trading firms for the technical talent required to effectively review such code?

Second, and more substantively, reviewing individual trading algorithms in isolation is of limited value in determining their potentially disruptive effects…

This is because in complex systems, attempts to improve the safety of individual components of the system can actually increase the probability of system failure.

Pirrong is a scholar after our own hearts, and he hits on so many important points here. The theory of complex systems tells us that non-holistic safety mechanisms often make things worse.

For example, after the 2010 Flash Crash, the SEC implemented single-stock circuit breakers. Such measures seem like a good idea, and the circuit breakers often help minimize disruptions. But on August 24, 2015, these single-stock circuit breakers halted trading in 471 different ETFs and stocks. This in turn lead to further dislocation as many key ETF liquidity providers simply stopped trading because they could no longer model the baskets of securities that underlie many ETFs.

Worse, if the intent is to prevent Knight-like fiascos, the CFTC should look elsewhere. Knight’s problem wasn’t even a coding error. Knight’s code worked—it was just deployed incorrectly. If that sounds like splitting hairs, that’s precisely the point. These systems are so complicated that code divorced from configuration files and deployment procedures is essentially meaningless.

I understand where the desire for a Reg AT-type solution comes from. The complexity of the financial markets is increasing, and we’ve seen over and over that regulators are struggling to get a handle on things. But if the CFTC really wants a window into the risk of automated trading, they should take a page from the Federal Aviation Administration’s playbook (as we’ve argued before). The FAA supports the airline industry’s quest for safety by cooperatively interfacing with airline-run Safety Management Systems. These systems specify a structure for reporting, discussing,  correcting errors, and for auditing those corrections—largely without the fear of regulatory reprisal.

The CFTC should drop the costly, draconian, ultimately counterproductive Reg AT proposal. Instead, they should consider “Reg SMS,” in which they work with the industry to set up standard error capture, discussion, and QA processes—modeled after the airlines’ Safety Management Systems—so we can all get a handle on this complexity together.

Just as there are best practices for coding, there are best practices for managing complexity. The CFTC needs to look for them.

Complexity Strikes T. Rowe Price

T. Rowe Price: Invest with Confidence… but vote with skepticism?

When we think about complexity, we naturally think about systems that seem risky, like nuclear power, aviation, space flight, the power grid, or high frequency trading.  But a recent, and costly, proxy voting mistake shows that even systems that seem really boring can have big consequences when they fail.

T. Rowe Price, the asset manager, announced this week that it was paying almost two hundred million dollars to clients for mishandling of a proxy vote related to the 2013 leveraged buyout of Dell Inc. At the time of the buyout, T. Rowe Price held the computer maker’s shares in a variety of its mutual funds and client accounts.

Even as T. Rowe Price actively opposed the buyout and advocated for a higher price for Dell shares, their proxy voting system mistakenly voted “for” the merger. Like almost all complexity-driven errors, this was a combination of human error (T. Rowe Price employees failed to check that the voting record matched what they expected), external factors (the shareholder vote was postponed several times, which overwrote the “Against” vote that T. Rowe Price recorded), and seemingly benign design decisions that have unintended consequences: in this case, that the T. Rowe Price’s default vote for a management-supported merger was “For” the proposal.

On May 31, 2016, the court ruled that Dell’s fair value per share was $17.62 and not $13.75. Because of their mistaken vote for the merger, T. Rowe Price’s shareholders were denied the additional $3.87 per share.

In a press release, T. Rowe Price pointed out that the $3.87 difference in share value “[validated] the firm’s original investment thesis.” A validation that’s now resulting in a $200 million loss for the firm. Seems like a Pyrrhic victory.

The challenge for any firm with complex technology like this is that it’s hard to tell where such errors might be lurking. The vast majority of the time, T. Rowe Price’s system recorded the intended vote. The problem is that this mistake came with a large price tag. And more likely than not, the next costly and unexpected error (at T. Rowe Price or another firm) won’t have anything to do with proxy voting. Instead, it will be a mishandled options conversion, dividend election, or something outside of the corporate actions space entirely.

So how can firms manage to protect themselves against the spectrum of possible errors? First, they should think of complexity itself as a risk factor. One way that this could have been explicitly considered is by noting near misses, instances where a vote or other corporate action was almost recorded incorrectly, but was caught. Sensitivity to near misses allows firms to correct deep and systematic errors before they become costly.

Second, recognize that organizational (and technical) boundaries can obscure what’s going on. In this case, interactions between T. Rowe Price’s fund managers and corporate actions group diffused responsibility. And their technology platform, integrated with an external processing agent, didn’t always tell the full picture. Boundaries like this create risk.

Finally, design systems defensively with the assumption that individuals are fallible. T. Rowe Price’s corporate action voting system had sensible defaults recorded for the majority of votes. And though there was a process to change the vote, the Dell leverage buy-out was a clear special case, especially as its multiple postponements required multiple votes. Just as happened at Knight Capital (where a technologist failed to roll out new code on all eight of Knight’s servers), humans struggle to accomplish tasks that require exceptional precision with little differentiation. Designing and using checklists can help, but only when supported by an organizational culture of dissent and healthy checks and balances.

Hat tip to Steve Lofchie at The Cadwalader Cabinet for the story.

The Bracken Bower Prize

To say that we entered the Bracken Bower prize on a whim wouldn’t be quite fair. The original details about the contest came from Ken McGuffin, the Media Relations Manager for the Rotman School of Management, who helped us through the process of writing an Op Ed for the Guardian on the continued risks of deepwater drilling. Ken recommended that we look into the prize. We agreed that it seemed interesting, but we started out by putting it on the back burner, where it sat for several months. Continue reading “The Bracken Bower Prize”

Chaos via Control

Regulations, Enforcement Create Risk in the Complex and Rigid Markets

The Flash Crash should have been the impetus for serious reconsideration image_0of the structure of our national markets. But in the five years since its occurrence, assumptions have not been re-examined. The market’s complex and interconnected structure, which regulations mandate, increases the likelihood of destabilizing failures. Despite extensive study, and a recent indictment, regulators have not fully grasped the lessons of the Crash: a simpler set of rules would result in a market more resistant to explosions of volatility.

Continue reading “Chaos via Control”

Courting Catastrophic Failure

Will Managers Ever Learn?

From BP’s Deepwater Horizon disaster, to deadly component failures at image alt textToyota and GM, to technological meltdowns at major stock market participants likeKnight Capital (now KCG Holdings), Goldman Sachs, and NASDAQ, catastrophic failures have devastating effects on the environment, businesses, and customers. And the potential for failures of this kind is growing as both the complexity of our systems and the probability of extreme “trigger” events increase.

Continue reading “Courting Catastrophic Failure”