Normal Accidents: Living with High-Risk Technologies: Difference between revisions

From The Jolly Contrarian
Jump to navigation Jump to search
Created page with "{{a|devil|}}This is one of those “books that will change your life”. Well — that ''should'' change lives — that it was written in 1984 — {{author|Charles Perrow}} pa..."
 
No edit summary
 
(56 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{a|devil|}}This is one of those “books that will change your life”. Well — that ''should'' change lives — that it was written in 1984 — {{author|Charles Perrow}} passed away in 2019 — suggests that, maybe it hasn’t: that the irrationalities that motivate so much of what we do are more pervasive than plainly written common sense.
{{a|systems|{{image|Erebus|gif|Air New Zealand Flight TE901}}}}{{bi}}{{quote|'''[[Accident]]''' /ˈaksɪd(ə)nt/ ''(n).'' <br>


Charles Perrow was a sociologist who fell into the discipline of [[systems analysis]]: analysing how social structures like businesses, governments and public utilities, being loose networks of autonomous individuals, work. Perrow’s focus fell upon organisations that present specific risks to operators, passengers, innocent bystanders — nuclear and other power stations, airways, shipping lines, but the read-across to the financial systems is obvious where a combination of [[complexity]] and [[tight coupling]] mean that periodic catastrophic accidents are not just likely, but ''inevitable''. It is the intrinsic property of a complex, tightly coupled system — not merely a function of operator error that can be blamed on a negligent employee — that it will fail catastrophically.
An inevitable occurrence due to the action of immutable laws.
:— {{author|Ambrose Bierce}}, {{br|The Devil’s Dictionary}}}}


If it is right, it has profound consequences for how we in complex, tightly coupled systems, should think about risk. It seems inarguably right.
{{quote|Humans in general do not reason well (even experts can be found to make simple mistakes in probabilities and interpretation of evidence); heroic effort would be needed to educate the general public in the skills needed to decide the complex issues of risk.
:— {{author|Charles Perrow}}, ''Normal Accidents'', Chapter 9}}


First, some definitions. Perrow uses “[[complexity]]” — a topic which is beginning to infuse the advocacy part of this site — without the benefit of [[systems analysis]], since it hadn’t really been invented when he was writing, but to describe interactions between discrete subsystems of an organisation that were not, and could not have been anticipated by the designers of the system.
This is one of those “books that will change your life”. Well — that ''should'' change lives — that it was written in 1984 — {{author|Charles Perrow}} passed away in 2019 — and it isn’t on the bookshelf of every [[thought leader]] in the land suggests that, maybe it hasn’t: that the irrationalities that motivate so much of what we do are more pervasive than plainly written common sense.
 
{{author|Charles Perrow}} was a sociologist who fell into the discipline of [[systems analysis]]: analysing how social structures like businesses, governments and public utilities, being loose networks of autonomous individuals, work. Perrow’s focus fell upon organisations that present specific risks to operators, passengers, innocent bystanders — nuclear and other power stations, airways, shipping lines: the read-across to the financial systems is obvious — where a combination of what he termed '''[[complexity|complex interactions]]''' and '''[[tight coupling]]''' in distributed systems mean that catastrophic accidents are not just likely but, from time to time, ''inevitable''. Such unpredictable failures are an intrinsic property of a complex, tightly coupled system, not merely a function of “operator error” that can be blamed on a negligent employee — although be assured, that is how management will be [[inclined]] to characterise it if given half a chance.
 
The classic case of such a tightly-coupled system is a nuclear power plant. Perrow was an accident investigator at the Three Mile Island incident. The early part of his book contains a fascinating blow-by-blow account of how TMI unfolded and how close it came to being catastrophically worse than it was.
 
Yet, while there were no fatalities, it is premature to conclude that the technology is therefore safe.
 
{{Quote|“Large nuclear plants of 1,000 or so megawatts have not been operating very long—only about thirty-five to forty years of operating experience exists, and that constitutes “industrial infancy” for complicated, poorly understood transformation systems.”}}
 
The unnerving practical conclusion that Perrow draws is that, for all the easy speeches<ref name="syed">[https://www.thetimes.co.uk/article/f8a262f8-4490-11ec-b414-b1f6389ab345 We are too emotional about risk — no wonder we make bad decisions]— Matthew Syed, ''The Sunday Times'', 14 November 2021.</ref> given about the relative low risk of nuclear power compared with traditional fossil fuel-based energy generation, it is just far too early to draw meaningful conclusions about the tail risk of nuclear meltdown. It is like rolling a die six times, and concluding that, because a six has not yet come up, one is not possible.
 
The potential for unanticipatable accidents that trigger unstoppable catastrophic chain reactions is incalculable, and the time horizon over which these accidents could occur or have effect is literally millennial. Which traditional industries these risks are better understood and generally less prevalent.
 
To claim that the statistics we have suggest nuclear power is is safe<ref name="syed"/> is to mistake an “absence of evidence” for “evidence of absence”.
 
===Financial services relevance===
This site is mostly concerned with financial services and not nuclear energy, of course. You would think [[financial services]] meet exactly the conditions of [[non-linearity]] and [[tight coupling]] this that Perrow describes.
 
If this is right, it has profound consequences for how we who inhabit [[complex]], [[tightly-coupled]] systems, should think about risk. If you work in [[financial services]], you ''do'' inhabit a complex, tightly-coupled system, and it seems unarguably right.
 
Yet you don’t hear many people in [[financial services]] talking about how to handle [[normal accidents]]. Instead you hear a lot about [[technological unemployment]] and how [[chatbot]]s are going to put as all out of work. Hmmm.
 
===[[Complex interaction]]s and [[tight coupling]]===
First, some definitions.  
*'''[[Complex interaction]]s''': Perrow anticipates the later use of the concept of “[[complexity]]” — a topic which is beginning to infuse the advocacy part of this site — without the benefit of [[systems analysis]], since it hadn’t really been invented when he was writing, but to describe interactions between non-adjacent sub-components of a system that were neither intended nor anticipated by the designers of the system. Complex interactions are not only unexpected, but for a period of time (which may be critical, if the interacting components are [[tightly coupled]]) will be ''incomprehensible''. This may be because the interactions cannot be seen, buried under second-order control and safety systems, or even because they are not ''believed''. If your — ''wrong'' — theory of the game is that the risk in question is a [[ten sigma event]], you know, expected only once in one hundred million years, you may have a hard time believing it could be happening in your fourth year of operation, as surviving partners of [[Long Term Capital Management]] may tell you. Here even [[epistemology]] is in play. Interactions that are not in our basic conceptualisation the world, are not ones we can reasonably anticipate. These interactions cannot be ''designed'' into the system; no one ''intends'' them. “They baffle us because we acted in terms of our own designs of a world that we expected to exist—but the world was different.”<ref>{{br|Normal Accidents}}, p. 75. Princeton University Press. Kindle Edition. </ref>
*'''[[Linear interaction]]s''': Contrast [[complex interaction]]s with much more common “[[linear interaction]]s”, where parts of the system interact with other components that precede or follow them in the system in ways that are expected and planned: “if ''this'', then ''that''”. In a well-designed system, these will (of course) predominate: in normal operation, any decent system should do what it is designed to do and not act erratically. Some systems are more complex than others, but even in the most linear systems are susceptible to some complexity: where they interact with the (intrinsically [[complex]]) environment.<ref>Perrow characterises a “complex system” as one where more than ten percent of interactions are complex; and a “linear system” where less than one percent of interactions are. The greater the percentage of complex interactions in a system, the greater the potential for system accidents.</ref> Cutting back into the language of [[systems analysis]] for a moment, consider that [[linear interaction]]s are a ''feature'' of [[simple]] and [[complicated system]]s, and can be “pre-solved” and brute-force computed; at least in theory. They can be managed by [[algorithm]], or [[playbook]]. But [[complex interactions]], by definition, ''cannot'' — they are the interactions the [[algorithm]] ''didn’t expect''.
*'''[[Tight coupling]]''': [[Complex interactions]] are only a source of catastrophe if another condition is satisfied: that unexpectedly-interacting components of the [[complex system]] are “tightly coupled” — processes happen fast, can’t be turned off, failing components can’t be isolated. Perrow’s observation is that complex systems tend to be more tightly coupled than we realise, and we usually only find out the hard way.
 
===Normal accidents===
Where you have a complex system, you should therefore ''expect'' accidents — yes, and opportunities, quirks and serendipities, to be sure, but here we are talking about risk — to arise from unexpected, [[non-linear interaction]]s. Such accidents, says Perrow, are “normal”, not in the sense of being regular or expected, but in the sense that ''it is an inherent property of the system to have this kind of accident at some point or other''.<ref>In the forty-year operating history of nuclear power stations, there had (at the time of writing!) been ''no'' catastrophic meltdowns, “... but this constitutes only an “industrial infancy” for complicated, poorly understood transformation systems.” In 1984, Perrow had a chilling prediction:
 
{{quote|“... the ingredients for such accidents are there, and unless we are very lucky, one or more will appear in the next decade and breach containment.”}}
Ouch.</ref>
 
Is a financial system [[complex]]? About as complex as any distributed system known to humankind. Is it tightly coupled? Well, you could ask the principals of [[LTCM]], [[Enron]], Bear Stearns, Amaranth Advisors, [[Lehman]] brothers or Northern Rock, if any of those venerable institutions were still around to tell you about it. But yes. Might reckless mortgage securitisation, excess [[leverage]] and flash boys have been on Perrow’s mind? We rather think so:
 
{{quote|“New financial instruments such as [[Financial weapons of mass destruction|derivatives]] and [[hedge fund]]s and new techniques such as programmed trading further increase the complexity of interactions. ''Breaking up a loan on a home into tiny packages and selling them on a world-wide basis increases interdependency.''”<ref>{{br|Normal Accidents}} p. 385.</ref>}}
 
He wrote this in ''1999'', for Pete’s sake.
 
===How to deal with [[system accidents]]===
So, [[financial services]] [[risk controller]]s take note: if your system is a [[complex]], [[tightly-coupled]] system — and it is — ''you cannot solve for systemic failures''. You can’t prevent them. You have to have arrangements in place to ''deal'' with them. These arrangements need to be able to deal with the unexpected interactions of components in a ''[[complex]]'' system, not the predictable effects of a merely ''[[complicated]]'' one.
 
Why make the distinction between [[complex]] and [[complicated]] like this? Because the financial services industry is in the swoon of automated, pre-configured safety mechanisms — think [[chatbot]]s, [[risk taxonomy|risk taxonomies]], [[playbook]]s, [[checklist]]s, [[neural networks]], even ~ ''cough'' ~ [[contract|contractual rights]] — and while these may help resolve isolated and expected failures in ''complicated'' components, they have ''no'' chance of resolving systems failures, which, by definition, will confound them. Instead, these safety mechanisms ''will get in the way''. They are ''of'' the system. They are ''part'' of what has failed. Not only that: safety mechanisms, by their existence, ''add'' [[complexity]] in the system — they create their own unexpected interactions — and when a system failure happens they can make it ''harder'' to detect what is going on, much less how to stop it.
 
===When Kramer hears about this ...===
[[File:Shit fan.jpg|400px|thumb|right|Kramer hearing about this, yesterday.]]
So far, so hoopy; but here’s the rub: we can make our systems less complex and ''reduce'' [[tight coupling]] by careful design, functional redundancy and iterative improvement — [[air crash|air transport has become progressively safer]] as it has developed: it has learned from each accident — but, as long as it is a complex system with the scope for complex interaction, ''we cannot eliminate [[system accident]]s altogether''. They are, as coders like to joke, a feature, not a bug.
 
Furthermore, in our efforts to pre-solve for catastrophe, we tend ''not'' to simplify, but to complicate: we add prepackaged “risk mitigation” components: [[Policy|policies]], [[taxonomy|taxonomies]], [[key performance indicator]]s, [[tick-boxes]], [[dialog box]]es, [[bloatware]] processes, rules, and [[Chatbot|new-fangled bits of kit]] to the process ''in the name of programmatic risk management''.
 
These might give the [[middle management]] layer comfort; they can set their [[RAG status]]es green, and it may justify their planned evisceration of that cohort of troublesome [[subject matter expert]]s who tend to foul up the mechanics of the [[Heath Robinson machine]] — but who will turn out to be just the people you wish you hadn’t fired {{shitfan}}.
 
Here is the folly of elaborate, [[complicated]] safety mechanisms: adding components to any complex system ''increases'' its complexity. That, in itself, makes dealing with [[system accident]]s, when they occur, ''harder''. The safety mechanisms beloved of the [[middle management]] layer derive from experience. They secure stables from which horses have bolted. They are, as {{author|Jason Fried}} elegantly put it,
 
{{quote|“organisational scar tissue. Codified responses to situations that are unlikely to happen again.”<ref>{{br|Rework}}, {{author|Jason Fried}}</ref>}}
They are, in a word, ''[[Linear interaction|linear]]'' responses to what will be, when it happens, by definition a ''[[Non-linear interaction|non-linear]]'' problem.
 
Not only do linear safety mechanisms exacerbate or even create their own accidents, but they also afford a degree of false comfort that encourages managers, who typically have financial targets to meet, not safety ones — to run the system harder, thus increasing the tightness of the coupling between unrelated components. That same Triple-A [[Ratings notches|rating]] that lets your risk officer catch some zeds at the switch encourages your trader to double down. ''I’m covered. What could go wrong?''
 
Perrow documents the sequential failures at Three Mile Island, the Space Shuttle ''Challenger'', Air New Zealand’s Erebus crash, among many other disasters and near-misses with salacious detail. The chapter on maritime collisions would be positively hilarious were it not so distressing.
 
===“Operator error” is almost always the wrong answer===
Human beings being system components, it is rash to blame them when they are component that is constitutionally disposed to fail — we are frail, mortal, inconstant, narratising beings — even when not put in a position, through system design or economic incentive that makes failure inevitable. A ship’s captain who is expected to work a 48-hour watch and meet unrealistic deadlines is hardly positioned, let alone incentivised to prioritise safety. Perrow calls these “forced operator errors”: “But again, “operator error” is an easy classification to make. What really is at stake is an inherently dangerous working situation where production must keep moving and risk-taking is the price of continued employment.”<ref>{{br|Normal Accidents}} p. 249.</ref>
 
If an operator’s role is simply to carry out a tricky but routine part of the system then the march of technology makes this ever more a fault of design and not personnel: humans, we know, are not good computers. They are good at figuring out what to do when something unexpected happens; making decisions; exercising judgment. But they — ''we'' — are ''lousy'' at doing repetitive tasks and following instructions. As ''The Six Million Dollar Man'' had it, ''we have the technology''. We should damn well use it.
 
If, on the other hand, the operator’s role is to manage ''[[complexity]]'' — then technology, checklists and pre-packaged risk taxonomies can only take you so far and, at the limit, can get in the way. Perrow’s account of the control deck at Three Mile Island, as reactant coolant pumps began cavitating, thumping and shaking, is instructive:
 
{{quote|“In the control room there were three audible alarms sounding, and many of the 1,600 lights (on-off lights and rectangular displays with some code numbers and letters on them) were on or blinking. The operators did not turn off the main audible alarm because it would cancel some of the annunciator lights. The computer was beginning to run far behind schedule; in fact it took some hours before its message that something might be wrong with the PORV finally got its chance to be printed. Radiation alarms were coming on. The control room was filling with experts; later in the day there were about forty people there. The phones were ringing constantly, demanding information the operators did not have. Two hours and twenty minutes after the start of the accident, a new shift came on.” <ref>{{br|Normal Accidents}} p. 28.</ref> }}
 
This is, as Perrow sees it, the central dilemma of the [[complex system]]. The nature of [[normal accidents]] is such that they need experienced, wise operators on the ground ready to think quickly and laterally to solve unfolding problems, but the enormity of the risks involved mean that central management are not prepared to delegate so much responsibility to the mortal, inconstant, narratising [[meatware]].
 
=== How best to manage? ===
The optimal means of managing differs depending on the type of risk.
 
 
For non-linear, tightly coupled systems, like banks, this presents a control paradox: complex systems demand decentralised control and local, on-the ground expertise, to react quickly and wisely to unexpected events; tightly-coupled systems that are susceptible to chain reactions require centralised management to control the event quickly at any point in the organisation.
===What is to be done===
Dumb operators aren’t the problem, but neither are those perennial culprits: technology, capitalism and greed.
 
Technology generally doesn’t ''create'' system accidents so much as fail to stop them and, at the limit, make them harder to foresee and deal with. And there is no imperative, beyond those of scale and economy, which are both very human imperatives to cut corners to profitability — that forces technology upon us. We choose it. We can complain about Twitter all we like, but — yeah.<ref>Twitter isn’t, of course a technology company. It’s a publisher.</ref>
 
And while capitalism does generate externalities, unreasonably concentrate economic power, and reward those who have wealth out of all proportion to their contribution, a “capitalist” is no worse at this than a socialist one. (Perrow was writing in 1984, where the distinction between “capitalist” and “socialist” economies was a good deal starker, and the social democratic third way had not really made itself felt. It is a curious irony that we ''feel'' ever more polarised now, whilst our political economies are far more homogenised. Even China, that last socialist standing, is closer to the centre than it was).
{| class="wikitable"
|+Suitability of centralisation or local control to management of different systems
!
!
!Linear
!Complex
{{aligntop}}
| Rowspan="2" |Tight
|Examples
|Dams, power grids, rail transport, marine transport
|Nuclear power plants, DNA, chemical plants, aircraft, space missions, BANKS
{{aligntop}}
|Control method
|'''Centralisation''': Best to deal with chain reactions, and best to deal with visible, expected linear reactions
|'''Centralisation''': best to deal with chain reactions once they happen:
Local control: best to deal with non-linear reactions and unexpected events as they happen.
{{aligntop}}
|Rowspan="2" |Loose
|Examples
|Manufacturing, single-purpose agencies
|Mining, Research and development, multi-purpose agencies, universities
{{aligntop}}
|Control method
|'''Centralisation or local control''': Few complex interactions; component failures create predictable results, and can be managed centrally.
|'''Local control''': allows indigenous solutions where there is little risk of unstoppable chain reactions, and is best to deal with non-linear reactions and unexpected events as they happen
|}
 
And nor is greed — perhaps the thread that connects the capitalist entrepreneur to the socialist autocrat (let’s face it: it connects ''everyone'') any more causative — or, if it is, it is baked in to the human soul, so can’t really be solved for.
 
Perrow thought it better to look at the by-product of these three modes as the problem in itself: ''[[Externality|externalities]]'': the social costs of the activity that are not reflected in its price, and borne by those who do not benefit from the activity. When the externality is powered by a tightly-coupled, non-linear system it can be out of all proportion to the bounties conferred on beneficiaries of that system — who are often a different class of individuals altogether. The Union Carbide accident at Bhopal being a great example: few of the half-million casualties would have bought a Duracell battery, let alone Union Carbide shareholders, and only 1,000 were employees.
 
This led Perrow to frame his approach to the problem by reference to “catastrophic potential”, which may present itself as ''inherent'' catastrophic potential: the nature of the activity is tightly-coupled and non-linear such that no amount of reorganisation can prevent occasional system accidents, or  ''actual'' catastrophic potential: preventable shortcomings in any of the design, equipment, procedures, operators, supplies and materials, and environment, or component failures in the system could have catastrophic potential — these being things one can theoretically defend against, whereas inherent catastrophic potential is not; against in each case the cost of alternative solutions to the same problem.
 
This leads to three categories of system. Those one should tolerate but seek to improve (mining, chemicals, dams, airways) those one should restrict (marine transport and DNA), and those one should abandon altogether, the benefits, however great, being out of all proportion with their downside risk. Here he includes nuclear weapons — no surprise — but also nuclear power).
 
This is a long review already, so I should stop here. This is a fantastic book. It is somewhat hard to get hold of — there’s no audio version alas —but it is well worth the effort of trying.
 
{{sa}}
*[[Complexity]]
{{ref}}
{{Book Club Wednesday|6/1/21}}

Latest revision as of 16:34, 5 November 2024

The JC’s amateur guide to systems theory
Air New Zealand Flight TE901
Index: Click to expand:
Tell me more
Sign up for our newsletter — or just get in touch: for ½ a weekly 🍺 you get to consult JC. Ask about it here.

Accident /ˈaksɪd(ə)nt/ (n).

An inevitable occurrence due to the action of immutable laws.

Ambrose Bierce, The Devil’s Dictionary

Humans in general do not reason well (even experts can be found to make simple mistakes in probabilities and interpretation of evidence); heroic effort would be needed to educate the general public in the skills needed to decide the complex issues of risk.

Charles Perrow, Normal Accidents, Chapter 9

This is one of those “books that will change your life”. Well — that should change lives — that it was written in 1984 — Charles Perrow passed away in 2019 — and it isn’t on the bookshelf of every thought leader in the land suggests that, maybe it hasn’t: that the irrationalities that motivate so much of what we do are more pervasive than plainly written common sense.

Charles Perrow was a sociologist who fell into the discipline of systems analysis: analysing how social structures like businesses, governments and public utilities, being loose networks of autonomous individuals, work. Perrow’s focus fell upon organisations that present specific risks to operators, passengers, innocent bystanders — nuclear and other power stations, airways, shipping lines: the read-across to the financial systems is obvious — where a combination of what he termed complex interactions and tight coupling in distributed systems mean that catastrophic accidents are not just likely but, from time to time, inevitable. Such unpredictable failures are an intrinsic property of a complex, tightly coupled system, not merely a function of “operator error” that can be blamed on a negligent employee — although be assured, that is how management will be inclined to characterise it if given half a chance.

The classic case of such a tightly-coupled system is a nuclear power plant. Perrow was an accident investigator at the Three Mile Island incident. The early part of his book contains a fascinating blow-by-blow account of how TMI unfolded and how close it came to being catastrophically worse than it was.

Yet, while there were no fatalities, it is premature to conclude that the technology is therefore safe.

“Large nuclear plants of 1,000 or so megawatts have not been operating very long—only about thirty-five to forty years of operating experience exists, and that constitutes “industrial infancy” for complicated, poorly understood transformation systems.”

The unnerving practical conclusion that Perrow draws is that, for all the easy speeches[1] given about the relative low risk of nuclear power compared with traditional fossil fuel-based energy generation, it is just far too early to draw meaningful conclusions about the tail risk of nuclear meltdown. It is like rolling a die six times, and concluding that, because a six has not yet come up, one is not possible.

The potential for unanticipatable accidents that trigger unstoppable catastrophic chain reactions is incalculable, and the time horizon over which these accidents could occur or have effect is literally millennial. Which traditional industries these risks are better understood and generally less prevalent.

To claim that the statistics we have suggest nuclear power is is safe[1] is to mistake an “absence of evidence” for “evidence of absence”.

Financial services relevance

This site is mostly concerned with financial services and not nuclear energy, of course. You would think financial services meet exactly the conditions of non-linearity and tight coupling this that Perrow describes.

If this is right, it has profound consequences for how we who inhabit complex, tightly-coupled systems, should think about risk. If you work in financial services, you do inhabit a complex, tightly-coupled system, and it seems unarguably right.

Yet you don’t hear many people in financial services talking about how to handle normal accidents. Instead you hear a lot about technological unemployment and how chatbots are going to put as all out of work. Hmmm.

Complex interactions and tight coupling

First, some definitions.

  • Complex interactions: Perrow anticipates the later use of the concept of “complexity” — a topic which is beginning to infuse the advocacy part of this site — without the benefit of systems analysis, since it hadn’t really been invented when he was writing, but to describe interactions between non-adjacent sub-components of a system that were neither intended nor anticipated by the designers of the system. Complex interactions are not only unexpected, but for a period of time (which may be critical, if the interacting components are tightly coupled) will be incomprehensible. This may be because the interactions cannot be seen, buried under second-order control and safety systems, or even because they are not believed. If your — wrong — theory of the game is that the risk in question is a ten sigma event, you know, expected only once in one hundred million years, you may have a hard time believing it could be happening in your fourth year of operation, as surviving partners of Long Term Capital Management may tell you. Here even epistemology is in play. Interactions that are not in our basic conceptualisation the world, are not ones we can reasonably anticipate. These interactions cannot be designed into the system; no one intends them. “They baffle us because we acted in terms of our own designs of a world that we expected to exist—but the world was different.”[2]
  • Linear interactions: Contrast complex interactions with much more common “linear interactions”, where parts of the system interact with other components that precede or follow them in the system in ways that are expected and planned: “if this, then that”. In a well-designed system, these will (of course) predominate: in normal operation, any decent system should do what it is designed to do and not act erratically. Some systems are more complex than others, but even in the most linear systems are susceptible to some complexity: where they interact with the (intrinsically complex) environment.[3] Cutting back into the language of systems analysis for a moment, consider that linear interactions are a feature of simple and complicated systems, and can be “pre-solved” and brute-force computed; at least in theory. They can be managed by algorithm, or playbook. But complex interactions, by definition, cannot — they are the interactions the algorithm didn’t expect.
  • Tight coupling: Complex interactions are only a source of catastrophe if another condition is satisfied: that unexpectedly-interacting components of the complex system are “tightly coupled” — processes happen fast, can’t be turned off, failing components can’t be isolated. Perrow’s observation is that complex systems tend to be more tightly coupled than we realise, and we usually only find out the hard way.

Normal accidents

Where you have a complex system, you should therefore expect accidents — yes, and opportunities, quirks and serendipities, to be sure, but here we are talking about risk — to arise from unexpected, non-linear interactions. Such accidents, says Perrow, are “normal”, not in the sense of being regular or expected, but in the sense that it is an inherent property of the system to have this kind of accident at some point or other.[4]

Is a financial system complex? About as complex as any distributed system known to humankind. Is it tightly coupled? Well, you could ask the principals of LTCM, Enron, Bear Stearns, Amaranth Advisors, Lehman brothers or Northern Rock, if any of those venerable institutions were still around to tell you about it. But yes. Might reckless mortgage securitisation, excess leverage and flash boys have been on Perrow’s mind? We rather think so:

“New financial instruments such as derivatives and hedge funds and new techniques such as programmed trading further increase the complexity of interactions. Breaking up a loan on a home into tiny packages and selling them on a world-wide basis increases interdependency.[5]

He wrote this in 1999, for Pete’s sake.

How to deal with system accidents

So, financial services risk controllers take note: if your system is a complex, tightly-coupled system — and it is — you cannot solve for systemic failures. You can’t prevent them. You have to have arrangements in place to deal with them. These arrangements need to be able to deal with the unexpected interactions of components in a complex system, not the predictable effects of a merely complicated one.

Why make the distinction between complex and complicated like this? Because the financial services industry is in the swoon of automated, pre-configured safety mechanisms — think chatbots, risk taxonomies, playbooks, checklists, neural networks, even ~ cough ~ contractual rights — and while these may help resolve isolated and expected failures in complicated components, they have no chance of resolving systems failures, which, by definition, will confound them. Instead, these safety mechanisms will get in the way. They are of the system. They are part of what has failed. Not only that: safety mechanisms, by their existence, add complexity in the system — they create their own unexpected interactions — and when a system failure happens they can make it harder to detect what is going on, much less how to stop it.

When Kramer hears about this ...

Kramer hearing about this, yesterday.

So far, so hoopy; but here’s the rub: we can make our systems less complex and reduce tight coupling by careful design, functional redundancy and iterative improvement — air transport has become progressively safer as it has developed: it has learned from each accident — but, as long as it is a complex system with the scope for complex interaction, we cannot eliminate system accidents altogether. They are, as coders like to joke, a feature, not a bug.

Furthermore, in our efforts to pre-solve for catastrophe, we tend not to simplify, but to complicate: we add prepackaged “risk mitigation” components: policies, taxonomies, key performance indicators, tick-boxes, dialog boxes, bloatware processes, rules, and new-fangled bits of kit to the process in the name of programmatic risk management.

These might give the middle management layer comfort; they can set their RAG statuses green, and it may justify their planned evisceration of that cohort of troublesome subject matter experts who tend to foul up the mechanics of the Heath Robinson machine — but who will turn out to be just the people you wish you hadn’t fired quand l’ordure se frappe le ventilateur.

Here is the folly of elaborate, complicated safety mechanisms: adding components to any complex system increases its complexity. That, in itself, makes dealing with system accidents, when they occur, harder. The safety mechanisms beloved of the middle management layer derive from experience. They secure stables from which horses have bolted. They are, as Jason Fried elegantly put it,

“organisational scar tissue. Codified responses to situations that are unlikely to happen again.”[6]

They are, in a word, linear responses to what will be, when it happens, by definition a non-linear problem.

Not only do linear safety mechanisms exacerbate or even create their own accidents, but they also afford a degree of false comfort that encourages managers, who typically have financial targets to meet, not safety ones — to run the system harder, thus increasing the tightness of the coupling between unrelated components. That same Triple-A rating that lets your risk officer catch some zeds at the switch encourages your trader to double down. I’m covered. What could go wrong?

Perrow documents the sequential failures at Three Mile Island, the Space Shuttle Challenger, Air New Zealand’s Erebus crash, among many other disasters and near-misses with salacious detail. The chapter on maritime collisions would be positively hilarious were it not so distressing.

“Operator error” is almost always the wrong answer

Human beings being system components, it is rash to blame them when they are component that is constitutionally disposed to fail — we are frail, mortal, inconstant, narratising beings — even when not put in a position, through system design or economic incentive that makes failure inevitable. A ship’s captain who is expected to work a 48-hour watch and meet unrealistic deadlines is hardly positioned, let alone incentivised to prioritise safety. Perrow calls these “forced operator errors”: “But again, “operator error” is an easy classification to make. What really is at stake is an inherently dangerous working situation where production must keep moving and risk-taking is the price of continued employment.”[7]

If an operator’s role is simply to carry out a tricky but routine part of the system then the march of technology makes this ever more a fault of design and not personnel: humans, we know, are not good computers. They are good at figuring out what to do when something unexpected happens; making decisions; exercising judgment. But they — we — are lousy at doing repetitive tasks and following instructions. As The Six Million Dollar Man had it, we have the technology. We should damn well use it.

If, on the other hand, the operator’s role is to manage complexity — then technology, checklists and pre-packaged risk taxonomies can only take you so far and, at the limit, can get in the way. Perrow’s account of the control deck at Three Mile Island, as reactant coolant pumps began cavitating, thumping and shaking, is instructive:

“In the control room there were three audible alarms sounding, and many of the 1,600 lights (on-off lights and rectangular displays with some code numbers and letters on them) were on or blinking. The operators did not turn off the main audible alarm because it would cancel some of the annunciator lights. The computer was beginning to run far behind schedule; in fact it took some hours before its message that something might be wrong with the PORV finally got its chance to be printed. Radiation alarms were coming on. The control room was filling with experts; later in the day there were about forty people there. The phones were ringing constantly, demanding information the operators did not have. Two hours and twenty minutes after the start of the accident, a new shift came on.” [8]

This is, as Perrow sees it, the central dilemma of the complex system. The nature of normal accidents is such that they need experienced, wise operators on the ground ready to think quickly and laterally to solve unfolding problems, but the enormity of the risks involved mean that central management are not prepared to delegate so much responsibility to the mortal, inconstant, narratising meatware.

How best to manage?

The optimal means of managing differs depending on the type of risk.


For non-linear, tightly coupled systems, like banks, this presents a control paradox: complex systems demand decentralised control and local, on-the ground expertise, to react quickly and wisely to unexpected events; tightly-coupled systems that are susceptible to chain reactions require centralised management to control the event quickly at any point in the organisation.

What is to be done

Dumb operators aren’t the problem, but neither are those perennial culprits: technology, capitalism and greed.

Technology generally doesn’t create system accidents so much as fail to stop them and, at the limit, make them harder to foresee and deal with. And there is no imperative, beyond those of scale and economy, which are both very human imperatives to cut corners to profitability — that forces technology upon us. We choose it. We can complain about Twitter all we like, but — yeah.[9]

And while capitalism does generate externalities, unreasonably concentrate economic power, and reward those who have wealth out of all proportion to their contribution, a “capitalist” is no worse at this than a socialist one. (Perrow was writing in 1984, where the distinction between “capitalist” and “socialist” economies was a good deal starker, and the social democratic third way had not really made itself felt. It is a curious irony that we feel ever more polarised now, whilst our political economies are far more homogenised. Even China, that last socialist standing, is closer to the centre than it was).

Suitability of centralisation or local control to management of different systems
Linear Complex
Tight Examples Dams, power grids, rail transport, marine transport Nuclear power plants, DNA, chemical plants, aircraft, space missions, BANKS
Control method Centralisation: Best to deal with chain reactions, and best to deal with visible, expected linear reactions Centralisation: best to deal with chain reactions once they happen:

Local control: best to deal with non-linear reactions and unexpected events as they happen.

Loose Examples Manufacturing, single-purpose agencies Mining, Research and development, multi-purpose agencies, universities
Control method Centralisation or local control: Few complex interactions; component failures create predictable results, and can be managed centrally. Local control: allows indigenous solutions where there is little risk of unstoppable chain reactions, and is best to deal with non-linear reactions and unexpected events as they happen

And nor is greed — perhaps the thread that connects the capitalist entrepreneur to the socialist autocrat (let’s face it: it connects everyone) any more causative — or, if it is, it is baked in to the human soul, so can’t really be solved for.

Perrow thought it better to look at the by-product of these three modes as the problem in itself: externalities: the social costs of the activity that are not reflected in its price, and borne by those who do not benefit from the activity. When the externality is powered by a tightly-coupled, non-linear system it can be out of all proportion to the bounties conferred on beneficiaries of that system — who are often a different class of individuals altogether. The Union Carbide accident at Bhopal being a great example: few of the half-million casualties would have bought a Duracell battery, let alone Union Carbide shareholders, and only 1,000 were employees.

This led Perrow to frame his approach to the problem by reference to “catastrophic potential”, which may present itself as inherent catastrophic potential: the nature of the activity is tightly-coupled and non-linear such that no amount of reorganisation can prevent occasional system accidents, or actual catastrophic potential: preventable shortcomings in any of the design, equipment, procedures, operators, supplies and materials, and environment, or component failures in the system could have catastrophic potential — these being things one can theoretically defend against, whereas inherent catastrophic potential is not; against in each case the cost of alternative solutions to the same problem.

This leads to three categories of system. Those one should tolerate but seek to improve (mining, chemicals, dams, airways) those one should restrict (marine transport and DNA), and those one should abandon altogether, the benefits, however great, being out of all proportion with their downside risk. Here he includes nuclear weapons — no surprise — but also nuclear power).

This is a long review already, so I should stop here. This is a fantastic book. It is somewhat hard to get hold of — there’s no audio version alas —but it is well worth the effort of trying.

See also

References

  1. 1.0 1.1 We are too emotional about risk — no wonder we make bad decisions— Matthew Syed, The Sunday Times, 14 November 2021.
  2. Normal Accidents, p. 75. Princeton University Press. Kindle Edition.
  3. Perrow characterises a “complex system” as one where more than ten percent of interactions are complex; and a “linear system” where less than one percent of interactions are. The greater the percentage of complex interactions in a system, the greater the potential for system accidents.
  4. In the forty-year operating history of nuclear power stations, there had (at the time of writing!) been no catastrophic meltdowns, “... but this constitutes only an “industrial infancy” for complicated, poorly understood transformation systems.” In 1984, Perrow had a chilling prediction:

    “... the ingredients for such accidents are there, and unless we are very lucky, one or more will appear in the next decade and breach containment.”

    Ouch.

  5. Normal Accidents p. 385.
  6. Rework, Jason Fried
  7. Normal Accidents p. 249.
  8. Normal Accidents p. 28.
  9. Twitter isn’t, of course a technology company. It’s a publisher.