Normal Accidents: Living with High-Risk Technologies: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 1: Line 1:
{{a|devil|}}This is one of those “books that will change your life”. Well — that ''should'' change lives — that it was written in 1984 — {{author|Charles Perrow}} passed away in 2019 — suggests that, maybe it hasn’t: that the irrationalities that motivate so much of what we do are more pervasive than plainly written common sense.
{{a|devil|}}This is one of those “books that will change your life”. Well — that ''should'' change lives — that it was written in 1984 — {{author|Charles Perrow}} passed away in 2019 — suggests that, maybe it hasn’t: that the irrationalities that motivate so much of what we do are more pervasive than plainly written common sense.


Charles Perrow was a sociologist who fell into the discipline of [[systems analysis]]: analysing how social structures like businesses, governments and public utilities, being loose networks of autonomous individuals, work. Perrow’s focus fell upon organisations that present specific risks to operators, passengers, innocent bystanders — nuclear and other power stations, airways, shipping lines, but the read-across to the financial systems is obvious — where a combination of [[complexity]] and [[tight coupling]] mean that periodic catastrophic accidents are not just likely, but ''inevitable''. It is the intrinsic property of a complex, tightly coupled system — not merely a function of operator error that can be blamed on a negligent employee — that it will fail catastrophically.
{{author|Charles Perrow}} was a sociologist who fell into the discipline of [[systems analysis]]: analysing how social structures like businesses, governments and public utilities, being loose networks of autonomous individuals, work. Perrow’s focus fell upon organisations that present specific risks to operators, passengers, innocent bystanders — nuclear and other power stations, airways, shipping lines, but the read-across to the financial systems is obvious — where a combination of [[complexity]] and [[tight coupling]] mean that periodic catastrophic accidents are not just likely, but ''inevitable''. It is the intrinsic property of a complex, tightly coupled system — not merely a function of operator error that can be blamed on a negligent employee — that it will fail catastrophically.


If it is right, it has profound consequences for how we in complex, tightly coupled systems, should think about risk. It seems inarguably right.
If it is right, it has profound consequences for how we in complex, tightly coupled systems, should think about risk. It seems inarguably right.
Line 11: Line 11:


===Normal accidents===
===Normal accidents===
Where you have a complex system, we should ''expect'' accidents — and opportunities, quirks and serendipities, but here we are talking about risk — to arise from unexpected, non-linear interactions. Such accidents, says Perrow, arer“normal”, not in the sense of being regular or expected,<ref>In the forty-year operating history of nuclear power stations, there had (at the time of writing!) been ''no'' catastrophic meltdowns, “... but this constitutes only an “industrial infancy” for complicated, poorly understood transformation systems.” Perrow had a chilling prediction: “... the ingredients for such accidents are there, and unless we are very lucky, one or more will appear in the next decade and breach containment.” Ouch.</ref> but in the sense that it is an inherent property of the system to have this kind of accident.  
Where you have a complex system, we should ''expect'' accidents — and opportunities, quirks and serendipities, but here we are talking about risk — to arise from unexpected, non-linear interactions. Such accidents, says Perrow, are “normal”, not in the sense of being regular or expected,<ref>In the forty-year operating history of nuclear power stations, there had (at the time of writing!) been ''no'' catastrophic meltdowns, “... but this constitutes only an “industrial infancy” for complicated, poorly understood transformation systems.” Perrow had a chilling prediction: “... the ingredients for such accidents are there, and unless we are very lucky, one or more will appear in the next decade and breach containment.” Ouch.</ref> but in the sense that it is an inherent property of the system to have this kind of accident.  


Are financial systems [[complex]]? About as complex as any distributed system known to humankind. Are they tightly coupled? Well, you could ask the principals of [[LTCM]], [[Enron]], Bear Stearns, Amaranth Advisors, [[Lehman]] brothers or Northern Rock, if any of those venerable institutions were still around to tell yiou about it.
Are financial systems [[complex]]? About as complex as any distributed system known to humankind. Are they tightly coupled? Well, you could ask the principals of [[LTCM]], [[Enron]], Bear Stearns, Amaranth Advisors, [[Lehman]] brothers or Northern Rock, if any of those venerable institutions were still around to tell you about it. But yes. Might mortage securitisations have been on Perrow’s mind?
:''New financial instruments such as derivatives and hedge funds and new techniques such as programmed trading further increase the complexity of interactions. Breaking up a loan on a home into tiny packages and selling them on a world-wide basis increases interdependency.''<ref>{{br|Normal Accidents}} p. 385. This in 1999, for Pete’s sake</ref>


So, financial services [[risk controller]]s take note: if your system is a complex, tightly-coupled system — and it is — ''you cannot solve for systemic failures. You can’t prevent them. You have to have arrangements in place to ''deal'' with them. These arrangements need to be able to deal with the unexpected outputs of a ''[[complex]]'' system, not the predictable effects of a merely ''[[complicated]]'' one.
So, financial services [[risk controller]]s take note: if your system is a complex, tightly-coupled system — and it is — ''you cannot solve for systemic failures. You can’t prevent them. You have to have arrangements in place to ''deal'' with them. These arrangements need to be able to deal with the unexpected outputs of a ''[[complex]]'' system, not the predictable effects of a merely ''[[complicated]]'' one.  


Why make the distinction between complex and complicated like this? because pre-configured devices — [[risk taxonomy|risk taxonomies]], [[playbook]]s, [[checklist]]s, [[neural networks]], even ~ ''cough'' ~ [[contract|contractual rights]]s may help resolve isolated failures in ''complicated'' components, but they have ''no'' chance of resolving systems failures. They are ''of'' the system. They are ''part'' of what has failed. Not only that: these safety mechanisms, by their existence, contribute to complexity in the system, and when a system failure happens they can make it ''harder'' to detect what has gone wrong.
Why make the distinction between [[complex]] and [[complicated]] like this? because pre-configured safety mechanisms think [[risk taxonomy|risk taxonomies]], [[playbook]]s, [[checklist]]s, [[neural networks]], even ~ ''cough'' ~ [[contract|contractual rights]]s may help resolve isolated failures in ''complicated'' components, but they have ''no'' chance of resolving systems failures. ''They are more likely to get in the way''. They are ''of'' the system. They are ''part'' of what has failed. Not only that: these safety mechanisms, by their existence, ''add'' complexity in the system, and when a system failure happens they can make it ''harder'' to detect what has gone wrong.


===Inadvertent complexity===
===Inadvertent complexity===
So far, so hoopy; but here’s the rub: we can make systems and processes more or less complex and, to an extent, reduce [[tight coupling]] by careful system design and iterative improvement: air transport has become progressively less complex as it has developed. It has learned from each accident. But it is axiomatic that we can’t eliminate complexity.  
So far, so hoopy; but here’s the rub: we can make systems and processes more or less complex and, to an extent, reduce [[tight coupling]] by careful system design and iterative improvement:<ref>Air transport has become progressively less complex as it has developed. It has learned from each accident.</ref> But it is axiomatic that we can’t eliminate complexity altogether.  


Here is where the folly of complicated safety mechanisms comes in: adding linear safety systems to a system ''increases'' its complexity, and makes dealing with complex interactions even harder. Not only do they create potential accidents of their own, but they also afford a degree of false comfort that encourages managers, who typically have financial targets to meet, not safety ones — to run the system harder, thus increasing the coupling of unrelated components. Perrow catalogues the chain of events leading up to the meltdown at Three Mile Island.
Here is where the folly of [[complicated]] safety mechanisms comes in: adding linear safety systems to a system ''increases'' its complexity, and makes dealing with systems failures, when they occur, even harder. Not only do linear safety mechanisms exacerbate or even create their own accidents, but they also afford a degree of false comfort that encourages managers, who typically have financial targets to meet, not safety ones — to run the system harder, thus increasing the tightness of the coupling between unrelated components. That same Triple A rating that lets your risk officer catch some zeds at the switch encourages your trader to double down. ''I’m covered. What could go wrong?''
 
Part of the voyeuristic pleasure of Perrow’s book is the salacious detail with which he documents the sequential failures at Three Mile Island, the Space Shuttle ''Challenger'', Air New Zealand’s Erebus flight, and other disasters and near misses. The chapter on maritime collisions would be positively hilarious were it not so distressing.


===“Operator error” is almost always the wrong answer===
===“Operator error” is almost always the wrong answer===
Line 28: Line 31:
:''But again, “operator error” is an easy classification to make. What really is at stake is an inherently dangerous working situation where production must keep moving and risk-taking is the price of continued employment.<ref>{{br|Normal Accidents}} p. 249.</ref>  
:''But again, “operator error” is an easy classification to make. What really is at stake is an inherently dangerous working situation where production must keep moving and risk-taking is the price of continued employment.<ref>{{br|Normal Accidents}} p. 249.</ref>  
If an operator's role is simply to carry out a tricky but routine part of the system then the inevitable march of technology makes this ever more fault of design and not personnel: humans, we know, are not good computers. They are good at figuring out what to do when something unexpected happens; making decisions; exercising judgment. But they — ''we'' — are ''lousy'' at doing repetitive tasks and following instructions. As ''The Six Million Dollar Man'' had it, ''we have the technology''. We should damn well use it.
If an operator's role is simply to carry out a tricky but routine part of the system then the inevitable march of technology makes this ever more fault of design and not personnel: humans, we know, are not good computers. They are good at figuring out what to do when something unexpected happens; making decisions; exercising judgment. But they — ''we'' — are ''lousy'' at doing repetitive tasks and following instructions. As ''The Six Million Dollar Man'' had it, ''we have the technology''. We should damn well use it.
If, on the other hand, the operator’s role is to manage ''complexity'' —
If, on the other hand, the operator’s role is to manage ''complexity'' — then technology, checklists and pre-packaged risk taxonomies will be of little use. Perrow’s account of the control deck at Three Mile Island is instructive:
:''Besides, about this time—just four or five minutes into the accident—another more pressing problem arose. The reactor coolant pumps that had turned on started thumping and shaking. They could be heard and felt from far away in the control room. Would they withstand the violence they were exposed to? Or should they be shut off? A hasty conference was called, and they were shut off. (It could have been, perhaps should have been, a sign that there were further dangers ahead, since they were “cavitating”—not getting enough emergency coolant going through them to function properly.) In the control room there were three audible alarms sounding, and many of the 1,600 lights (on-off lights and rectangular displays with some code numbers and letters on them) were on or blinking. The operators did not turn off the main audible alarm because it would cancel some of the annunciator lights. The computer was beginning to run far behind schedule; in fact it took some hours before its message that something might be wrong with the PORV finally got its chance to be printed. Radiation alarms were coming on. The control room was filling with experts; later in the day there were about forty people there. The phones were ringing constantly, demanding information the operators did not have. Two hours and twenty minutes after the start of the accident, a new shift came on. <ref>{{br|Normal Accidents}} p. 28.</ref>


Yet if you are facing
This is, as Perrow sees it, the central dilemma of the complex system. The nature of normal accidents is such that they need experienced, wise operators on the ground ready to think quickly and laterally to solve unfolding problems, but the enormity of the risks