Template:M intro design System redundancy
“I think the people in this country have had enough of experts from organisations with acronyms saying that they know what is best and getting it consistently wrong.”
- —Michael Gove
The JC likes his pet management theories as you know, readers, and none are dearer to his heart than the idea that the high-modernists have, for forty years, held western management orthodoxy hostage.
The modernist programme is as simple to state as it is self-serving: a distributed organisation is best controlled centrally, and from the place with the best view of the big picture: the top. All relevant information can be articulated as data — you know: “In God we trust, all others must bring data” — and, with enough data everything about the organisation’s present can be known and its future extrapolated.
Even though, inevitably, one has less than perfect information, extrapolations, mathematical derivations and algorithmic pattern matches from a large but finite data set will have better predictive value than the gut feel of “ineffable expertise”: the status we have historically assigned to experienced experts is grounded in folk psychology, lacks analytical rigour and, when compared with sufficient granular data, cannot be borne out: this is the lesson of Moneyball: The Art of Winning an Unfair Game. Just as Wall Street data crunchers can have no clue about baseball and still outperform veteran talent scouts, so can data models and analytics who know nothing about the technical details of, say, the law outperform humans who do when optimising business systems. Thus, from a network of programmed but uncomprehending rule-followers, a smooth, steady and stable business revenue stream emerges.
Since the world overflows with data, we can programmatise business. Optimisation is a mathematical problem to be solved. It is a knowable unknown. To the extent we fail, we can put it down to not enough data or computing power.
Since data quantity and computing horsepower have exploded in the last few decades, the high-modernists have grown ever surer that their time — the Singularity — is nigh. Before long, and everything will be solved.
But, a curious dissonance: these modernising techniques arrive and flourish, while traditional modes of working requiring skill, craftsmanship and tact are outsourced, computerised, right-sized and AI-enhanced — but yet the end product gets no less cumbersome, no faster, no leaner, and no less risky. There may be fewer subject matter experts around, but there seem to be more software-as-a-service providers, MBAs, COOs, workstream leads and itinerant school-leavers in call-centres on the outskirts of Brașov
The pioneer of this kind of modernism was Frederick Winslow Taylor. He was the progenitor of the maximally efficient production line. His inheritors say things like, “the singularity is near” and “software will eat the world” but for all their millenarianism the on-the-ground experience at the business end of this all world-eating software is as grim as it ever was.
We have a theory that this “data reductionism” reducing everything to quantisable inputs and outputs — owes tends to a kind of reductionism, only about time: just as radical rationalists see all knowledge as reducible to, and explicable in terms of, its infinitesimally small sub-atomic essence, so the data modernists see it as explicable in terms of infinitesimally small windows of time.
This is partly because computer languages don’t do tense: they are coded in the present, and have no frame of reference for continuity. [1] And it is partly because having to cope with history, the passage of time, and the continued existence of objects, makes things exponentially more complex than they already are. An atomically thin snapshot of the world as data is enough of a beast to be still well beyond the operating parameters of even the most powerful quantum machines: that level of detail extending into the future and back from the past is, literally, infinitely more complicated. The modernist programme is to suppose that “time” is really just comprised of billions of infinitesimally thin, static slices, each functionally identical to any other, so by measuring the delta between them we have a means of handling that complexity.
That is does not have a hope of working seems beside the point.
In any case, just in time rationalisers take a cycle and code for that. What is the process, start to finish, what are the dependencies, what are the plausible unknowns, and how do we optimise for efficiency of movement, components and materials, to manage
It’s the long run, stupid
The usual approach for system optimisation is to take a snapshot of the process as it is over its lifecycle, and map that against a hypothetical critical path. Kinks and duplications in the process are usually obvious, and we can iron them out to reconfigure the system to be as efficient and responsive as possible. Mapping best case and worst case scenarios for each phase in that life cycle can give good insights into which parts of the process are in need of re-egineering: it is often not the ones we expect.
But how long should that life cycle be? We should judge it by the frequency of the worst possible negative event that could happen. Given that we are contemplating the infinite future, this is hard to say, but it is longer that we think: not just a single manufacturing cycle or reporting period. The efficiency of a process must take in all parts of the cycle — the whole gamut of the four seasons — not just that nice day in July when all seems fabulous with the world. There will be other days; difficult ones, on which where multiple unrelated components fail at the same moment, or where the market drops, clients blow up, or tastes gradually change. There will be almost imperceptible, secular changes in the market which will demand products be refreshed, replaced, updated, reconfigured; opportunities and challenges will arise which must be met: your window for measuring who and what is truly redundant in your organisation must be long enough to capture all of those slow-burning, infrequent things.
Take our old, now dearly departed, friends at Credit Suisse. Like all banks, over the last decade they were heavily focused on the cost of their prime brokerage operation. Prime brokerage is a simple enough business, but it’s also easy to lose your shirt doing it.
In peace-time, things looked easy for Credit Suisse, so they juniorised their risk teams. This, no doubt, marginally improved their net peacetime return on their relationship with Archegos. But those wage savings — even if $10m annually, were out of all proportion to the incremental risk that they assumed as a result.
(We are, of course, assuming that better human risk management might have averted that loss. If it would not have, then the firm should not have been in business at all)
The skills and operations you need for these phases are different, more expensive, but likely far more determinative of the success of your organization over the long run.
The Simpson’s paradox effect: over a short period the efficiency curve may seem to go one way; over a longer period it may run perpendicular.
The perils, therefore, of data: it is necessarily a snapshot, and in our impatient times we imagine time horizons that are far too short. A sensible time horizon should be determined not by reference to your expected regular income, but to your worst possible day. Take our old friend Archegos: it hardly matters that you can earn $20m from a client in a year, consistently, every year for twenty years if you stand to lose five billion dollars in the twenty-first.
Then, your time horizon for redundancy is not one year, or twenty years, but two-hundred and fifty years. Quarter of a millennium: that is how long it would take to earn back $5 billion in twenty million dollar clips.
Tight coupling
Redundancy is another word for “slack”, in the sense of looseness in the tether between interconnected parts of a wider whole.
For optimum normal operation, one should minimise slack, thereby generating maximum responsiveness, handling , cornering: what lmusicians would call “attack” — the greatest torque, the most direct transmission of power to road; the minimum latency.
The tighter we couple inputs to outputs, the faster the response. But the less margin there is for variation.
And, as Charles Perrow notes[2] this in-the-moment flow state, when the machine is humming, is only a stable state in tightly constrained environments. Where every outcome can be predicted, monitored, and sub-optimal ones can be avoided by rote.
But, generally, these are not very interesting environments. They are production lines. Factory shop floors: every element of the process is within your gift and under your control.
Because these just-in-time systems have the lowest tolerance for a failure. The more efficient a system is, the greater more “single point failures”, that can bring whole system to a halt, there will be. Any component misbehaviour can trigger a chain reaction leading to catastrophe.
That very lack of “give” that makes the sports car so responsive on a dime dry track makes it skid off a wet one. The tighter the coupling the less time sone has to diagnose the failure and fix or shut the system down before catastrophic damage results.
Conversely a system with built-in back-ups and redundancies can go on working while we repair failed components. A certain amount of “stockpiling” in the system allows production to continue should there be any outages or supply chain problems throughout the process.
Even a production line environment is not perfectly stable, of course. It is — should be — in a constant state of improvement— “jidoka” in the Toyota Production System — whereby engineers adjust for evolving demand, react to market developments, and capitalise on new technology and knowhow.
This is “meta-production”: a valuable “background processing” function — important and valuable but not day to day “urgent”— for which “redundant” personnel can be occupied, from which they can redeploy immediately should a crisis arise.
This has two benefits: firstly the process of “peacetime” self-analysis should in part be aimed at identifying emerging risks and design flaws in the system, thus heading off incipient crisis; secondly, to do that the personnel need expertise: an intimate, detailed, holistic understanding of the process and the system. By intimately understanding the system, these second-line workers should therefore be better able to react to a crisis should one arise.
This behaviour rewards long-term “skin in the game”. The best employees here are long-serving, local, full-time, employees full of institutional knowledge and practical hands-on systems knowhow. Inexperienced outsourced labour, of the sort by whon these traditional experts are being systematically replaced, will be far less use in either role.
To be sure, the importance of employees, and the value they add, is not constant. We all have flat days where we don’t achieve very much. AIn an operationalised workplace they pick up a penny a day on 99 days out of 100; if they save the firm £ on that 100th day, it is worth paying them 2 pennies a day every day even if, 99 days out of 100, you are making a loss.
Fragility
Another thing about a lean system is that it is fragile. Super fragile, in fact as it has multiple critical failure points, any one of which will, in the best case, stop the whole of the system functioning, but in tightly-coupled complex systems may themselves trigger further component failure or a chain reaction.
A modern market is undoubtedly a complex system. It is boundaryless, comprised of interactions of an indeterminate number of autonomous actors, and many of those “actors” — notably corporations — themselves are complex systems comprising indeterminate autonomous actors). The question becomes how loosely connected are the parts? How much slack is there? Increasingly, none at all.
When the JC started practice the standard means if conveyance of written communication was mail. Facsimile was the innovation: while exponentially quicker than the mail, the fax is still was manual and bounded by analogue processes such that the end product, like the starting product, was indelibly embedded in a physical substrate. You couldn’t forward a fax,[3] much less cut and paste from one.
Chain reactions were slow, and somewhat containable. The bank run that did for Northern Rock took days to unfold: required media propagation, and account holders to stop whatever they were doing and physically go to their local branch , for a queue to form, for that to be reported, and so on. In the same time, the liquidity crunch that ruined Silicon Valley Bank had also finished Credit Suisse, an unrelated bank on a different continent that had nothing to do with Silicon Valley Bank at all.
So, yes, financial services are tightly coupled. The increasingly complexity of their interconnectedness means that coupling grows ever tighter.
Redundancy, in this environment, is a virtue. the regulators fixate on one kind of redundancy — regulatory capital — but we should look at others. Redundancy of skill, experience and expertise.
Just as we cannot expect to only hold capital when we need it, nor should we expect to only run expertise in fair weather on a shoestring. You can't buy in institutional knowledge in a time of crisis. (You can’t buy institutional knowledge at all). Even uncontextualised expertise, at a time of panic, will command outrageous premiums.
We should consider holding regulatory human capital.
Unlike tier 1 capital, human capital does not just sit there costing money. These are people you can use as systems design and process experts, to analyse systems, root out anachronisms, build parallel state-of-the-art it systems from which legacy infrastructure can be migrated. This is jidoka — automation with a human touch. This is creative, rewarding, builds
We run the gamut from superfragility, where component failure triggers system meltdown — these are Charles Perrow’s“system accidents”; a continuum between normal fragility, where component failure causes system disruption and normal robustness where there is enough redundancy in the system that it can withstand outages and component failures, bit components will continue to fail in predictable ways, and then antifragility, where the redundancy itself is able to respond to component failures and secular challenges, and resigns the system in light of experience to reduce the risk of known failures.
The difference between robustness and antifragility here is the quality of the redundant components. If your redundancy strategy is to have lots of excess stock, lots of spare components and an inexhaustible supply of itinerant,enthusiastic but inexpert school-leavers from Bucharest ,then your machine will be robust and functional will be able to keep operating as long as macro conditions persist, but it will not learn it will not develop, and it will not adapt to changing circumstances.
An antifragile system requires both kinds of redundancy: plant and stock, to keep the machine going, but tools and knowhow, to tweak the machine. Experience, expertise and insight. The same things — though they are expensive — that can head off catastrophic events can apprehend and capitalise upon outsized business opportunities. ChatGPT will not help with that.
Redundancy as a key to successful change management
Damon Centola ’s research about concentration and bunching of constituents to ensure change is permanent.
- ↑ I have put some thoughts together on that here. I should say this is all my own work and may therefore be nonsense.
- ↑ In one of the JC’s favourite books, Normal Accidents: Living with High-Risk Technologies.
- ↑ Not at least without time , manual intervention and further loss of fidelity.