Pragmatic Quality Blog: April 2024

Thursday, April 25, 2024

Much ado about nothing?

I keep thinking that I'm done writing about Boeing. Is there really more that needs to be said at this point? And aren't there other Quality topics out there? But it's like Michael Corleone says: "Just when I thought I was out, they pulled me back in."

This time the trigger was a reader who sent me a recent article about the rash of airline safety incidents that have been making the news this spring. The article, by Kelsey Piper, appeared in Vox under the title "Are there really more things going wrong on airplanes?" Piper argues that, while the reporting of airline safety incidents is way up this spring, the actual numbers are consistent with last year, and the years before that. She summarizes the last 15 years of US commercial aviation as having "a safety record of about one or two passenger fatalities per light-year traveled."

Piper doesn't quite say it in so many words, and at the end of the article she does hint briefly that there might be a story behind the scenes. But as soon as she frames her statistics in terms of "fatalities per light-year traveled," the message to the casual reader is certainly that all this focus on airline safety is much ado about nothing.

I always look forward to conversation with my readers, but in this case I think my reply must have been a little too abbreviated or dismissive. In any event my reader doubled down, suggesting that I look up several years of statistics from the National Transportation Safety Board (NTSB) so that I could determine independently whether we are seeing more problems with new Boeing aircraft than we saw in the past, even if the increase is being drowned out by other statistical noise.

As an aside, it's nice to know my readers have such confidence in me. But I'm not going to do that, because it is the wrong question and it hides the right one. The real story is about the changes Boeing has made in their safety management system, and about the failures of their configuration and documentation systems. These stories are critical, because they are the root causes of failures that may not even have happened yet. By contrast, counting up how many planes have lost tires while landing, or how many harmless engine fires have broken out, is a distraction. And it misses the point.

The critical fact that distinguishes these two topics is that it is perfectly possible to make a safe airplane with no Quality system (or Safety system) whatsoever! Therefore a story about Boeing hobbling their Quality system is in principle a different story from one about their current safety statistics.

Let me explain. Of course it's not likely that a team could make a safe aircraft without suitable systems in place. But if it just so happened that by random chance the team did all the right things to make the airplane safe … well then, it would be safe. The odds are against it, to be sure, but it's not impossible.
More realistically, when Boeing management removed this or that inspection step, doing so did not automatically mean that every single plane built under the new regimen was henceforth—let's say—3% less safe. The employees still remembered how to do their work, and nobody ever shows up to work wanting to do a bad job. Yes, human fallibility is always a factor, but there is no formula that rigidly connects the exact number of Quality steps in a procedure with the safety rating of the output product.
You remember that last week we talked about how Boeing management decided that Quality was "non-value-adding" overhead? This is part of why they thought that! They found, empirically, that they could eliminate one Quality inspection, save a few dollars, and no planes fell out of the sky. OK, good. How about eliminating two inspections? Three? Four? Where do we stop? You can see how, in the absence of visible negative feedback (like an increased accident rate), this could get out of hand quickly.

But wait. If that's true, why do we bother with Safety and Quality systems at all? If you can build a safe and reliable airplane without such a system, why were the Boeing executives wrong to eliminate all those extra costs?

Here's the thing: Yes, you can (in principle) build a safe aircraft without a formal Safety system. But you can't know that it is safe! What you buy with all the extra expenditures on Safety and Quality is certainty. And, of course, in order to get that certainty you implement a lot of inspections which then find problems … so the problems can be fixed before the plane is put into service. This improves the plane's safety even farther, which is all to the good.

So when Boeing pulled back their Quality system, what they did was to make their planes less certain, not less safe. This or that specific aircraft might be perfectly airworthy—who knows? To bring this discussion back to Piper's article, it is tempting to answer the question by looking at statistics: "Well, the failure rate in new planes is pretty low, so I'll take my chances." The problem with this answer is that it assumes that all planes of the same model are pretty much alike, except for the normal statistical fluctuations of the manufacturing process. But how can you know that? The presumption that all planes of the same model are pretty much alike is just one more kind of certainty. With fewer inspection steps there is less certainty, so you can't even know that the new planes are the same as the old ones. And therefore, like the ads for investment products always say, "Past performance is no guarantee of future results."

We can all be grateful that the accident rate for commercial aircraft is so very, very low. But to have confidence … or certainty … that it will stay low, we need aircraft manufacturers to rely on robust Quality and Safety systems. That's why the story that matters is about those systems, and the statistics are a tempting distraction.

Thursday, April 18, 2024

Is Quality a "value-added" activity?

Does Quality add value?

When I was researching my posts about Boeing this spring, I ran across several sources who said Boeing had been cutting back Quality activity for years, on the grounds that Quality work was merely "overhead" and not "value-added."* So even though I've touched on this topic once or twice before,** maybe it's useful to review the question again.

It should be no surprise that I think Boeing was wrong to say that Quality doesn't add value, but in a sense they were on to something. There are two fundamental ways in which Quality differs from components like wheels or doors:

Quality is not tangible or material. Quality isn't a What, but a How.
Quality depends on the user. Quality means getting what you want, and different people want different things. So it's easy to think that Quality isn't objective.

The first point means that there's no container in inventory labelled "Quality." You can't reach in and pull out half a kilogram of Quality to install in one of the engines. Whether you build an airplane with or without Quality, mostly you use the same parts and the same tools. The difference is in how you use them. Do you really need to pay Quality personnel for that? Isn't Quality free?

Well no, it's not. We've discussed this before. People make mistakes. The way to prevent those mistakes is to put systems in place. The systems will save you money in the long run (because you won't be paying for warranty repair or liability lawsuits), but they still cost you in the salaries of the people who run them. It's just cheaper to pay your Quality personnel a predictable sum now, than it is to pay angry customers and victorious plaintiffs incalculably more at some unexpected time in the future.

The second point is easier to explain with an example. Suppose one of Boeing's airplanes is still around far in the future, and is discovered by a band of scavengers crossing a post-apocalyptic hellscape. They won't care about the precision machining that went into the parts, nor about the multiple fail-safe systems that keep the plane in the air. All that will matter to them is that the airplane can be torn apart for scrap metal. So that precision machining will add no extra level of Quality from the perspective of the scavengers. They won't find Quality anywhere as they rip the plane apart. Doesn't that mean that Quality is subjective?

Of course not. The answer is that Boeing's actual customers aren't scavengers in a post-apocalyptic hellscape. Boeing's customers are airlines, all of whom want the same thing—namely, to satisfy their own customers. For their part, the airline customers want to get where they are going safe and sound, and more or less on time. In cases like this, where everyone wants the same thing, Quality is absolutely objective. Anything that makes an airplane easier to fly and safer in the air is part of Quality. Anything that makes it more difficult and more dangerous is Wrong, and has to be avoided!

All the same, I can see how these two points could mislead the Boeing management. When Harry Stonecipher took over Boeing, he avowedly set out to shift the company's focus from engineering to business. But that means that management had to focus on what was tangible and objectively quantifiable: we've all heard the admonition, "You can't manage what you can't measure." And so, ineluctably, the business focus on strict measurables with a visible impact on the bottom line meant that management had no alternative but to pay less attention to Quality.

Boeing is going through a lot of very public troubles right now, so maybe we shouldn't focus on them too relentlessly. Let's look elsewhere. Can we find other areas where Quality—an intangible that relates to customer experience and customer preferences—adds a value that people are willing to pay for?

Yes. Everywhere.

Sometimes it's not quantifiable, but it's still real. There is an old saying in sales, "Don't Sell The Steak, Sell the Sizzle." The point is that—mostly—nobody cares nearly as much about the composition of a product as they care about their experience of it. You can't eat the sizzle, but that's what people pay for. More generally, people pay for experiences that make them happy; the only time that they pay for specific physical components are when they believe that those specific components are necessary to achieve their happiness.*** But their experience, their happiness, is not tangible; and in principle it can change from one customer to another. In other words, customers pay for Quality, and not for things.

It also happens that sometimes people pay for Quality in a way that is very quantifiable! I once had an employee who used to work for a company that made medical implants. And he told me that on his very first day, his boss sat him down to say:

We sell plastic devices that cost us $5. We sell them for $125. The extra $120 pays for Quality! So don't mess it up.

Yes, Quality adds value. Sometimes you can measure it in dollars. Even when you can't, it is absolutely real.

__________

* See for example "The last days of the Boeing whistleblower" from Fortune, March 16, 2024 (especially the next-to-last paragraph), or the On Point Podcast from NPR titled "Whistleblowers, an executive shakeup, and the future of Boeing" (especially from about 11:20 to 11:40).

** See for example the series "Do audits really add value?" in 2021 (parts 1, 2, and 3), and the series on "Parasitic certifications?" in 2022 (parts 1, 2, 3, and 4). Or just search the blog for the phrase "value add."

*** And sometimes this is obvious. If I drive over a nail that punctures my tire, the only thing that is going to make me happy is a new tire with no holes in it.

Thursday, April 11, 2024

"The system is broken!"

As Quality professionals, we get into the habit of thinking along certain lines. Often these lines are very useful, which is why we develop the habits. And even when we knock off work to go home, it can be a huge benefit in our daily private lives not to blame people when things go wrong, to use incremental improvement to get better at golf, or to remember the process approach when negotiating with unhelpful Help Desks.

But every so often those habits can trip us up, when they encourage us to assume things that aren't really there. A couple of days ago I was talking to an official from ASQ, and questioned why I hadn't gotten a certain mailing. I was sure this was a sign that there was a bug in the routine that generated mailing lists, and asked for a bunch of information to help locate the error: "Which mailing list did you use? What email address does that list show for me?" After a while the answer came back, and it turned out that mailing hadn't gone out to anybody yet. Maybe I could afford to be a little more patient? 😀

The same thing can happen in bigger cases.

Last Friday, April 5, some 60,000 households in the Province of Alberta lost power in rolling blackouts, starting at 6:49 am and continuing over four hours until about 11:00.* Fortunately this was in April, so temperatures were a little warmer than they were back in January—the last time that Alberta's power grid almost failed. (I discussed that failure at the time in this post.) This time, thermometer readings hovered cozily right around freezing: from 28°F-32°F in Calgary, and from 30°F-32°F in Edmonton. Even the rural town of Conklin in the north-east experienced temperatures in the same range. Still, the unplanned blackouts caused understandable alarm across the province, and many people rushed to assign blame.

Premier Danielle Smith said the blackouts were all because the market doesn't encourage natural gas plants to stay operating, so that they can pick up the slack in a moment when other sources fail. "This is at the heart of everything that we've been saying for the last year, that the system is broken."
On the other hand, Marie-France Samaroden, the vice-president of grid reliability operations with the Alberta Electric System Operator (AESO), pointed out that gas plants aren't simply a panacea, because they are as subject to disruption as any other type of generator. And in fact one of the immediate triggers for Friday's blackouts was that the 420-megawatt Keephills 2 natural gas power plant went offline unexpectedly. (At the moment it is not clear why.)
Andrew Leach, an energy and environmental economist and professor at the University of Alberta, said that the whole market structure misallocates energy production, because it has been set up inflexibly.
Blake Shaffer, an associate professor of economics at the University of Calgary, summarized the situation by remarking, "People like to assign blame on power system woes to their least favorite generation technology. And the reality is, all generation technologies have reliability challenges."

The last time I wrote about Alberta's power grid, I discussed the kind of analysis and planning that we might expect: FMEAs for individual components of the system, plus an overall system analysis for the entire grid. And I assumed that such an analysis, if it were thorough enough, would highlight exactly where Alberta has to take action to improve weaknesses, in order to prevent future failures. But I overlooked one huge fact that makes the entire problem far more difficult. It is only a small consolation to reflect that everyone else who has commented on the problem has made the same mistake.

The critical mistake we all have made is that we think of Alberta's electric grid as one large system. But it's not.

Think about it for a minute. The grid consists of producers (who generate power) and consumers (who use it). The producers are plants powered by natural gas or coal; wind turbines; banks of solar cells; and so forth. The consumers are private homes, businesses, and in fact even the production plants themselves to the extent that they need electricity to power their own operations.
But now, what is a system? According to Wikipedia,** a system is "a group of interacting or interrelated elements that act according to a set of rules to form a unified whole"; and for the concept of system planning to make any sense at all, the planner has to be able to intervene in the system to adjust it in any spot where it is not running correctly. That's how a machine works, and it's how a factory works. All our Quality tools are designed to analyze systems that look like this.
And this is exactly what the Province of Alberta cannot do! Each of those producers is a private company. Each of the consumers is a private company or else a private citizen. The Province of Alberta has no authority to tell the owners of those power plants how to run their businesses, nor to tell private homes how much electricity they are allowed to use. But this means that the Province is powerless to reach in and adjust this or that element of the "system" in order to make the whole thing run better. (Or to keep it from breaking down!) The most they can do is to provide information and offer incentives in the hope of coordinating and influencing the behavior of the "system components." But those "components" remain stubbornly independent.

Of course I have overstated the situation when I say that Alberta's electric grid is "not a system." There are plenty of other systems that have the exact same features: the economy is one of them, and a natural ecosystem is another. Sellers and shoppers are "system components" in the economy; animals and plants are "system components" in a natural ecosystem. In both cases, the "components" do what they want, and not what we tell them to; but we still talk about both the Economy and Nature as "systems." All the same, it is important to recognize that they are systems of a very different kind than a machine or a factory, precisely because the "system components" can do what they want and ignore all our good planning. I won't enumerate examples when an attempt to plan the economy (or an ecosystem!) has had unexpected or unfavorable results, because you can probably come up with plenty of examples on your own. What is important is to recognize that Alberta faces the same kind of challenge in planning the electrical grid.

Does this mean blackouts are inevitable? Not exactly. But so long as the system is structured the way it is today, it is probably not possible to guarantee that future blackouts have been prevented.

Well, is it possible to restructure the system to prevent blackouts? Maybe, but take it slowly here. When I say that the current structure cannot guarantee an end to blackouts, I'm talking about the structure where producers and consumers are all independent. Theoretically I could imagine trying to "simplify" the system by giving the provincial government full authority over all the power companies and all the consumers. Then they could adjust the system wherever needed, to make sure it runs smoothly. But that's a lot of authority.

Does anyone really want the provincial government telling them how many hours they are allowed to keep their lights on, or what days they are allowed to recharge their phones? Probably not.
Or if you own a power company, do you want the provincial government to tell you how much you can produce and when you have to produce it, even if their decisions mean you lose your shirt? Again, probably not.

Of course the whole question is a political one, to be answered by the voters of Alberta and not by me. But I can imagine an outcome where the voters decide that they'd rather put up with the risk of future blackouts, because the available alternatives are even worse.

Like I said at the beginning, sometimes our habits as Quality professionals can mislead us. Our familiarity and facility with technical tools can make us think that enough technical skill can solve any problem. But sometimes the most difficult issues are not technical ones.

_____

* You can google the event to find coverage. Here are some of the articles I consulted in writing this piece:
https://www-cbc-ca.cdn.ampproject.org/c/s/www.cbc.ca/amp/1.7165290
https://www.theenergymix.com/rotating-brownouts-in-alberta-highlight-need-for-more-flexible-grid/
https://tnc.news/2024/04/08/alberta-to-modernize-power-grid/
https://calgary.ctvnews.ca/alberta-s-second-grid-alert-in-2-days-leads-to-rolling-blackouts-1.6835023
https://globalnews.ca/news/10405013/alberta-electric-system-grid-alert-april/

** https://en.wikipedia.org/wiki/System

Thursday, April 4, 2024

Disasters happen!

There are people on the Internet who claim that what we see as Reality is actually a giant Simulation, and some days it seems like they have a point. Would random chance in real life have given us the entertaining string of disasters we've experienced so reliably this spring, or should we assume that it's a plotting device dreamed up by some intergalactic blogger and content creator with an offbeat sense of humor? Since my purpose in this blog is not to tackle the Big Metaphysical Questions I'll leave this one unanswered, remarking only that our record of calamities the last few months has been strikingly consistent.

A lot of my recent posts since January have been related, in one way or another, to the tribulations of Boeing, who seem to have dominated the headlines for some time now in spite of themselves. But of course that's not all that has been going on. Also back in January, the electric grid in the Province of Alberta came close to shutting down, seemingly because, … (checks notes) … it got too cold. (I discuss this event here.) Then in another extreme weather event that did not repeat the Alberta experience but somehow rhymed with it, massive hailstorms in central Texas three weeks ago destroyed thousands of solar panels.* And perhaps the most dramatic recent catastrophe (upstaging even Alaska Airlines flight 1282) took place early Tuesday morning a week ago, when a massive container vessel piloting out of Baltimore Harbor collided with one of the supports of the Francis Scott Key Bridge—and demolished the bridge.

It should go without saying that tragedies like this are devastating. If there is any way to find a silver lining around clouds this dark, it is that by analyzing what went wrong we can often learn how to prevent similar catastrophes in the future.

Sometimes this analysis can rely on straightforward data collection about the environment in which the planned operation will take place. Historical records could offer information, for example, on the likelihood of cold weather in Alberta in January, or the risk of hail in central Texas. But often the question is more difficult. For example, the Dali (the container vessel in Baltimore Harbor) appears to have suffered some kind of power failure just before the accident, a power failure which could have made it impossible to steer the ship. I'm sure there was some kind of planned protocol for how to handle a power failure; there was probably an emergency backup power supply available. But how much time did it take to activate the backup power? Did the advance planning take account of the possibility that the power would go out when the ship was in such a position that even a minute or two without steering could mean catastrophe? At this point I don't have any information to answer that question. But I can easily imagine that the answer might be "No, we assumed that five minutes [for example] would be plenty fast enough" … and I can also imagine that back when the planning was done, that might have sounded reasonable! Today we would evaluate the same question differently, but only because we have seen an accident where seconds counted.**

So it turns out that analyzing catastrophes is a hard thing to do. In particular, it is important to recognize that even when we can collect all the data, there are huge innate biases we have to overcome in order to understand what the data are telling us. Two important ones are the Hindsight Bias, and the Outcome Bias.

The Hindsight Bias means that when we already know the outcome, we exaggerate (in retrospect) our ability to see it at the time. This is why people can play tabletop games to refight battles like Gettysburg or Waterloo and the other side ends up winning. Once you know what stratagems your opponent could use to win (because they are part of the historical record), it becomes easier to block them.

The Outcome Bias means that when we already know the outcome, we judge the decisions that people made in the moment by how far they contributed to the outcome. So if someone took steps in the middle of a crisis which looked logical at the time but ultimately made things worse, retrospectively we insist that he's an idiot and that it was his "bungling" that caused the disaster. We ignore the fact that his actions looked logical at the time, for reasons that must have made sense—and therefore, if it happens again, somebody else will probably do the exact same thing. By blaming the outcome on one person's alleged "stupidity" we lose the opportunity to prevent a recurrence.

If you can spare half an hour, there's a YouTube video (see the link below) that explains these biases elegantly. It traces the history of the nuclear accident at Three Mile Island on March 28, 1979. The narrator walks us through exactly what happened, and why it turned out so badly. And then the narrator turns around to show us that the whole story he just told is misleading! It turns out that Hindsight Bias and Outcome Bias are fundamentally baked into the way we tell the story of any disaster. And if we allow ourselves to be misled by them, we can never make improvements to prevent the next accident.

The basic lessons are ones you've heard from me before—most critically, that human error is never a root cause but always a symptom. (See also here, here, and here.) But the video does a clear and elegant job of unpacking them and laying them out. And even though we all know how the story is going to end, the narrator makes it gripping. Find a free half hour, and watch it.

__________

* I have seen multiple posts on Twitter insisting that this happened again a week later, but the weather websites which I've cross-checked disagree. See for example this news report, which showcases a tweet that pegs the storm on March 24, whereas the text of the article dates it to March 15.

** Again, to be clear, I have no genuine information at all about the disaster planning aboard the Dali. I am reconstructing an entirely hypothetical situation, to show how our judgements about past decisions can be affected by our experience in the present.

Thursday, April 25, 2024

Much ado about nothing?

Thursday, April 18, 2024

Is Quality a "value-added" activity?

Thursday, April 11, 2024

"The system is broken!"

Thursday, April 4, 2024

Disasters happen!

Five laws of administration