Pragmatic Quality Blog: January 2024

Thursday, January 25, 2024

Can you still fail with a Quality system?

It has been a tough year for Boeing.

Five days ago, Delta Airlines flight 982 from Atlanta to Bogotá (757) lost a wheel while taxiing for departure.

On January 13, All-Nippon Airways flight 1182 from Sapporo to Toyama (737-800) found a crack in the cockpit window shortly after takeoff.

On January 5, a door-sized plug blew out of the airframe of Alaska Airlines flight 1282 (737 MAX-9) six minutes after the plane took off from Portland International Airport.

Image from the NTSB investigation of the Jan. 5 accident involving Alaska Airlines Flight 1282 on a Boeing 737-9 MAX.
Captured on Jan. 7. (Image in the public domain.)

A few days before that, Boeing asked the FAA to exempt a new model of the 737 MAX from a safety standard which was implemented last summer, when it became clear that using an anti-icing system in dry air could overheat engine-housing parts, which could then cause them to break away from the plane. (The Associated Press reports that "Boeing needs the exemption to begin delivering the new, smaller Max 7 to airlines.")

Back in December, Boeing asked airlines to inspect their 737 MAX planes for a loose bolt in the rudder control system. (This month Boeing followed up by issuing a bulletin to their suppliers to ensure that all bolts are properly torqued.)

Last April, Boeing said that production of the 737 MAX could be delayed because one supplier used a “non-standard manufacturing process” during installation of some fittings.

Then there are the engine fires:

United Airlines flight 129 from Houston to Rio de Janeiro (767) on March 28
United Airlines flight 1509 from San Francisco to Honolulu (777-300) on June 5
United Airlines flight 2376 from Fort Lauderdale to Newark (737-900 MAX) on June 28
United Airlines flight 329 from Denver to Boston (737-9 MAX) on September 30
the S7 flight from Novosibirsk to Moscow (737-800) on December 7*
United Airlines flight 551 from Newark to Denver (737-900ER), with an emergency landing in Wichita, on December 14
and Atlas Air flight 95 from Miami to San Juan, Puerto Rico (747-8), just six days ago on January 19.

What's an airplane manufacturer to do?

As a first answer, let me suggest ... that's why they have a Quality system.

It sounds crazy to look at a string of failures and conclude, "That's why they have a Quality system," but in fact it's true. No Quality system can prevent all possible errors. The best Quality systems reduce the number of failures until there are very few, but even so there is always room for improvement.

And in a sense it is remarkable that the number of failures has been so small. The FAA handles an average of 45,000 flights per day in the United States alone, or well over sixteen million per year. Let's assume Boeing made half of those planes (and is therefore responsible for eight million flights a year), which is probably close enough.** A dozen errors in a year—even if you think they've been under-reported and the real number is a few dozen—is a very small fraction of eight million.

The next thing that a Quality system gives you is a way to respond when something goes wrong.*** Maybe the most remarkable fact about all the failures I listed above is that nobody was seriously hurt in any of them. Even when the door blew out of the airframe, some phones and other belongings fell out but the only injuries were minor. In each case, the airline personnel knew what to do so that a problem didn't become a catastrophe. Even Boeing's bulletin to their suppliers is clearly a response—no doubt defined by their Quality system—to a discovered Quality defect in purchased materials.

Even if you certify your Quality system to a management system standard (such as ISO 9001, or AS9100 for aviation)****, that doesn't guarantee that your products will be perfect. All it does is to certify that your system meets a defined threshold of goodness. Using that system every day to get the work done is still a challenge for every organization.

__________

* S7 is a Russian airline, and current sanctions on international trade with Russia make it impossible for them to import spare parts from Boeing; so it is possible that this incident could have resulted from inadequate maintenance.

** Boeing's website says that they have manufactured "almost half the world fleet" of commercial jetliners, and that "about 90 percent of the world’s cargo is carried onboard Boeing planes."

*** That was the most important point in this post from 2021: no Quality system can prevent (for example) someone going crazy just before he does a critical operation. But a Quality system can define how you respond after the fact, to contain and correct the problem.

**** Yes, I remember that Boeing isn't formally certified to AS9100, though they flow the requirements down to their suppliers and clearly state that their internal Quality system meets the AS9100 requirements. Note that I said "Even if." Whether they would do better if they were in fact certified is a question for another day.

Thursday, January 18, 2024

Alberta gets cold in the winter

Alberta gets cold in the winter. Maybe you already knew that.

Last weekend it got cold enough that the Alberta Emergency Management Agency issued an alert: the electric grid was overtaxed, and was in danger of implementing rotating power outages unless Albertans cut back on their electricity usage immediately. The good news is that enough people responded right away to avert the crisis. The Globe and Mail reports on the story here, and YouTube has reported on it in a number of videos (for example, here).

How cold is "cold"? Calgary (in the south of the province) recorded temperatures as high as -17°F on Saturday, January 13; that same day, Fort Chipewyan (in the north) recorded temperatures as low as -49°F. (Neither of these numbers takes account of wind chill.) At those temperatures, a power outage could have been devastating, perhaps even lethal. It is tremendously fortunate that there was no need for one.

But it's hard to think that this weather—not to mention the consequent demand on the electric grid—was unexpected. As I suggested above, everybody knows that Alberta gets cold in the winter! So how did things get to the point where there was a serious risk of rotating outages? How could the system be functioning this badly?

A real answer requires real data, which is more than I have. The most I can do is to sketch how an analysis should be structured, and maybe incidentally to advance a hypothesis or two. If any of my readers live in Alberta, or are otherwise close enough to the event to have good data on what happened, please comment to set me straight! I would love to know the answer; and if my hypotheses turn out to be no more than hot air (excuse the pun), that's a small price to pay to get the facts.

The first step in an analysis is to understand the components that make up the system. According to the website of the AESO (Alberta Electric System Operator), 60% of Alberta's electricity comes from burning natural gas; 20% from wind turbines; 7% from coal; 6% from solar power; 5% from hydro power; and 2% from other sources. Last weekend, though, several of those sources were interrupted.

According to The Globe and Mail, two of the natural gas plants were offline: one for maintenance, and the other (ironically) because of weather.
According to EnergyNow.ca, the wind farms were shut down—even though there was wind—because in temperatures that low there is a high risk that the steel turbines will shatter.
And several sources have pointed out that the biggest demand hit after dusk, when solar cells weren't collecting any sunshine.
One report says that the coal-fired plants worked just fine; but at only 7% of the total supply (at least theoretically) they could only contribute so much.

The second step is to understand the failures in those components that failed. Of these, I find the wind farms most interesting. I assume that there was some kind of FMEA done when the wind turbines were designed, and that it was this analysis which discovered the risk that steel parts shatter when they get too cold. Probably the shutdown protocol was defined as a protective measure, to reduce the risk of damaging or destroying the wind turbines.

In isolation from other factors, the logic here is impeccable. But in the context of the entire system, this "protective measure" means that when the weather gets very cold—in other words, precisely when we should expect the demand for electricity to spike—20% of Alberta's electric supply goes offline. Is that right? Is that how we want the grid to work?

This leads us to the third step of any analysis, and the one most likely to turn up unexpected conclusions: look at the system design as a whole to understand how failures might arise even when each component is behaving perfectly as designed. (See, for example, the discussions here and here.) Looking at the system also means understanding how it responds when individual components fail, as of course they are sure to do from time to time. And if you discover any critical points—places where the system is particularly fragile, or where it is highly likely to fail—those are the points you especially have to strengthen or stabilize. (Among other things, the system's capacity should be enough greater than demand that it can keep functioning even when some components fail.)

There is one more factor which will have to be part of any analysis, though it is not so much a separate step as a consideration that may affect the analysis in multiple places. Any decision made by or about a public utility is inherently a political decision, even if the utility is nominally private. This means that there will inevitably be factors involved in the decision quite different from those of pure design and operational functionality, and these factors cannot be ignored. As we have seen before: when the Quality process conflicts with the political process, the political process usually wins.

The encouraging side to this last reflection is that it is to nobody's political advantage for the lights to go out in the middle of winter. So there is no reason to suppose that politics will get in the way of a proper root-cause analysis of last weekend's near-failure.

Thursday, January 11, 2024

Stage gates: looking back and forward

Last week I talked about how a formal stage-gate process can take the drama out of project management decisions. But I oversimplified one point.

Remember that the basic idea is to define objective criteria at the beginning of the project for what it takes to wrap up each stage and move onto the next one. Then when it is time for one of these stage transitions, you know what you should have achieved. So you can go down the list and check: Did we do them all? If yes, you move forward. If no, you've still got work to do in the current stage.

Sure, what's wrong with that?

The oversimplification is that I described it as if you can make these decisions blindly, guided just by the mechanical operation of the checklist itself. But while it's true that the use of a checklist helps to depersonalize the process and thus reduce the associated drama—Yes, we have to delay the release but it's not about you; it's just that Task 17 isn't done yet—on the other hand we all know that making decisions blindly is a bad idea.

Things might have changed since you made your first plan. Maybe you've learned during the project that one of your planned features is impossible, or (more likely) that it's going to cost far more to develop than you will ever make back. So you haven't passed all your tests (because the product is still missing that one feature) but the rest of the product is fine and now you think it's good enough.

Or maybe once you got deep into the development process you discovered new risks you hadn't known about before. In this case maybe the product does pass all of its tests, because back at the beginning of the project you never thought to test for these new issues. But now that you know about them, you're really not comfortable releasing it yet—even though your formal testing is at 100%.

So what do we do?

The answer is to use the stage gates to make two different evaluations: one looking backward to assess what has actually been accomplished, and one looking forward to weigh risks in the future.

In practice, I have seen this idea implemented in different ways. One company actually held two meetings for each stage gate, with different (but overlapping) attendance lists. Another company kept it to a single meeting, but the discussion points shifted halfway through.

Part one

The first part, then, is to ascertain the facts about what has really been done. Pull out your list for all the things that were supposed to be accomplished in the stage you are now ending, and invite the people responsible for their accomplishment. Then start at the top.

"Attendee #1, did you succeed in __________?" If you get the answer Yes, ask for proof (such as a test report). If you get the answer No, ask for the status: is it just because the department needs another day to wrap things up, or are there fundamental problems that have surfaced? Write down the issues, plus any actions that have already been planned and any projected completion dates. Then go on to the next question, and proceed through the whole list just like that.

At the end, you will have a full picture of the project's status as of today. These tasks are done, and here is a list of the relevant artifacts. The rest of the tasks are still open; and for each one here is an explanation of why, plus a list of the actions still planned and expected completion dates. So far, so good.

Part two

Then the second part is to decide What next? But notice that this is a very different kind of question from the ones you asked in the first part.

The questions in the first part were all project management questions: What's the status? Who is responsible? When will you finish?
The questions in the second part are business decisions: What do we do? What are our options? What have we committed? What can we afford? What are the risks? What are the consequences?

The point is that you have to know all the facts before you make a decision; but ultimately mere facts don't make the decision for you.

So in the case where you found that one feature was a lot harder than planned, you can discuss the consequences of releasing your product without that feature. I worked on a project once where exactly that happened. Since we were designing the product for one specific customer, we asked the customer what they wanted us to do. They said, "We don't need that feature until next year, but we need the rest of the product right away—literally as soon as you can start shipping." We agreed with them to release the product as-is, but to start a second project to develop that last remaining feature so we could release it in an update a year later.

Or in the case where you discover previously-unsuspected risks, you can decide that even though your checklist is 100% satisfied you are going to hold the product and not release it yet. Then you can assign a team to investigate the new risks, and to evaluate how likely they are and how much damage they will entail. Maybe another team can work on ways to mitigate the risks, so that the likelihood and severity both decrease. And when the teams are done, meet again to evaluate what they have learned, and to decide whether and how to move forward.

Do we have to go through this rigamarole for each stage gate, or just to release the product?

Well, naturally it's most important to go through it before you release a product to your customer or to the public. If you release a product with defects, you'll likely incur warranty costs repairing it, and in some cases there might be legal penalties as well. The business impact of a wrong decision at this point can be huge.

But in principle you should handle the other stage gates the same way. After all, what if you discovered a problem in the middle of the design phase such that you can already tell this project will become a money pit? Do you really want to keep working on it all the way up to release? Or do you want a chance to reconsider? Using the interim stage gates to make the business decision "Yes, keep going" or "No, stop this now" affords you a finer granularity of control over the work.

Who attends these meetings?

Quality has to be there of course. Quality, as a neutral arbiter, often runs the meetings. Other than Quality, there are two answers:

You need the responsible members of the project team to give you information about the project status. This is what I have called the "first part" of the stage gate decision.

But the project team typically shouldn't make business decisions (the "second part"). Business decisions cost money, and they expose the company to risk. So they have to be made by people who are authorized to spend the money and assume the risk. Often this means Senior Management. In a small or mid-sized company, it could well mean the President.

Does this mean you have to drag the CEO into all these project meetings?

Every company is different. Figure out what works for you. But before you dismiss the idea as "obviously ridiculous," do a little math on the back of a napkin. How much will it cost if your stage-gate decision is wrong? You'll find the numbers add up fast. And when your company has to make any other decision to put that much money at risk, who has to be in the room?

Well then.

Summary (TL;DR)

To recap, here are the basics.

The stage gate decisions in any formal product development process must have two parts: a project management assessment of the current status, and a business decision about whether and how to move forward.
Normally a clean project management assessment means you are ready to move ahead. But not always.
Normally a failed project management assessment means you can't move forward because you still have to fix things in your current stage. But not always.
All these decisions have to be made from the perspective of what's good for the whole business. And they have to be made by people who can speak for the business.

That's all.

Thursday, January 4, 2024

"I don't want to stand in the way!"

Over the holidays, I was talking to a friend of mine about a tangle in the internal documentation in her organization. I don't understand all the details yet, and I won't try to write about it until I do. 😀 But in the course of our discussion she referenced some issues that took me straight back to my days supporting a product development process.

A formal product development process is typically articulated in stages: Definition, Design, Development, Testing, and Launch, or something similar. Generally there are entry-criteria that have to be satisfied before you can start work on a stage, and exit-criteria that have to be satisfied before you can leave it. For example, entry-criteria for the Test stage should include "You have a set of tests defined" and "You have some kind of prototype to run the tests on." (There are probably others, too.) Exit-criteria from the Test stage nearly always include, "The product passed all its tests." (Or at any rate, if it didn't pass all its tests there had better be a good reason why not.) And it works the same way with all the other stages.

This is the Stage-Gate® new product development process from The AIM Institute.
(See website.) But you can find many others that look almost the same.

Part of the role of Quality in a system like this is to act as a neutral arbiter. (Compare this post here, for example.) Otherwise there's always a risk that one department (for example, Marketing or Sales) might try to push a product out the door when it's not really ready, because there are eager customers waiting for it. But if Quality has to sign off on every "stage gate" (transition from one stage to the next), then Quality can check the list of stage-gate criteria and ask for objective evidence on each one: "It says here the product has to pass all its tests. Can you show me a test report? [Pauses to read test report.] Wait a minute, here on page 57 there's a test that failed. What's the story with that?"

And so on.

If you've never worked under a system like this before, it can look a little slower than developing without these artificial stopping points. But in the long run you save time and anxiety, because there is so much less confusion about what your status really is.

Sometimes, though, people got very anxious when it was time to hold a stage-gate review. If the project was late and things weren't coming together well, team members might start putting in longer hours to make sure all the documented artifacts were ready in time for me to review them. And once in a while, even the longer hours wouldn't help because the product just didn't work right.

I remember one project where I had been collecting the artifacts prior to a stage-gate review, and I was missing one test report from an engineer I'll call Ken (not his real name). So I went to see him.

"Hi, Ken. Do you know when your test report will be ready?"
"Well, ... do you really need it?"
"Of course I need it. Is there a problem?"
"Well, ... some of the tests aren't working out the way they should."
"Do you mean you can't execute them, or do you mean the product fails?"
"It's that last one. The product isn't doing what it's supposed to do."
"Fine, that's no problem. Just write up the report and document the results you are getting. If some of the tests fail, we need to know that."
"That's easy for you to say. But I don't want to be the guy that stands in the way of releasing this product on time! I don't want to be the guy that tells the whole train to stop!"

Aha. That was the issue. So I explained.

"Ken, you're not standing in the way of anything. All you are doing is providing data. When we sit down in the stage-gate review with the management team, we'll decide what we want to do about that data. Depending on the problem, if it's small enough we might go ahead even though the problem is there. Or if it's an important use-case that's failing, we might decide that it's better for everyone to wait till the problem is fixed."
"This problem is pretty important."
"Fine. But either way, we're helpless unless we know the facts! And once we know the facts, we can make an informed decision about the right way forward."
"OK, I see that. But I still feel like I'm the one who's standing in the way of releasing on time."

Standing in the way

I smiled.

"Trust me, Ken, you're not the one who is going to stand in the way of releasing the product on time. I am. But I've got to have your report."
"I'll email it to you this afternoon."
"Thank you."

Ken's perspective is actually a common misconception. It's easy to look at the project organization barreling forward at top speed, and to conclude that project management decisions are based on Who Argues the Loudest ... that somehow decisions about when to release a product come down to a contest where the loud shouters prevail and the quiet ones just have to grumble.

But under a formal product development process, it's nothing like that. A formal process forces you to define all your decision criteria in advance. Then, when the moment arrives, it's a simple factual calculation: have you met the criteria, or not? If yes, go forward; if no, stop.

Often it is helpful if there is some wiggle room to handle anomalous or difficult cases. But even that wiggle room has to be clearly defined up front. I can talk about that more later.