Pragmatic Quality Blog: June 2024

Thursday, June 27, 2024

Return of the FMEA

Last week I wrote about how to carry out a Failure Mode and Effects Analysis (FMEA) before you build a product. The idea is to look ahead to find all the risks you can foresee, and then prevent or mitigate them.

The method is straightforward, which means it can be applied in different contexts as needed. Some organizations, when they get large enough, find it useful to split the FMEA into two parts:

a Design FMEA (DFMEA) to look specifically at the product design;
and a Process FMEA (PFMEA) to look at the manufacturing processes that will be needed in order to build the product once it is designed.

So to take last week's example of my friend's suitcase with the flimsy handle, a DFMEA would have evaluated whether the right plastic had been chosen for the job and whether the two little screws were enough to anchor it. But a PFMEA would have asked how the handle with its screws was actually going to be attached to the rest of the suitcase? Will those screws be tightened by hand, or by a machine? What torque will screw them in without either stripping the threads or ripping the fabric? Would it be easier and more reliable to replace the screws with brads, or would that fail more often? And so on.

In other words, in addition to the risks we discussed last week—that a product might fail in use or that someone might get hurt—the PFMEA team must also watch for risks that the manufacturing process might break down and create a bunch of unusable scrap. The kinds of operations where this last risk is a big worry are precisely those which manufacture on such a large scale that they regularly separate DFMEAs from PFMEAs in order to keep both of them a manageable size.

But once you've gone through all this work, found all your relevant risks and prevented the ones that matter, once you are finally into production, then at last you are done. Right?

It depends.

Specifically, it depends on your overall risk profile. Because as soon as you go into production and start selling your product, the first thing you discover is that you didn't foresee everything after all. As one fellow commented on one of my posts in LinkedIn several months ago, "Things will fail in new and exciting ways that were never even thought of during FMEA." And he is absolutely right.

Partly this is because of the sheer cussedness of things, that even when you do the very best you know how to do, there's often some issue you never even thought to explore.

Another factor is that your customers may well prove creative in finding new use cases that you never dreamed of. This special talent was once summarized in the remark that "It is impossible to design anything that is foolproof because fools are so ingenious."

What now?

One way or another, once your product is out in the market you will start getting information about how it is performing. Read the customer complaints as they come in. Ask your Service or Repair departments what kinds of issues they are seeing. All of this is raw data that helps you learn more about your product.

What do you do with this data? Well, in principle you should feed it back into the FMEA cycle. Pull the FMEA records off the shelf and revisit them. Is the actual performance that you see in the field consistent with what you expected? Or are there discrepancies?

Maybe the data tells you that you estimated some of your values wrong when calculating RPN for all the risks you identified. In that case plug in the new numbers, and see what that does to your overall calculations.
It's also possible that you discover a risk you never considered before. In that case add it, and use the real-life data from the field to work out reasonable RPN values.
When you are done with this part, then treat this review just like any other FMEA. Check whether you have any risks higher than your threshold value. (You could also use the data from the field to help you decide if your threshold is really in the right place.) If yes, plan an improvement of the product to address those risks; if no, then things are good. But even so, you should keep watching.

Wow, this sounds like a lot of work. Do we really have to revisit every single FMEA on a regular basis? And how often is a "regular basis" anyway? Do we have to do this annually? Monthly? Weekly? Where will we find the time?

A fair question. Notice that I said "in principle." It really does depend on your overall risk profile. If you are making products that have a high risk of hurting people—medical devices, airplanes, space craft—then yes, you need to take this process very seriously. I can't tell you exactly how often to revisit your FMEAs, but you absolutely should keep abreast of all the news from the field and respond accordingly.

But if the risks inherent in your products are a lot milder, then your risk-mitigation activities can be less intensive as well. As with all aspects of your Quality system, your level of effort should be proportional to the scale of your operations and your risks. Of course at some level you always want to know what's going on with your products, so you can improve them and keep your customers happy. But you also have to keep it pragmatic.

Thursday, June 20, 2024

Think about your product before you build it

Earlier this year a friend of mine was traveling, and her suitcase broke. The handle snapped in two places. She was already past the deadline for returning it, but she left a detailed review on Amazon.

This suitcase is a great size and weight and I like the zippered top compartment that prevents contents from falling out when the suitcase is opened. But unfortunately, the handle failed during the second trip I used it on. The design looks flimsy—the handle is held on by a single small screw on each side of the mount, and the mount is a flimsy plastic piece. It appears unrepairable; and even though it has been less than 5 months, I am outside the return window and Amazon will not refund my money. I will not purchase this suitcase again and I will look for a suitcase that has a more durable construction and a decent warranty.

Of course I sympathized with her bad luck. But at the same time—naturally enough—I started to wonder, How did this happen? A failure like this can't be waved away as a random accident or an Act of God. Surely the company who designed and manufactured these suitcases could have seen this coming, as soon as they selected "flimsy plastic" for the handle and chose to anchor it with "a single small screw on each side of the mount." Even so, I guess it could have been fine if all you carried in it were marshmallows. But it was sold as a suitcase, not a marshmallow-carrier. And many people pack their suitcases full. (I know my friend does!)

What should this company have done differently?

That's simple: as part of their design process, before they went into production, they should have carried out a Failure Mode and Effects Analysis, universally abbreviated FMEA.

The point of an FMEA is to avoid exactly this problem.

Look at your product design, and think through—in advance!—all the ways it can possibly fail.
Once you've collected a list of all foreseeable failures, go back and update the design as necessary to eliminate them.
Now with your updated design, update your FMEA to see if you've introduced any new failures.
Rinse and repeat.

Of course you can't do this forever. At some point you have to exit the design cycle and move into production. Also, you might identify some failure modes which yes, are theoretically possible, but they are highly unlikely. Maybe you've designed an umbrella which will protect you just fine from rain, but in case of an alien invasion from Mars it won't protect against ray guns. On the other hand, the odds of an invasion from Mars are pretty slim. One way or another, then, you need a criterion for when to let it go.

The answer is to assign every possible failure a Risk Priority Number, or RPN. This number is most commonly calculated based on three other numbers that you assign first. (The method here is an extension of the method for risk prioritization we discussed in the post "Basic risk management.")

Evaluate the probability (P) of the failure happening, typically on a scale from 1-5.

1 means that the failure is extremely unlikely, or virtually impossible.
5 means that the failure is frequent or almost inevitable.

Evaluate the severity (S) of the damage caused in case the failure does happen, again on a scale from 1-5.

Naturally, the scale depends on knowing, "What's the worst that could happen?" If the worst outcome is that the product stops working, that's a 5. But if there's also a chance that somebody could get hurt, obviously that's even worse than that the product shuts down.
1 means that even if the failure happens, there is no effect on reliability or safety.
5 means that if the failure happens, the results are catastrophic. The product stops working, and—if there is any possibility for people to get hurt—people get badly hurt.

Evaluate the detectability (D) of the failure, on a scale from 1-5.

The idea is that if it's obvious something has gone wrong, the user will put the product down before anything bad happens. That's why car manufacturers design your brakes to make a terrible noise when the brake pads are getting thin—so you'll know it's time to replace them. But hidden problems can catch you unawares.
1 means that you are certain to detect the problem in time.
5 means that the problem will be invisible to users or even regular maintenance personnel.

Then your RPN = P x S x D.

Now every possible failure in your list has an RPN between 1 and 125 (= 5 x 5 x 5). The next step is that you have to assign a threshold value, call it N. Then the rule is that you have to correct every possible failure whose RPN is greater than N. When the RPN is less than N, you leave it alone.

Do you see how this solves the two problems I identified above, where you get stuck in a Design loop forever?

It's true that after you carry out your FMEA, you go back into design to correct all the failure modes that scored worse than your threshold N. And it's true that after you've redesigned the product, you should redo the FMEA to see if you introduced any new errors (and also to check that you really did prevent the ones you tried to prevent). But you don't stay in this loop forever. As soon as all the failure modes on your list have an RPN less than N, you are free to move on to the next step.
Also, it's unlikely that you will end up trying to protect against Martian ray-guns. The odds of an invasion from Mars pretty clearly deserve a probability rating of P = 1. So even if S = D = 5, your final RPN will be only 25. And it's likely that your threshold is higher than that.

How do you decide where to set your threshold N? This is a judgement call, and there is always a risk that the decision might be corrupted by someone pushing to set it in the wrong place. For example, someone might urge the team to set the value too high, so that they don't have to spend time preventing foreseeable problems. The best advice I can give is to get suggestions from stakeholders across the organization—for example, from Customer Service and Manufacturing, as well as Design—and to use honest common sense. Most of the time, when you list all your possible failures in order of RPN (from worst down to best), it will be obvious that the risks at the top of the list are terrible, and the ones at the bottom of the list are inconsequential. And often it will be equally obvious where to draw the line between them. There might be a small handful that you have to discuss because they are close to the line on one side or the other, but usually there aren't many.

If The Suitcase Company had carried out an FMEA on their suitcase design, would that have helped my friend? I think so.

Assuming that their engineers knew their job, they should have calculated that the probability of a failure in the handle was pretty likely once the suitcase was full.
So let's say P = 4.
Since suitcases typically don't present a big safety risk to users, the relevant measure for severity would be whether the suitcase was still usable after the handle broke; and the answer is "Mostly no."
So let's say S = 4.
And the suitcase gave no warning signs before the handle suddenly snapped, which argues for the worst score for detectability.
So let's say D = 5.
Then the RPN for the failure "Handle could snap" would be 4 x 4 x 5 = 80.

Ah, but where was their threshold? Of course I don't know. But it seems to me it should have been lower than 80.

Thursday, June 13, 2024

How do you write a Quality Policy?

I've discussed Quality Policies in earlier posts—especially here and here, for example—but I've never explained how to write one. Honestly, I'd never thought about it before now. I guess I thought that somehow the writing would take care of itself. But Kyle Chambers and Caleb Adcock over at Texas Quality Assurance have done me one better on this point. In a recent episode of their #QualityMatters podcast (episode 181, to be exact) they talk through the theory behind Quality Policies and give step-by-step instructions for writing one. And mostly I agree with them.

You can play the episode as a podcast by clicking here, or you can play it on YouTube by clicking the video image below.

Kyle and Caleb start their episode with a general discussion of whether Quality Policies have to be inspiring (2:30) and of what the formal requirements for a Quality Policy actually say (5:00).* (Kyle quotes the requirements from ISO 9001, clause 5.2, but makes the point that the analogous clauses in other comparable standards all say basically the same thing.) But soon they move to the central question. Kyle asks (at about 8:50), Is there a formula for writing a Quality Policy? And then he proposes one: Simply describe what your business does, and lay it out like this:

We provide <these goods or services> …
… to <these customers>.
We meet all requirements and we improve continually <by using these and those methods.>

Absolutely straightforward.

Then Kyle shows how this formula works in practice by using it (from about 11:00 to 18:00) to generate a Quality Policy for his own business, Texas Quality Assurance. He first states it at about 16:45, and then goes back to revise it a little later.

The next part of the conversation (from 19:45 to about 24:20 or so) is on how to derive major Quality objectives from your Quality Policy, and the method is equally direct.

Look at your Policy.
See the things it says you do?
Those are your major Quality objectives. Do them.
More exactly, use your Policy statement to make a concrete list of what you have to achieve.
Naturally you might need to track department-level KPIs as well, just to make sure you are on track. But the business-level objectives are to do the things you say you do in your Policy.

After that, Kyle and Caleb discuss some pointers on things-NOT-to-do and on how to brainstorm effectively, they bring up a few incidental clarifications, and the talk is over.

I said at the outset of this post that mostly I agree with everything Kyle and Caleb say here. My one caution is a point of emphasis, rather than anything stronger. If you follow Kyle's formula, your output will be perfectly serviceable. But will it be a policy, or just a statement of the scope of your management system? ISO 9000:2015, clause 3.5.8, defines a policy as:

intentions and direction of an organization as formally expressed by its top management.

In other words, a policy should answer certain kinds of broad questions before they are ever asked. Do you allow a customer to return a product if it is defective because the customer himself broke it? Yes or No, that's a policy (though not exactly a Quality Policy). On the other hand, "We sell lawn furniture" is less obviously a policy. Now, I can imagine circumstances where it might function as a policy,** and that's why I qualify my disagreement as no more than a point of emphasis. But if you can think of something stronger to say about your attitude or strategic direction with respect to Quality, then say it.

All the same, Kyle's formula generates results that are a lot better than many of the Quality policies currently out there. Kyle talks (6:30) about one company he knows whose "Quality Policy" expresses an aspiration about the kind of company they want to grow into twenty years from now. But that's a Vision, not a Quality Policy. Your Policy has to describe what you are doing today. If it doesn't, and if you hope to be certified to ISO 9001, your auditor will write you up for any discrepancy. Don't give him findings that are so obvious.

There is one topic on which Kyle admits a little disappointment with his own formula: it is not reliably flashy or inspiring. That is, if you apply this formula, the output might sound a little dull (2:30, 3:30, 32:50). I think if that's the biggest risk you have to worry about then you are doing pretty well. There are a lot of dreadful Quality Policies in the world, and this formula will give you one that works just fine. If you really want one that's flashy and inspiring, well, Quality Policies have to express a commitment to continual improvement, so maybe "Writing a flashier Policy" is a good target for future improvement.

__________

* When I make reference to the podcast, I will give approximate time markers. These are not exact.

** For example, if there were a debate within top management whether to start a second product line that was totally unrelated to the primary business, "We sell lawn furniture" might be a useful policy statement to shut down the proposal.

Thursday, June 6, 2024

Can a robot do your audits?

A couple weeks ago, I stumbled across an article from the Harvard Business Review that asked, "Are You Developing Skills That Won’t Be Automated?" The article itself is five years old, but clearly the topic is still a live one: just two months ago, the AI-and-Deep Learning start-up Nanonets updated an article on their website about the use of robotic process automation in internal audits. Much of the discussion seemed to relate primarily to financial audits, but it's only natural for us in Quality to watch the developments closely.

The Nanonets article emphasizes those tasks at which robots are undoubtedly more powerful than humans: data collection, review for formal compliance (and discrepancies), report generation, and the like. And there is no question that these are areas where automated support is invaluable. While I've never used robotic tools in my own auditing, I've used tools like AHP's iQ-Audit, which automate parts of the report-writing process; and it is always a help not to have to rewrite the same findings and the same boilerplate multiple times.*

But there is more to auditing than just reviewing paperwork and writing reports, even though sometimes in the moment it can be hard to remember that. And here is where the HBR essay becomes relevant. This article looks not at whole jobs per se, but rather at specific skills inside jobs. And it identifies two areas where automation is unlikely.

First, emotion. Emotion plays an important role in human communication (think about [a] physician sitting with [a] family, or [a] bartender interacting with customers). It is critically involved in … nonverbal communication and in empathy. But more than that, it … plays a role in helping us to prioritize what we do, for example helping us decide what needs to be attended to right now as opposed to later in the evening….
Second, context. Humans can easily take context into account when making decisions or having interactions with others. Context is particularly interesting because it is open ended — for instance, every time there’s a news story, it changes the context (large or small) in which we operate. Moreover, changes in context … can change not just how factors interact with each other, but can introduce new factors and reconfigure the organization of factors in fundamental ways. This is a problem for machine learning, which operates on data sets that by definition were created previously, in a different context. Thus, taking context into account (as a congenial bartender can do effortlessly) is a challenge for automation.

The key to our question today—Can a robot do your audits?—is that an awareness of emotion and an understanding of context are critical tools for a successful auditor. At some level we all know this.

We listen to emotions, for example, as closely as we listen to words. You are talking to the Operations Manager and ask to look at his shipping records; he says, "Sure, here they are," and hands them to you without a second thought. Then you ask for his calibration records; his eyelids flutter and he stares at the floor. "The … calibration records? Umm … OK … they're … in this file over here." At that point, is there any working auditor in the world who would NOT go over the calibration records with extra care? Of course we all would.

As for context, that can make the difference between trivial errors and disasters. If the Social Events Committee fails to keep minutes on one of their meetings to plan the New Year's Party, that might rate as an Opportunity for Improvement or you might decide not to mention it at all in your final report. But if a design team holds a formal Design Review—and especially if, let's say, that Design Review is a regulatory requirement in your industry—the very same failure to keep minutes could rate as a Major Nonconformity because of the risk it poses to regulatory compliance. When you have to evaluate the severity of what you have seen, context is everything.

I think this is good news for auditors, because it means we are unlikely to be replaced by robots any time soon. We might even get them to do some of our paperwork for us. Or at least we can hope.