Pragmatic Quality Blog: 2021

Thursday, December 30, 2021

Finding root causes, Part 1: 5-Whys

Last week I talked about what a "real root cause" actually is, but I didn't say much about how to find them. Maybe a couple more words would be helpful.

There are several tools you can use to dig out a root cause from under a big pile of symptoms. The simplest one is called a "Five-why analysis," and you can think of it as "problem-solving by a bright, persistent six-year-old."

It all starts when something goes wrong. Somebody asks "Why?" and you give an answer.
"Yes, but why did that happen?"
Another answer.
"Yes, but why did that happen?"
A third answer.
"But Daddy, why did that happen??"
And so on. Just remember — six-year-olds never, ever get tired of this game.

The system is called "5-Why" but there is no law that you have to repeat the question "Why?" exactly five times. Maybe you can do it with fewer repetitions; sometimes it takes a lot more. But you keep at it until you get to a cause that is fundamental and actionable.

Here's an example.

Problem: My car won't start.
Why won't it start? The battery is dead.
Why is the battery dead? The alternator isn't working.
Why isn't the alternator working? The alternator belt is broken.
Why is the alternator belt broken? It wore out and was never replaced.
Why was it never replaced? I didn't maintain the car according to the schedule in the manual.
So the root cause why my car won't start is that I didn't maintain it properly.

Notice a few things about this example.

FIRST: The most basic point is that the root cause really is a cause. It is a cause in the narrow sense that you can toggle it like a light switch and see the problem disappear or reappear. If I maintain my car regularly, this kind of problem will never happen. If I don't, it's bound to.

SECOND: Each "Why?" is based exactly, word-for-word, on the answer to the previous question. This is important to keep you from jumping around -- to make sure that the analysis has no logical breaks in it.

THIRD: Related to this point is another one, that you have to be able to read the answers backwards, linking them with therefore. If you can't, you've made a mistake in your analysis somewhere. In this example, it works:

I didn't maintain the car according to the schedule in the manual.
Therefore the alternator belt wasn't replaced when it wore out.
Therefore the alternator belt broke.
Therefore the alternator didn't work.
Therefore the battery died.
Therefore my car wouldn't start.

Does that make logical sense? Yes it does. But now consider this example:

Problem: I was late to work.
Why? There was a lot of traffic.
Why? I took a different route than usual.
Why? It was raining.

If you are not used to the 5-Why method, it can be easy to start down an analytical path like this one because this is how explanations burble up when you ask people what went wrong. And superficially it doesn't sound crazy. But let's rewrite it backwards:

It was raining.
Therefore I took a different route than usual.
Therefore there was a lot of traffic.
Therefore I was late to work.

Does that make logical sense? Maybe it makes a kind of sense, but right away you can see some gaps.

"It was raining, therefore I took a different route" is missing some explanation of what was wrong with my normal route. Was it closed? Flooded out?
"I took a different route, therefore there was a lot of traffic" is weak too. Did I cause the extra traffic by taking a different route? No, of course not. Maybe I'm trying to say that I didn't know how much traffic to expect on that route because I don't usually take it, but that's not what I actually say. And would there normally have been so much extra traffic on that alternate route, or was the traffic jam caused by something else — like the rain — which makes my choice of a different route irrelevant?
Of course these are little quibbles, and in this example they probably don't matter. But in a real-life example, it matters a lot which causes are relevant because those are the ones you will spend time on.

So no, as it stands this line of investigation has some gaps in it. And notice that I just said "Why ... why ...?" instead of repeating the previous answer each time. If I had done that, probably I would have seen the gaps earlier.

FOURTH: Sometimes there is more than one answer to a single question. In my example about the car not starting, the fourth "Why?" has two answers: (1) the alternator belt wore out, and (2) the alternator belt was never replaced. But in the next step, I explore only one of them. Why not the other?

In this case it wasn't worth exploring answer (1) in its own right: the answer to "Why did the alternator belt wear out?" is that everything wears out sooner or later. We all know that and it doesn't help us. It's not actionable, because we can't do anything to prevent it.

So the analysis focused on answer (2), that the alternator belt hadn't been replaced. But sometimes it won't be so obvious which branch is important. In that case, list all the causes as different branches and follow each branch individually. Some of them will trickle out into truisms like "Everything wears out," and then you learn that those branches aren't useful. But sometimes you are surprised by which branches turn out to be relevant.

FIFTH: As I just repeated, a root cause has to be actionable. It has to be something you can correct. I can do something about maintaining my car on schedule; but I can't do anything about the overall tendency of things to wear out with time. For another example, see my discussion of wildfires last week.

So that's how you find a root cause, at the most basic level. Next week I'll talk about two ways you can expand the investigation, to make it broader and more comprehensive.

Thursday, December 23, 2021

Real root causes

You hear a lot about "real root causes" in the Quality business. What's that mean and why do you care?

Why you care is that if you fix a problem without fixing the root cause, it'll just happen again tomorrow. So the idea is not to fix it for a day, but to avoid that same problem — or any similar problems that are kind of like it — in the future.

So what is a "real root cause"? That's a little harder. Let me start by giving examples of what it isn't:

A real root cause is not just a restatement of the problem. Let's say — this is a fictitious example! — that we get a batch of 5000 plastic housings from our supplier, and they all have a big ugly scratch across the front. Obviously no good. We call them to say we're rejecting the lot, and we ask for a root cause analysis why the housings were no good and how they got through the supplier's QA inspection. Suppose they answer, "The root cause for why the housings were no good is that they've got a big scratch on the front." Is that any help? Of course not. We already knew that. We wanted them to figure out how the housings got that way. In other words, they haven't found the real root cause.

This sounds like a silly example, but in highly technical engineering problems it happens a lot. When the problem is very sophisticated, it is amazingly easy to think you've found a meaningful root cause when all you've done is to re-state the problem in totally different verbiage. Be warned.
A real root cause does not assign blame. I worked for a company once that hired a certain freight-forwarder, and this freight-forwarder mixed up a hugely important (and very expensive) order. We asked them for a root cause analysis. They replied that the root cause was a guy named Fred in their warehouse who was always making mistakes ... and as a Corrective Action they fired Fred! They sent us a copy of his pink slip as proof.

But what good is that? Who's to say that the guy next to him — call him Max — won't make the very same mistake tomorrow? They never actually looked at how they process orders, to see if their system is confusing.
A real root cause has a clear causal connection to the problem. Take the plastic housing example, above. The real root cause will not turn out to be, "Oh that's because Mercury is retrograde in Taurus right now." It also won't be, "Gee, I dunno. These things just happen."
A real root cause is something you can fix. Out here in California we have wildfires from time to time. Sometimes they are caused by natural forces like lightning, and sometimes by human carelessness. The Zaca Fire, back in 2007, was started by sparks from a grinding machine. But clearly part of the reason these fires spread is that Earth has oxygen in its atmosphere. Right? No oxygen, no wildfire. But it doesn't help to call that a root cause unless we can seriously entertain the option of moving to Mars.
"Human error" is not a real root cause. This is a special case of the previous point. We'll never get rid of "human error" any more than we can live on a planet without oxygen. The whole point of a Quality System is that you start by assuming that human beings make mistakes, and so you build in safeguards to reduce the likelihood or the impact of those mistakes to a minimum.

With those cautions as a background, investigate the causes of a failure. You'll probably end up with a chain like this: "A was caused by B, which was caused by C, which was caused by D ...." Did they lead you to a real root cause? Well, you can test them:

First, make sure that each step in the chain really does represent a causal link that makes sense. You should be able to re-word your whole causal chain as, "D, so as a result C, so as a result B, so as a result A." If it makes no sense that way, keep looking.
Second, make sure there's nothing personal in the causal chain. (More exactly, "Fred had a hangover" might be a legitimate cause but "Fred's an idiot" is not.) The assumption is always that everybody is trying his best to do a good job, so anybody could make the same mistakes Fred made.
Third, make sure there's something you can do about each step. Telling everybody to "be more careful next time" does nothing to solve the problem.
How far do you push it? As far as makes sense. Typically that means push it as far as you can go and still derive some meaningful countermeasures. That's your real root cause.

Thursday, December 16, 2021

Basic risk management

A while ago I was talking with a friend who works in retail, and she told me about a time when one of the clerks at her store helped a customer take her [the customer's] bags to the car. Then, as he loaded the bags into the car, the customer’s dog bit him.

There was more to the story, though the rest of the details don’t matter right now. But one of the things I asked my friend was, “You’re on the store’s Safety Committee. Did you update your Risk List to include ‘Getting bitten by a customer’s dog’?” She said they discussed it, but it seemed like something that would happen only once in a blue moon. And in that case, does it really make sense to add it to the list?

This happens a lot – I mean, identifying a risk that shows up only rarely. It’s only common sense to want to know what risks you might be facing, and (for example) the ISO management system standards all require some level of risk identification. (ISO 9001, ISO 14001, and ISO 45001 all put this requirement in section 6.1.1.) But of course you can’t take action to prevent everything you think of, so you need some way to rank your list in order of importance. That way you can plan for the ones that really matter, and let the rest go. But what ranking do you choose? Generally there are at least two questions to consider:

How likely is this risk?
And how bad will the impact be if it happens?

Anything that scores high on both questions goes to the top of the list. After that, it’s not so obvious. But here’s one simple approach you can take. Please note two things:

You can use this approach for any kind of risks. In my story about the dog, I was talking about safety risks. But your marketing team can do the very same thing to analyze competitive risks. Your product developers can use this approach (or a more sophisticated version of it) as an FMEA (Failure Mode and Effects Analysis) to think through potential product failures. Your shipping department can do this to evaluate different logistical methods. It is a very general and very powerful tool.
There are a lot of ways to make this approach more sophisticated, depending on the needs of your organization. What I describe here is the simplest possible version.

Step one: Score all of your risks according to how likely they are, using just three values: High, Medium, Low.

Step two: Now score all of your risks according to their impact – how bad things would be if they happened – using the same three values: High, Medium, Low.

Step three: Use these two scores to calculate a priority for each risk, using the following formula:

Priority = Likelihood x Impact

	High	Medium	Low
High	High	High	Medium
Medium	High	Medium	Low
Low	Medium	Low	Low

On this scale, for example, “getting bitten by a customer’s dog” would probably rank Low for likelihood but potentially High for impact, for a composite priority of Medium.

Now that you have assigned a priority to every risk on your list, what next? The next step should be to address the important ones.

What does it mean to “address” a risk? If possible, prevent it. If you can’t prevent it, take steps now to mitigate the impact when it happens. Also, consider how you will respond when it does happen: those are your contingency actions.
Which ones are “important”? It depends on what you are doing. At the very least, you should address all the risks with priority = High. Naturally you don’t have to stop there. Maybe you want to address the Medium ones as well, or some of them. Maybe there are steps you can take for a few of the Low risks too, though typically you should think about them last. You have to decide what works for you. But addressing all the risks rated High is pretty much a minimum.

What happens to the risks that you choose not to address? If my friend’s company updated their list of safety risks to include “getting bitten by a customer’s dog” and then calculated its priority as only Medium, they might not plan any action for it. So why put it on the list?

The point is that the priority ratings aren’t static. From time to time you’ll review your list to see if things have changed. As you take mitigation steps, for example, the impact of some risks will drop. The impact of others might rise, depending on changes in the outside world. Back in 2019, most American companies who did disaster planning probably rated “global pandemic” at a very low likelihood; by mid-2020, it had become a simple fact of life. So even if a risk falls below your threshold and you decide not to address it right now, keep it on the list. Then the next time you review the list – next quarter, next year, or whenever – you can think about it again. And as long as it stays on the list, you won’t forget.

Thursday, December 9, 2021

Process fragility — or — "People Before Process" Part 3 of 3

In last week's post we saw that there are powerful reasons why companies build up their process systems, while the motives to build up their people are often less obvious or less urgent. But the week before, we saw that in the long run it looks more important for an organization to have the right people than to have the right process; because good people will improve a bad process, while bad people will degrade a good one. What does this mean? Is it just one more case where the easy and obvious motives line up in support of short-term benefits at the cost of long-term ones?

Maybe so, but in the full picture we can see other things as well. The first of these is that a reliance on process is fragile, while a reliance on competence is resilient. Let me tell you a story.

Once upon a time, I helped to support the Quality system in a small factory. The factory had run successfully, under one owner or another, for most of the 20th century; some people operating the lines had worked there all their lives and were nearing retirement age. Recently the factory had been acquired by a new owner, and part of the "post-merger integration" was to implement the new owner's QMS across the board. This meant, among other things, generating Control Plans for every factory operation — something that had never been done before. A couple of manufacturing engineers were assigned the task; they made a quick inventory of all the things the factory could do, listed the steps for each in an Excel table, and published the results as Control Plans.
Then one day it was time for our external surveillance audit. In order to audit section 8.5.1 of ISO 9001:2015, the auditor asked for a Control Plan. We offered him several, and he picked one that covered the plating bath. Then he walked out on the line to watch it in action. Right away he discovered that one of the vats was at the wrong temperature. The defined reaction in the Control Plan was, "Stop the line and call the manufacturing engineer," but the line was still running. Our auditor had been watching the process for less than five minutes, and — presto! — he found a Nonconformity.
When the day was over and the auditor had gone back to his hotel for the night, my boss and I walked out on the line to ask the operator what was going on. What was he thinking, to keep running the line when the temperature was significantly outside of range? He wasn't the least bit bothered. He explained that the plating reaction depended on both the temperature of the bath and its chemical composition. When he saw that the heater for one vat was malfunctioning, he changed the chemical composition of that specific stage of the bath to compensate. The final output would be indistinguishable; the customer would get exactly what they ordered, and there would be no need to delay this production order. And a good thing too, because he happened to know that the responsible manufacturing engineer was on vacation for another week yet. But he assured us it was all fine. The product would be correct, and the customer would be happy.
"All fine" is a matter of perspective, of course. My boss and I had to do a lot of talking to persuade the auditor to rate this Nonconformity as a Minor and not a Major. But from the customer's perspective it really was "all fine." The product that shipped to the customer really was going to be indistinguishable from one that had been made at the right temperature and with the defined chemical bath. This means two things:
From the perspective of the audit, the finding really should have been a Major Nonconformity, because the system was absolutely not working the way it was defined (on paper). The written Control Plan said that if anything was out of adjustment, the whole process should stop until the responsible manufacturing engineer could review the situation and instruct the operators what to do. (And that would have been another week, at least.)
But if the organization had followed the written Control Plan, the order would have been a week late — needlessly! In this particular case, the operator himself already knew exactly what to do because he was so deeply familiar with the process. Because the operator could rely on his own competence, work did not stop ... the order was not late ... and the customer was not disappointed. Because the operator could rely on his own competence, the organization could confront an unexpected problem and then roll with it — resiliently.
It still should have been a Major Nonconformity from the perspective of the audit. But probably the operator never even looked at the Control Plan. That should have been another Nonconformity, come to think of it.*

This is what I mean when I say that relying on process is fragile. No written process can possibly cover all situations that might arise, so every written process runs the risk that one day the organization will face a situation that the process does not address. When this happens, the process breaks down. But relying on competence is resilient, because a well-trained expert with deep knowledge of the process can figure out a response to any unexpected situation, with a high probability of getting it right.

Notice something else. The whole pattern of thought and planning that underlies modern industrial capitalism favors this fragile, process-based approach over the resilient, competence-based one. For consider:

On the one hand, someone who is simply trained to follow a process (and no more than that) is unprepared to solve problems or handle novel situations on his own. But he is a lot cheaper than the employee with wide experience and deep knowledge.
On the other hand, most of the time your organization shouldn't be facing problems or novel situations.
Therefore in principle you shouldn't need your line operators to have wide experience or deep knowledge. If you have one knowledgeable problem-solver for every ten ignorant line operators, that should give you plenty of coverage for the number of problems you are actually likely to face and it's a lot more cost-effective than training everyone.
What's more, this arrangement means that your line operators are interchangeable human resources. You can move them wherever you need them in the organization. As long as they know how to follow procedures, you can use them to carry out any task that has been defined by a written procedure. And this gives you far more flexibility than you would have if they were tied to specific tasks because that's all they knew. This is what you want.

But look where this line of calculation takes us. By following the ordinary patterns of thought and planning that underlie modern industrial capitalism, we end up adopting a policy towards our people which has been proven to be very powerful, and which supports indefinite expansion; but this same policy makes our whole organization more fragile, and risks bringing us to our knees if something truly unusual happens.

How can this be? Is there something wrong with the theory?

Well yes, in a sense. Peter Drucker argued for years that our economy is no longer truly "capitalist" because Capital is no longer the most important factor of production. Capital is almost irrelevant these days, because it can be crowdsourced — either in a traditional manner, by issuing shares of stock; or in a contemporary manner, by launching a campaign on GoFundMe. The critical factor of production today, in Drucker's argument, is Knowledge; and the most critical member of any organization is the knowledge worker. (Drucker argued this point in many places but see for example his Post-Capitalist Society (1993).) A knowledge worker is any employee whose unique value comes from the knowledge he carries in his head. And because that knowledge is always of something specific, knowledge workers are in general not interchangeable. (If you have too many quality auditors, it is typically not easy to repurpose some of them as accountants or design engineers.)

Note also that the story above about the plating bath shows that even line operators can be knowledge workers. As a result, the whole approach of treating line employees as interchangeable units starts to look misguided or (at best) out of date.

None of this is to deny that a process focus really is very powerful in the short run. But if anything happens to interrupt normal operations — if, ... oh I don't know, ... say a global pandemic throws the Designated Problem-Solvers out of the office at the same time that it disrupts all the organization's supply chains — then an organization that has relied on a process-focus will be in deep difficulties, while an organization that has built up the competence of all its employees will be able to roll with the changes and adapt.

This development is something that we in the Quality business need to understand and pay attention to. We've heard the message before: W. Edwards Deming insisted in his fourteen key principles on the need for training on the job (point 6), for breaking down barriers between functions (point 9), for pride of workmanship (point 11), and for a "vigorous program of education and self-improvement" (point 13). But we have yet to integrate these concepts into the "common sense" understanding that all Quality professionals carry around with them. We have yet to rewrite our standards — like ISO 9001 — so they give as much attention to people as to processes.

We can do this. Once upon a time, we Quality professionals didn't all think in terms of statistical variation, but now we do. Once upon a time we didn't all think in terms of business processes, but now we do. We can absorb this change just like all the others. But we need to start.

__________

* The attentive reader will have noticed that I describe the very same action as resilient (as well as good for both the customers and the company) and a potential Major Nonconformity. How can it be both? Aren't audits supposed to improve the company's behavior? Or am I trying to criticize audits as counterproductive?

I'm not criticizing audits per se, but the usefulness of any audit depends critically on the usefulness of the management system documentation that you are auditing against. In this case, the root cause of the finding was the slapdash way that the company threw together their Control Plans, aiming to get something written so they could check a box rather than thinking through what the controls should really be. Since what the operator actually did to respond to the condition was correct, it should have been permitted as one option under a proper Control Plan. Or else perhaps the operator's deep knowledge of the process could have qualified him to be designated as a responsible "Manufacturing Engineer" for this particular production line.

In real life, the company analyzed the audit finding and realized that all their other Control Plans were probably just as bad. So they started over from the beginning and rewrote the lot of them more carefully. It was the best possible response to that finding, and I was glad that's what they chose.

Thursday, December 2, 2021

"Why do we always revert to process?" — or — "People Before Process" Part 2 of 3

After I published last week's post, I got a note from Jeff Griffiths asking why I think we in the business world regularly put so much emphasis on process. Among other things, he wrote, "I think the reason organizations always seem to revert to process is that developing people and actively managing competency is hard work, and most front-line leaders aren’t trained to do it, and they certainly aren’t invited to do it. What’s your experience been?"

Of course he's right. But his note got me thinking. And the longer I thought about it, the more reasons I could see that organizations choose to emphasize process development.

In the first place, of course, there are the undeniable benefits of the process approach. A process focus supports continual improvement, by letting you understand the overall flow of your operations so you can see where there are blockages. A process focus permits standardization across functions or locations. A process focus facilitates interaction of multiple functions across an organization. And a process focus makes possible the "checklist effect," which enhances the performance even of deeply trained experts like pilots and surgeons. All of these are good things. Nothing that follows is meant to minimize any of these genuine benefits, and I would never suggest that you try to do without defined processes!

But there are other reasons for the focus on process, and not all of these reasons are so obviously wholesome.

One reason, for example, is that Quality professionals — people like me — push the process approach so hard. That (in turn) is because the process approach is a major focus of the ISO 9001 standard. Section 0.3 of the Introduction is about nothing else. In the normative sections of ISO 9001:2015 (I mean chapters 4-10), the word process or processes occurs 57 times. By contrast: competence or competent occurs 9 times, training occurs twice, people occurs once, and skills shows up only in Annex B (well outside the normative chapters). So organizations can be forgiven for thinking that process is something they have to emphasize.

Another reason is that organizations know how to write processes — even if they don't follow all my advice from last summer, they can write something that's good enough to get by — but they often don't know how to develop their people. As Griffiths wrote to me, most front-line leaders aren't trained to do it. I can personally confirm that when I first became a manager, nobody took me aside to train me how to develop my people — nor even to give me some basic pointers. Of course there are companies that are happy to step in to help with the task — Griffiths himself works for one — but it seems like there aren't nearly enough of them, and of course none of them works for free.

Related to the foregoing is the simple fact that a process focus is easier and cheaper than a competence focus. A process focus is narrow and finite, while a competence focus is potentially infinite — there's always room to learn more and get better. This means that a process focus is easier to replicate at scale, and therefore supports expansion.

For example, McDonald's is well-known as a process-focused business. Everyone knows that McDonald's has defined an exact procedure for every aspect of running each restaurant. The result is that McDonald's has spread across the globe, with more than 37,000 stores in 120 countries. What is more, the food they serve (with the exception of minor regional specialties) is absolutely uniform. When you place an order in a McDonald's, you know what you are going to get.

Contrast this with a restaurant that is not so process-focused, such as Le Bernardin in New York City. No doubt Le Bernardin uses some recipes, and there must be procedures for how to take reservations or manage the flow of customers. But there is also room for a cook to express personal artistry. And while Le Bernardin has been — in its class and with respect to its own (very different) criteria — every bit as successful as McDonald's, nonetheless there is only one.

For a more dramatic example, consider the situation of American war materiel when the country first entered World War Two. Nearly all weapons required precision optical sighting devices to aim them, and the best lenses in the world were all ground in Germany by deeply trained experts who had served long years of apprenticeship. So far as anyone knew at the time, that was the only way to get lenses ground. Peter Drucker tells the story:

Belief in the mystery of craft and skill persisted, as did the assumption that long years of apprenticeship were needed to acquire both. Indeed, Hitler went to war with the United States on the strength of that assumption. Convinced that it took five years or more to train optical craftsmen (whose skills are essential to modern warfare), he thought it would be at least that long before America could field an effective army and air force in Europe—and so declared war after the Japanese attack on Pearl Harbor.
We know now [Frederick W.] Taylor was right. The United States had almost no optical craftsmen in 1941. And modern warfare indeed requires precision optics in large quantities. But by applying Taylor’s methods of scientific management, within a few months the United States trained semiskilled workers to turn out more highly advanced optics than even the Germans were producing, and on an assembly line to boot. And by that time, Taylor’s first-class men with their increased productivity were also making a great deal more money than any craftsman of 1911 had ever dreamed of.

(You can find Drucker's article, from which these paragraphs are extracted, here. There is a short video about the same topic, made in 1945, available here.)

In short, there are powerful motives pushing businesses to adopt a process focus, even if it comes at the expense of putting similar effort into developing their people. And yet we saw last week that in the long run, investment in people is more important than investment in process. At the level of the individual firm, this is something to watch. Be careful that you don't overemphasize an approach that will give you only a limited return on your investment.

There are also implications for how we understand the broader economy, and in particular for how we Quality professionals practice our craft. I will address both of those topics next week.

Thursday, November 25, 2021

"People Before Process"

A few weeks ago, I had the good luck to attend a webinar called "People Before Process." This webinar was a real treat. It was clear, engaging, and insightful. And it touched on themes that I have discussed before about the role that defined processes play in achieving Quality — in getting what you want.

Before I forget, here are a few particulars. The talk was sponsored by the ASQ's Quality Management Division, and speaker was Jeff Griffiths ("About" page, LinkedIn). If you are a member of ASQ, you can access the video here. But he discusses some of the same concepts in less detail here (no need for an ASQ membership) and you can find him in a number of video conversations here.

Griffiths's fundamental point throughout these talks is that, if you want to get results, there is no substitute for having the right people to get you those results. Starting from that foundation, he then discusses various ways that an organization can plant, nourish, and grow the needed competencies in their people. In the webinar I joined he introduced the Dreyfus model to distinguish multiple levels of skill acquisition, and described client quality problems that his firm had helped resolve specifically through enhancing worker competency rather than by introducing new procedures.

To make the point that people are more important than process, Griffiths proposed an interesting thought experiment. Suppose, he said, you have a table like the one below, and that your organization can fall in one of the four quadrants depending whether your people are strong or weak and whether your processes are strong or weak.

Of course anyone who has a choice wants to be in quadrant II, with strong people and strong processes. Likewise we can all agree that our last choice is to be in quadrant IV. That part is easy. But what if we have to choose between quadrant I and quadrant III? Griffiths argues that we are far better off in quadrant III, because if the people in the organization are fundamentally strong and competent — but they have been saddled with processes that are weak or badly-designed — the people will change the processes into something that works better, and thus will pull the organization in the direction of quadrant II. But if you have weak people — poorly trained, uncaring, or actively disengaged — the best process in the world can't overcome them.

Quadrant III is better because it is temporary: it is always pulling towards II.

Is it true? I think so. Even when an organization is not in control of their own processes*, they can make improvements at a daily level by interpreting the rules so they support the work, applying and enforcing them in ways that are helpful and productive. There is no single instruction for how to do this that fits all cases, no one-size-fits-all formula. Each situation has to be evaluated on its own. But in my experience it is possible.

When I look at the table, I see something else too, something Griffiths never says explicitly. I bet that quadrant I is always pulling towards quadrant IV. Think about it. Suppose you have an organization with an excellent set of written processes, but where the people are poorly trained for the work and don't understand the processes — or don't care. What happens? That's easy: Scott Adams has made an entire career writing about it in Dilbert. Look at all the jokes made at the expense of ISO 9001: the standard itself is more or less a body of formalized common-sense, but when it is badly implemented or badly-applied it becomes a punchline. And so, bit by bit, processes which were once helpful and robust are misused and misapplied; enforcement is either too strict or too loose (or veers unpredictably between one and the other); the processes thus become obstructions rather than enablers; and the organization drifts from I to IV.

If people are so much more important than process, why do I write about process? The simplest answer is that I write what I know; further, I never said process was irrelevant. Obviously your business processes and the structure of your QMS still make a significant difference to your outcomes. But the overriding theme of this blog is that any QMS has to be applied pragmatically; this means that the system itself can never solve all your problems. The centrality of your people is one huge reason why not.

If you find yourself wondering what to do about that fact and where to turn next, check out Griffiths's organization and blog. You'll find some advice there.

__________

* This can happen in, say, a global company that requires all units to follow the same processes even if they do different work.

Thursday, November 18, 2021

Do audits really add value? Part 3 of 3

In the last couple of weeks, I've discussed the question whether external, third-party audit results are reliable. On the one hand, I've given reasons it is fair to be suspicious of them; on the other, I've had experiences where they have proven uncannily perceptive. What's the middle ground, the synthesis of these two conflicting positions? Last week I tipped my hand by saying that "yes, we can trust our audit results, provided that we understand clearly what the job of an external audit really is and don't expect it to do something else instead." In what follows, let me try to spell out what that job is.

What an external audit is not

In the first place, an external audit is not a complete health-check. After all, it is a commonplace that auditing is a sampling operation. Third-party auditors routinely remind clients of this during their Opening or Closing Meetings. And in an earlier post I mentioned an audit instructor who said clearly, Any time you do an audit, there will be minor nonconformities that you will miss. Even if your organization passes, that doesn't guarantee that everything is perfect. Notice that for this reason, it is not necessary for an auditor to make the experience painful for the client, because he's not even trying to catch everything. So if someone (like the fellow I quoted two weeks ago) says that auditors are taking a "kinder and gentler" approach than they did back in the early 1990's, that doesn't have to be a problem.

That's what an external audit does not do. Now what does it do?

Enforcement

The first thing an audit does is to support the enforcement of the organization's Quality Management System. Every QMS involves imposing a set of rules on the organization; and no matter how engaged the employees are, there will always be someone who thinks that this particular rule shouldn't apply to him. And maybe for a while he gets away with it: management's attention is somewhere else, and his colleagues don't feel like leaning on him. But sooner or later somebody schedules an audit. And then the message — from management and colleagues alike — suddenly becomes the same: Dude, even if you think the rule is dumb you have got to comply with it or else the auditor will write us up. And that message is often convincing even when no other message has worked.

Fresh eyes

Sometimes there is something wrong in your system which you know is wrong, but you walk past it every single day and after a while you stop seeing it. I had something like this happen to me. The local organization I worked in had a procedure for processing 8D problem reports. It was based on a global procedure that covered our whole division worldwide, but there were local adaptations for one reason and another. Anyway, the global procedure changed, which meant that we (that means I) had to change the local one to match it. The adjustment was straightforward; I knew exactly what I had to do. So I put it on my to-do list. This was six months before our next surveillance audit.

You know what happened next. One thing and another interrupted me before I could work on it immediately, and then it slid far enough down the list that I didn't see it often. Occasionally I would notice it and remember, Oh right, I still have to fix that. But about that time another problem would cross my desk and I'd forget again.

And then our auditor showed up. As we reviewed the corrective action system, he asked to see our 8D procedure. I gave it to him, and he read to about page 2 where he suddenly asked, It says here you process 8Ds like this. Is that true? And then I remembered, Oops! Not any more we don't. I was going to fix that, wasn't I? Of course he wrote a nonconformity, and to answer it I finally updated the document correctly. It doesn't seem like a big issue in the grand scheme of things, but if he hadn't written that finding I might never have remembered to do it. And as I discussed last summer, it actually does matter that your procedure documents be accurate.

System integrity

There is at least one more job that an external audit does reliably. It guarantees the overall integrity of the system. To explain what I mean, let me tell another story.

Years ago, I worked in a place that was struggling to implement a disciplined QMS. We had gotten ISO 9001 certification, but keeping things at a sustainable level was a challenge. It seemed like every year I was writing Major Nonconformities in our internal audits.

So after one of our external surveillance audits, the General Manager took a few minutes out of his next staff meeting to complain that the audit process was useless.

Me: What do you mean "useless"?
General Manager: Well that guy spent a few days here, he seemed to talk to everyone, but then he gave us a clean bill of health! What's wrong with him? Didn't he see that we had seven Major Nonconformities in our internal audits? How could he say that our system is working OK?

I wasn't sure how to answer, so after the meeting I forwarded that question to the auditor (whose contact information I had kept). And he answered:

Auditor: Yes, I saw those seven Majors. But you found them, didn't you? They were all clearly stated in the internal audit report; and when we checked the action plans, the root-cause analyses looked reasonable and the corrective measures were on-schedule. The system was working exactly the way it's supposed to work.

Then he went on.

Auditor: Look, if you want me to come out there, photocopy your internal audit results, sign my name to them, and then spend the rest of the week in a bar — and get paid for it — I can do that. But that's not going to give you a lot of value. So it's more important to me to make sure that your overall system is hanging together and functioning the way it should. Of course you're going to have problems or hit bumps in the road. That's normal. The important part is how you react to those problems, and right now you guys are doing fine.

And that's what I mean by guaranteeing the overall integrity of the system. This is why an external auditor doesn't have to find every little thing the organization is doing wrong: because if the system is working correctly, the organization will find those problems themselves. Therefore the one critical thing that the external auditor has to ensure is that the system itself is working.

This point relates also to our earlier discussion of the difference between Minors and Majors. Two weeks ago, when I listed reasons to be suspicious of audit results, most of those reasons applied to Minors. Didn't they?

If auditors used to strain at gnats and no longer do, that has to mean that they used to write a lot of Minors and no longer do, because writing Majors has always been the exception.
More to the point, think about the external audit that started this whole train of thought, where the auditor asked a few simple questions and then wrapped up the audit. What made that possible was that the overall system was functioning just fine — and in that office, by that time, it was. Yes, if he had been more focused he could have found a few Minors for us to chase after. But fundamentally that's not what we needed from him. We had internal audits for that — and customer complaints, and nonconforming material reports, and the whole armamentarium of Quality Management tools. What we needed from him was assurance that the system was intact, and it was.
And while experts certainly disagree, I would argue that they are a lot more likely to disagree over Minors than over Majors because Minors are one-off failures. They are almost incidental. And therefore there is a lot more room for personal, subjective judgement to come into play. Majors, on the contrary, are by definition failures that endanger the system. My old instructor might have been exaggerating when he said that "if the organization has Majors, you will know it by the time you reach the Receptionist's desk!" But it is pretty hard to mistake a system breakdown for a one-off failure, or vice versa.

From this point of view, the most important job of the external auditor is to find and report Majors, if there are any. Minors are lagniappe. If the external auditor happens to find them, of course he reports them; but if he doesn't, somebody else will. On the other hand if the system has broken down, that "somebody else" might never come along. So the external auditor has to report on Majors.

And for that reason, as long as we remember the difference between what external audits must do and what they cannot pretend to do, we can continue our audit programs with a good conscience.

Thursday, November 11, 2021

Do audits really add value? Part 2 of 3

Last week I asked whether we can ever trust the third-party audit process, and suggested two (or maybe three) reasons we might not: registrars go easy on us because they don't want to lose clients, experts disagree, and (although this last point is a very individual matter) there are a few external auditors who are kind of goofy. (We've all met one somewhere.)

On the other hand, I've also had experiences that run smack in the other direction. Let me describe two.

One time I worked with an external auditor who I thought was going to end up in the "goofy" category. She was a chatty little old lady, who always started off her interviews by talking about her vacations or her grandkids, and who invariably wound up her interviews ahead of schedule. Then as soon as the auditee had left the room she'd ask me to step outside with her so she could have a cigarette. All through the audit, I had people comment to me quietly that they were amazed how smoothly it was going. And then at the end she wrote us three nonconformities which exactly nailed the three places we were having the most trouble. Her style was so relaxed that it put everyone off-guard, but her questions probed deeply — and got there fast.

There's also a phenomenon that I have experienced when I do internal audits, and that I have seen play out in external audits as well. I have sometimes jokingly called it a Special Providence for ISO Auditors. It works like this:

The client has a drawer full of 100 files.
Ninety-eight of those files are perfect. Two are wrong.
You, the auditor, close your eyes and randomly pull three files out to check.
One of the ones you pull out will be wrong.

It's uncanny how often this works. I've had it happen to me when I do internal audits, and I've watched external auditors do the exact same thing. Those auditors wrote us up, too.

So where does this leave us? Last week I talked about reasons not to trust audit results. But this week I've discussed two reasons we can trust them: trained auditors can be amazingly perceptive, and problems or errors seem to jump out in front of them. Which is it?

My answer is that yes, we can trust our audit results, provided that we understand clearly what the job of an external audit really is and don't expect it to do something else instead. I'll talk about the real job of an audit in next week's installment.

Thursday, November 4, 2021

Do audits really add value? Part 1 of 3

Auditing is a fundamental part of any management system. In the four-step cycle of "Plan-Do-Check-Act," auditing is the core of step three, "Check." Without audits we would never know if our systems were working.

But there are internal audits and external audits. And is the information we get from external audits actually reliable? Is it useful? Or are there institutional or organizational factors that skew or corrupt the auditing process?

I think the real picture is not a simple one, so I'll break up my answer into three parts, all based on my own personal experience. In the first post — this one — I'll give reasons to suspect that maybe external audits don't really add value. In the second post, I'll backtrack and give reasons that maybe they do add value after all. And in the third post I'll try to find a middle ground that does justice to both opinions.

You may be able to guess where I'll end up before I get there, or you may have opinions of your own. Either way, feedback is always welcome. Please feel free to comment.

My thinking along these lines all started after a singularly-unimpressive third-party audit. The auditor walked around, asked a couple of aimless questions to which we gave answers that he apparently found acceptable, and wound up. All done.

Afterwards I sat around talking with the site management, and we tried to understand what had happened. Partly it was just that this particular guy was a little goofy, and we've all met auditors for whom we could say the same. (Christopher Paris even jokes about it in his auditor cartoons.) But the General Manager remarked that our experience that day seemed, in his mind, about par for the course. He said that all the auditors he could remember seemed to have been more-or-less colorful individuals, but he couldn't think of any findings that made much of a difference to the organization. So what's the point, he asked. Is there any real value that we get out of interrupting operations for a day or two while we host these guys? Do their reports actually help us build a better organization? Or is this all just a song-and-dance we have to go through as part of the price for getting our certificate? Because we really do need the certificate. But is that all we're getting?

In reply, I explained something that another third-party auditor had described for me several years ago. When ISO 9001 was first introduced, so this man had told me, auditors were very strict. Getting certification was a big deal. But over time there were more and more registrars available, and they were in competition with each other. This meant that if a company was refused certification by, say, BVC or BSI, they could always call up DNV or DQS or Perry-Johnson and try again. The risk of losing repeat business pushed the registrars to grade more and more leniently, and to market themselves as “partners” in improving your management systems. They wrote fewer nonconformities and more “opportunities for improvement” … suggestions for things you might want to consider.

The General Manager listened to all of this politely. Then, bless his heart, he jumped in with exactly the right question: You know who else behaved like that? The bond-rating agencies, back in 2008. And see how well that turned out for everybody!

Even when auditors aren't deliberately throwing slow softball pitches, though, there's another risk to the reliability of any audit report. Experts disagree. Often they disagree a lot.* If you have ever performed an audit and then discussed it with another trained auditor, you know that no two auditors will choose to include exactly the same data points; and even if they can agree on a particular finding, it's not infrequent that they'll rate it at different levels of severity. We always ask for "objective evidence" to justify any finding, and rightly so. But what we then do with that objective evidence depends more than we like to admit on subjective assessments and personal expertise.

So if audit reports are, in the end, based (at least partly) on the auditor's personal, subjective opinions — and if auditors are under institutional pressure not to be terribly hard on their clients — this brings us back to the original question: Do audits really add any value? And if you look at it from one point of view, you wouldn't be crazy if you suspected that the answer might be No.

But don't touch that dial. I'll come back next week to look at the question from another point of view.

__________

* You can find this topic discussed at great length in, for example, the medical field, where it would seem to be a matter of life-or-death to get the answers right. See for example this article from 2008, this one and this one from 2010 (both based on the same book), or this one from 2020. This extreme variability is one of the reasons that medicine relies more and more often on objective checklists instead of personal expertise, as described here.

Thursday, October 28, 2021

"Is there anything you want me to write up?"

Over the years I've gotten to work with a lot of other auditors, and I've learned something from each of them. Sometimes they've just had a really interesting perspective on the work of Quality: I remember an external auditor who explained over lunch that a few years before he had started his own business (something unrelated to Quality) and it failed. When he analyzed the failure, he concluded that the root cause was that he didn't know how to run a business. So he trained to become a Quality auditor, which would allow him to look at many other companies and study how they were run. His plan was that when he finally felt he had learned how to run a business, he would quit auditing and try again.

But sometimes I've learned techniques. One of the most surprising was when I was working with a colleague on an internal audit, and as we were about to wind up one interview he asked, "Just one more question: Is there anything you would like us to write up? That it would help you for us to write up?"

Wait, … what? In my experience most auditees look at an audit like some kind of oral exam: the last thing they usually want is to volunteer something to be written up.

But my friend was completely serious. He pointed out that as internal auditors we are there to help the organization improve. And all we are ever able to see is a sampling. So maybe there's something that isn't working the way it should, but that we missed. And if it would help our auditee for us to look at it, why not ask for it?

In fact the auditee said yes there was, or at least maybe. She wasn't sure, but what did we make of these project requests she had gotten yesterday? We looked at them, and they were requests for her to set up and track projects where half the estimated durations were blank and the costs were listed as "Don't know." We agreed that this didn't look like enough for her to work with, but for various reasons there wasn't an easy way for her to push back. (That sounds unlikely the way I'm describing it, but I'm leaving out a lot of details.) Most of the project requests that she got were just fine, so there didn't appear to be a systemic problem. But we did write a minor nonconformity that she was being asked to move forward without the planning data she needed to do her job.

In the big picture, that was probably the most meaningful finding we wrote during that entire audit, the one most likely to help the organization improve. And we would never have gotten it if my friend hadn't asked for it.

I've used that question ever since. Often the answer is No, but even then I think it helps the auditee see the audit differently. It helps make the point that this really is a collaborative effort — that we really are on the same team.

Of course there are risks to watch for, when you ask a question like this. Once in a while an auditee will take this as an invitation to air some personal grudge against a coworker, or to try to score political points in a fight between departments. Obviously you have to watch for those and can't write them up.

But it's a good question. And I'm grateful to my friend for having taught it to me.

Thursday, October 21, 2021

Auditing and consulting

My last couple of posts (see here and here) have suggested a kind of relationship between an auditor and the audited organization that has real risks, so let me talk about them briefly.

There is a fundamental principle that auditors must not do consulting. The difference is that an auditor tells you what's wrong (how your organization is deviating from its requirements) and a consultant tells you how to fix it. The reason to keep these roles separate is that combining them poses a temptation for the auditor to abuse his authority: first, he writes up a list of nonconformities; second, he comes back charging $500 an hour to tell the organization what they have to do to clear the nonconformities; third, he comes back next year to see if they did exactly what he said — and if not, he writes more nonconformities, ad infinitum. Permanent employment for the auditor, but really bad for the client. If you separate auditing from consulting, you prevent this cycle. So the general rule is, "I can tell you that you don't conform to your requirements, but I can't tell you how to correct the problem."

The basic principle is a good one, especially in the case of an external (or third-party) auditor who gets paid every time he shows up on the premises. But for internal auditors the distinction between auditing and consulting is often not so practical. In the first place, unless the organization is large enough, whoever does the internal audits is very likely the same person who will be assigned to lead or coach the corrective action team because there is literally no-one else available and qualified. In the second place, even during the audit itself it's not unusual to hear the question, "Why is it wrong to do what I'm doing? I don't understand what that paragraph of the standard even means. What should I do differently so that I'm not violating the requirement?" When someone asks you a question like that, the line between explaining the finding and consulting on how to fix it becomes so thin it almost disappears.

In my last couple of posts, I say that sometimes you might talk to the organization's management before rating a finding, or you might take into account topics like the organization's overall level of maturity. This advice is most appropriate in internal audits, where the distinction between auditing and consulting is already compromised for the reasons I described above. What about the risk that the auditor might abuse his authority? In the internal case that risk is minimized because if the auditor starts asking for something crazy, the department can easily escalate over his head to his manager to ask for intervention. And when you are all on the same team — when you are all paid out of the same payroll — there is no advantage to the auditor in demanding things that don't make the company healthy and prosperous.

It is important to understand that there is a difference between auditing and consulting, and also why the line between them is drawn so sharply. But then when it comes to working in the real world, like with everything else, what you do depends on risk and judgement: what are the concrete risks here and now, and how do you judge that you can meet them most effectively?

Thursday, October 14, 2021

Minors and Majors

Last week I talked about how to distinguish Minor Nonconformities from Opportunities for Improvement. Now I'll review the difference between Minor Nonconformities and Major Nonconformities.

The difference to the organization is that Majors get a lot more attention and typically require a lot more work to close. If the auditor from your registrar raises a Major in an external audit, it can block your certification or recertification. Depending on the finding and the contract with your registrar, you may have to pay for a re-audit within a specified time frame (far sooner than you were planning for!) to prove that the nonconformity has been corrected and permanently prevented. Because the consequences for external Majors are so significant, organizations frequently define heavy procedures to handle internal Majors — so that they get immediate and sustained management attention, to make sure they have been resolved before the external audit.

In short, Minors can be comparatively innocuous but Majors are The Scary Ones. But what is the real difference? When do you write a Major?

The definition can be found in ISO 17021:2015, and it relates to the idea of a Quality Management System (QMS). Briefly, if the failure is a system failure, it's a Major; if not, it's a Minor. More exactly, according to definition 3.12 a major nonconformity is a:

nonconformity that affects the capability of the management system to achieve the intended results.

In the same way, definition 3.13 tells us that a minor nonconformity is a:

nonconformity that does not affect the capability of the management system to achieve the intended results.

But what are "the intended results"? In a broad sense this probably has something to do with healthy operation and customer satisfaction; but in a narrow sense, surely every single procedure in the organization has as one of its intended results that everyone in its scope should comply with it. And if you take the term that broadly, then any failure would count as a Major. Clearly that can't be the right way to see the question.

In casual conversation, the difference is usually described in terms of extremes: a Major is "a total breakdown of the system," while a Minor is "an isolated one-off error." Of course this leaves most nonconformities somewhere in the middle, with the auditor having to decide whether a finding is more like the first or more like the second.

For example, suppose the organization has defined a specific template for all their internal documents; but when you examine these documents during the audit you find that nobody uses this template except the Quality department. Is that a Major or a Minor?

On the one hand, it's clearly not "an isolated one-off error," since you see the very same error almost everywhere. It certainly looks like the error is "systemic," or at any rate like there is no functioning system for introducing document templates and making sure everyone uses them.

On the other hand, will the organization take it seriously if you write a Major for document templates?

Some will, especially if they have contractual requirements to other interested parties related to the use of those templates: but those organizations won't have this finding in the first place.
An organization where this finding turns up is an organization that doesn't see any reason to care about internal document templates — and is therefore an organization that will never take such a Major seriously.
But can you as an auditor, in good professional conscience, justify calling it a Minor? Can you honorably say that it meets the definition of a Minor?

It depends. If it's an internal audit, consider talking to them. Ask top management — the people who will receive your report when you are done — whether having a uniform format for internal documents is one of their "intended results." If not, then this finding does not affect the achievement of "intended results," and could count as a Minor.

That doesn't mean that any time the organization doesn't care about something it's a Minor, of course. Some "intended results" (like customer satisfaction or legal compliance) are so serious that the organization has to care. So be reasonable.
Note also that as the organization matures, so will their list of "intended results." Maybe in a few years they will have reached a place where they take document formatting more seriously. By the time that happens, though, you probably won't see this particular finding any more.

My very favorite explanation of the difference, though, came from a class discussion back when I took my first Lead Auditor training class. The instructor had just made the point that an audit is a sampling exercise. You can never see everything that goes on inside an organization. And one of the students had a concern.

Student: If an audit is a sampling exercise, doesn't that mean there's a big risk that when we audit an organization they might have huge, serious problems and we don't see them?

Instructor: That will never happen.

Student: But you just said an audit is a sampling. What if the big problems are all over here and we happen to be looking over there? What if we just miss them?

Instructor: That happens with Minors all the time. In fact, I guarantee that any time you do an audit, there will be minor nonconformities going on in the organization that you will miss. But if the organization has Majors, you will know it by the time you reach the Receptionist's desk! You will smell them! You will know they are there. The point is that if the organization suffers from major nonconformities, their attitude will come through in so many little things that it will be impossible for them to hide it, or for you to miss it. And then — since you already know the Majors are there waiting to be found — all you have to do is find them.

That's obviously a very informal criterion, but it makes the point beautifully.