Thursday, December 30, 2021

Finding root causes, Part 1: 5-Whys

Last week I talked about what a "real root cause" actually is, but I didn't say much about how to find them. Maybe a couple more words would be helpful. 

There are several tools you can use to dig out a root cause from under a big pile of symptoms. The simplest one is called a "Five-why analysis," and you can think of it as "problem-solving by a bright, persistent six-year-old." 

It all starts when something goes wrong. Somebody asks "Why?" and you give an answer.

"Yes, but why did that happen?"

Another answer.

"Yes, but why did that happen?"

A third answer.

"But Daddy, why did that happen??"

And so on. Just remember — six-year-olds never, ever get tired of this game.

The system is called "5-Why" but there is no law that you have to repeat the question "Why?" exactly five times. Maybe you can do it with fewer repetitions; sometimes it takes a lot more. But you keep at it until you get to a cause that is fundamental and actionable.

Here's an example.

  • Problem: My car won't start.
  • Why won't it start? The battery is dead.
  • Why is the battery dead? The alternator isn't working.
  • Why isn't the alternator working? The alternator belt is broken.
  • Why is the alternator belt broken? It wore out and was never replaced.
  • Why was it never replaced? I didn't maintain the car according to the schedule in the manual.
  • So the root cause why my car won't start is that I didn't maintain it properly.

Notice a few things about this example. 

FIRST: The most basic point is that the root cause really is a cause. It is a cause in the narrow sense that you can toggle it like a light switch and see the problem disappear or reappear. If I maintain my car regularly, this kind of problem will never happen. If I don't, it's bound to.

SECOND: Each "Why?" is based exactly, word-for-word, on the answer to the previous question. This is important to keep you from jumping around -- to make sure that the analysis has no logical breaks in it.

THIRD: Related to this point is another one, that you have to be able to read the answers backwards, linking them with therefore. If you can't, you've made a mistake in your analysis somewhere. In this example, it works:

  • I didn't maintain the car according to the schedule in the manual.
  • Therefore the alternator belt wasn't replaced when it wore out.
  • Therefore the alternator belt broke.
  • Therefore the alternator didn't work.
  • Therefore the battery died.
  • Therefore my car wouldn't start.

Does that make logical sense? Yes it does. But now consider this example:

  • Problem: I was late to work.
  • Why? There was a lot of traffic.
  • Why? I took a different route than usual.
  • Why? It was raining.

If you are not used to the 5-Why method, it can be easy to start down an analytical path like this one because this is how explanations burble up when you ask people what went wrong. And superficially it doesn't sound crazy. But let's rewrite it backwards:

  • It was raining.
  • Therefore I took a different route than usual.
  • Therefore there was a lot of traffic.
  • Therefore I was late to work.

Does that make logical sense? Maybe it makes a kind of sense, but right away you can see some gaps.

  • "It was raining, therefore I took a different route" is missing some explanation of what was wrong with my normal route. Was it closed? Flooded out? 
  • "I took a different route, therefore there was a lot of traffic" is weak too. Did I cause the extra traffic by taking a different route? No, of course not. Maybe I'm trying to say that I didn't know how much traffic to expect on that route because I don't usually take it, but that's not what I actually say. And would there normally have been so much extra traffic on that alternate route, or was the traffic jam caused by something else — like the rain — which makes my choice of a different route irrelevant? 
  • Of course these are little quibbles, and in this example they probably don't matter. But in a real-life example, it matters a lot which causes are relevant because those are the ones you will spend time on.
So no, as it stands this line of investigation has some gaps in it. And notice that I just said "Why ... why ...?" instead of repeating the previous answer each time. If I had done that, probably I would have seen the gaps earlier.

FOURTH: Sometimes there is more than one answer to a single question. In my example about the car not starting, the fourth "Why?" has two answers: (1) the alternator belt wore out, and (2) the alternator belt was never replaced. But in the next step, I explore only one of them. Why not the other?

In this case it wasn't worth exploring answer (1) in its own right: the answer to "Why did the alternator belt wear out?" is that everything wears out sooner or later. We all know that and it doesn't help us. It's not actionable, because we can't do anything to prevent it. 

So the analysis focused on answer (2), that the alternator belt hadn't been replaced. But sometimes it won't be so obvious which branch is important. In that case, list all the causes as different branches and follow each branch individually. Some of them will trickle out into truisms like "Everything wears out," and then you learn that those branches aren't useful. But sometimes you are surprised by which branches turn out to be relevant.

FIFTH: As I just repeated, a root cause has to be actionable. It has to be something you can correct. I can do something about maintaining my car on schedule; but I can't do anything about the overall tendency of things to wear out with time. For another example, see my discussion of wildfires last week.


So that's how you find a root cause, at the most basic level. Next week I'll talk about two ways you can expand the investigation, to make it broader and more comprehensive.

         

Thursday, December 23, 2021

Real root causes

You hear a lot about "real root causes" in the Quality business. What's that mean and why do you care? 

Why you care is that if you fix a problem without fixing the root cause, it'll just happen again tomorrow. So the idea is not to fix it for a day, but to avoid that same problem — or any similar problems that are kind of like it — in the future. 

So what is a "real root cause"? That's a little harder. Let me start by giving examples of what it isn't

  • A real root cause is not just a restatement of the problem. Let's say — this is a fictitious example! — that we get a batch of 5000 plastic housings from our supplier, and they all have a big ugly scratch across the front. Obviously no good. We call them to say we're rejecting the lot, and we ask for a root cause analysis why the housings were no good and how they got through the supplier's QA inspection. Suppose they answer, "The root cause for why the housings were no good is that they've got a big scratch on the front." Is that any help? Of course not. We already knew that. We wanted them to figure out how the housings got that way. In other words, they haven't found the real root cause.
     
    This sounds like a silly example, but in highly technical engineering problems it happens a lot. When the problem is very sophisticated, it is amazingly easy to think you've found a meaningful root cause when all you've done is to re-state the problem in totally different verbiage. Be warned. 
  • A real root cause does not assign blame. I worked for a company once that hired a certain freight-forwarder, and this freight-forwarder mixed up a hugely important (and very expensive) order. We asked them for a root cause analysis. They replied that the root cause was a guy named Fred in their warehouse who was always making mistakes ... and as a Corrective Action they fired Fred! They sent us a copy of his pink slip as proof.
     
    But what good is that? Who's to say that the guy next to him — call him Max — won't make the very same mistake tomorrow? They never actually looked at how they process orders, to see if their system is confusing. 
  • A real root cause has a clear causal connection to the problem. Take the plastic housing example, above. The real root cause will not turn out to be, "Oh that's because Mercury is retrograde in Taurus right now." It also won't be, "Gee, I dunno. These things just happen." 
  • A real root cause is something you can fix. Out here in California we have wildfires from time to time. Sometimes they are caused by natural forces like lightning, and sometimes by human carelessness. The Zaca Fire, back in 2007, was started by sparks from a grinding machine. But clearly part of the reason these fires spread is that Earth has oxygen in its atmosphere. Right? No oxygen, no wildfire. But it doesn't help to call that a root cause unless we can seriously entertain the option of moving to Mars. 
  • "Human error" is not a real root cause. This is a special case of the previous point. We'll never get rid of "human error" any more than we can live on a planet without oxygen. The whole point of a Quality System is that you start by assuming that human beings make mistakes, and so you build in safeguards to reduce the likelihood or the impact of those mistakes to a minimum. 
With those cautions as a background, investigate the causes of a failure. You'll probably end up with a chain like this: "A was caused by B, which was caused by C, which was caused by D ...." Did they lead you to a real root cause? Well, you can test them: 

  • First, make sure that each step in the chain really does represent a causal link that makes sense. You should be able to re-word your whole causal chain as, "D, so as a result C, so as a result B, so as a result A." If it makes no sense that way, keep looking. 
  • Second, make sure there's nothing personal in the causal chain. (More exactly, "Fred had a hangover" might be a legitimate cause but "Fred's an idiot" is not.) The assumption is always that everybody is trying his best to do a good job, so anybody could make the same mistakes Fred made. 
  • Third, make sure there's something you can do about each step. Telling everybody to "be more careful next time" does nothing to solve the problem. 
  • How far do you push it? As far as makes sense. Typically that means push it as far as you can go and still derive some meaningful countermeasures. That's your real root cause.
     

Thursday, December 16, 2021

Basic risk management

A while ago I was talking with a friend who works in retail, and she told me about a time when one of the clerks at her store helped a customer take her [the customer's] bags to the car. Then, as he loaded the bags into the car, the customer’s dog bit him.

There was more to the story, though the rest of the details don’t matter right now. But one of the things I asked my friend was, “You’re on the store’s Safety Committee. Did you update your Risk List to include ‘Getting bitten by a customer’s dog’?” She said they discussed it, but it seemed like something that would happen only once in a blue moon. And in that case, does it really make sense to add it to the list?

This happens a lot – I mean, identifying a risk that shows up only rarely. It’s only common sense to want to know what risks you might be facing, and (for example) the ISO management system standards all require some level of risk identification. (ISO 9001, ISO 14001, and ISO 45001 all put this requirement in section 6.1.1.) But of course you can’t take action to prevent everything you think of, so you need some way to rank your list in order of importance. That way you can plan for the ones that really matter, and let the rest go. But what ranking do you choose? Generally there are at least two questions to consider:

  • How likely is this risk?
  • And how bad will the impact be if it happens?

Anything that scores high on both questions goes to the top of the list. After that, it’s not so obvious. But here’s one simple approach you can take. Please note two things:

  • You can use this approach for any kind of risks. In my story about the dog, I was talking about safety risks. But your marketing team can do the very same thing to analyze competitive risks. Your product developers can use this approach (or a more sophisticated version of it) as an FMEA (Failure Mode and Effects Analysis) to think through potential product failures. Your shipping department can do this to evaluate different logistical methods. It is a very general and very powerful tool.
  • There are a lot of ways to make this approach more sophisticated, depending on the needs of your organization. What I describe here is the simplest possible version.

Step one: Score all of your risks according to how likely they are, using just three values: High, Medium, Low.

Step two: Now score all of your risks according to their impact – how bad things would be if they happened – using the same three values: High, Medium, Low.

Step three: Use these two scores to calculate a priority for each risk, using the following formula:

Priority = Likelihood x Impact

 

High

Medium

Low

High

High

High

Medium

Medium

High

Medium

Low

Low

Medium

Low

Low

On this scale, for example, “getting bitten by a customer’s dog” would probably rank Low for likelihood but potentially High for impact, for a composite priority of Medium.

Now that you have assigned a priority to every risk on your list, what next? The next step should be to address the important ones. 

  • What does it mean to “address” a risk? If possible, prevent it. If you can’t prevent it, take steps now to mitigate the impact when it happens. Also, consider how you will respond when it does happen: those are your contingency actions. 
  • Which ones are “important”? It depends on what you are doing. At the very least, you should address all the risks with priority = High. Naturally you don’t have to stop there. Maybe you want to address the Medium ones as well, or some of them. Maybe there are steps you can take for a few of the Low risks too, though typically you should think about them last. You have to decide what works for you. But addressing all the risks rated High is pretty much a minimum.

What happens to the risks that you choose not to address? If my friend’s company updated their list of safety risks to include “getting bitten by a customer’s dog” and then calculated its priority as only Medium, they might not plan any action for it. So why put it on the list?

The point is that the priority ratings aren’t static. From time to time you’ll review your list to see if things have changed. As you take mitigation steps, for example, the impact of some risks will drop. The impact of others might rise, depending on changes in the outside world. Back in 2019, most American companies who did disaster planning probably rated “global pandemic” at a very low likelihood; by mid-2020, it had become a simple fact of life. So even if a risk falls below your threshold and you decide not to address it right now, keep it on the list. Then the next time you review the list – next quarter, next year, or whenever – you can think about it again. And as long as it stays on the list, you won’t forget.

      

Thursday, December 9, 2021

Process fragility — or — "People Before Process" Part 3 of 3

In last week's post we saw that there are powerful reasons why companies build up their process systems, while the motives to build up their people are often less obvious or less urgent. But the week before, we saw that in the long run it looks more important for an organization to have the right people than to have the right process; because good people will improve a bad process, while bad people will degrade a good one. What does this mean? Is it just one more case where the easy and obvious motives line up in support of short-term benefits at the cost of long-term ones?

Maybe so, but in the full picture we can see other things as well. The first of these is that a reliance on process is fragile, while a reliance on competence is resilient. Let me tell you a story.

Once upon a time, I helped to support the Quality system in a small factory. The factory had run successfully, under one owner or another, for most of the 20th century; some people operating the lines had worked there all their lives and were nearing retirement age. Recently the factory had been acquired by a new owner, and part of the "post-merger integration" was to implement the new owner's QMS across the board. This meant, among other things, generating Control Plans for every factory operation — something that had never been done before. A couple of manufacturing engineers were assigned the task; they made a quick inventory of all the things the factory could do, listed the steps for each in an Excel table, and published the results as Control Plans.

Then one day it was time for our external surveillance audit. In order to audit section 8.5.1 of ISO 9001:2015, the auditor asked for a Control Plan. We offered him several, and he picked one that covered the plating bath. Then he walked out on the line to watch it in action. Right away he discovered that one of the vats was at the wrong temperature. The defined reaction in the Control Plan was, "Stop the line and call the manufacturing engineer," but the line was still running. Our auditor had been watching the process for less than five minutes, and — presto! — he found a Nonconformity.

When the day was over and the auditor had gone back to his hotel for the night, my boss and I walked out on the line to ask the operator what was going on. What was he thinking, to keep running the line when the temperature was significantly outside of range? He wasn't the least bit bothered. He explained that the plating reaction depended on both the temperature of the bath and its chemical composition. When he saw that the heater for one vat was malfunctioning, he changed the chemical composition of that specific stage of the bath to compensate. The final output would be indistinguishable; the customer would get exactly what they ordered, and there would be no need to delay this production order. And a good thing too, because he happened to know that the responsible manufacturing engineer was on vacation for another week yet. But he assured us it was all fine. The product would be correct, and the customer would be happy.

"All fine" is a matter of perspective, of course. My boss and I had to do a lot of talking to persuade the auditor to rate this Nonconformity as a Minor and not a Major. But from the customer's perspective it really was "all fine." The product that shipped to the customer really was going to be indistinguishable from one that had been made at the right temperature and with the defined chemical bath. This means two things:

  • From the perspective of the audit, the finding really should have been a Major Nonconformity, because the system was absolutely not working the way it was defined (on paper). The written Control Plan said that if anything was out of adjustment, the whole process should stop until the responsible manufacturing engineer could review the situation and instruct the operators what to do. (And that would have been another week, at least.)
  • But if the organization had followed the written Control Plan, the order would have been a week late — needlessly! In this particular case, the operator himself already knew exactly what to do because he was so deeply familiar with the process. Because the operator could rely on his own competence, work did not stop ... the order was not late ... and the customer was not disappointed. Because the operator could rely on his own competence, the organization could confront an unexpected problem and then roll with it — resiliently.
It still should have been a Major Nonconformity from the perspective of the audit. But probably the operator never even looked at the Control Plan. That should have been another Nonconformity, come to think of it.*

This is what I mean when I say that relying on process is fragile. No written process can possibly cover all situations that might arise, so every written process runs the risk that one day the organization will face a situation that the process does not address. When this happens, the process breaks down. But relying on competence is resilient, because a well-trained expert with deep knowledge of the process can figure out a response to any unexpected situation, with a high probability of getting it right.

Notice something else. The whole pattern of thought and planning that underlies modern industrial capitalism favors this fragile, process-based approach over the resilient, competence-based one. For consider:

  • On the one hand, someone who is simply trained to follow a process (and no more than that) is unprepared to solve problems or handle novel situations on his own. But he is a lot cheaper than the employee with wide experience and deep knowledge.
  • On the other hand, most of the time your organization shouldn't be facing problems or novel situations.
  • Therefore in principle you shouldn't need your line operators to have wide experience or deep knowledge. If you have one knowledgeable problem-solver for every ten ignorant line operators, that should give you plenty of coverage for the number of problems you are actually likely to face and it's a lot more cost-effective than training everyone.
  • What's more, this arrangement means that your line operators are interchangeable human resources. You can move them wherever you need them in the organization. As long as they know how to follow procedures, you can use them to carry out any task that has been defined by a written procedure. And this gives you far more flexibility than you would have if they were tied to specific tasks because that's all they knew. This is what you want.
But look where this line of calculation takes us. By following the ordinary patterns of thought and planning that underlie modern industrial capitalism, we end up adopting a policy towards our people which has been proven to be very powerful, and which supports indefinite expansion; but this same policy makes our whole organization more fragile, and risks bringing us to our knees if something truly unusual happens.

How can this be? Is there something wrong with the theory?

Well yes, in a sense. Peter Drucker argued for years that our economy is no longer truly "capitalist" because Capital is no longer the most important factor of production. Capital is almost irrelevant these days, because it can be crowdsourced — either in a traditional manner, by issuing shares of stock; or in a contemporary manner, by launching a campaign on GoFundMe. The critical factor of production today, in Drucker's argument, is Knowledge; and the most critical member of any organization is the knowledge worker. (Drucker argued this point in many places but see for example his Post-Capitalist Society (1993).) A knowledge worker is any employee whose unique value comes from the knowledge he carries in his head. And because that knowledge is always of something specific, knowledge workers are in general not interchangeable. (If you have too many quality auditors, it is typically not easy to repurpose some of them as accountants or design engineers.)

Note also that the story above about the plating bath shows that even line operators can be knowledge workers. As a result, the whole approach of treating line employees as interchangeable units starts to look misguided or (at best) out of date.

None of this is to deny that a process focus really is very powerful in the short run. But if anything happens to interrupt normal operations — if, ... oh I don't know, ... say a global pandemic throws the Designated Problem-Solvers out of the office at the same time that it disrupts all the organization's supply chains — then an organization that has relied on a process-focus will be in deep difficulties, while an organization that has built up the competence of all its employees will be able to roll with the changes and adapt.

This development is something that we in the Quality business need to understand and pay attention to. We've heard the message before: W. Edwards Deming insisted in his fourteen key principles on the need for training on the job (point 6), for breaking down barriers between functions (point 9), for pride of workmanship (point 11), and for a "vigorous program of education and self-improvement" (point 13). But we have yet to integrate these concepts into the "common sense" understanding that all Quality professionals carry around with them. We have yet to rewrite our standards — like ISO 9001 — so they give as much attention to people as to processes.

We can do this. Once upon a time, we Quality professionals didn't all think in terms of statistical variation, but now we do. Once upon a time we didn't all think in terms of business processes, but now we do. We can absorb this change just like all the others. But we need to start.

__________

* The attentive reader will have noticed that I describe the very same action as resilient (as well as good for both the customers and the company) and a potential Major Nonconformity. How can it be both? Aren't audits supposed to improve the company's behavior? Or am I trying to criticize audits as counterproductive?

I'm not criticizing audits per se, but the usefulness of any audit depends critically on the usefulness of the management system documentation that you are auditing against. In this case, the root cause of the finding was the slapdash way that the company threw together their Control Plans, aiming to get something written so they could check a box rather than thinking through what the controls should really be. Since what the operator actually did to respond to the condition was correct, it should have been permitted as one option under a proper Control Plan. Or else perhaps the operator's deep knowledge of the process could have qualified him to be designated as a responsible "Manufacturing Engineer" for this particular production line.

In real life, the company analyzed the audit finding and realized that all their other Control Plans were probably just as bad. So they started over from the beginning and rewrote the lot of them more carefully. It was the best possible response to that finding, and I was glad that's what they chose.         

Thursday, December 2, 2021

"Why do we always revert to process?" — or — "People Before Process" Part 2 of 3

After I published last week's post, I got a note from Jeff Griffiths asking why I think we in the business world regularly put so much emphasis on process. Among other things, he wrote, "I think the reason organizations always seem to revert to process is that developing people and actively managing competency is hard work, and most front-line leaders aren’t trained to do it, and they certainly aren’t invited to do it. What’s your experience been?"

Of course he's right. But his note got me thinking. And the longer I thought about it, the more reasons I could see that organizations choose to emphasize process development.

In the first place, of course, there are the undeniable benefits of the process approach. A process focus supports continual improvement, by letting you understand the overall flow of your operations so you can see where there are blockages. A process focus permits standardization across functions or locations. A process focus facilitates interaction of multiple functions across an organization. And a process focus makes possible the "checklist effect," which enhances the performance even of deeply trained experts like pilots and surgeons. All of these are good things. Nothing that follows is meant to minimize any of these genuine benefits, and I would never suggest that you try to do without defined processes!   

But there are other reasons for the focus on process, and not all of these reasons are so obviously wholesome.

One reason, for example, is that Quality professionals — people like me — push the process approach so hard. That (in turn) is because the process approach is a major focus of the ISO 9001 standard. Section 0.3 of the Introduction is about nothing else. In the normative sections of ISO 9001:2015 (I mean chapters 4-10), the word process or processes occurs 57 times. By contrast: competence or competent occurs 9 times, training occurs twice, people occurs once, and skills shows up only in Annex B (well outside the normative chapters). So organizations can be forgiven for thinking that process is something they have to emphasize.

Another reason is that organizations know how to write processes — even if they don't follow all my advice from last summer, they can write something that's good enough to get by — but they often don't know how to develop their people. As Griffiths wrote to me, most front-line leaders aren't trained to do it. I can personally confirm that when I first became a manager, nobody took me aside to train me how to develop my people — nor even to give me some basic pointers. Of course there are companies that are happy to step in to help with the task — Griffiths himself works for one — but it seems like there aren't nearly enough of them, and of course none of them works for free.

Related to the foregoing is the simple fact that a process focus is easier and cheaper than a competence focus. A process focus is narrow and finite, while a competence focus is potentially infinite — there's always room to learn more and get better. This means that a process focus is easier to replicate at scale, and therefore supports expansion.

For example, McDonald's is well-known as a process-focused business. Everyone knows that McDonald's has defined an exact procedure for every aspect of running each restaurant. The result is that McDonald's has spread across the globe, with more than 37,000 stores in 120 countries. What is more, the food they serve (with the exception of minor regional specialties) is absolutely uniform. When you place an order in a McDonald's, you know what you are going to get.

Contrast this with a restaurant that is not so process-focused, such as Le Bernardin in New York City. No doubt Le Bernardin uses some recipes, and there must be procedures for how to take reservations or manage the flow of customers. But there is also room for a cook to express personal artistry. And while Le Bernardin has been — in its class and with respect to its own (very different) criteria — every bit as successful as McDonald's, nonetheless there is only one.

For a more dramatic example, consider the situation of American war materiel when the country first entered World War Two. Nearly all weapons required precision optical sighting devices to aim them, and the best lenses in the world were all ground in Germany by deeply trained experts who had served long years of apprenticeship. So far as anyone knew at the time, that was the only way to get lenses ground. Peter Drucker tells the story:

Belief in the mystery of craft and skill persisted, as did the assumption that long years of apprenticeship were needed to acquire both. Indeed, Hitler went to war with the United States on the strength of that assumption. Convinced that it took five years or more to train optical craftsmen (whose skills are essential to modern warfare), he thought it would be at least that long before America could field an effective army and air force in Europe—and so declared war after the Japanese attack on Pearl Harbor.

We know now [Frederick W.] Taylor was right. The United States had almost no optical craftsmen in 1941. And modern warfare indeed requires precision optics in large quantities. But by applying Taylor’s methods of scientific management, within a few months the United States trained semiskilled workers to turn out more highly advanced optics than even the Germans were producing, and on an assembly line to boot. And by that time, Taylor’s first-class men with their increased productivity were also making a great deal more money than any craftsman of 1911 had ever dreamed of.

(You can find Drucker's article, from which these paragraphs are extracted, here. There is a short video about the same topic, made in 1945, available here.)

In short, there are powerful motives pushing businesses to adopt a process focus, even if it comes at the expense of putting similar effort into developing their people. And yet we saw last week that in the long run, investment in people is more important than investment in process. At the level of the individual firm, this is something to watch. Be careful that you don't overemphasize an approach that will give you only a limited return on your investment.

There are also implications for how we understand the broader economy, and in particular for how we Quality professionals practice our craft. I will address both of those topics next week. 

                     

Five laws of administration

It's the last week of the year, so let's end on a light note. Here are five general principles that I've picked up from working ...