Thursday, January 6, 2022

Finding root causes, Part 2: Reaching farther

When something goes wrong and you are investigating what corrective action to take, how many root causes do you have to find? Typically there is more than one. We've talked over the last couple of weeks about what a root cause is and how to find it, and it's not unusual for the logic path of a 5-Why to branch. For example: The fire started because there were oily rags and at the same time there was also a spark — either one of them alone would not have done it. So now you have to ask "Why were there oily rags?" and also (on a separate branch) "Why was there a spark?"

But even in simpler cases you may need to follow several different paths if you want to get a full picture of what's going on. To understand why, remember that a Quality system is all about getting what you want, and that means minimizing the extent to which you are derailed by problems. You can do this in three ways, and a good Quality system uses all three: 

  1. When a problem occurs, fix it.
  2. Looking downstream from the problem (if it has already occurred), catch it.
  3. Looking upstream from the problem (if it hasn't occurred yet), prevent it.
And so a really thorough investigation of the root cause of some problem takes all three perspectives into account. 

  1. You want to know what happened and why, so you can fix it and make sure it never happens again. This is what we have been talking about up till now.
  2. You want to know how to catch it, which means asking a second question.
  3. And you want to know how it was possible in the first place, which means looking for a second kind of answer.
In the rest of this post, I will walk through both of these enhancements.

One caution, before I begin: don't go crazy with any of this. Remember that your Quality methods have to be pragmatic: they have to serve you, and not vice versa. Add these enhancements so far as they are useful — and often they truly are very useful — but keep your level of effort proportional to the problem you are trying to solve.

Two questions

One way to make your investigation reach farther is to ask two questions instead of one, and to do a 5-Why analysis on each. The two questions are, "Why did the problem happen?" (which we have already discussed) and also "Why didn't we catch it in time?" 

The easy example is to think of a machine producing widgets that gets out of alignment, so that we start shipping crooked widgets to our customers. 

  • The first question asks about the machine: how did it get out of alignment? 
  • But the second question asks why nobody caught the problem in time: why didn't the inspectors at the end of the line see that the widgets were crooked and send up an alarm?

The point is that a working Quality system is built on the premise that things go wrong: machines break down, people make mistakes, and so on. As I noted above, a Quality system is designed to prevent mistakes before they can happen, and also to catch mistakes after they do. So if you've got a Quality system in place and a crooked widget slipped through anyway, there must have been several points of failure. Otherwise the problem would have been caught and corrected in the normal course of the workday.

If you ask about both the occurrence of a problem and its non-detection, that's sometimes called a "2 x 5-Why." And of course it makes your overall Quality system more robust and resilient, because it helps you to catch problems better as well as just fixing and preventing them. It enhances your investigation by adding the downstream perspective.

Two kinds of answers

The other way to reach farther is to look for two different kinds of root cause and not just one. The two kinds are "technical root cause" (which is the kind of root cause we have discussed up till now) and "managerial root cause." The idea behind this second one was summarized for me once by a senior colleague at one of the places I used to work — he was from another division, but I was lucky enough to get personalized training time with him — who said: "If you look at it right, everything that ever goes wrong in any plant is the fault of senior management."

Wait, what? How is that possible? I tried to make up some examples to prove him wrong.

  • What if some employee doesn't know what he's doing? Then the training process has broken down. And senior management set up the training program — either that, or they hired the person who did. So either they set up a faulty program, or they hired the wrong person.
  • What if a piece of equipment breaks? That equipment should have been covered by a preventive maintenance program: somebody should have been assigned to go around at regular intervals to check how the equipment is holding up, calibrate it if necessary, and then clean it and oil it. If that program had been in place, the employee responsible for maintenance would have seen that the part was getting worn and ordered a new one. But he didn't do it ... because there was no program ... because senior management either failed to set it up or failed to tell the Operations Manager to set it up.
  • What if an employee is measuring where to cut and the ruler slips? Isn't that a case of "Accidents happen"? Not at all. Why does he have to measure the cut with a ruler, when everybody knows that rulers can slip? The cutting operation should have been error-proofed by giving him a fixture to use: shove the thing to be cut until it is snug against the fixture, and then cut along the edge. He doesn't have to measure anything, and he gets the right length every time. But nobody built a fixture to error-proof the job, because — again — senior management didn't make sure that it happened and didn't hire an Area Lead who knew about these things. 
It went on like that for a while. Finally I got the point.  

Up above, the issue was that not only did something go wrong, but the system failed to detect it afterwards. In this case, the issue is that not only did something go wrong, but the system had to allow it to go wrong. That means that someone failed to set up the system correctly, or failed to execute some system-level task with an appropriate level of diligence. Either way, that's a managerial responsibility.

Think again of the widget machine we discussed above, the one that is out of alignment and making crooked widgets. 

  • The technical root cause might be that a certain part wore out, or perhaps the machine's design could be improved in such a way that it is less likely to slide out of alignment in the future. Either of these could be a valid cause and something we want to fix. 
  • But the managerial root cause has to do with flaws or gaps in how the system was set up by human beings. If a part wore out, the machine should have been serviced under a preventive maintenance program that would have found the part and replaced it in time. If the design was faulty, the development process which designed the machine in the first place should have foreseen that flaw (maybe by using an FMEA) and used a better design from the beginning.

In brief, this enhances your investigation by adding the upstream perspective.

As an aside: In non-industrial applications, it is not always obvious how to apply both of these elaborations, but it is usually worthwhile to think about it. And if you see a place where it can help, then use it.


I said before that if you ask about both occurrence and non-detection, the method is sometimes called "2 x 5 Why." If you do that and then also ask for both technical and managerial root causes, it is sometimes called "2 x 2 x 5 Why." And sure enough, in this case you really are asking four different questions, and you are looking for some meaningful and actionable answer for each one:

  • What is the technical root cause why the problem occurred?
  • What is the managerial root cause why the problem occurred?
  • What is the technical root cause why we didn't detect the problem in time?
  • What is the managerial root cause why we didn't detect the problem in time?

But the name isn't the important thing. The important thing is to understand what really caused the problem, so you can fix it. And if you can give good, actionable answers to all four questions, you have a really good framework to make sure this problem — and anything like it — never happens again.

     

No comments:

Post a Comment

Five laws of administration

It's the last week of the year, so let's end on a light note. Here are five general principles that I've picked up from working ...