Thursday, January 27, 2022

What about human error? Part 1 of 2

It's a commonplace in the Quality business that any time we start a problem investigation, we insist that "there is no such thing as human error." I say it in this post here. But what does that mean, anyway? And is it true?

At a superficial level, at any rate, it looks like there is something wrong. About a month ago the topic came up in this post and this one on LinkedIn. If the links don't open for you, the basic point is made by Christopher Paris, who points out that obviously all errors are made by humans! After all, they certainly aren't made by space aliens.

Obviously Paris is right that errors are made by human beings and not space aliens. But sometimes I think he's a little too hard on those of us in the Quality industry.* For myself, I've always taken the principle about human error as a motivational slogan rather than a statement of fact, and I think that in a pragmatic sense it performs two roles.

First, if you want to do a decent root-cause analysis, you have to get all the facts. This means getting the cooperation of whoever was there on-site when the problem happened. Now if this employee thinks you're going to blame the whole thing on him and his errors, he's not going to tell you a thing. So you start off by saying that the problem has to be with the system, not with him, and you just need his help to figure out how to improve the system. With luck this will put him at his ease, so you can make progress.

Second, sometimes when your problem-solving team is in the middle of its work, you'll have someone who really wants to get back to his desk to work on something else instead. So he says, "Look, this whole accident was caused by human error. Next time we just have to try harder, that's all. So can we wrap it up and get back to our real jobs?" The problem is that "trying harder" has never solved anything. Often — nearly always, in fact — there is something that can be improved in the system to make it easier to do the job right and harder to make a mistake. So to keep your team from giving up too early, you remind them that "there is no such thing as human error," and if they really believe in "trying harder" then the problem-solving team should try harder to find a systemic cause.

What do I mean by a "systemic cause"? It's the kind of thing I talked about here (and then expanded on in the next two posts here and especially here). If someone made a mistake out of ignorance, see if you can improve your training system. If someone made a mistake because his hand slipped, see if you can get him a tool that makes the work easier. If someone forgot that those drums were filled with nasty waste until he almost dropped his lunch in one of them (let's say it was a "near-miss" and nothing bad actually happened to the lunch), see if you can label the drums or put up signs. Those are all system-level improvements.

At the same time, it's important to notice something else. You remember that there's no such thing as a perfect process, and in the same way there's no such thing as a perfect system. There's even an old joke that says, "You cannot make things foolproof because fools are so ingenious." So while the problem-solving team always has to look for additional system improvements, the organizational management has to emphasize improving the overall competence of every employee. This is because, as we discussed a couple months ago, good people can work under bad processes a lot better than bad people can work under good processes. So the best way to error-proof your operation, so far as you can, is to strengthen both.

Next week we'll look at a typology of human errors, and at the preventive measures which work for each one. It turns out there are several different kinds of human error, and the measures which prevent this kind are no help at preventing that kind. Join me.     

__________

* In fairness, his stated purpose is to motivate us to pull up our socks on a number of basic issues, so it's to be expected that he not go easy on us.

      

Thursday, January 20, 2022

Keep your root cause analysis out of the courtroom!

Over the last several weeks, we have talked a lot about how to do a good root-cause analysis. But of course the first step to doing a good root-cause analysis is making sure you do one at all.

Wait, what? Why wouldn't you?

Suppose the worst happens and somebody gets hurt using one of your products. Somewhere along the line, as you start the investigation, somebody is bound to ask, "Why are we doing a root-cause analysis at all? If we find out the real root-cause for the problem and take steps to fix it, doesn't that just mean some hot-shot attorney can subpoena all our files and then use them to prove our original design was at fault? Won't he sue us for everything we've got? Aren't we safer just closing our eyes and hoping it doesn't happen again?"

Yes that sounds crazy -- how could it ever be better not to know what caused an accident? -- but it's also true that an ambitious plaintiff's attorney can do a lot of damage when a company is innocently trying to do the right thing. Where do you draw the line?

There's some good news here. It turns out that if you do an accident investigation that results in taking steps which would have made some past injury less likely, the results of that investigation are not admissible in U.S. federal court to prove negligence, culpable conduct, a product or design defect, or a need for a warning instruction. Note the restrictions. There are other cases where the results of your investigation are admissible — for example, in intellectual property cases, or if the court is trying to determine whether any improvement is possible. But they cannot be used to hang you for your old design.

As I mentioned in an earlier post, I am not an attorney and nothing in this blog constitutes legal advice. Please consult with your own legal counsel.

On the other hand, you can look up the background information on the Internet. The rule in question is called Federal Rule of Evidence 407. And you can find discussions of it in several places. I based this blog post on:

But if you search for FRE 407, I'm sure you can find other sources of your own.

       

Thursday, January 13, 2022

Problem-solving: Anatomy of an 8D

Over the last three weeks, I've talked about ways to improve your root cause analyses. But a root cause analysis is just one step — albeit the most important step — in the whole process of problem-solving. Now that we've discussed it in some detail, let's back up to look at the whole cycle from beginning to end.

There are several different tools or methods to formalize solving a problem. The one I'm going to describe is called an 8D. The name is an abbreviation of "eight disciplines," although "eight steps" might have been just as good a name. It does, in fact, unfold in eight steps, and it is a systematic way to cover all the bases when you want to make sure that a problem is completely solved.

When do you use an 8D?

Because an 8D is thorough and systematic, it can also be time-consuming. So you probably won't want to use it for every single problem that comes along. But pick and choose the problems where you really need a thorough solution — where it is really important that you guarantee it will never happen again.

  • Any problem with legal implications has to get a thorough solution, because you don't want to make a habit of running legal risks.
  • Any problem where you fundamentally aren't meeting your stated commitments — where you are failing to do what your organization is there to do — is another one that needs a thorough solution: for example, if you sell a product that plain doesn't work.
Those two categories should pretty much require an 8D (or the equivalent). After those, think about where it will be useful.
  • One good choice is repeated or systemic problems, things that always go wrong at the same time and in the same way. Even if one of these problems isn't bad enough to fit into the first two categories, if you can fix it once for all you will easily save enough time over the long haul to make it worth the time you invested up front.
  • On the other hand, an 8D is not a useful tool to untangle problems in a complex business process. 8Ds work best when there is already a clear definition of the target state — how things ought to be — so you can easily define the ways that reality doesn't match.

There might be other useful criteria too. Think about what is going to help you.

The steps in an 8D

Here are the eight steps. The first three can be done in any order, depending on the situation. In an emergency it can be more important to get D3 in place before you even think about D1 and D2.

D1: Name a team

Who's going to work on the problem? You need more than one person. Nobody knows everything, and everyone has blind spots. Besides, if one person knew enough to solve the problem by himself, he would have done it already and it would never have happened.

Name someone as the Team Lead. That's who will call the meetings, organize the investigation, and keep everyone on track.

Depending on your organization, you might also need to name a Sponsor. This happens especially when everybody has a lot of other things to work on, so team members risk being pulled away to do something else instead. A Sponsor is someone who can insist, "No, this problem has to be fixed and I want a report on my desk by Friday that tells me how far you've gotten." Ideally the Sponsor is high enough in the organization (or at least has enough authority) to make the 8D a real priority; but is also close enough to the working level that he feels the pain caused by the problem, and therefore really cares about getting it fixed.

D2: State the problem

OK, this sounds obvious, but write it down. As you study the problem, it is easy to get distracted by symptoms or side topics. Writing down what problem you are really trying to solve helps keep you on track. This is also the time to collect all the information you can find about the problem: what went wrong, where and when and how did it happen, who was involved, and so forth. Be as exact as you can.

Now, it's not unusual that you learn more about the problem as you get deeper into it. You might find that your original statement was too superficial, or that it only captured a symptom. In that case you can absolutely go back to update your problem statement based on your new and deeper understanding. Just remember that, when you are done, the problem you have identified in that statement is the problem you are going to have to fix so it never happens again. And all along the way, there should be a logical connection between the stated problem and the investigation you are doing.

D3: Contain the problem

Solving your problem will probably take a while, and you don't want it to get worse in the meantime. So before you get deep into the solution, take some kind of steps to contain it. If your widgets are coming off the line crooked, maybe you need to shut down the line until you find the problem; if only half of them are coming out crooked, maybe you put someone there to do a manual inspection of every single widget to filter out the bad ones. If there is a catastrophic bug in software that you distribute over the Internet, you probably want to pull it off your download server. Whatever it is, do something to block the problem so it won't get worse while you are figuring it out. Notice that your containment action doesn't have to be efficient or sustainable, because it's only temporary. But it has to be effective.

D4: Find the root cause

We've already discussed this topic at length: here, here, and here.

D5: Brainstorm possible corrective actions

Your root cause analysis may have uncovered several different root causes. (If you did a 2 x 2 x 5-Why analysis, for example, you should have at least four, and maybe more.) Now try to think of at least one permanent corrective action for each one of them: something you can do which will guarantee that that cause never recurs. I say "at least one," but it's fine to come up with more. Go ahead and list them all. Try to write each one so that it is obvious how it is relevant — how will this action prevent that cause from ever coming back again?

D6: Implement corrective actions

This step has several parts to it.

  1. Now you have a list of possible corrective actions. Depending what you've got, it might not be practical to do them all. So evaluate them, one by one. Maybe this one is too expensive to be practical, or that one causes more problems than it solves. This is where you figure that out. Then, when you have evaluated all your possible actions, pick at least one corrective action to implement. I'd like to say "at least one corrective action for each root cause," but sometimes that's not practical. Figure out what you can pragmatically achieve. But also check the logic trees you worked out in D4 to make sure that the actions you've chosen really will be enough to prevent the problem from ever coming back.
  2. Go implement the action (or actions) that you chose.
  3. Follow up by checking your results. After you implemented your corrective actions, did the problem really go away? 
    1. If yes, that's great. Skip down to part 4. 
    2. If no, then there's a mistake somewhere in the analysis: either one of your root causes was wrong, or you missed a possible cause, or one of your corrective actions didn't completely eliminate the cause that it was supposed to resolve.
      1. In that case, go back through your analysis until you find the error, and re-enter the process at that point: D4 for a new root cause, D5 for a new batch of possible corrective actions, D6 to pick one (or more).
      2. Implement the new corrective action(s), and — once again — check to see if the problem disappears.
      3. And so on. Do this until you finally make the problem disappear.
  4. Once the problem has been permanently eliminated, there's one more part to D6. At this point you can finally afford to remove the temporary containment measure you put in place back in D3. Since there is no longer any possible chance of the problem recurring, there's no longer anything to contain.

At this point, your corrective actions are finally done.

D7: Assess risks and learn lessons

The corrective actions are done, but you're not.

As a result of going through this detailed analysis, you've learned something you didn't know before. You have learned that it is possible for such-and-such a problem to occur any time that you have this or that initial condition. You didn't know that before, and now you do. 

So now ask yourself, "Where else do we have the same conditions?" In other words, "Where else are we at risk of the same problem happening, even though so far we've been lucky and it hasn't happened there yet?" This could mean almost anything, depending on what problem you just solved. 

  • If the problem was that your widgets were coming off the line crooked, and if one of the root causes was that the widget-making machine didn't get the preventive maintenance it needed, you might ask yourself "What other machines do we use, and are all of them already getting preventive maintenance? Or do we have another under-maintained machine somewhere in the plant that might go bad tomorrow?"
  • If the problem was that you ran a plating bath at the wrong temperature despite clear instructions in the Control Plan, and if one of the root causes was that the line operator knew better than the author of the Control Plan, you might ask yourself "Is that the only bad Control Plan in the plant? Or are there others that were done just as sloppily? How many of them do we have to fix?"
  • Maybe you sell your widgets internationally, and you need to get them certified before they can be legally imported into Grand Fenwick. Also these certifications expire every two years and have to be renewed. Maybe the problem which triggered your 8D was that someone sold a big shipment of widgets to Grand Fenwick a week after the certification expired, and maybe one of the root causes was that nobody ever told the Order Desk that there was any kind of legal restriction on the orders they are allowed to take. During D6 you will certainly get all the problems related to Grand Fenwick certification sorted out, but during D7 you'll want to ask, "Are there any other legal restrictions on where we can sell any of our other products?" And it would be good to know the answer.
And so on. 

Then, once you have answered the question where else this problem might be expected to show up, do something about it. Take steps to prevent it in those other spots before it has a chance to happen. Of course you should keep your level of effort proportional to the importance of the problem. But preventing a problem before it happens usually saves everyone a lot of time and money.

Finally, if you keep risk lists for any of your activities, look at them to see if any of them are affected by the new information you have learned. If they are, update them as needed with the new risks you have learned about.

D8: Close the 8D and thank the team

Once all these steps are done, the team should meet one last time to run through the results and check that everyone agrees they are complete. Close and sign off the documentation in whatever way your organization does these things. And thank the team, genuinely recognizing them for the improvements they have made in your system. Order pizza. 

Strictly speaking that last bit isn't mandatory. But it is hard to go wrong by ordering pizza.

Another way to think about it

By Efbrazil - Own work, CC BY-SA 4.0, https://commons.
wikimedia.org/w/index.php?curid=102392470
There is no question that an 8D can be a lot of work. But it is a powerful tool. One way to think about it is to see the 8D process as a way to apply the scientific method to solving your organization's problems. At a high level, the steps are the same:

  • Observation / question: This corresponds to the statement of the problem in D2. 
  • Research topic area: This corresponds to the data collection in D2 and the logical analysis of the data in D4.
  • Hypothesis: This corresponds to the list of possible corrective actions in D5.
  • Test with experiment: This corresponds to D6, where you implement one or more corrective actions and then check to see if they succeeded in eliminating the problem.
    - Remember that the whole point of the scientific method is that we don't know how an experiment will turn out until we do it; that's why running an experiment teaches us something new.
    - But it's the same with an 8D: we don't know whether our proposed corrective action will really correct anything until we try it. That's why there are so many sub-parts to D6: we have to allow for branching paths, depending on how the results turn out in the real world.
    - And either way — regardless whether the first action we try (or the second, or the third) succeeds or fails — we learn something new, something we didn't know before.
  • Analyze data: Part of this analysis takes place in D6, where we evaluate whether we really did eliminate the problem (and, if not, why not). The rest of it takes place in D7, where we evaluate the broader implications of the new information we just learned: Where else do we risk seeing the exact same problem?
  • Report conclusions: This is where we wrap up the paperwork at the end of D8. 

This particular cyclical representation of the scientific method fails to include a step for ordering pizza. But it's still a good thing to do.           

               

Thursday, January 6, 2022

Finding root causes, Part 2: Reaching farther

When something goes wrong and you are investigating what corrective action to take, how many root causes do you have to find? Typically there is more than one. We've talked over the last couple of weeks about what a root cause is and how to find it, and it's not unusual for the logic path of a 5-Why to branch. For example: The fire started because there were oily rags and at the same time there was also a spark — either one of them alone would not have done it. So now you have to ask "Why were there oily rags?" and also (on a separate branch) "Why was there a spark?"

But even in simpler cases you may need to follow several different paths if you want to get a full picture of what's going on. To understand why, remember that a Quality system is all about getting what you want, and that means minimizing the extent to which you are derailed by problems. You can do this in three ways, and a good Quality system uses all three: 

  1. When a problem occurs, fix it.
  2. Looking downstream from the problem (if it has already occurred), catch it.
  3. Looking upstream from the problem (if it hasn't occurred yet), prevent it.
And so a really thorough investigation of the root cause of some problem takes all three perspectives into account. 

  1. You want to know what happened and why, so you can fix it and make sure it never happens again. This is what we have been talking about up till now.
  2. You want to know how to catch it, which means asking a second question.
  3. And you want to know how it was possible in the first place, which means looking for a second kind of answer.
In the rest of this post, I will walk through both of these enhancements.

One caution, before I begin: don't go crazy with any of this. Remember that your Quality methods have to be pragmatic: they have to serve you, and not vice versa. Add these enhancements so far as they are useful — and often they truly are very useful — but keep your level of effort proportional to the problem you are trying to solve.

Two questions

One way to make your investigation reach farther is to ask two questions instead of one, and to do a 5-Why analysis on each. The two questions are, "Why did the problem happen?" (which we have already discussed) and also "Why didn't we catch it in time?" 

The easy example is to think of a machine producing widgets that gets out of alignment, so that we start shipping crooked widgets to our customers. 

  • The first question asks about the machine: how did it get out of alignment? 
  • But the second question asks why nobody caught the problem in time: why didn't the inspectors at the end of the line see that the widgets were crooked and send up an alarm?

The point is that a working Quality system is built on the premise that things go wrong: machines break down, people make mistakes, and so on. As I noted above, a Quality system is designed to prevent mistakes before they can happen, and also to catch mistakes after they do. So if you've got a Quality system in place and a crooked widget slipped through anyway, there must have been several points of failure. Otherwise the problem would have been caught and corrected in the normal course of the workday.

If you ask about both the occurrence of a problem and its non-detection, that's sometimes called a "2 x 5-Why." And of course it makes your overall Quality system more robust and resilient, because it helps you to catch problems better as well as just fixing and preventing them. It enhances your investigation by adding the downstream perspective.

Two kinds of answers

The other way to reach farther is to look for two different kinds of root cause and not just one. The two kinds are "technical root cause" (which is the kind of root cause we have discussed up till now) and "managerial root cause." The idea behind this second one was summarized for me once by a senior colleague at one of the places I used to work — he was from another division, but I was lucky enough to get personalized training time with him — who said: "If you look at it right, everything that ever goes wrong in any plant is the fault of senior management."

Wait, what? How is that possible? I tried to make up some examples to prove him wrong.

  • What if some employee doesn't know what he's doing? Then the training process has broken down. And senior management set up the training program — either that, or they hired the person who did. So either they set up a faulty program, or they hired the wrong person.
  • What if a piece of equipment breaks? That equipment should have been covered by a preventive maintenance program: somebody should have been assigned to go around at regular intervals to check how the equipment is holding up, calibrate it if necessary, and then clean it and oil it. If that program had been in place, the employee responsible for maintenance would have seen that the part was getting worn and ordered a new one. But he didn't do it ... because there was no program ... because senior management either failed to set it up or failed to tell the Operations Manager to set it up.
  • What if an employee is measuring where to cut and the ruler slips? Isn't that a case of "Accidents happen"? Not at all. Why does he have to measure the cut with a ruler, when everybody knows that rulers can slip? The cutting operation should have been error-proofed by giving him a fixture to use: shove the thing to be cut until it is snug against the fixture, and then cut along the edge. He doesn't have to measure anything, and he gets the right length every time. But nobody built a fixture to error-proof the job, because — again — senior management didn't make sure that it happened and didn't hire an Area Lead who knew about these things. 
It went on like that for a while. Finally I got the point.  

Up above, the issue was that not only did something go wrong, but the system failed to detect it afterwards. In this case, the issue is that not only did something go wrong, but the system had to allow it to go wrong. That means that someone failed to set up the system correctly, or failed to execute some system-level task with an appropriate level of diligence. Either way, that's a managerial responsibility.

Think again of the widget machine we discussed above, the one that is out of alignment and making crooked widgets. 

  • The technical root cause might be that a certain part wore out, or perhaps the machine's design could be improved in such a way that it is less likely to slide out of alignment in the future. Either of these could be a valid cause and something we want to fix. 
  • But the managerial root cause has to do with flaws or gaps in how the system was set up by human beings. If a part wore out, the machine should have been serviced under a preventive maintenance program that would have found the part and replaced it in time. If the design was faulty, the development process which designed the machine in the first place should have foreseen that flaw (maybe by using an FMEA) and used a better design from the beginning.

In brief, this enhances your investigation by adding the upstream perspective.

As an aside: In non-industrial applications, it is not always obvious how to apply both of these elaborations, but it is usually worthwhile to think about it. And if you see a place where it can help, then use it.


I said before that if you ask about both occurrence and non-detection, the method is sometimes called "2 x 5 Why." If you do that and then also ask for both technical and managerial root causes, it is sometimes called "2 x 2 x 5 Why." And sure enough, in this case you really are asking four different questions, and you are looking for some meaningful and actionable answer for each one:

  • What is the technical root cause why the problem occurred?
  • What is the managerial root cause why the problem occurred?
  • What is the technical root cause why we didn't detect the problem in time?
  • What is the managerial root cause why we didn't detect the problem in time?

But the name isn't the important thing. The important thing is to understand what really caused the problem, so you can fix it. And if you can give good, actionable answers to all four questions, you have a really good framework to make sure this problem — and anything like it — never happens again.

     

Quality and the weather

“ Everybody complains about the weather, but nobody does anything about it. ” The weather touches everybody. But most people, most of the ti...