Thursday, April 11, 2024

"The system is broken!"

As Quality professionals, we get into the habit of thinking along certain lines. Often these lines are very useful, which is why we develop the habits. And even when we knock off work to go home, it can be a huge benefit in our daily private lives not to blame people when things go wrong, to use incremental improvement to get better at golf, or to remember the process approach when negotiating with unhelpful Help Desks.

But every so often those habits can trip us up, when they encourage us to assume things that aren't really there. A couple of days ago I was talking to an official from ASQ, and questioned why I hadn't gotten a certain mailing. I was sure this was a sign that there was a bug in the routine that generated mailing lists, and asked for a bunch of information to help locate the error: "Which mailing list did you use? What email address does that list show for me?" After a while the answer came back, and it turned out that mailing hadn't gone out to anybody yet. Maybe I could afford to be a little more patient? 😀

The same thing can happen in bigger cases.

Last Friday, April 5, some 60,000 households in the Province of Alberta lost power in rolling blackouts, starting at 6:49 am and continuing over four hours until about 11:00.* Fortunately this was in April, so temperatures were a little warmer than they were back in January—the last time that Alberta's power grid almost failed. (I discussed that failure at the time in this post.) This time, thermometer readings hovered cozily right around freezing: from 28°F-32°F in Calgary, and from 30°F-32°F in Edmonton. Even the rural town of Conklin in the north-east experienced temperatures in the same range. Still, the unplanned blackouts caused understandable alarm across the province, and many people rushed to assign blame.

  • Premier Danielle Smith said the blackouts were all because the market doesn't encourage natural gas plants to stay operating, so that they can pick up the slack in a moment when other sources fail. "This is at the heart of everything that we've been saying for the last year, that the system is broken."
  • On the other hand, Marie-France Samaroden, the vice-president of grid reliability operations with the Alberta Electric System Operator (AESO), pointed out that gas plants aren't simply a panacea, because they are as subject to disruption as any other type of generator. And in fact one of the immediate triggers for Friday's blackouts was that the 420-megawatt Keephills 2 natural gas power plant went offline unexpectedly. (At the moment it is not clear why.)  
  • Andrew Leach, an energy and environmental economist and professor at the University of Alberta, said that the whole market structure misallocates energy production, because it has been set up inflexibly. 
  • Blake Shaffer, an associate professor of economics at the University of Calgary, summarized the situation by remarking, "People like to assign blame on power system woes to their least favorite generation technology. And the reality is, all generation technologies have reliability challenges." 

The last time I wrote about Alberta's power grid, I discussed the kind of analysis and planning that we might expect: FMEAs for individual components of the system, plus an overall system analysis for the entire grid. And I assumed that such an analysis, if it were thorough enough, would highlight exactly where Alberta has to take action to improve weaknesses, in order to prevent future failures. But I overlooked one huge fact that makes the entire problem far more difficult. It is only a small consolation to reflect that everyone else who has commented on the problem has made the same mistake.

The critical mistake we all have made is that we think of Alberta's electric grid as one large system. But it's not.

Think about it for a minute. The grid consists of producers (who generate power) and consumers (who use it). The producers are plants powered by natural gas or coal; wind turbines; banks of solar cells; and so forth. The consumers are private homes, businesses, and in fact even the production plants themselves to the extent that they need electricity to power their own operations. 

But now, what is a system? According to Wikipedia,** a system is "a group of interacting or interrelated elements that act according to a set of rules to form a unified whole"; and for the concept of system planning to make any sense at all, the planner has to be able to intervene in the system to adjust it in any spot where it is not running correctly. That's how a machine works, and it's how a factory works. All our Quality tools are designed to analyze systems that look like this.

And this is exactly what the Province of Alberta cannot do! Each of those producers is a private company. Each of the consumers is a private company or else a private citizen. The Province of Alberta has no authority to tell the owners of those power plants how to run their businesses, nor to tell private homes how much electricity they are allowed to use. But this means that the Province is powerless to reach in and adjust this or that element of the "system" in order to make the whole thing run better. (Or to keep it from breaking down!) The most they can do is to provide information and offer incentives in the hope of coordinating and influencing the behavior of the "system components." But those "components" remain stubbornly independent.

Of course I have overstated the situation when I say that Alberta's electric grid is "not a system." There are plenty of other systems that have the exact same features: the economy is one of them, and a natural ecosystem is another. Sellers and shoppers are "system components" in the economy; animals and plants are "system components" in a natural ecosystem. In both cases, the "components" do what they want, and not what we tell them to; but we still talk about both the Economy and Nature as "systems." All the same, it is important to recognize that they are systems of a very different kind than a machine or a factory, precisely because the "system components" can do what they want and ignore all our good planning. I won't enumerate examples when an attempt to plan the economy (or an ecosystem!) has had unexpected or unfavorable results, because you can probably come up with plenty of examples on your own. What is important is to recognize that Alberta faces the same kind of challenge in planning the electrical grid.

Does this mean blackouts are inevitable? Not exactly. But so long as the system is structured the way it is today, it is probably not possible to guarantee that future blackouts have been prevented.

Well, is it possible to restructure the system to prevent blackouts? Maybe, but take it slowly here. When I say that the current structure cannot guarantee an end to blackouts, I'm talking about the structure where producers and consumers are all independent. Theoretically I could imagine trying to "simplify" the system by giving the provincial government full authority over all the power companies and all the consumers. Then they could adjust the system wherever needed, to make sure it runs smoothly. But that's a lot of authority. 

Does anyone really want the provincial government telling them how many hours they are allowed to keep their lights on, or what days they are allowed to recharge their phones? Probably not. 

Or if you own a power company, do you want the provincial government to tell you how much you can produce and when you have to produce it, even if their decisions mean you lose your shirt? Again, probably not. 

Of course the whole question is a political one, to be answered by the voters of Alberta and not by me. But I can imagine an outcome where the voters decide that they'd rather put up with the risk of future blackouts, because the available alternatives are even worse. 

Like I said at the beginning, sometimes our habits as Quality professionals can mislead us. Our familiarity and facility with technical tools can make us think that enough technical skill can solve any problem. But sometimes the most difficult issues are not technical ones. 

_____

* You can google the event to find coverage. Here are some of the articles I consulted in writing this piece:
https://www-cbc-ca.cdn.ampproject.org/c/s/www.cbc.ca/amp/1.7165290
https://www.theenergymix.com/rotating-brownouts-in-alberta-highlight-need-for-more-flexible-grid/
https://tnc.news/2024/04/08/alberta-to-modernize-power-grid/
https://calgary.ctvnews.ca/alberta-s-second-grid-alert-in-2-days-leads-to-rolling-blackouts-1.6835023
https://globalnews.ca/news/10405013/alberta-electric-system-grid-alert-april/
   

** https://en.wikipedia.org/wiki/System        

                    

Thursday, April 4, 2024

Disasters happen!

There are people on the Internet who claim that what we see as Reality is actually a giant Simulation, and some days it seems like they have a point. Would random chance in real life have given us the entertaining string of disasters we've experienced so reliably this spring, or should we assume that it's a plotting device dreamed up by some intergalactic blogger and content creator with an offbeat sense of humor? Since my purpose in this blog is not to tackle the Big Metaphysical Questions I'll leave this one unanswered, remarking only that our record of calamities the last few months has been strikingly consistent.

A lot of my recent posts since January have been related, in one way or another, to the tribulations of Boeing, who seem to have dominated the headlines for some time now in spite of themselves. But of course that's not all that has been going on. Also back in January, the electric grid in the Province of Alberta came close to shutting down, seemingly because, … (checks notes) … it got too cold. (I discuss this event here.) Then in another extreme weather event that did not repeat the Alberta experience but somehow rhymed with it, massive hailstorms in central Texas three weeks ago destroyed thousands of solar panels.* And perhaps the most dramatic recent catastrophe (upstaging even Alaska Airlines flight 1282) took place early Tuesday morning a week ago, when a massive container vessel piloting out of Baltimore Harbor collided with one of the supports of the Francis Scott Key Bridge—and demolished the bridge.


It should go without saying that tragedies like this are devastating. If there is any way to find a silver lining around clouds this dark, it is that by analyzing what went wrong we can often learn how to prevent similar catastrophes in the future.

Sometimes this analysis can rely on straightforward data collection about the environment in which the planned operation will take place. Historical records could offer information, for example, on the likelihood of cold weather in Alberta in January, or the risk of hail in central Texas. But often the question is more difficult. For example, the Dali (the container vessel in Baltimore Harbor) appears to have suffered some kind of power failure just before the accident, a power failure which could have made it impossible to steer the ship. I'm sure there was some kind of planned protocol for how to handle a power failure; there was probably an emergency backup power supply available. But how much time did it take to activate the backup power? Did the advance planning take account of the possibility that the power would go out when the ship was in such a position that even a minute or two without steering could mean catastrophe? At this point I don't have any information to answer that question. But I can easily imagine that the answer might be "No, we assumed that five minutes [for example] would be plenty fast enough" … and I can also imagine that back when the planning was done, that might have sounded reasonable! Today we would evaluate the same question differently, but only because we have seen an accident where seconds counted.**

So it turns out that analyzing catastrophes is a hard thing to do. In particular, it is important to recognize that even when we can collect all the data, there are huge innate biases we have to overcome in order to understand what the data are telling us. Two important ones are the Hindsight Bias, and the Outcome Bias.

The Hindsight Bias means that when we already know the outcome, we exaggerate (in retrospect) our ability to see it at the time. This is why people can play tabletop games to refight battles like Gettysburg or Waterloo and the other side ends up winning. Once you know what stratagems your opponent could use to win (because they are part of the historical record), it becomes easier to block them.

The Outcome Bias means that when we already know the outcome, we judge the decisions that people made in the moment by how far they contributed to the outcome. So if someone took steps in the middle of a crisis which looked logical at the time but ultimately made things worse, retrospectively we insist that he's an idiot and that it was his "bungling" that caused the disaster. We ignore the fact that his actions looked logical at the time, for reasons that must have made sense—and therefore, if it happens again, somebody else will probably do the exact same thing. By blaming the outcome on one person's alleged "stupidity" we lose the opportunity to prevent a recurrence.  

If you can spare half an hour, there's a YouTube video (see the link below) that explains these biases elegantly. It traces the history of the nuclear accident at Three Mile Island on March 28, 1979. The narrator walks us through exactly what happened, and why it turned out so badly. And then the narrator turns around to show us that the whole story he just told is misleading! It turns out that Hindsight Bias and Outcome Bias are fundamentally baked into the way we tell the story of any disaster. And if we allow ourselves to be misled by them, we can never make improvements to prevent the next accident.

The basic lessons are ones you've heard from me before—most critically, that human error is never a root cause but always a symptom. (See also here, here, and here.) But the video does a clear and elegant job of unpacking them and laying them out. And even though we all know how the story is going to end, the narrator makes it gripping. Find a free half hour, and watch it. 



__________

* I have seen multiple posts on Twitter insisting that this happened again a week later, but the weather websites which I've cross-checked disagree. See for example this news report, which showcases a tweet that pegs the storm on March 24, whereas the text of the article dates it to March 15. 

** Again, to be clear, I have no genuine information at all about the disaster planning aboard the Dali. I am reconstructing an entirely hypothetical situation, to show how our judgements about past decisions can be affected by our experience in the present.   

           

Thursday, March 28, 2024

A podcast on Boeing!

This week I had another chance to sit down with Kyle Chambers of Texas Quality Assurance, this time to talk about Boeing. Like me, Kyle has had a series of episodes dealing with Boeing's troubles in the last year, and he always brings a refreshing and very practical energy to all Quality topics. I start off talking about the FAA report that I discussed here last week; but in the course of the discussion we also cover why Boeing should hire TQA to revamp their training programs, and how to make safety classes matter to people who don't want to be there.

Please join us!

You can find the podcast version here: #QualityMatters episode 175.

Or there's a version on YouTube that also includes video, which you can find here:



Leave me a comment to let me know your thoughts! 

          

Thursday, March 21, 2024

What did the FAA find?

It's all very well to sit snugly behind a keyboard and criticize Boeing's safety culture (as I have done in a number of posts this spring, for example here and here). But how much of this is just talk, and how much is based on hard data? Has anyone done the hard work to sit down with Boeing and study their culture in detail? Maybe an exercise like that could tell us something useful.

In fact, a special Expert Panel completed just such a study last month. These experts were appointed by the Federal Aviation Administration (FAA) and began to meet a year ago, at the beginning of March, 2023. They wrapped up their investigation in February 2024 after spending a full year on it. The team reviewed 7 surveys and more than 100 policies and procedures, comprising over 4000 pages. They interviewed more than 250 people across 6 locations. In the end they issued 27 findings and 53 recommendations. You can find the full report online here, and the New York Times has an article about it here

The report is devastating. 

More exactly, it's written in the bland bureaucratic language that is mandatory for reports like this. There are no bold headlines screaming "J'Accuse!" But I have been auditing since 1996, and I cannot remember ever reading—much less writing!—a report about a fully functioning organization* that painted in such broad strokes a picture of a management system floating so loose from its moorings.

Background and summary

The Expert Panel was formed in accordance with the provisions of the 2020 Aircraft Certification, Safety, and Accountability Act (ACSAA), Pub. L. 116-260, Div. V, § 103, which requires review of organizations that hold an Organization Designation Authorization (ODA) from the FAA. An ODA is the arrangement by which the FAA delegates certain Boeing employees to inspect Boeing's own work, on behalf of the FAA, so that the FAA does not have to assign their own people. The idea seems to be at least in part that there are a lot of inspections which are mandated by airworthiness regulations, and if all of them had to be carried out by FAA personnel then the FAA's staff and budget would have to be significantly increased. 

If you think it sounds crazy to ask a company to inspect its own work when there are serious safety risks at stake, … well, you can look up the text of the 2005 rule (70 FR 59932) establishing ODAs in the Federal Register; the "Background" section of that document explains how the idea grew incrementally over time as a way to cut down the long delays caused by airworthiness inspections. But the FAA still retains oversight of the whole process—naturally, right?—which is why the 2020 law referenced above requires all ODA holders explicitly "to adopt safety management systems (SMS) consistent with international standards and practices," and also directs the FAA "to review The Boeing Company’s ODA, safety culture, and capability to perform FAA-delegated functions." (Reference.)    

When the Expert Panel issued their report, they summarized their findings under four general headings:

  • Boeing's safety culture, where they found a "disconnect" between what they heard from senior management and what they heard from the rank and file;
  • Boeing's SMS, which was structured to reflect all the applicable standards perfectly but which appeared to have been glued on top of the organization with library paste;
  • Boeing's ODA management structure, which the Panel conceded had been recently reorganized to make it harder for the company to retaliate against an employee finding violations while acting in the name of the FAA (but "harder" still doesn't mean "impossible");
  • Other topics.

In the remainder of this post I will highlight and discuss some of the specific findings and other observations. (Sometimes I will indent my comments in blue, when I think it helps to distinguish my remarks from those of the Panel.)

Boeing's safety culture

The basic observation here is that Boeing has defined and rolled out a formal, written safety culture, but most employees don't really understand it. (Sec. 3.3) Concretely:

  • Many employees, when interviewed, didn't know about "Boeing's enterprise-wide safety culture efforts, nor its purpose and procedures." (Sec. 4.1, #1)
  • Even employees who knew the terminology of the safety culture couldn't use it in a sentence. (Sec. 4.1, #2)
  • Some Boeing sites have good, "confidential, non-punitive reporting systems" in place—but not all of them. (Sec. 4.1, #3)
  • Managers can investigate reports in their own reporting chain, which means they risk not being impartial. (Sec. 4.1, #4)
  • Employees don't know which reporting system to use for safety problems. Employees don't really trust any of the reporting systems, and prefer to report safety problems to their bosses. Employees especially don't trust the anonymity of the "preferred system." Employees do not (reliably) get informed of the outcome when they do report through these systems. (Sec. 4.1, #5)
    • My comment: When you first hear it, "reporting safety problems to your manager" doesn't sound like a bad idea. (Although naturally people who report problems should still hear back how they were dispositioned, or they'll start to think that reporting is a waste of time.) The reason that "reporting safety problems to your manager" can become a problem is that ….
  • When employees report safety problems to their managers, it's often done verbally. So there is no way to know if any particular problem ever made it into the reporting system. And if a problem didn't get into the system, there's no way to track whether it was ever analyzed or fixed. (Sec. 4.1, #6) 

Boeing's SMS

Grigory Potemkin,
architect of the system?
The Panel makes a number of high-level observations about Boeing's SMS, before diving into the details. Among these observations are the following:

  • All the SMS documents are new, and there is no traceability to the changes from what came before. (Sec. 3.4, para. 4)
  • Most of the SMS documents cover general conduct and do not translate to the concrete working level. (Sec. 3.4, para. 5)
  • Many employees don't really understand the elements of the SMS, or else they think it is a management fad that won't stick around. (Sec. 3.4, para. 10) 
  • Many employees point out that Boeing already had a detailed safety system before the SMS was implemented—so why do we need this new one now? (In fact the old system is still referenced in many procedure documents.) (Sec. 3.4, para. 11)
  • Boeing requires employees to take safety training classes, but doesn't test whether they learned anything. (Sec. 3.4, para. 13)

In other words, the Panel says that Boeing's shiny new SMS—which complies perfectly with all the relevant requirements and standards—is a Potemkin system

After those general observations, the specific findings might be an anticlimax, but here are a few of them:

  • The complexity of the SMS documentation, and "the constant state of document changes," make it hard for employees to understand it. (Sec. 4.2, #10)
  • Boeing uses an SMS dashboard to track safety goals, but employees (and some managers) don't understand what it is or how to use it. (Sec. 4.2, #12)
  • There are different tracking systems for the SMS and for the legacy safety systems, and many people are confused by them. (Sec. 4.2, #12, cont'd.)
  • Since Boeing has kept all the legacy safety systems in place, employees across the company don't trust that the new SMS will last long. (Sec. 4.2, #13)
  • Boeing has procedures on how to evaluate safety-relevant decisions, but there's nothing to explain how to tell which business decisions count as safety-relevant. (Sec. 4.2, #14)

In other words, employees don't understand the SMS and they have no motivation to learn it.

Boeing's ODA management structure

The Panel's general observation about the ODA program is that it is getting harder to fill, because participating inspectors (called Unit Members, or UMs) are retiring faster than new ones are being brought onboard. (Sec. 3.5, paras. 4-6; sec. 4.3, #18)

But the detailed findings have to do mostly with the risk that UMs could fear retaliation for speaking out about problems:

  • Boeing has not eliminated the possibility of retaliation when UMs raise safety concerns, and some UMs have experienced what looks like retaliation. Other UMs are not willing to help or step in, and their help is rejected as interference. (Sec. 4.3, #16)
  • Boeing says they took steps to make sure the ODA program is working correctly, but cannot provide proof. (Sec. 4.3, #17)
    • In which case, did they really do anything?
  • Supposedly Boeing has changed the ODA organizational structure, but nobody knows how. Employees still report to their old managers. Procedures are still written around the old structure. (Sec. 4.3, #19) 

There are some other smaller findings as well.

Other topics

Of the findings classified as "Other matters," the two that concern me the most state (in different ways) that input from pilots is treated inconsistently: if it comes into Executive A, it is treated seriously and addressed; but if it comes into Executive B, it might get lost or forgotten. (Sec. 4.4, #23 and #24) Less alarming are some technical points about how to handle the relationship between Boeing and the FAA in the future.

But a couple of the other general observations are worth noting.

Right at the beginning, Boeing welcomed the Panel and made sure to say that they looked forward to open collaboration. But the Panel says that in fact, Boeing answered questions rather as if the evaluation were an audit or a deposition, and asked for no input of any kind. (Sec. 2.6, paras. 12-13; sec. 3.2, para. 1)

So I have to ask, Did Boeing expect to learn anything from this evaluation? Or was the intent simply to get through it as fast as possible, with as few findings as possible? Because clearly, if you approach the whole exercise in a defensive frame of mind, you leave open fewer chances to learn and improve from the experience. 

Also interesting: the Board of Directors emphasized that they use safety-related performance metrics "when determining both Annual Incentive Pay and Long-Term Incentives." These metrics include, for example, "the requirement for executives to complete Boeing's Safety Management System training." This statement was intended to demonstrate Boeing's commitment to safety. (Sec. 3.7, paras. 5, 9, and 12)

The problem is, I think it demonstrates the reverse. Safety metrics in the bonus program? No! On the contrary, safety should be more important than any bonus program! Ironically, when you pay people for something, you cheapen it. At that point people start weighing one part of the bonus against another: Let's see, if I'm willing to give up a few dollars on safety, we can sell a lot more planes and by the end of the year the difference will more than make up for what I lost. Dollarizing the safety program is irresponsible if not worse. Safety should be non-negotiable, and paying people for it makes it negotiable. (I discuss this point in more detail in this post here.)

On the other hand, I understand why the Board of Directors would take this approach. To the man with a hammer, every problem looks like a nail. And it does seem like, in the last couple of decades, money is the hammer that Boeing's management has learned how to use. 



It's a long report. But I think it explains why Boeing has gotten into its present straits. From my point of view, the fundamental problems are all around system implementation. Boeing tried to create a new system, but went for the quick-n-easy approach rather than making sure the new system was fully implemented and integrated at all levels in the organization. As a result, people don't know what to do! Even people who want to do the right thing—and I firmly believe that this includes nearly everyone, nearly all the time—don't know how to do the right thing so that errors get caught, followed up, and fixed … and so that they themselves don't get in trouble for finding those errors in the first place.

Too much system can be as much a problem as not enough system. There's a balance and it always has to be pragmatic. I may have said this once or twice before now. 

__________

* I have participated audits that were meant as gap analyses, for organizations that wanted ISO 9001 certification and knew they weren't ready yet; and the results of those were often far worse than this one. But it was no surprise because the organizations knew in advance they had a lot of work to do.    

                

Thursday, March 14, 2024

The news just keeps coming!

I thought I was done writing about Boeing's current Quality problems, but the news just keeps coming and coming. Some of the stories simply confirm what we've already said about Boeing's current Quality culture; other stories talk about legal issues, and have less to do with Quality strictly understood. But one way or another, there continue to be a lot of them.

Here's a quick sampling of recent stories that I've found around the Internet:

It's an exciting time.

Ziad Ojakli, Boeing EVP
But the story I want to write about is a different one. In some ways it is smaller and quieter than the ones I just listed, but it sheds a helpful light on one of the least glamorous—but most critical!—of all the Quality disciplines. Yes, I'm talking about records control, and about how Boeing's records control system seems to have failed them at the worst possible moment.

The basic story is told by the Seattle Times here, and Associated Press chimes in here for corroboration. Briefly, it all started with the investigation into Alaska Airlines flight 1282, when a door plug blew out while the plane was in the air. The investigation revealed that four bolts were missing which were supposed to hold the door plug in place.

Why were the four bolts missing?

They had been removed to facilitate earlier rework.

Why was there rework?

There was damage to five rivets which had to be repaired. The procedure to repair those rivets required that the door plug be removed temporarily. Then after the repair the door plug was replaced.

Why weren't the four bolts replaced when the door plug was replaced?

… good question. Here the trail runs cold. The logical thing would be to ask the person who did the repair, but we don't know who that was.

Wait, what?? How can we not know who did the repair? Surely that information was captured as part of the repair documentation!

You would think so. But up till now Boeing has been unable to provide that documentation. And last Friday, Ziad Ojakli, Boeing executive vice president and the company’s chief government lobbyist, sent a letter to Sen. Maria Cantwell of the Senate Commerce Committee, saying, "We have looked extensively and have not found any such documentation." He added as a "working hypothesis: that the documents required by our processes were not created when the door plug was opened."

Let me repeat that, just to be clear:

  • Boeing's procedures require complete documentation of any rework, whenever rework is done. (So far, so good.)
  • But now they can't find the documentation for rework that was done two months ago.
  • The company's executive management is willing to tell a Senate Committee that he thinks maybe the documentation was never generated.  

This is terrifying.

To be more exact, there are several possible explanations for this turn of events, and every single one of them is terrifying!

One possibility is that the documentation really wasn't generated for this particular rework. 

But in that case, what else are they doing that hasn't been documented? How could you ever know? (Hint: you couldn't.) And if you don't know what work has been done on an airplane, why would you ever be willing to fly on one again?

Another possibility is that the documentation was generated, but Boeing can't find it.

This raises the same fears. If you can't find your documentation, it might as well not exist. At that point you are totally unable to use the documentation: for example, to monitor trends, or to connect the dots between one failure and another. You can't do anything proactive, and you can't even do much that's reactive. All you can do is wait for the next plane to fall out of the sky.
And of course a third possibility is that Boeing is brazenly lying to a Senate Committee.

In some ways, I almost hope this last one is the answer. I would rather that a company like Boeing be competent, even while doing something villainous, than that they succumb to floundering ineptitude. At the very least, a competent villain is more likely to build planes that keep flying.

But if you make the conscious decision to lie to the Senate, it's because you are hiding something really bad. Nobody does that on a whim. And so, once again, I start to worry about "What else don't we know?"

Yes, of course there are other possibilities, but mostly I think they add filigree details to the ones I have already sketched out. Maybe the documentation was created, but then the guy who did the work snuck into the records system and destroyed it afterwards so he wouldn't get in trouble when flight 1282 lost its door plug in such a dramatic way. Or maybe his friend did it on his behalf. And naturally it's easy to understand why this guy would be afraid of being in the spotlight nationwide. What's not easy to understand—what is, in fact, flatly inexcusable—is why any company as big as Boeing would tolerate a document control system that could be subverted so easily by a single bad actor.

You keep documentation for a reason. And even when the documentation embarrasses you, it's better to provide it (and own up in public to your mistakes) than to hide it (and leave everyone wondering whether things are even worse than they really are).

When I first started writing about Boeing's troubles (back in January) I tried to put those troubles in the best possible light by pointing out how few failures there have been (as a fraction of the total number of flights in a year) and by explaining that the whole point of a Quality Management System is to help you handle failures gracefully.

But document and records control is the single most basic element of any QMS. If Boeing never generated (or cannot find) rework documentation for a recent job, then their QMS fundamentally isn't working.

There is no way to tell this particular story so that it sounds good.

Photo from the National Transportation Safety Board


          

Thursday, March 7, 2024

Problem-solving is like breathing!

After a month and a half (or more!) of articles about corporate cultures and poor Quality choices, maybe I can afford to take a break and post something different. Yesterday I saw a delightful video by Jamie Flinchbaugh over on his JFlinch blog, about how problem-solving is like breathing.

Breathing?

Yes, exactly!

His point is that problem-solving is something we do all the time. And yes, we can learn to do it better. But that doesn't automatically mean we will always do it better when we aren't thinking about it intentionally.

On the other hand, if we practice being intentional about our problem-solving (or our breathing!) then yes, over time it can pay dividends in our daily lives as well. 

Here, listen to Jamie explain the point:


Robert Pirsig makes a similar point in
Zen and the Art of Motorcycle Maintenance, after cataloguing a long list of "gumption traps" that prevent someone from doing good work. (He presents these in terms of doing mechanical work on your motorcycle, but in fact they apply to any kind of work you can think of.)

Some could ask, ‘Well, if I get around all those gumption traps, then will I have the thing licked?’

The answer, of course, is no, you still haven’t got anything licked. You’ve got to live right too. It’s the way you live that predisposes you to avoid the traps and see the right facts. You want to know how to paint a perfect painting? It’s easy. Make yourself perfect and then just paint naturally. That’s the way all the experts do it. The making of a painting or the fixing of a motorcycle isn’t separate from the rest of your existence. If you’re a sloppy thinker the six days of the week you aren’t working on your machine, what trap avoidances, what gimmicks, can make you all of a sudden sharp on the seventh? It all goes together.

But if you’re a sloppy thinker six days a week and you really try to be sharp on the seventh, then maybe the next six days aren’t going to be quite as sloppy as the preceding six. What I’m trying to come up with on these gumption traps I guess, is shortcuts to living right.

The real cycle you’re working on is a cycle called yourself.*

__________

* Robert Pirsig, Zen and the Art of Motorcycle Maintenance (New York: William Morrow, 1974, 1999), pp. 324-325.   

                

Thursday, February 29, 2024

The myth of the silver bullet

For the last few weeks we've been talking about corporate culture: in particular, about whether you can build a company's culture deliberately, and about how far that culture is implicated when things go well or badly. So it was through a delightful synchronicity that I recently ran across two very different sources which spoke to this topic in rather different ways.

The Patagonia case study

Building a culture ...

The first was a talk given by Carlos Conejo, LSSMBB, under the auspices of ASQ, about "The Patagonia Ethos." Conejo reviewed the outdoor clothing company Patagonia, and explained how they built a corporate culture deliberately and systematically. Back in the old days, when Yvon Chouinard (the founder) first started to make climbing equipment, he told customers they shouldn't expect quick responses during climbing or skiing seasons. Then, as the company grew, they introduced:

  • Flexible work arrangements
  • Casual dress code
  • Flat organization
  • No private offices
  • Health food in the offices
  • On-site daycare
  • Transparent communications to employees
  • Classes for employees on how to get involved in local, grassroots environmental causes 

100% of the electricity used by the company is from renewable resources. 

98% of the raw materials used by the company are recycled. 

If you have old gear from Patagonia, you can send it in and they will repair it. 

These principles make Patagonia's gear more expensive than that from their competitors, but customers gladly pay the higher prices because they support the company's mission.

Then in 2022, Patagonia transferred all its paying (but nonvoting) stock to the Holdfast Collective, "a nonprofit dedicated to fighting the environmental crisis and defending nature." The voting (but non-paying) stock went to the Patagonia Purpose Trust, "created to protect the company’s values." Chouinard described these transfers by saying, "Earth is now our only shareholder." (Interestingly, Robert Bosch GmbH has a very similar ownership structure.)

All of these steps have contributed to a clear and embedded corporate culture.

... but not a silver bullet

But it's not all roses. Conejo explained that one of the consequences of the company's pervasive informality was that for many years they were very weak when it came to formal planning, budgeting, and performance management. Then when it finally became clear that these activities were needed, they created a home-grown solution that lurched too far in the other direction. For a while, the business planning process took three whole months to plan each year. Partly this is because—in the name of transparency—it engaged all employees at all levels clear across the organization. But many of these employees had no previous experience in (or even exposure to) business planning or the rudiments of project management. So the value of their input was compromised, or else they had to take the time out to learn the subjects they were contributing to. 

Ultimately, Patagonia grew past these problems. They scaled back the planning process while continuing to emphasize openness and the development of their employees. But two overall messages were inescapable. 

First, culture is important but it is not a silver bullet. You need systems too. 

Second, every culture has its own failure mode. There is no "perfect culture"; each one has some strengths and some weaknesses. Which ones predominate is partly a matter of which circumstances the company faces.     

Boeing, again

All of which brings us back to Boeing.

In recent posts* I've suggested that Harry Stonecipher (Boeing President 1997-2001 and 2003-2005) deserves a measure of criticism for deliberately dragging the Boeing culture away from a focus on solid engineering and toward a focus on the economic bottom line. But the second source that I ran across a few days ago was a blog post that provided important insight into that transition. (See "The Myth Of Old Boeing," by Bill Sweetman.) 

What Sweetman makes clear is that Boeing, back in the days before Stonecipher took over, may well have had a solid culture; and the engineers were surely very smart. But their configuration-control system dated from World War Two! By any normal standards, Boeing should have been totally incapable of building airplanes for multiple customers** in the modern day. The only thing that saved them—for a while—is that they had low-ranking employees on the production floor who understood the archaic configuration system backwards and forwards, and who worked around it with heroic effort in order to get the planes built. But these were individual human beings. One by one they got old and retired. And we all know that any system which relies on heroes to get the job done will fail sooner or later.

This was the challenge that Stonecipher faced when he took over the company. Yes, he insisted that Boeing start thinking about the economics of profit and loss. And yes, in the end, it's possible that he went too far. But part of his motivation at the time was to drag Boeing—kicking and screaming—away from a configuration-control system that made factory production pointlessly expensive and mind-numbingly inefficient.

In other words: if it hadn't been Stonecipher, it would have been someone else. The only other alternative would have been for Boeing to collapse under the weight of its own inefficiency.

To repeat the two points above:

  1. Culture is important but it is not a silver bullet. You need systems too.
  2. Every culture has its own failure mode, and there is no "perfect culture."

For those of us in the Quality business, none of this should be controversial. In a sense, culture is about making sure that all your people are approaching their work in the right way. But Deming taught us years ago that "A bad system will beat a good person every time." That's why you need both. 

__________

* See specifically here and here.    

** A configuration-control system manages how changes or alternatives are introduced into a design. If you sell a single basic product to several customers, each of whom insists on their own unique package of options, you need a sophisticated configuration-control system to keep track of all the variations so that (for example) United gets airplanes tailored for United and not for American. By the early 1990's, Boeing's system for handling these variations was woefully out of date.   

               

Quality and the weather

“ Everybody complains about the weather, but nobody does anything about it. ” The weather touches everybody. But most people, most of the ti...