5 min read
Most causal analysis programs in regulated industries where I’ve worked or consulted (commercial nuclear power, the nuclear weapons complex, the nuclear Navy, the Army and Aerospace) have adopted a graded approach to causal analysis. That is to say, the level of effort expended to investigate a problem or incident is commensurate with the level of significance and complexity of the problem or incident. For example, most organizations use a grading system similar to the one below:
- Significance Level A: high significance, complexity and consequences requiring a rigorous root cause analysis and actions that will prevent recurrence.
- Significance Level B: moderate significance, complexity and consequences requiring a less rigorous apparent cause analysis, and it is accepted that corrective actions may not prevent recurrence.
- Significance Level C: low significance, complexity and consequences requiring no formal investigation and only limited actions to correct the immediate issue (“fix and forget”).
At most sites, the vast majority of issues captured in their condition reporting system are classified SL-C. These require little or no investigation and just enough corrective actions to address the immediate issue. The SL-C condition reports are captured in the Corrective Action Program (CAP) database and closed out as quickly as possible. After reviewing the information in CAP databases for three decades in several industries, I’ve noted that most organizations are not taking advantage of the vast majority of lower level issues or incidents, to give us insights into the deeper causes that are resulting in problems at our sites.
One of the premises of root cause analysis is that the same root causes that manifested themselves in the larger events are also causing smaller events. Take for example the case of automobile accidents caused by texting while driving. For every 1000 incidents of texting while driving, 990 may have only resulted in minor swerving or otherwise minor and inconsequential results. Perhaps in 9 of those 1000 cases there may have been moderate consequences such as hitting a curb or another car in a minor fender-bender. And in 1 of 1000 instances, it may result in a serious crash or even a fatality. The same root cause can have a variety of effects, depending on an infinite number of other variables.
So it makes sense to do a slightly better job at evaluating our SL-C and SL-B events, so that we can more accurately identify their causes that can then be entered into a trending and analysis program. However, until now, it has been difficult to conduct an efficient analysis for these low significance level events, or at least a sample of them. The BlueDragon HCA’s Quick-hit Investigations can set up an investigation and develop lines of inquiry in minutes, and the analysis can be completed in an hour or two.
Let’s take the example of a fire in the tool crib in a building that housed nuclear weapons (part of our nuclear weapons complex). The Fire Marshall and the site team completed their investigation and concluded that the fire was started when the Tool Crib Attendant plugged a charging cord into the wrong end of a battery charger. The component overheated, starting a fire and igniting combustible materials that were nearby. Case Closed.
However, site personnel had recently completed BlueDragon Hyper-Integrated Causal Analysis (HCA) training and decided to build an HCA chart using the same information. In about an hour, they assembled the information into the timeline, the defenses (admin requirements and physical barriers) that should have prevented the event, and developed their lines of inquiry. The team then invited the affected parties to participate in the causal analysis, including the Tool Crib Attendant involved in the incident. Within a few hours, many new latent weaknesses came to light, the least of which was the attendant’s mistake in connecting the battery charger. For example:
- The normal Tool Crib Attendant was actually not on site, and this was an untrained stand-in attendant that did not know much about the tool crib and its contents. They were tasked charging a battery without any knowledge of the process.
- There were different models of batteries, chargers and wires, with no labels that made it easy to identify which cables went with which chargers.
- The battery charging station was not isolated from combustible materials, as required. Had the battery charging station been properly isolated, the chances of any other materials in the tool crib catching fire would have greatly diminished.
However, the group quickly discovered that the most significant aspect of the event was not investigated. From the HCA chart, the evaluation focused on the part of the timeline where a building fire alarm pull station was pulled and it did not go off. Luckily, the person ran to a different pull station that did work, initiating a building evacuation. What made this most significant is that the building stored nuclear weapons. The investigation quickly found that the subcontractor conducting the annual tests of the fire alarm pull stations was not actually testing all the pull stations, but was signing off that they were all tested. The potential consequences of not finding this problem were quite large. Let’s take the case of a fire that could rapidly spread to rooms that contained nuclear weapons…if we could not evacuate personnel from the building fast enough due to inoperable fire alarm pull stations, a weapon could go off (a “dirty bomb”) resulting in serious consequences to the workers and the site.
New corrective actions were put in place to address this (most important) aspect of the event that was not broached in the initial investigation, along with the other latent weaknesses identified in the HCA. At the same time, any discussions about taking punitive actions against the alternate (non-trained) Tool Crib Attendant faded rather quickly. The value of having a structured, well organized methodology like HCA to evaluate incidents such as this one became apparent.
That scenario is repeated across many industries every day. Weak causal analyses that stop too soon are one of the main reasons that regulated industries experience recurring events and even fatalities. I estimate the cost of these human-centric problems (i.e. problems that cannot be solved by math and science) to be in the $100’s of Billions across our regulated industries. Any regulated industry that has to protect human life, contains language that requires the investigation of significant problems to identify their causes, and to actions that will prevent recurrence.
But, as mentioned above, industries have taken the graded approach whereby only those problems or incidents classified as SL-A will actually get the attention of a root cause investigation. And unfortunately, at many sites, the classification of events is pushed down to SL-B and SL-C because of the time and resources that it takes to conduct a thorough SL-A root cause analysis. And many organizations wonder why they have recurring events and why their corrective action backlogs are so large.
The lesson learned is that we can investigate events of lower significance and identify latent weaknesses that may eventually cause more significant events. Like having 20% of the fire alarm pull stations at your facility not working when you need them.
BlueDragon Tip of the Week: When investigating problems or incidents of lower significance and complexity, take just a few extra minutes to lay out the information you have collected into a BlueDragon HCA chart and generate evidence-based lines of inquiry. Then, in as little as one session with interested parties in the room, answer the lines of inquiry and conduct a rapid cause and effect analysis. By using a structured, organized approach, you will arrive at deeper and more meaningful causes that can be addressed to prevent future incidents.
For more information on how to solve complex, human-centric problems, visit us at: https://bluedragon1-ips.com/