Why DevOps is Like Firefighting at Sea
October 1993, 08:30 pm, somewhere at the Atlantic.Firealalarm goes of, beeep beeeep, fire in the back engine room!!
A few hours later everything is under control and calm again. In 1993 I was part of the fire attack crew, in 2022 I'm working as a DevOps coach and realize that there are much similarities between DevOps and firefighting at sea. By telling the story of the fire and explaining all the steps involved I will show you the similarities to modern IT teams, share some of the practices that helped us take control of the situation in minutes and explain why we should also apply them at our DevOps teams, because s#!t will happen oneday and we better are ready for it.
Chapters
Full transcript
The complete talk, organized by section.
Danny Higler
Great. I think everybody's in, so let's get started.
Actually, I'm going to take you back in time, almost 31 years ago, somewhere on the North Atlantic Ocean, on board of a Navy frigate called [inaudible]. And there's me, 19-year-old me, having dinner with some friends and colleagues on board of the ship, when suddenly all the ventilation drops.
I'm not sure if you're aware of it, but if you are 24/7 surrounded by the same background noise, you don't hear it anymore unless it drops. Then it will be silent. And when it was silent, there was an alarm. Beep beep: "Fire. Attention. Here is the technical control room. There is a fire in the back engine room. Fire in the back engine room. This is not a drill."
For a second, everybody was sitting motionless. And then, all of a sudden, everybody dropped their gear, what they were doing, and moved to their certain positions. We had people setting up medical positions. There were people running toward boundary cooling surrounding this engine room: we need to cool it down in order, if there is a fire, to prevent it from spreading.
My position was what we called a fire post. I was the third attack crew, so I ran to that position. When I arrived there close to the back engine room, I met my buddy Pete. Together with Pete, I was the one to jump into the fire. We were helped getting dressed in these fireproof or fire-resistant suits, having breathing equipment on. Some guys already rolled out the hoses, and we had a high-power water shield and a high-volume foam cannon. We were almost ready to go into that engine room behind us.
There was a petty officer having a thermal camera, having a radio connection to the technical control room. We were ready to go into the fireplace, into the back engine room. Before we continue, I would like to go back a few minutes before that and explain a little bit about how it looks on board of those frigates.
This is technology 40 years old, so to modern standards it looks different. But there is a control room. It is set up, it is manned 24/7: some able seamen, some petty officers. Whatever happens on the ship is displayed there. On the side you see a map of the ship, a complete layout of the ship. Every compartment, every room has a fire indicator. So whenever a fire or smoke alarm goes off, there is a sound alarm and a little red light blinking: boom, boom, fire.
What they do first is they send somebody out. The guy sitting in front of it will grab a radio, grab some fire extinguishers, run to the back engine room in this case, and see what is going on, what will happen. You arrive there, and there is a hatch on the floor. If you open it, it might be a big fire. You never know. So touching the hatch: it is still cold, so there is probably not a big fire inside. You open it up, but a lot of smoke is coming out. So he immediately knew, this is something serious. There is something going on in that back engine room. He radioed back to the technical control room, and they sounded the alarm. They stopped the ventilation and sounded the alarm. That is what happened there.
So I was with my buddy Pete on the way back to the engine room. It is good to know that if there is a fire on board of a Navy frigate, or any ship actually, there are three waves of attack. Maybe more is needed, but three initial waves.
The first one is the first person on the scene carrying this manual equipment. But if there is smoke, he cannot do anything. Then there is a second attack. The second attack are the people wearing up quickly with some gloves, balaclava, but also portable fire extinguishers and breathing apparatus, so they have some breathing protection, oxygen on their back. They can go in the fire room, and they did it. Before us entering, already two guys went down in the back engine room trying to extinguish the fire.
Me and my buddy Pete entered. The hatch was already open. Slowly we went down the stairs, having a guy with a thermal camera behind us, staying in close contact with the technical control room. He was constantly updating us: "I don't see a fire. I see a heat source, but it is not hot enough to be fire. There are also two persons. They are still there. They are moving." Great. Good for us to know that the second attack crew was at least still alive and was doing their job.
We went through the smoke, and finally we came to the position where the heat source was. Two guys were there still working. We shouted and screamed, "How are you doing? What's going on?" They turned around and said, "It's okay. It's controlled. Not a big fire. We extinguished it. So you're here for nothing."
Actually, it is good. My heart rate was already 180, I think, going into that fire. But those two guys already extinguished it. Two hours later we vented out the room. It turned out there was a bearing of, I think it was a port-side shaft. There is a lot of grease there. It overheated. Heat and grease is smoke, maybe some fire, a small one, but nothing too serious. Two hours later we were having a beer, sitting down, and said to each other: wow, that was exciting, but we dealt with it. We made it work.
Let's do a bit of fast forward, because almost 31 years later I was on an assignment. I am an agile coach and I work in these different environments, and there are IT incidents happening.
One of those incidents, and I am pretty sure you will know it: something is going on, we are having a production issue, and first thing is, yeah, there might be something wrong. There is an indication. People already started calling us. Let's call John; he probably knows it. Well, John is on safari, on a rural Africa trip. We cannot reach him. Let's get to the team. Yeah, which team? Okay, we find a team and maybe they can deal with this problem. We need to access the logs. Yeah, but I do not have access to them. Actually, I do not know where they are.
So I was thinking to myself, why could it be that 30 years ago we had a crew of 19-, 20-year-olds - actually, I have been there, you would not say it, but I think it is also 30 kilos and 30 years - why were we able to solve that problem quickly in a structured way? And today, when there is a relatively simple IT incident, we are not able to deal with that. What is happening there?
I kept on thinking about that, and in no random order I came up with a few, for me, valuable lessons which I think I can share in this presentation with you.
One of the most important reasons: fire prevention on board of a ship. It is all about scheduled maintenance. We make sure that all the machines are there, they are in tiptop order, they are repaired, they are running smoothly. There is a Boy Scout rule: clean up any debris. Actually, back then we did not call it Boy Scout rule. Just clean up your mess. If you see something which is not correct, clean it up. If you see something leaking, fix it. If you cannot fix it, report it and then fix it. Kind of basic stuff, but it is important to remember. Any risks solved immediately.
How does that relate to our technical environment at the moment? Make technical debt a priority. So often I see backlogs with incidents and leaks we know about, but we are not really fixing them. No time. It is not important for us. Know your system and potential risks. Map them, do something with them, at least make sure you are aware of them. Peer reviews as your first line of defense. It is all about preventing incidents from happening. I think we can learn a lot from that. That is an important one.
Another one: train. When I joined the Navy, I had a six-week training course on how do you fight fire? How do you deal with that? On the basic training ground in the port, they have a simulated ship. You can go in there and they can react to any kind of fire, any kind of even flooding of the ship. We can train it there.
On board, we had different fire drills. Still up to the day, there is FOST, Fleet Operational Sea Training. It is a crew of British Navy engineers who just enter your ship. They stay there for a couple of weeks and they act like Chaos Monkeys. I am pretty sure you all heard about Chaos Monkeys, but they literally break stuff or set it on fire - not real fire, but set up smoke pots or disable machines in the middle of the Atlantic Ocean. They stop your energy, your propulsion. That is Chaos Monkey hardcore. But it works. It makes you aware of what you should do. If you practice in a safe environment, we can deal with the real situations, knowing what we should do.
I think in our IT environments, practice in production-like, maybe even in production environments if you are really good, with multiple teams, not only one team, but make teams work together on that. Set up the schedule for that basic training in the sandbox environment. I think it was Jennifer this morning with a really good presentation on Sink or Swim, the importance of training and why it matters. Do that. And then have the guts, or trust, to go Chaos Monkey.
Who is practicing incidents at the moment in an IT environment? Can you raise some hands if you are practicing IT incidents? Nice, nice, but not a lot. I see three, four hands. The rest, who would like to practice? Why aren't you doing it? Maybe ask him what he will do if his ship is on fire. So crossing your fingers is not really a good strategy. "Oh, something happens, I think we will manage." I am pretty sure you will not if you do not practice. So set it up and do something with it.
Having the right people, that is an important one as well. On board of a ship, it is top-one priority. The first few weeks on the ship, you will learn how the ship is built up: the different decks, the compartments, which rooms are there, what are the dangers of each room, where ammunition is stored, what is an engine room, what are all the hazards. Getting to know the ship is top priority.
Anybody on board of the ship could do attack one or two. They know where the portable fire extinguishers are. A few specialists trained could also do attack three, and even a Halon attack, meaning just close off a complete engine room, fill it with gas. It is not allowed anymore, I think, but it used to work pretty well. Data centers were also using it, not anymore.
In your team, make sure every team member, or people in all the teams you have, have the basic knowledge of the system setup. Invest in it. Train them there. Make sure that the tools you are using, everybody can use them and know how to operate them and where to find them and how to find those logs. The majority of the incidents should be solved by the team. There are always specialists you can reach out to, but put your teams in such a status that they can deal with at least most of the problems they run into. That will help you there.
Trust, that is so important. Even in a previous presentation in this room, they talked about trust. On board of a ship, it is my life if fire is included. And it is not only on a ship, it is also fire departments. I am trusting you with my life, with the rest of the team. So I must trust you and you must trust me. That goes two ways. It is also about following orders, almost blind sometimes. Literally, if you are in the smoke and you do not see anything, and somebody tells you to the left is a fire, I have to trust him it is there.
Building trust in your IT environment, in your IT teams or your DevOps teams: start building a team, invest in it, make it grow. Do things together, practice together, train together. Make sure you can trust each other on your capabilities and your skills. Spend some time on helping and learning each other. Pair programming not only for the sake of pair programming, but really: I want to help you so in case something happens, or we have an incident, you are also able to solve it.
Also in Jennifer's talk, you saw this nice quadrant she wrote where junior people are put on a certain level until they can also do the standby services. That would really help you there.
Important: if we have an emergency, we need some leadership. That can be leadership on a team level. That could be leadership on a department level, organizational level. [In firefighting] there is an ultra-short and clear line of command, but also with the freedom for the people really going into that engine room to do whatever they thought was good and make sure the fire goes out.
Lead with intent. Turn the Ship Around!, David Marquet, great book also about the same topic, Navy ships - submarine in this case - but leading with intent. The attack crew could make their own decisions. Once we are in that engine room, it is up to us how to attack that fire. It is up to us to make sure it is extinguished.
Also in the environments, be clear on purpose, method, and end state if there is an incident. The purpose is: we want to get back in business as soon as possible. The method could be: you can use some workarounds. And the end state should be: in an hour from now, we should be able to serve at least 10% of our traffic.
It could be other circumstances. The purpose could be: we want to have a fully operational service again. The method could be: be precise, no shortcuts, just do it right. It might take a little bit longer. And the end state is: we have a 100% running service again, end of the day hopefully. So it could differ per situation, but be clear on what you want. How does it look like? Are you allowed, in the method, to take shortcuts or not? Or do you want to be more precise?
Decentralized decision making if possible. The people in the teams have the knowledge, they know the context. Trust them to get the job done. Trust the teams there.
Sorry for my cough, got something with my lungs going on and sometimes I have to cough.
Having the right tools: specialized tools for different environments. What really made me think last time is that even on board of this ship, the people using the tools are also the people maintaining the tools, making sure that all these fire extinguishers are checked every month or something, that the pumps are working, that the valves are working. The same people operate that equipment, make sure to take care of the maintenance of those, and know how to use the tools.
An important one: having access to all the rooms and spaces. On board of the ship, there are some rooms you cannot access. They are locked. It is with ammunition, or there is some secret stuff going on. Sometimes on Navy ships, you cannot enter all the rooms, but the petty officers in charge of the technical control room have a master key. Whatever happens, if there is an alarm, grab the master key, run to it, and you can enter everything.
How about the tools the teams are working with? Having the right tools for logging, monitoring, make sure they know how to use them, and trust them, and have tested identity and access management. Make sure that in case of an incident you can enter, you can reach your destination, you can view the logs. I have worked with quite some teams. I said, "No, we have it all in control." Have you tested it? "Not really." Try it. See what happens. "We cannot access." Okay, make sure you can come in. It will help you in time of an emergency and make sure you make the right decisions there.
What else? Monitoring. 24/7 monitoring on board of the ship. It is always manned, looking out. Paged duty, in front of me, could send you the alarms. Every alarm is treated serious. So if there is an indication for a fire, we treat it like a fire. How many times did I see a red alert going off on dashboards and people saying, "Ah, it is probably nothing," or "An alarm is always giving wrong information. It goes off, but it does not matter." Do not make it red, right?
There are scenarios. Whatever happens, if we have a fire in that room, this is how we should deal with it. If we have an incident, helicopter ditch, this is how we should deal with it. Make sure you think of scenarios. What if we have incidents in our production environment? What if we have incidents in just one data center or two? What if our connections drop? I do not know. Create scenarios and work on them. Measure, and be sure the measurements count. Measure everything. If it does not matter, and you would say, yeah, there is a red alarm going on but it does not really matter, ignore it. On intervals, rethink your scenarios. It is good to see: has something changed? Should we change something else? Is the environment different? Adjust plan and adjust.
So these were some of the thoughts I had while thinking on why was it that 30 years ago we could deal with it, and now not, with incidents? Is there something else we can do? Just some of the things that came up in my mind.
As a wrap-up: you know incidents will happen. You know that whatever you do, whenever you are working, incidents will show up. Be ready for them. Be prepared. Do not cross your fingers and hope they will pass or the impact will not be that high. Just make sure you are ready for those impacts. Invest in your people, invest in your tools, and trust the teams to get the job done. That is one of the most important messages I want to give to you.
As a final remark for this conversation or this talk, this is my experience. This is my experience 30 years ago. But I am pretty sure that all of you will have a similar kind of experience, maybe not firefighting at sea, but you could be part of a team or a group of people where you think, with this group of people, we could perform magic. We could do anything. We could solve any problem. Think back and try to remember what were the circumstances there? What made us give that feeling, or what made us achieve such good results? Try, if you can, to recreate them as leaders of your teams or your organizations. Can we recreate that, or can we learn something from that?
Thank you.