"DevOps to the Metal": Achieving “Flow” in a Large Organization and In Cyber-physical Systems
This experience report describes the transition of a large medical organization which develops computed tomography modalities towards more agile and lean approaches that improve the speed and the “flow” in the organization and the products it produces. It describes the broad set of challenges that are faced, especially when developing complex software-intensive systems in a safety-critical and regulated medical domain. This practical case study details the concrete set of activities, insights, and “lessons learned” that were done in this organizational-wide transition in a structure that allows persons or organizations with similar challenges to apply them to their own organizations.
Furthermore, this experience report describes how we apply DevOps approaches not only to software, but to software-intensive cyber-physical systems. Minimizing downtime of systems during updates is a vital aspect for our customers as well as the robustness of in the field updates. In a world of complex cyber-physical systems of systems that is an far from easy task and requires early continuous * activities including continuous deployment to in-house systems before the subsequent final release as well as an extremely high transparency of the installed base through a regular feedback loop and data analytics methods.
Chapters
Full transcript
The complete talk, organized by section.
Thomas Jachmann
Welcome to the talk about bringing DevOps into a complex environment of cyber-physical system of systems, into a large-scale organization and a regulated environment. Welcome to DevOps to the Metal.
I am working in Siemens Healthineers. Siemens Healthineers is a leader in the global healthcare industry with impressive numbers. We provide the systems, software, and solutions to clinicians worldwide for the best possible analysis and treatment of our customers and patients.
I am from one of the segments, computed tomography, and you will find computed tomographs in almost every radiology department worldwide. We are actually, even today, a key lever for fighting the global pandemic, as computed tomographs are the best radiology equipment to analyze and treat the COVID patients in the best possible way.
My name is Tom Jachmann. I am head of software of computed tomography. As you see on this picture here, computed tomography is offering a strong portfolio of software and CT scanners. Most of these scanners are driven by a single software platform that allows the utilization of the great quality and functionality of this platform onto the whole fleet of scanners and brings it to all of our customers.
This is the story about DevOps and what we did in our organization.
We are facing many challenges. Not only are we in a regulated environment, but we also have to cope with a growing complexity, both in the world of our products and within our global growing organization. We are transforming from a hardware-only business into a world where software becomes more and more dominant, where the key sales aspect is not the speed of the rotation of the gantry anymore alone, but what the software makes out of those images received from the scanner.
Systems become more and more interconnected. You might have been part of the talk of last year's conference, where teamplay presented in a global environment, but they are software only. We are not only incorporating teamplay elements, but many other elements that are enhancing the boundaries of our systems.
We have also problems with the compatibility concerns between the parts of this large-scale system. The hardware, the software, the firmware, all different elements are changing in their own pace. How do you get all of them together? There we see DevOps as a huge answer to it.
We also see a fast pace of change, of course, when it comes to all the deliveries we are incorporating, let it be open source software, off-the-shelf software, or components we incorporate from other Siemens Healthineers units. Last but not least, also the innovation cycles are accelerating dramatically. So all cries for fast feedback, for fully automated loops, and getting the possibility into our systems to leverage the benefits of DevOps.
This you might have seen is a picture from a typical deployment of a cloud homogeneous application like teamplay. If you go into the world of cyber-physical system of systems, the world looks different. Here you have not only the scanner by itself, but it consists of control systems and other PC-based elements. It consists of embedded solutions. You have a magnitude of individual components associated with the very different elements of a scanner, and all are intelligent and all are driving the complexity of a single step, which is continuous deployment, through the roof.
This is a typical DevOps picture, a continuous cycle of your various levels. In our world, we are in a regulated world. Of course, healthcare business has to obey the rules of a regulated environment. We cannot daily drop and deliver our solutions to the customers worldwide. There is a regulatory wall, and the regulatory wall prevents us to go to the end customer.
But it does not prevent us, it even encourages us, to go in a fast feedback cycle internally and deploy and test in user-equivalent environments along the way, that at the end, our push of the final button, the release becomes a non-event.
Behind the regulatory wall, of course, there needs to be a diligent verification and validation. That is a formal part. But we can already make sure everything up front is right, that the V&V phase is simple, quick, and does not reveal too many new findings.
This is where we started. We realized many years ago we have to invest. The first thing we invested in was efficient software integration. The whole organization was growing, and so software integration became more and more the bottleneck of continuously delivering value. With the goal to make release as a non-event, we had also to invest in continuous deployment.
Automatic system installation on such a scanner, as you have seen with the many fold of sub-components, is not an easy-to-do task. It is not like you do in a homogeneous cloud-based environment where you basically have everything already in place to accommodate this. Here you are in a novel environment where you have to discover your own ways and solutions, how it works.
But even after being able to do so, we were still on a six-week cycle because many steps in between were still manual. So we invested even more. We came down to a daily, fully automatic cycle and ability to deploy our software on each CT scanner in-house automatically without any intermediate people steps. So this is when we can really call us continuous software integration. A huge step forward. A major leap, you might think.
But it was already at the point where we almost lost our DevOps endeavor in our organization. Why? Well, we realized very quickly we had high goals. We set KPIs. The KPIs matched basically our North Star. But we did not reach those KPIs.
So yes, we could automatically deploy on a scanner. Yes, we were very close to green. Yes, we were able to run fully automatic test suites on a magnitude of scanners to see what happened in between. But those tests revealed issues, and we had to fix those issues. Even the steps in between revealed issues. Because we had a changing software, we had a changing hardware, and both together did not allow us a daily green deployment at the beginning. So there were already some people saying it is not worth the effort. Let's rather focus on something else.
What did we do? We realized you have to create, we called it plateaus, very different steps where we found people, we found groups, we found roles who were excited about the new ways of deployment, of the new ways of integration. What they did, basically, they helped us to continue our DevOps spirit. Because for them, we realized their benefits. We brought them into the picture. We made them part of the stakeholders of this DevOps endeavor.
As you see, there might be developers in there or development teams. There might be system engineers in there. Of course, in our system-focused DevOps environment, the system engineers play a vital role. Not all of those steps were, even at the beginning, fully automated. But we had people behind every one of those layers, and every one of those layers was basically something people were working towards too, because they felt the benefit for it.
So what was it what we learned in these early stages? We have to have a balanced stimulus. We almost lost ourselves in the final details to automatically deploy to a scanner. Those details do not always come from the software side. A lot of times it was a hardware change or a change in the test environment which caused those problems on the pipelines. But that became the predominant observation of our DevOps endeavor.
What we realized: if you balance the stimulus also to reach the next plateau because you already conquered the other one, you just are not done with the last final steps, then you can contain the spirit of DevOps. Do not rush to the next topic. Do not leave all of the topics behind that are unsolved because this will leave you a huge bag of technical debt at the end. But understand when you have really reached a plateau and when you reasonably secured it. Then while finishing those last steps, and this might take even a year or two more, set a new stimulus and a new direction to keep the spirits up.
The second topic where we almost lost the organization was that we set this North Star as a KPI, which always showed red because we did not reach the North Star yet. Of course we did not. But we were also not able to celebrate very visibly intermediate successes by doing so.
So the lessons we learned is we had to create reachable beneficial plateaus, as you have seen in the slide before. If they are reachable, you can celebrate the success because you will get the success out of them. Here, even celebrating small victories is very important to keep the spirit of your organization directed towards your DevOps endeavor.
And remember, it is all about conducting experiments and measuring the results. Some of you might know the complexity model of Cynefin, as you see them on the very right lower corner of the slide. There we have a simple environment, a complicated environment. Those are the ones where you have already established best practices.
But if you move into complex space, you will realize there is no predefined solution to your problems. Even more, you cannot find one path from the start to the goal, which you can just plan up front very diligently and then walk it to the end, and you will be successful. There you have to subdivide. As Cynefin tells us, probe, sense, respond. You set up a KPI which guides you into the right direction. You run an experiment, you see what is the outcome of the experiment, and then you respond to the result. If it is leading in the right direction, great. If it is not leading in the right direction, you learn something new and you have to readjust. By doing so on a rapid basis, you are not wandering off too far in the wrong direction, and even the wrong direction gives you insight in your organization.
Those are the lessons learned we had in the very first stage. I give you one example. I am talking about the plateaus for continuous integration here. You can see this simplified chart of the number of tests we have in our platform. Yes, I could have shown you something like test coverage, but it would not have made this example so illustrative. So let's stick to this KPI for a moment, knowing there are more to watch. What you can see here is the number of our tests is going continuously up, and we are in the range of the 150,000. This is good.
The second thing, after we acknowledge the check mark of the high auto-test coverage, is that if you look at the unit tests, they are the most in the numbers. Then the integration and subsystem tests, so they really form a pyramid. You might have heard about test pyramids which have the shape of an hourglass, or even possibly it is a test pyramid which is put on the tip. But in our case, I think after redefining our test pyramids three times or so, we really reached the point where our test pyramid is a real pyramid. Wonderful. So we have a pyramid.
The other part, if you look at the execution time of all of those tests, you see we are in the 80-plus hours of execution time for all of those tests. So we knew it up front. When we are growing the number of tests, we continuously have to invest in a scaling infrastructure. And we did, and it worked.
But at that point, we lost almost our development teams because we got alarming signals out of the team saying, we can still do our impact-based rolling tests without any problem. But normally we do every night in every team, we do a nightly build where all of the tests run. The nightly builds, in the beginning unnoticed, had grown by the time spent they required until they did not fit into a night anymore. So they grew to eight hours, 10 hours, 12 hours, while we continuously invested in our test infrastructure.
So we realized something is wrong. If you spot it very closely, you see what is wrong. The number of subsystem tests is very, very small. But at the same time, they consume almost 80% of the overall execution time. So we did something very, very good here. We created a test environment and a framework, and even a model-driven language for testing that allowed it to make subsystem tests, the creation of subsystem tests, very, very easy.
But on the other side now, people did not think too much anymore, and they created subsystem tests. Those subsystem tests require the startup of a full platform to be executed. So we thought we are almost done with this topic of continuous integration when we were drawn back to the drawing board, because we realized we are missing something in the middle between the subsystem tests and the integration tests. Now we are investing here again. So we invest in some things that mock and stub various subsystems, components, in ways that we do not have to start our whole platform to run the tests, because we realize we are running into the problems that we cannot scale our test infrastructure as quickly anymore as we would need it in order to run all of the subsystem tests.
So we could say we failed because we went almost through to the end to realize something is wrong. But at the end, we did not fail. We just uncovered the next level we need to reach the next plateau in order to move forward in our endeavor. The next plateau is, of course, we do the continuously even more shift left, and make sure that we keep our balanced test pyramid and we establish a new layer, which is not only balanced when it comes to the number of tests, but which is also balanced when it comes to execution times. I wish somebody would have told us before.
DevOps brings transparency to your whole organization, regardless which of those boxes you are opening up here. But I am looking very quickly in the continuous deployment with you for this talk, and there you will find a lot of puzzle pieces which fit exactly to your DevOps. You find other things where you, by the first look, say, hmm, what is this? But then you realize, yeah, if I tweak it, if I change it slightly, it works quite well.
But then you will uncover those skunks, and you do not want to have them on your front porch, and you might not want to have some in your DevOps endeavor. But it is certain that you will find all the bad things of your organization underneath.
Now we can do two things. The one possibility is you turn away from DevOps because you are saying, no, sorry, that is too much effort. Or you embrace the change and say, it is important to uncover those issues. Even going beyond that, I am not only celebrating my successes, I am also celebrating the people and the failures. If you are doing so, you are well underway with your DevOps endeavor. If you do not, then you might look too short.
Let me share you a story. I do not know what you are doing on a sunny weekend when you are chit-chatting with your neighbor. But it happened to me actually, just a few weeks ago when I was chit-chatting with my neighbor about system DevOps. He is working in an industrial environment, and he got basically the mandate from his boss to establish DevOps in his organization. But at the same time, it was that here are at max 500 grand and get his check mark behind DevOps within the next two years at most.
So I was trying to tell him that is the wrong approach to DevOps. DevOps is not something you can predefine exactly with this budget and that time you will succeed. It is more about the journey.
So regardless where you are coming from, you could have started possibly in the test automation arena. You could have started in the request to be a more efficient software development organization. You could have started somewhere in your software integration. This is where we started mostly because we felt that integration in the software realm became more and more the burden. Or you can come possibly from the product quality area. Or you could go towards moving, shifting left also your regulatory required V&V activities, your verification and validation activities, and bringing them much closer to the software.
You could also look from the customer perspective and say, if you are in a clinical environment, every downtime of the scanner is a problem. You cannot turn off a scanner sitting in an intensive care unit or an emergency department and install there an update for 10, 12 hours. So it is all about minimizing the downtime and maximizing how much your customers can rely on their computer tomographs. Of course, it is also about how smooth the installations, the deployments work for the customers. Do they have to do it manually? Do they have to do many individual steps, or is it basically automatically updating in the background for them and they just acknowledge the change?
Regardless where you come from, all roads lead to Rome, as the saying goes. If you want to reach Rome with flow, continuous system integration, operations, ability to continuously release with trunk-based development, with continuous deployment or continuous software verification, the way is not straight. It is not well-defined. It is not a paved road which you can go full speed. It will force you to sidetrack. It will force you to go also in not only detours, maybe cycles. In some cases, you might decide to add new stories next to it, like digital twin.
But it always helps your organization to grow. In a continuous fashion, you will uncover and learn new things about your organization, and it will drive your ability and your willingness to continuous improvements basically to the boundaries. Your organization, after going through this DevOps journey, will not look like it looked in the beginning.
And yes, we in our organization are facing possibly more challenges than others who deploy in a homogeneous cloud-based environment. But this also means we can benefit even more out of this journey because we will see more elements working together much better than before. Let it be system approaches, let it be deploying to our customers in a highly operative fashion.
So if you are willing to embrace the continuous change and you understand that DevOps is a journey, you will realize DevOps will lead you to a better organization.
Here are the key takeaways from my presentation. It is a culture and a mindset you have to tackle. The culture and mindset of continuous improvement is a very decisive factor. You have to have this in mind and also in the minds of your management. If your management is not buying into this as a journey, but they are looking forward to it as a check mark behind a topic, then you will struggle in between. Rather discuss it up front and then have a smooth ride in between.
It is also about establishing as much automated fast feedback cycles as possible. You have seen in some intermediate steps it might not be fully automated in the beginning, but you will work towards it. Even if you are not doing continuous delivery to your customers, as you might also be in a regulated environment, all of the benefits coming from the internal benefits are so great that you can stick to it for the greater good of the organization.
You have learned we are in a complex environment where probe, sense, respond is the only approach to it. Run experiments and measure. But put the right KPIs into place. Do not put your KPIs to the level where you are managing with your KPIs your North Star, or you will lose the ability to celebrate and have success stories on the way. It is okay to measure your North Star, but it is more important to have intermediate KPIs which show you that you are working for the greater good in the right direction.
Strive to meet the overall goals of the various involved roles. If the people do not feel what is in for them, they will be lost on this long journey. Since it is a journey, the people need to be kept entertained, and the only way to entertain them is to create wins for them. The only way to create wins for them is by driving their specifics, making them stakeholders, making them protagonists of your DevOps journey.
Accept that your approaches will not show the results you have expected at the beginning. You will uncover those topics in between like the skunks at your front porch. But by doing so and treating them correctly, celebrating them the proper way, your organization will grow on the journey towards DevOps.
Thank you very much for listening to my talk. I am happy to chit-chat with you on Slack and also hear your questions and try to answer them. But we all are seekers for the DevOps truth. So let's see if I can learn also a lot from you. Thank you very much.