Fear to Hope - How HCSC Became Nimble Through Experimentation During Peak Demand

Log in to watch

US 2021

Fear to Hope - How HCSC Became Nimble Through Experimentation During Peak Demand

Does seasonal demand impact your ability to implement change? From insurance to retail to entertainment, peak demand can limit an organization's capacity to take advantage of one of the best times to run and learn from experimentation. In this talk, attendees will hear how one of the largest insurance companies in North America evolved to conduct experiments during their open enrollment season.

Focused on business results, HCSC teams addressed missing feedback loops in their delivery engine to be more responsive to insights gained from customer experiences. They addressed this feedback with fast experiments for delivering features, which made it easier and more compelling for customers to sign up for HCSC’s plans. From frozen to flowing, HCSC’s strategy set teams up to be nimble during peak season.

This session is presented by Tasktop.

Chapters

Full transcript

The complete talk, organized by section.

Tashfeen Mahmood and Dominica DeGrandis

Tashfeen Mahmood: Hello, my name is Tashfeen Mahmood, and I am the Senior Manager of DevOps Engineering at HCSC.

Dominica DeGrandis: Hi, everyone. I am Dominica DeGrandis. I am Principal Flow Advisor at Tasktop.

Tashfeen Mahmood: Today, we will tell you the story of how we at HCSC evolved our mindset about implementing production change during peak demand season. For us, that is our open enrollment period.

We will discuss how we went from a fear of making change in production to being hopeful about the learning that we will get from it; how we went from a place where changes were frozen to where business value started flowing to production; and how we went from being distant from our business to having shared business goals and objectives.

Here is an overview of HCSC by the numbers. HCSC is Healthcare Service Corporation, but you may know us as Blue Cross Blue Shield. We operate in five different states: Illinois, New Mexico, Oklahoma, Texas, and Montana.

It is funny because a lot of times when I tell people that I work at HCSC, it does not get a lot of reaction. But then when I tell them that it is Blue Cross Blue Shield, I get a lot of different expressions on their faces.

We are a large company. We are about 24,000 employees, and we serve 17 million members. That is 17 million souls for whom we are the healthcare insurer. We are very proud of the fact that we are the world's most ethical company for five straight years now.

That is us. We are a large, conservative healthcare company, and that is going to play into some of these slides coming up next.

Tashfeen Mahmood

Tashfeen Mahmood: I am going to tell you the story of our IT journey. I will talk about these milestone moments in our history in more detail in subsequent slides. At a glance, I will tell you about our big IT transformation that happened in 2016. Then we will focus on the open enrollment period, which constitutes our period of peak demand. This is like Black Friday for a retail company, in that our systems get stressed more during this time.

In this regard, we will discuss our change management practices and talk about the various types of production freezes. Then I will tell you about how a zero-downtime deployment changed our perspective about learning potential completely, how it led us to start to measure product metrics, and shifted our mindset to experimentation.

Before 2016, we did not focus much on DevOps practices. Deployments were often done by an enterprise team, and we would email them our deployment files or sometimes put them into a drop folder, and they would pick it up from there and do the deployments manually in a lot of cases.

Things were not really great, and our leadership, especially our CIO, saw the problem with that and wanted to do an IT transformation. So we set about doing a product and DevOps transformation.

In the first stage, new roles were defined and a new product-based organizational structure was created. These products were defined as collections of applications, and the roles were aligned to those products.

In the second stage, Agile practices were created, modern DevOps tools were selected, and automated pipelines were created. Then we trained those practitioners who had been aligned to those product teams on those Agile and DevOps practices and Agile and DevOps concepts.

Now that that had been done, these practitioners were given the ability to start running sprints, and they then worked in a Scrum model and started observing Agile ceremonies. That is pretty much where we are today. We continuously improved since then. We continue to create more automation and do more process updates.

As great as that transformation was, it still left some challenges. The first challenge was that our transformation was centrally driven. It was top-down, and our practitioners felt that the change was done to them rather than for them. This made it really hard for them to own up to this new model.

The second part was that this was only an IT transformation. The business was not really involved in it. Our work from the business still came in through projects. We in IT would then convert it into products and do the work, and then collect it up in projects and deliver that work. This meant that we were still pretty far from our business.

The third challenge was that our organization was complex and highly matrixed. As you can see on the right-hand side, we have product management teams that sit in product lines, and then technical teams that are in resource pools, and together they all build a product team. That makes for a really complex and interdependent organizational model. The teams that work within this construct then create really complex and interdependent releases.

That kind of went into the change management mindset, which was the fourth challenge. The change management mindset was that of avoiding change. Because the releases were so complex and large, we did not want to take on that pain too much.

This was especially true in our period of peak demand, which again is open enrollment. Our enterprise goals were to go for stability, and the enterprise goals, especially around open enrollment, were very focused on production reliability: goals like production is job one and make sure production is stable.

However, this focus is not really countered by any goals that prioritize learning and fast delivery of business value. Therefore, even up until 2018, we used to implement an enterprise-level production freeze during an open enrollment period. This meant that no change was allowed unless absolutely necessary.

In 2019, our change management policy evolved a little bit. We allowed only necessary changes to production. However, we added quite a bit of rigor. Questions like, "Why can this change not wait until after open enrollment ends?" were asked with every release.

To make a change in production, you had to claim that it was really absolutely necessary, and that without that change, a disaster was looming. So, yeah, we had production freeze, production flush, anything to discourage change.

Tashfeen Mahmood and Dominica DeGrandis

Dominica DeGrandis: The difference then between the freeze and the flush: was it subtle? Did it evolve? Or was there one incident that occurred that enabled this flush to allow?

Tashfeen Mahmood: Yeah. That is a good point, Dominica. I think a lot of the reason behind the flush was that we kept looking at our change management analytics, and we kept finding that no matter what, the change volume was still there. People wanted to, or needed to, make this production change.

The production freeze really was not working, as hard as we were trying. That is why we started going to a model of, let us just add more rigor. Let us just try to get those changes done that we really absolutely need to get done.

Dominica DeGrandis: Great. Thank you.

Tashfeen Mahmood: In 2020, we did something which I call a backward story, and this is what I mean by it. A lot of DevOps transformation stories begin with business had to move forward, or delivering faster was the need to compete in the marketplace, et cetera. Not us.

Especially during open enrollment, our business was sold on the idea of production stability by not making changes to production. In the past, production changes were scheduled on Friday night. That way, if something went wrong, we would have the whole weekend to recover from it.

In 2020, however, my DevOps engineering team worked with our retail team to start open enrollment by making a zero-downtime deployment. This was not a trivial change either. This was the change that started open enrollment, and we did it on a Wednesday afternoon. That raised some eyebrows.

This meant that we could use a blue/green deployment environment as a fallback. If something goes wrong during a release, then do not take the whole weekend to recover. Just back out to a blue environment.

The risk of making changes in production was now significantly lower, and our business saw the potential and now wanted to deliver faster. We actually went from making a capability, and then our business saw the potential of that capability and now has the thirst for speed, so to speak.

Dominica DeGrandis: Yeah, and also a little bit more trust was there too, since you were able to show them that you could do this zero-downtime release and they caught on: oh, yeah, maybe we can get a bit of change out there and accept a wee bit more risk.

Tashfeen Mahmood: Yeah. Nothing like building trust by actually showing that you can do it. You can talk all you want about doing something, but when you make a change and when you start something as consequential as open enrollment with a zero-downtime release, that really gets attention.

Earlier this year, in 2021, we started looking at our product metrics. This is a snapshot of the metrics of one of our very typical products. You can see here that during year-end freeze, delivery slows down. As you can see in the top-right graph, in the rectangular box, as soon as open enrollment starts, sure enough, the velocity does go down. We are delivering less in production.

But the rectangular box at the bottom shows that your work in progress continues to go up. What was happening to us was we were not making any changes in production, but we were still working on those changes. Our work in progress was continuing to go up, but it just was not going into production.

The other thing that was very interesting was, and I call your attention to the oval on the top right-hand graph, frequently there is an urgent change that we needed to make in production that just could not wait. Within open enrollment, we can see this is still in the open enrollment period, suddenly there was a spike, and we needed to deliver some features.

This was a very typical pattern. This is something that happened with many of our products: within open enrollment, after the initial dip in the velocity, we then went back up and started delivering more.

The learning here was that we are invariably going to need to deliver more before 1/1, before the new year starts. That really means that we have a higher need for agility even during the year-end freeze. Even during our period of peak demand, we need to have a higher need for agility.

But the things that were prohibiting us from being agile were the work in progress that was continuing to pile up. The more work in progress we had, the harder it was for us to make these changes fast, and the harder it was for us to be agile during the period that we really needed to be agile.

The other part is we had neglected our technical debt because we were focused on delivering features just before open enrollment starts, because there was going to be a freeze. Because there was so much technical debt that had been accumulated, we really could not have the agility to deliver as fast as we could, or as fast as we wanted to, for that spike that was coming. We were not really prepared very well for it.

Dominica DeGrandis: Yeah. It is also interesting to note that WIP is increasing, and there is a slowdown with delivery, and the throughput starts to tank, the velocity starts to tank, but then there are all kinds of problems. There are war rooms happening and people not having capacity to attend to what we would typically call training sessions or fix technical debt.

But the forces were there, and the pressure is on, and so the teams really worked hard to try and get that delivered. But then we can see the impacts went way past open enrollment period and past end of year, and went into the following year. We can see for the beginning of 2021 the impacts that it had to flow time as it increased during that time.

Tashfeen Mahmood: Yeah. You remember these times, right, Dominica? These are the times when we were in meetings, or we were supposed to be meeting with somebody, and they would be like, "Sorry, I am too busy right now because I have all this work to do." This is what was happening. Suddenly things were spiking up. This was a very interesting time for sure.

Dominica DeGrandis: Yeah. Also sometimes it is last minute, like, "Oh, we have got to cancel. We cannot make it. We are in a war room."

Tashfeen Mahmood: Yeah.

Dominica DeGrandis: "We cannot handle it."

Tashfeen Mahmood: Last minute.

Dominica DeGrandis: Or just new priorities that pop up.

Tashfeen Mahmood: Right. Yeah. That is exactly right. We had some formalized interviews with our practitioners that confirmed this issue. The kind of things they were saying was, everything is a priority. Others were complaining, there is too much work being done all at the same time.

We had already seen this with the metrics. We could see that the work in progress was high. We could also see that there was a lot of technical debt. The closer we got to open enrollment, the more we were neglecting our debt, and now it was coming back to bite us.

Here is what we did. We started identifying some business goals, such as improvement of customer retention rate or reduction of customer acquisition cost. The idea behind that was we wanted to be closer to our business.

We wanted to be closer to our business, and we tied the flow metrics, like work in progress, to these business results. Now that we knew what business goals we were going for, we tied our IT metrics, like work in progress, to those.

Our hypothesis was that the reason why product teams think that everything is a priority is that they are not focused on the business results. Therefore, we were trying to close that gap between the business and the IT teams by making those business results become shared goals.

Now each member of the team is behind improved customer retention rate and lower customer acquisition cost. Only the features tied to customer retention and acquisition cost are priority now.

We set up experiments that will impact these business results while lowering our WIP and tech debt. This way, it was easier for us to become nimbler and respond to any emergent needs during our peak season.

This also enhanced our focus on fast delivery and feedback. When you learn that a slight tweak can improve the goal that you as a group believe in, then you want to quickly make that tweak so you can positively impact that goal. This was very powerful for us.

Dominica DeGrandis: Where were the insights gained and communicated between your IT teams and business folks when you started doing these experiments and presenting the flow metrics and the data to show the progress that was being made? Was that something that was occurring on a monthly basis or at a quarterly review? What did that look like, just real briefly?

Tashfeen Mahmood: There are what we call the product line retros, or the product line retrospective. That is where we look at the business results that are being tracked, and we could notice that those are our business results.

Really what we were trying to go for was not really business results at this point. We were trying to show to our business that we were prepared for meeting those business results. This is work that we are doing to prepare for it so that there is less work in progress, so that we can build the agility up during open enrollment. That is the concept that we are going for within our team right now.

Dominica DeGrandis: By building the agility up, are you talking about making sure that the teams have more capacity to do this type of work instead of being so overloaded?

Tashfeen Mahmood: Yeah, exactly right. The less work is in progress, the more we have the capacity to make those changes faster that come up inevitably during open enrollment. The less the tech debt, the more our ability to make those changes in production as fast as possible.

Dominica DeGrandis: Hence leading into flow distribution here, with some technical debt showing up on the very right bar there.

Tashfeen Mahmood: Yeah. Thank you for calling that out. This is the flow distribution map here. Green is features, red are bugs, and purple is tech debt. As you can see, this product was not really fixing a lot of their, or addressing a lot of their, technical debt until recently, until we called out this need here and we started to talk about how technical debt needed to be reduced in order to deliver faster. This is something that we are trying to focus on, reducing technical debt so that during open enrollment we can work on that.

This slide is really talking about how we have learned that experimentation, especially in the period of peak demand, is paramount. We learned that allocating capacity to reduce technical debt is really the way to go, because the more we do that, the more process improvement we do and visualize it with our flow distribution metric. That way we can be more nimble in addressing the needs of our peak demand season.

Our year-end processing starts in October, and our open enrollment is going to start in early November. We have our fingers crossed, and hopefully we are going to learn a lot from our experiment.

Dominica DeGrandis: Yeah. I am really hopeful to watch you make these daily improvements and improve the process for making daily improvements. That is one of the five ideals of The Unicorn Project. It is the third ideal, right? Make time, allocate capacity for continuous improvement through daily improvements. We will be able to see here the impact that it has on your business objectives.

Tashfeen Mahmood: Yeah. Thank you. All right. Thank you.