Your Bureaucracy. Our Tenacity.

Log in to watch

Las Vegas 2023

Download slides

Your Bureaucracy. Our Tenacity.

Andrew Fichter

Deputy Director - VA Lighthouse Developer Experience · U.S. Department of Veterans Affairs

Rob Monroe

Sr Product Manager · Rise8

John F. Kennedy once praised our Veterans for their bravery and selflessness by saying, “as we express our gratitude, we must never forget that the highest appreciation is not to utter words, but to live by them.”

On any given day, the Department of Veterans Affairs (VA) aspires to live by these words in every in-person and digital interaction for over 9 million Veterans, their families, caregivers, and survivors, by meeting their needs for health care, disability compensation, education, housing assistance, and service record maintenance…all around the world.

Unfortunately, Congress has heavily scrutinized the VA’s cybersecurity practices as a contributing factor to its failure in serving the needs of our Veterans. Something must change! As the VA attempts to shift towards a more DevSecOps mindset and associated culture to support its transformation objectives, there is a sobering realization that traditional approaches to risk management frameworks have resulted in both technology and policies that silo people, deteriorate trust, and ultimately are not aligned to enable modern, agile software delivery.

In this talk, I will share insights, lessons learned, and recommendations, including:

- The art and science of hacking your bureaucracy to achieve higher levels of maturity for DevSecOps, continuous Authority to Operate (cATO)

- How continuous risk management reduced MVP time to market from 450+ days to under 90 days

- How to pivot from policy-as-paper to policy-as-code to address security throughout product life cycles, at scale

- Why the success of your security culture shift starts and ends with your people strategy

- How Product Management can help your platform and services adoption stick, and scale

- Why project and output based efforts are impeding your organization's ability to compete

Chapters

Full transcript

The complete talk, organized by section.

Andrew Fichter

All right, so it's now 3:30, so we'll go ahead and get started.

Hi, everybody. Welcome to our talk. I know it's a few days in and everyone's talked out, so thanks for the time to come here late in the conference. All kinds of great information and conversations already, so I'm sure you have a lot in your head. Thank you.

We're here today to talk to you about our journey achieving one of, I haven't fact-checked it, I have tried to fact-check it, but I don't know for sure, but one of the first ongoing authorizations, continuous authority to operate, in a civilian federal agency. If you don't know what that means, don't worry. We got you covered. We're going to dig into that in our talk and try to break it down pretty clearly for you. It's a very exciting accomplishment, and I think it's a funny story, so I can't wait to share it with you.

Let's start off. My name is Andrew. I work at the Department of Veterans Affairs. I've been there since 2019. At VA, I lead parts of VA's API platform. We have a public API platform, plus a developer experience program, which primarily entails a delivery platform that we recently launched.

I'm excited to talk to you today because I love creating opportunities to learn and connect with shared challenges. I know many of us here today, we've encountered similar challenges on many levels, and we all have a unique perspective and unique lessons learned from overcoming those challenges. So we can both share what we've learned and also learn from each other.

Rob Monroe

Hey, everyone. My name is Rob Monroe. I've been a practitioner within the DevSecOps transformation domain for the last eight years of my career, and now I'm a senior product manager at a company called Rise8. We're a three-year-old startup in gov tech. Love the company, love our culture, and we're purely obsessed with trying to make the world a better place where fewer bad things happen because of bad software.

Andrew and I can't thank you enough for giving us some of your time this week. We know it's been jam-packed, but we hope you enjoy this presentation as much as we enjoyed putting it together and going on this journey as well.

Really quickly, I would like to see a show of hands. How many of us believe that bureaucracy is a major impediment to leadership in the organizations we work with or for? Okay. No surprise, pretty much everyone here.

Before we get into how we achieved continuous delivery by focusing on how we accelerate with continuous risk management, Andrew and I really wanted to get into the details of what bureaucracy means to us. We talked about it being a problem in the federal government space.

The term bureaucracy was first actually coined in the mid-18th century, and it didn't carry the negative connotation that it does today. A recent definition that I came across states that bureaucracy is a system of managing an organization by strictly following fixed routine and procedures that often leave a delay.

Our takeaway from this was that we all have bureaucracy. Whether you work in commercial, private, or government, you all have structures, and it's in your DNA as to how you get things done and how you work. So it's not out of the realm of possibility that actually some bureaucracies are good, some bureaucracies are bad.

But let's get a little bit more specific then. When Andrew and I are targeting bureaucracy as the impediment to getting shit done and something that we want to reject, we're specifically talking about the behaviors and accents that we've witnessed keeping product teams from achieving continuous delivery of value.

Instead, what we really want to strive for is actively shifting our ways of working to something that provides more effective clarity, transparency, and structure that we believe optimizes for rapid and continuous adoption of change as a means to enable greater value.

I'll now hand things back over to Andrew to help set the scene for why it's so important that we focus on solving problems for the government and why we should be rallying behind it, even if we're not the ones directly involved.

Andrew Fichter

All right, I can talk for days about this, so I'm going to try to limit myself. But if I start talking fast, it's just because I'm passionate about it and I have a lot to cover here, and we have limited time. So bear with me.

On the screen here, you can see the placard outside of the VA headquarters in downtown Washington, DC. On there, the VA's mission is described there on the bottom. I'm not going to read it verbatim, but it's a pretty easy sell.

Basically, VA's full purpose for existing is so that our veterans are supported and have access to the benefits and services they're entitled to. They've made the ultimate sacrifice. They put their lives on the line. They left their families. They put their health and well-being on the lines to defend our nation and our liberty. And we're also fortunate to have the liberty that they fought for and that they protect for us on a daily basis. VA is here to help support and empower those veterans throughout the remainder of their civilian life.

The mission, it's an easy sell. Impact potential is clear. Obvious inefficiencies. An agency as large and complex as VA, and the potential for change is massive. As Paul Gaffney and Courtney Kissler discussed yesterday, the disparity or the comparison of inefficiency and effectiveness, there's lots of inefficiencies and effectiveness in an organization as the VA. It's a very humbling challenge that I'm very passionate about, both as a technologist and then also just as making an impact to a cause that matters, that makes a difference in lives.

On this slide here, you can see that VA, we obviously have the resources to address these challenges. Over 390,000 employees. Many of them work for the Veterans Health Administration. That's the single largest provider of healthcare in the United States, because you can see there's more than 18 million veterans. VA's Office of Information Technology, or OIT, alone is comparable size to many Fortune 500 companies. So over 8,000 employees, just as many contractors also that support those employees.

As you can imagine, our technology landscape is very, very complex, full of redundancies and siloed systems. This is a bit of inside baseball, but the VA effectively operates as three separate independent government agencies. There's a Veterans Health Administration, which I mentioned. There's a Veterans Benefits Administration that provides benefits such as VA loans, pension, disability compensation to veterans. And there's also the National Cemetery Association. That handles afterlife care for veterans and their families.

But these siloed, effectively independent agencies, they don't share data, they don't share systems, they don't share processes. And individually, they have too many data systems and processes to fathom just on their own.

Our veterans deserve better. They shouldn't have to print out and mail a claim form to VA in order to apply for benefits. They shouldn't have to wait for their claim to be processed and likely kicked back to them for more information. They shouldn't be outright denied benefits due to unreliable systems, data quality, or data retention, to say nothing of the quality of their digital experience, even if things went exactly as expected.

Now we'll talk a bit about the challenges and continuous learning we've had to apply to make progress and move our needle in the right direction.

Here's a very high-level snapshot of the journey we've been on. Each milestone you see on the slide was prefaced by identifying and solving a problem. While solving that problem, we identified new problems and chose to tackle those. So I'd like to think of this as a journey of continuous learning.

In 2018, we launched our public API platform. If anyone has their phone or a computer in front of them, you can go to developer.va.gov. You can see our API portal with the APIs that are available right now.

We really wanted to enable VA teams to build good APIs and expose third-party innovation. That was our goal. We wanted to provide a platform, and we wanted to enable other teams within OIT to build good APIs.

What we learned is that VA teams really struggle with this for many reasons: costs, priorities, manpower, skill set, so many obstacles that stand in the way. And also there's a lot of fear around exposing data outside VA or even to other VA teams. Teams are very siloed and protective of their systems and their data. Making their data available, it's a scary notion. It can be a tough sell.

Around the time COVID hit, we decided to address this challenge with delivery and having teams be able to have the skills and resources they need to build APIs. We decided to address that by building a delivery platform.

Yesterday there was a talk from U.S. Bank. They talked about their delivery platform. What we're doing is very similar to that, if anyone watched that. We wanted to reduce costs, reduce time there. We launched that platform in 2021.

But the big elephant in the room still is authorization. We realized teams can only ship code as fast as it's authorized. We can run a platform that allows the team to ship to a dev environment in a week and provide in a couple weeks, but if they can't be authorized to run in production, what good is it? It still takes months just to get authorized.

So we recognized we wanted to solve that problem, and we also knew that we had our work cut out for us. We had a pretty serendipitous discovery of Rise8 around the time that we were thinking about this problem. That's when we decided to partner with Rise8, who actually had experience doing this exact thing at Air Force with Kessel Run. It was a very timely relationship.

With Rise8 on board over the past couple years, we've built the first ongoing authorization, continuous ATO, in a civilian agency.

I mentioned a continuous ATO. What the hell does that mean? It's a continuous authority to operate. That means we're effectively managing risk continuously rather than every couple of years.

I started to define that and some other key concepts on the left-hand side of this slide, which you can read through later if you're curious. But what's important for you to understand right now is that the standard method to achieve security goals, following a nonlinear flexible framework and implementing against this framework and inheriting through the processes that you build around it, ultimately leads to permission to run a system in production within the federal government, which is an ATO.

As you can imagine, at VA and other federal agencies, the process to achieve an ATO doesn't lend itself to modern software development. Very waterfall. You can see on this slide here that the standard waterfall happens, including development and testing before everything is handed over for assessment and authorization, which is required before you can go to production.

And so a development team actually working on a thing, the security piece of this is a total black box. It's not something they're taught to even think about throughout the development process. The assessment and authorization process is very disjointed from the development process. It's largely based off static documentation, which itself is a couple layers removed from the actual dev team. And then the assessment typically yields a one- to three-year authorization.

When teams are shifting to prod once a quarter, a couple times a year, I guess that kind of makes sense and works. But it doesn't work if you're trying to ship code that's authorized on a daily or weekly basis like we are.

Feedback loops are obviously a challenge. Security scanning under this legacy process is a perfect example. Overall, the team would develop their app and then had to request a security scan from a completely separate and isolated scanning team. The scan had a 30-day SLA, so you could put in a ticket and then just hope that within 30 days you gave your scanner results back. Sometimes you don't.

The output of that would just be a down list of vulnerabilities to be remediated, and then rescan wouldn't be required for another quarter or year in many cases. There's no feedback loop to ensure that vulnerabilities that were detected are remediated in a specific timeframe. Sure, there's manual checks in the process, but that only works as well as the people and process behind it. And it's not perfect, obviously.

It's obvious how the existing process not only introduces a delay, but it also reduces security posture. You have outdated scan results and then you don't have tight feedback loops to make sure that those scan results are addressed, remediated, and rescanned in a timely manner.

I mentioned static documentation earlier. That documentation captures evidence as to how security controls are accounted for. Those controls are identified and assessed by a team within the Office of Information Security. That's within OIT.

The assessment's not technical in nature, and it's not transparent. Many people who are far removed from the app itself are involved in the assessment and authorization decision chain, and they have no capacity. They're very limited. They have many systems on their plate at one time that they're trying to assess and authorize.

That just leads to rushed assessments and emergency, all-hands-on-deck fire drills to sort off things before turning it over to the authorizing official to make an authorization decision. Fun fact, I was actually on a two-hour call yesterday morning with exactly this kind of fire drill at 6:00 a.m. back on the East Coast. So that was fun.

A ton of work and manpower goes into this. This creates an illusion of security, a false sense of trust. Surely if we have 20-plus people spending hours assessing a system's security controls, it must be in court by now, right? That's not the case. Systems are not secure just because the static documentation says so, or because a person reading the documentation confirms that, yes, the documentation says it's secure enough for me to agree.

Again, back to this topic of this interplay of efficiency and effectiveness, really think how inefficient it is. So indicative of the challenges that we're trying to solve.

I'll hand it back to Rob, and he'll walk us through strategies that we employ to address these problems.

Rob Monroe

Thank you so much, Andrew.

I thought it might be appropriate to reference David Marquet's Turn the Ship Around! because when you think about the challenges that Andrew just laid out, what we're really talking about is an attempt to move authority to where the information's located. That might be the systems themselves. It might be the people that are involved and understanding that they understand more context and have that readily available when necessary.

I think we can all agree at this point, day and age, that adopting cloud technologies and microservice-based architectures is a way of abstracting away overhead problems from our products and teams, making it easier for them to ship mission capabilities.

This also presents a unique opportunity, however, to establish a controls inheritance model where we actually have common control providers that are addressing security risks and vulnerabilities, or threats of such, in a way that can be adopted and inherited by the other layers themselves. In this case, our application teams, again, bring them up to not have to have that concern, but obviously still having the tight loop in terms of changes that are occurring in different layers of this stack as well.

This allowed us to be more intentional, in fact, with our definitions of responsibilities for controls at the application layer. This also made it possible for clear lines of authority, making it so product teams are more accountable for actually owning the security and privacy aspects of their applications rather than passing the buck off to somebody else.

Another bet that we made was that by optimizing for accessibility and transparency, we believed we actually could achieve sustained agility, trust, and velocity. Similar to the aspirations described by Clarissa Lucas the other day, to bring auditors into the newer ways of working, we actually decided to embed application security assessors with cybersecurity engineering backgrounds into a ratio of three to four product teams. So actually embedding them as if they're a teammate within the team. A third-party member of the team nonetheless, but still there to help them grow, help them learn about security threats and weaknesses about their systems, and helping them learn along the way.

And they're also learning about the maturity of that system as it continues to mature throughout its product lifecycle. So the feedback loops and decisions that can be made as changes are being made is a lot easier and more effective.

We then wanted to ensure that we could provide access and transparency to all of the tooling and systems that were maintaining this concept of risk across our software development lifecycle. So we implemented an initial secure release toolchain. We'll talk about the difference there with the pipeline in just a minute, where we could allow any party that needed the information access to the information immediately and on an ongoing basis to all forms of risk context as it was iteratively and incrementally developed or changed.

And yes, this actually also includes documentation. So this might be a really funny GIF to walk away with today, but I think what we really want to ask ourselves is: how do you actually continuously update this, and how do you get the feedback loops to support that?

One of the solutions that you saw on previous slides we utilized, called SD Elements, and what we were able to do is now we could bring security, privacy, and application teams to a single place to have a conversation and capture context about risk and changes to risk, not just the first time, not just getting to prod the first time, but actually release over release.

What this actually allowed us to do is allocate more appropriate requirements based upon the explicit context in that moment. And as changes were occurring, by translating those common weaknesses about explicit risk for a given system into actionable backlog, we now give teams the opportunity to centralize the context of what priorities are we going to work on, both from a security, privacy, nonfunctional requirements, for the actual product backlog itself.

And once completed, this led to greater confidence with verifiable evidence. When I talk about evidence, I mean even down to the need to see lines of code that were implemented to address the common defense vulnerability threat, and doing so in an ongoing verification feedback-loop manner because you're structured to people away around what it means to be focused on a common objective and understanding that learning about each other's formats of risk and context about the system.

Now, this is an oversimplification of our secure release pipeline, but it illustrates actually a polarizing comparison in how systems are typically authorized in the federal government.

What I've come to learn in my one and a half to two years here with the VA is that it's actually common that an authorizing official could be responsible for authorizing tens, maybe even sometimes hundreds of systems. Let that sink in for a second. I have to know enough context about tens or hundreds of systems and feel confident about all the people that are working in our village to help me understand the risk levels and to make the decision of go, no-go.

If you think about it in terms of the typical approaches that are project-led, documentation-based techniques, it just simply won't scale. And it gets back to what Andrew was stating around how we're giving ourselves a false sense of security around what we're dealing with there.

So, as a recommendation, by investing in automated paths to production, where you start by building collaborative work environments for the people to work more collaboratively, and then govern it through policy as code within your security pipelines as one form of strategy, you are actually able to continuously deliver secured solutions that veterans will, in this case.

I love the idea from yesterday's talk about having security addressed before code's even written. That sounds like an awesome future Nirvana state on the horizon. But I just want to be able to demonstrate that even low-fidelity solutions can yield really positive results if you're thinking about developer experience.

As an example, we had some of our security solutions that were actually only accessible through a different network, or you had to have a government service equipment device to actually get to it. If you think about having to pull away from where you're actually writing code to go log in, to go look up that, or to go access the tool, to learn the tool, learn how to navigate it, and then figure out what you're actually trying to resolve from a vulnerability perspective, you're adding tons of clicks and tons of time to that feedback loop where we can make the engineers more productive with their time and helping them achieve what they're trying to ultimately solve.

In this case, by understanding and researching what our engineers needed to make decisions with these vulnerabilities, we actually just brought that data closer to them within their pipeline. We gave them the actual context of what the problem was, where it was located, what available fix would actually address that problem, and several other properties. This greatly reduced the time necessary to make those decisions and make them obviously happier.

At this point, you're probably thinking to yourself, "That all sounds great, but I need to see the results. Actually really show me the proof here."

I'm going to start off by focusing on the business side of things for a second, in that we first reduced the average time to acquire an ATO from 568 days to roughly 88 days.

Thanks.

The other thing to point out on this slide is each of the digital products you see on the left have their own unique story of what they delivered into prod and what it meant to a veteran or a clinician, their actual end user that was leveraging the software that they developed.

But what I want to actually draw our attention to with that first step is we're talking about time to celebrate successes and failures. What this means is we no longer have to wait years to find out if we made the wrong bet on something. We were able to get the solution highlighted at the bottom, Ventake, into prod in the hands of the clinician, and actually get the feedback necessary to make business decisions to go ahead and retire that system.

Now, to think about that in terms of adding another 400-plus days onto that timeline, which would be ridiculous at this point. Why would we possibly want to go back to other ways of working to add more lead time to understand what's going on in product?

Through this journey of changing behaviors, we've also seen change of mindsets. Through this experimentation period, I often refer back to this quote that I heard from an information system owner, and I'm going to read it because it's fairly long.

"Going in, I thought C-ATO was an attempt to fast-track ATOs by avoiding VA processes and documentation altogether. Now I believe it's a more humanistic approach that emphasizes automation, transparency, and trust to support our modern SDLC process."

This is focusing on some of the outcomes in terms of what behaviors we saw from engineers now that they had accessibility, transparency to everything, and they had certain thresholds of decision-making power, how quickly they were demonstrating higher urgency of addressing security vulnerabilities.

And while I think that these stats in terms of before-and-after effects are amazing, I'm more interested in understanding, what do the actual users of our processes or systems actually have to say about this?

And so one engineer user of our secure pipeline stated that, "I love that I get vulnerability checks in minutes instead of days, approaches to how to fix those problems, and then I can do all of this from my software configuration management solution," really making things easier for our users. And that's really what's making this more effective in our environment.

On the other side of understanding how frequently we're addressing assessments and what we call POA&M, or plan of actions and milestones, we're actually decreasing that number because of the way that we're inviting everyone into the conversation, being more collaborative about what we're actually identifying as risks and how we address those things, which has been also leading to, for those cases where we have POA&M, or let's say unfinished or sufficient requirements going into prod, also living longer, living not longer, less. Yeah. Anyway.

Again, I think it's really important to see what is this actually having in net effect to the person that's involved and who's got the responsibility for this. And so, "I'm learning so much," a product manager stated, "I'm learning so much from our AppSec teammate and feel more confident that we're delivering a better product for our users and to our business."

But even with all of this initial success, there are many challenges ahead of us, and here's where we think we could use your help.

Andrew Fichter

Yeah. So that picture on the left there, I think it's pretty fitting in that I feel like a lot of the problems we solved are kind of the tip of the iceberg. And you can see on that picture on the left, the gray pyramid at the top, that's kind of what we've addressed right now through our inheritance model.

We do have other layers of controls that need to be accounted for. We want to mature and expand our approaches to the platform, which is in the middle, and then also the infrastructure layers. There's more controls to be managed at those layers, and we're looking for help in determining how we can automate monitoring and implementation of those controls.

Also, coalition of practitioners. I think that's really what the value is of conferences like this and getting people together who are having similar professional experiences and challenges. I think things like a coalition can come out of that.

We want to hear from folks who have had similar challenges making to establishing change. It doesn't even have to be the same kind of change. But I think all large organizations where you have silos and you have the ways of doing things that are really deeply ingrained, learning how to overcome those is really powerful, and sharing lessons on how you've done that.

We want to find people who have similar passions and figure out how we can start standardizing some of these security control ongoing security assessments, too.

And then just a couple more things that kind of came to me. One, we want smart and mission-driven people to come work for VA OIT. So if any of this is you and your background, the government's a great place to work. We're hiring. Think about that. Tell your friends.

Anything to add, Rob?

Rob Monroe

Nope. As we started off the conversation, thank you for your time. We know it's very valuable, and we appreciate the opportunity to share our story. Thank you.