Productizing the Network: Square Peg, Round Hole?
We've been on an amazing journey at Capital One since productizing our network infrastructure teams in 2019. Listen as Girija Rao, Vice President Software Engineering; Denee Ferguson, Director-Edge Network Services; and Jennifer Miles, Director-Agile Portfolio discuss the drivers for productization, unique challenges with productizing network infrastructure teams, changes made, outcomes, and lessons learned.
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
[00:00:12.900] I love the story because it challenges the notion that DevOps principles and practices are only for customer-facing software applications. So up next is Girija Rao. She is a vice president who up until very recently led Enterprise Connectivity, presenting with Denée Ferguson, Director of Edge Network Services, and Jennifer Miles, Director of Agile Portfolio. So here is Girija, Denée, and Jennifer.
Girija Rao
[00:00:40.960] Hi everyone. We are here today to share with you the product-oriented, agile-driven transformation we performed in our network organization at Capital One two years ago, and a reflection on the outcomes we achieved.
[00:00:55.840] I am Girija Rao, and with me are Denée Ferguson and Jennifer Miles, and we each played a key role in this effort. I was responsible for the Enterprise Connectivity organization and, along with my leadership team, initiated and led the transformation. Denée Ferguson was responsible for tier-three operations at the time and was one of the key leaders in driving the effort with teams along the change curve. She now leads Edge Network Services, providing wired and wireless connectivity to over 50,000 associates across over 450 locations. Jennifer Miles leads Agile Portfolio and Program Management for cloud and connectivity and was a key contributor to figuring out how to make all work visible and evolve our delivery model.
[00:01:43.720] Before I get into the details of what we did, I would like to share some key facts about Capital One. We were founded in 1994 and are led by our founder, Richard Fairbank. We are a top 10 bank based on U.S. deposits, the third largest credit card issuer in the U.S., and a Fortune 100 company.
[00:02:04.220] From the start, Capital One has been a leader in technology-driven banking, and in 2020, we became the first U.S. bank to exit our data centers and go all in on public cloud.
[00:02:16.700] Our organization, Enterprise Connectivity, has approximately 350 associates. We have divided our team into three major towers: security and application services, connectivity, and horizontal services. With technology scopes that range from proxy, VPN, DDoS, firewall, DNS, load balancing, and network admission control to contact centers and voice. And let us not forget my personal favorites: wireless LANs, routing and switching, and software-defined networking, which includes SD-WAN and SD-Access. We support connectivity for over 50,000 employees, over 100 offices, and hundreds of retail bank branches and cafes. And we have in excess of 14,000 devices on our network and 185,000 carrier assets, which includes circuits, toll-free lines, and POTS lines.
[00:03:07.240] So let us start by talking about why we felt change was needed. We had an organizational structure that had been in place for a few years, and there were several pain points and opportunities that had become clearly apparent to us.
[00:03:21.640] In this structure, there were two primary network organizations. One focused on network architecture and engineering, and the other focused on network operations, each in their own tower and under separate executive leadership. There was fragmentation in our approach and, despite all of our best efforts, a significant engineering-versus-operations dynamic. There was no unified sense of ownership or clear lines of accountability. It was too easy for things to get thrown across the fence without adequate consideration of the end-to-end lifecycle, sustainability, or customer experience.
[00:04:01.780] Due to this, for any project or issue, multiple network teams needed to be engaged, each focused on their own piece of the puzzle. Practically speaking, this meant that teams were working in parallel without sufficient visibility to what others were doing, absorbing the unplanned impact of upstream or interdependent projects, and lacking standardized knowledge and best practices for the same platforms. There was inefficient resourcing due to the siloed nature of the roles, and finally, not all work was being tracked or was being tracked in inconsistent ways. People leaders did not have full visibility to all the work that their team members were doing, and prioritization was a challenge.
[00:04:47.560] So in 2019, as we embarked upon this change, our drivers were to unify our vision and strategy, improve efficiency, make delivery predictable, and improve the overall quality of work delivered. To accomplish this, we needed to do several things: reorganize around key products and services with clear accountability, make all work visible to enable effective prioritization, dependency mapping, and resourcing, centralize our intake, and set limits on work in progress. Combined together, we felt these actions would strengthen us, position us better to deal with our complex and dynamic environment, and help us to achieve predictable, high-quality outcomes.
[00:05:34.860] Our transformation involved changes in three major categories: organizational structure, prioritization, and reporting. We started with our organizational structure. This was huge and ended up with over 160 people changing managers.
[00:05:52.720] As I mentioned earlier, and you can see on this slide, engineering and operations ran as two separate organizations under different executive leadership while supporting the same network platforms and services. We consolidated all of it into one organization under a single accountable executive. We identified each of the distinct products and services and grouped them into product portfolios. For each product, for example wireless, we created dev and ops teams aligned under a single product owner, who also served as the people leader for the team. Agile delivery and program support were matrixed to the teams as a horizontal service.
[00:06:33.104] This provided a sense of ownership and accountability within each team for their product and the work delivered, as well as helped establish consistent practices across teams. Where it did not make sense to do this, we left it as is. For example, we kept our 24-by-7 tier-one and tier-two ops teams separate as horizontal functions, engaging with all the other teams.
Jennifer Miles
[00:06:57.964] The second change we made was related to the way we prioritize our work. Before our transformation, work was prioritized at the team level with very little visibility into what was coming next. Operations and engineering prioritized work separately within their own silos, even though they may have been working on the same platform. This led to teams working on different goals and even different customer pain points. Context switching also became a challenge as team members were distributed across multiple teams within a silo. They had difficulty knowing where to focus as priorities varied between teams. We also became frequently disrupted with internal and external distractions, such as shoulder tapping and leadership high-priority asks. These things became a huge problem. Whoever was the most visible at that moment received the attention of the team, regardless of existing priorities.
[00:07:46.424] So what do we do about this? As part of the product team reorg, we took the opportunity to adjust multiple aspects of our planning cycles. We started by creating an annual initiative prioritization process. We used a bottom-up approach to define our work pipeline with teams contributing potential initiatives to start the list. Leadership review forums were then held to prioritize and refine the list based on business need and organizational goals.
[00:08:15.404] Once the final list was prioritized and distributed, all teams were able to see the established priorities across the entire organization. Work was then broken down into achievable epics and stories, with tracking and course corrections made as needed, using quarterly increment and shorter sprint planning sessions. This more robust work breakdown process made our teams realize several benefits, including better alignment of capacity and resources, the establishment of work-in-progress limits, and the ability to understand historical unplanned work cycles. Work-in-progress limits were an immediate outcome of the priority list, with above-the-line work taking priority and below-the-line work only being pulled in once teams had open capacity. Teams were then able to say no or not now to work that previously may have caused disruption.
[00:09:06.024] In summary, better prioritization ultimately reduced team overcommitment, allowing them to deliver on what they said when they said.
Girija Rao
[00:09:14.444] Before we get into more specifics about our outcomes, I would like to spend a few moments clarifying why this transformation was anything but a guaranteed slam-dunk move. As mentioned previously, our organization is focused on network infrastructure, and network delivery has some key differences from application delivery.
[00:09:30.004] First, common roles within our engineering teams include network engineer, wireless engineer, firewall engineer, DNS engineer, load balancing engineer, et cetera. While many of our engineers have some software development skills, most have learned them over the course of this multi-year transformation. Most are not software developers by trade. Consequently, some agile constructs that are well known and commonly used in the software development world were not consistently understood or employed across the entire organization prior to the start of this transformation.
[00:10:03.424] Second, while we have eliminated physical infrastructure wherever we can, many network technologies are not at a point where they can be fully eliminated. As an example, you cannot connect your laptop to a wireless LAN without a physical access point. Consequently, we still have a significant physical footprint.
[00:10:21.364] One of the consequences of having a physical footprint is many projects are waterfall-like and have heavy interdependencies. For example, establishing connectivity to a new office location involves ordering, delivery, and turn-up of a circuit, extending the circuit from the telco room to the equipment location, provisioning DHCP scopes, DNS entries, firewall rules, and ordering, configuring, and installing the equipment. Multiple teams are involved in performing these tasks, and some of them require an on-site presence. So there is a lot of scheduling and a lot of coordination.
[00:10:57.894] Third, agile delivery methodologies are not a perfect fit for how we have historically operated. Many of our tasks do not neatly fit inside short, iterative, time-boxed sprints or even regular PI cadences. Work tends to oscillate between short-duration tasks and long-duration tasks, and sometimes we have waiting times within the same effort. Work to accomplish a particular objective can span multiple teams and organizations. And finally, we often have a heavy unplanned workload component, which needed a cultural shift to address, forcing us to plan much farther in advance, learn to say no or not yet, which was a muscle that just was not well exercised or even well tolerated at the time.
[00:11:40.584] Finally, product management constructs often prove challenging, bringing to mind the image of a square peg in a round hole. Our customers are often not aware of the role our services play in their daily experience, and this makes common product management constructs tricky, such as doing product- or service-focused customer surveys, empathy interviews, developing North Star metrics, and actually measuring on them. Second, there is no customer choice element. There is no competitor that our employees can choose from. New features and capabilities deployed often are not driven by end-user requests or market-share concerns. Instead, many are security- and automation-delivered, and consequently, they are frequently invisible to our customer base. Last but not least, customer experience: how our customers interacted with the products and services we provide was not historically at the forefront of our mind when we designed new solutions.
Jennifer Miles
[00:12:30.918] The third pillar of our transformation relates to reporting. We updated, created, or consolidated multiple reporting mechanisms to provide a better flow of information to our teams and leaders. Specifically, we created more robust operational reporting metrics that measured incident frequency and severity, helping us to understand change impacts. We also adjusted and consolidated agile metrics used at the team, portfolio, and tower levels, helping us identify improvement opportunities. And finally, we created mechanisms to make initiative progress visible at all levels, ensuring accountability for our outcomes.
[00:13:09.088] Primary operational metrics focused on items directly related to network stability, such as incident frequency, severity, and incidents caused by change, which historically is our number one driver of incidents. On the top half of the slide, we are focusing on incident statistics. The left-hand side shows network incident counts for all severities by quarter. The volume, as indicated by the trend line in red, has clearly declined over the course of the past two and a half years. The right-hand side shows network incident counts for only the critical incidents. Although historically we have not had a high number of these, each of them is in the single digits. When they do happen, they have an outsized impact on the organization due to our desire to quickly resolve the impact, perform root cause analysis, and identify and implement remediation actions to prevent recurrence. Critical incidents, as you can see, have also declined significantly at a bit sharper pace than the overall network incident volumes, as indicated by the steeper slope of the red trend line. At this point, we have had in excess of 300-plus days without a critical incident, which has been huge for us.
[00:14:19.347] The bottom half of the slide focuses on network changes. The left-hand side shows the percentage of network changes resulting in incidents, and that has declined significantly, nearly a 60% drop over the course of the past two and a half years. The right-hand side shows how network change volume is trending, and you can see that overall, it is relatively steady. That little decrease in Q2 of 2020 is the result of change freezes that were put in place right after the pandemic forced everyone to start working from home. So in conclusion, our testing and change validation procedures have accomplished the desired effect of improving network stability.
[00:14:53.368] Two of our agile metrics are sprint commitment reliability and velocity. Sprint commitment reliability indicates the percentage of the work the team committed to accomplishing at the beginning of a sprint that actually was completed during the sprint. And velocity shows the trending number of story points that were completed each sprint and measures the amount of work completed by the team. Both metrics illustrate how the teams were initially disrupted by the transformation and became much more reliable as time progressed. And we are attributing this due to the better strategic alignment on our outcomes. Everyone on the team is now working toward the same goal. Efficiency gains from organizational restructuring, visibility into previously hidden work, all work is now captured and tracked. Permission to say no or not yet, so we can focus on the right priorities. And finally, an enhanced continual improvement mindset.
[00:15:42.148] So let us talk briefly about our initiative progress tracking improvements. Our updated prioritization cycles created an environment of greater visibility for all of our associates. We implemented initiative progress tracking through monthly leadership reviews and quarterly planning readouts. By doing this, our teams and leaders were able to see tangible progress occurring. Readout conversations and views included not only milestones achieved, but more importantly, what value was delivered to support the overall organizational goal.
[00:16:13.828] Previously, teams had difficulty delivering on the predicted timelines, and slippages were a regular occurrence. Also, leadership visibility into team challenges was limited to team members self-identifying an issue and then having to search for the right forum to raise it in. This process was not easy to navigate or transparent in any way. In the new organization, regular review cycles created an open environment that fostered transparent communication between teams and leadership. Leaders engaged early and often to help in issue resolution, leading to a reduction in delivery date movement overall. We also have made better use of our agile tools and applications to help teams raise visibility on dependencies and impediments so leaders and teams can see where work is at any time in the cycle. The most important outcome of our enhanced progress tracking is that our teams, leaders, and partners all have greater clarity into when work is planned to be delivered. They are also able to address challenges before they become critical.
Denée Ferguson
[00:17:14.189] We have talked about all the great things we accomplished. However, we also learned some valuable lessons. What we found out is that the devil is really in the details.
[00:17:23.608] Our first lesson learned was that it is very important to get buy-in at all levels, not just with the leadership team. We made this organizational change with input primarily from several levels of leadership and basically pulled everyone else along the change curve with us. In hindsight, we should have invested additional time to explain the why of the change to everyone. A few examples of the why include articulating to all levels a clear vision of the target state from the very beginning, determining at the outset how agile would be adapted for use in a non-software development organization like our own, explaining why it is important to make all work visible, illustrating how projects that normally span weeks to months can work in a sprint-based model and why it makes sense to do this, and finally, better explaining the you-build-it-you-own-it model and why we created on-call groups that included all team members and not just operations. Although with all that being said, at some point our teams had to make the leap of faith, and they did.
[00:18:26.976] Our second lesson was that the product model may not fit all teams. We encountered this challenge with select support teams. These included circuit provisioning and, as Girija mentioned, tier-one and tier-two operations, where historically workload is heavily ticket-based. To that end, we did keep those teams, plus our architecture and agile delivery, as horizontal services.
[00:18:44.996] Our third lesson was that we did not make enough skill retooling assumptions. We would have been better to assume that all roles would need some sort of skill retooling, whether it be operations, engineering, or delivery. This lack of skill enhancement led to some confusion around role responsibilities.
[00:19:02.476] Our fourth lesson involved focusing on reporting needs would have been much more helpful if we had done it from the start. Incorporating required status reporting into planning sessions is a key lesson learned. The right Jira structure can make status reporting a breeze, and the wrong Jira reporting structure can make it extraordinarily painful. It is far better to invest the time early on to determine what structure makes the most sense before you start the work than to try to adapt your structure while work is in flight.
[00:19:36.276] For example, I had a recent effort with my team to perform wireless site surveys at a large number of locations. This involved performing site surveys, analyzing the data generated from the site surveys, and then identifying required remediation actions, which either took the form of deploying additional access points at each location or simply making tuning changes to the RF parameters. Initially, we had a single epic that focused on all of this work, and we rapidly learned that that was not going to work for us. So we had to course-correct during the course of the project to shift our stories into four different epics: one focused on actually performing the site surveys, a second on analyzing the data, and then a third and fourth focusing on the remediation actions, one for access point deployments, and then one for actually doing the RF optimization.
[00:20:28.376] Our fifth lesson learned involved agile training. While we did have some agile training, more robust agile training was really necessary. We needed to have a refresher for those who were already familiar with the agile construct and a bit of a deeper dive for those that were not. More mock-up discussions that focused on how our work would change on a day-to-day basis from what it historically was to what it needed to be. We found initially that it was a struggle to convince our engineers that the additional work we were asking them to do by writing stories for all their work provided value.
Girija Rao
[00:21:03.396] So looking back at this two years later, wow, we did achieve the outcomes we had envisioned. The Enterprise Connectivity organization has a unified sense of mission, clear lines of accountability, and as you saw from the metrics, improved delivery with higher quality of work.
[00:21:22.916] Our partners and stakeholders have also weighed in with their feedback on how much they appreciate the clarity of ownership and accountability and the stronger partnerships that we have been able to form with them as a result. This model has stood the test of time and also enabled us to easily incorporate additional infrastructure functions over the past two years. And Jen, I believe you said this the other day, that we would never choose to go back to the old model. This foundational structure that we have created is one that we continue to iterate and evolve upon.
[00:21:59.416] Ultimately, it is important to note that this transformation was by us, for us, and purely driven by a strong desire to address our pain points, improve how we were operating, and enable ourselves to be the best that we could be. I would encourage anyone considering something similar to take the leap.
[00:22:17.996] And finally, the help that we are looking for is that we are actively hiring. So come on over and check out our Capital One career website.