Intentional Architecture in Agile Organizations

Log in to watch

San Francisco 2017

Intentional Architecture in Agile Organizations

Senior Principle Engineer · CA Technologies

A well functioning technology stack is crucial for maintaining high availability, low defect rates, performance, scalability, quick feature delivery, and developer morale. How do teams make architecture decisions in an environment when technologies and priorities change weekly?

How can teams adopt new features without adding to the technical debt dumpster fire? Intentional architecture is a careful balance of technical planning woven into the agile product development process. This talk will cover integrating intentional architecture within agile organizations.

Chapters

Full transcript

The complete talk, organized by section.

Dave Thompson

Okay. So, looks like we're getting started here.

First off, I just want to appreciate you all for being here for an after-lunch session. My name is Dave Thompson, and I work for FreshTracks.io, which is a small startup working on a Kubernetes monitoring project. But we're a little bit unique because we are actually an accelerator within CA Technologies.

So CA Technologies' Office of the CTO has a program where people can pitch ideas to an investment board, and some of those, like FreshTracks, get funded.

But today, I'm actually going to be talking about my experience working as an architect for Rally Software. Rally Software is a company that builds agile management tools, software tools. We were acquired by CA Technologies a couple of years ago and are now known as CA Agile Central.

I was hired by Rally in 2011. We went public in 2013, and we were acquired by CA in 2015. So I have a broad range of experience from a startup-size company, to a midsize company, to a product line inside of a large company.

I worked on the architecture team with other architects to align technology decisions between about 100 employees and 15 teams across three offices. So kind of a mid-sized company.

Today, Agile Central, most of our services are continuously deployed, and the ones that aren't are released once a day. I've been in an environment where we're doing automated testing, doing continuous integration, doing continuous deployment for a while, so I take it for granted.

So when this talk was accepted to DevOps, I thought, "Uh-oh, I better figure out what the heck DevOps actually means." So I asked around my office, "Hey, what does DevOps mean to you?" And I think the best answer was from my colleague, Adam, who said, "DevOps is when the product team can break production."

Obviously, that's just a joke, but you could probably figure out who the engineers are because they were just like, "Yeah, definitely product team."

But for me, DevOps means that the development teams are responsible for not only developing the feature, but also deploying the feature, and when and how it gets out to the customers.

And if that's a definition that you use for DevOps, too, I don't think you can be very successful unless your organization is also agile.

And by agile, I don't mean specific processes like SAFe or Scrum. Those are really just tools to help you get to the core goals of agile, which are: code is more valuable than plans and documentation, rapid iterations are more valuable than large releases, and people are more valuable than processes, and trust is more valuable than control.

When teams adopt agile DevOps, things will start moving more quickly. And when you're in that environment, it can be difficult to plan technical architecture for the future.

So where most teams end up, or at least where Rally ended up at the beginning, was what I call emergent architecture. That is, architecture decisions are made within development teams. They decide what tools, what technologies, what frameworks, what languages they're using.

Some of the success requirements for emergent architecture are a small number of teams, limited number of communication paths, and a high level of trust in the dev team leads.

When Rally was startup-sized, we had all of those kind of metrics. Our R&D department looked like this: we had a director. We had a handful of teams. Each team had a manager and was a cross-functional team with all of the tools to write a feature.

Communication was really easy. You could get up from your desk, walk across the room, ask a question, and you also knew everyone, so you had a good sense of people's strengths and weaknesses, and that could inform decisions around your technology decisions.

But as we grew, the organization transformed into something more like this, where we had multiple directors, each director managing multiple teams, and communication became more difficult. You couldn't just get up and walk across the room anymore. Your colleague might be in a different time zone, or you might have never even met the person you need to ask a question to.

You've probably all heard of Conway's Law. Conway's Law is simply that the design of a system will match the communication patterns within an organization. So if your communication patterns look like that, that's probably what your system will look like as well.

And we found that was true for ourselves because as we grew, we found ourselves in a situation where we had one product using three different JavaScript frameworks. And if you've ever done any JavaScript work, you know that this is a really bad place.

It's bad for your users because it results in bad performance, loading all these frameworks and executing them. And it's bad for developers because it kills productivity when they have to touch three different frameworks to get a feature out the door.

So I think this was the point where we realized we need to do something different. What we were doing as a small-scale startup was no longer working after we had grown larger.

We wanted to implement more of an intentional architecture, actually thinking about where we were going and bringing some of those decisions out of the teams. So some of the goals for intentional architecture: to manage the communication channels so you know who to talk to when you have a question, and also who actually has the ability to make some of these important decisions.

We want to make sure we're planning our scaling runway, making sure we can support user growth and expansion into new markets. And of course, we've all worked on systems with a lot of tech debt. So how do we plan to manage tech debt so it doesn't go out of control?

And kind of along with managing tech debt and scaling the runway is we want to improve R&D productivity. We don't want our developers always stuck fixing problems. We want them developing new features.

So we put together an architecture team, and the architecture team was a group of people who worked with individual teams. Each architect kind of had a number of teams assigned to them.

Now, if you've been in the industry for a while, architect might be kind of a bad word because you may think about 15 years ago when architect meant somebody who went in their office and closed the door and made a bunch of great UML diagrams, and then a couple of weeks later came out and said, "Please implement my glorious UML."

That's kind of the ivory tower approach, and that's not what we wanted to do. Teams are really good at writing code. You need to be able to let them do their job.

What teams aren't always good at are understanding the ramifications of their decisions within the rest of the organization. So that's where the architecture team can step in.

The five main things that our architecture team focused on was relying on teams to facilitate innovation, shepherding new tech rollouts, maintaining R&D team productivity and morale, planning for the future, and ensuring successful deployments.

One of the trickiest parts of being on the architecture team is balancing team autonomy versus team alignment. When we had a small number of teams, it was easy to reach consensus. And as we grew, there was more conflict in the organization. It was more difficult to reach consensus.

In that situation, the easy thing to do was to say, "Okay, teams, just do whatever you want to do as long as you get the job done." You'll end up too far on the autonomy side, and that's how you end up with three JavaScript frameworks in the same product.

On the other hand, if you go too far towards the alignment end, you're going to be using more of a command-and-control structure where you're telling the development teams exactly what they should be using, what languages they should be using, what frameworks they should be using.

And that's going to kill your innovation, and it's also going to have an impact on the R&D team morale. It's kind of soul-crushing if you can never make your own decisions.

So to make that balance, a good way to do it is think of architecture as a bottom-up process. The teams produce the innovation and let that bubble up to the architecture team, or someone else who can help then evaluate the ideas and spread them across the rest of the organization.

Some of the organizational impacts that your individual teams might not be thinking about are licensing and hosting costs, hardware costs, training and hiring requirements, and all kinds of little details like disaster recovery, security, internationalization, accessibility, performance.

We want to have cross-functional teams, right? But this is a lot of stuff to know. It's really hard to know all of this stuff. So when you're developing a new technology that's ready to roll across the organization, it's helpful to have kind of a second check to make sure you have all these details in place before doing a rollout.

So a case study for us: as I said, we had three different JavaScript frameworks, and our developers said, "Well, we really actually like the React framework the best, but when we're using React, we keep developing the same components over and over. We really need a composable system so that we can put UI components on our page and make a page without having to do all the nitty-gritty details of things like accessibility."

So we decided to fund a team called Mineral UI to put together a UI library for our teams to use. And when you're doing something like this, it's important to do it in a way where you're not forcing it on teams, so you're not saying, "You must use this library." It's a much better option if you can say, "We have a compelling library for you that solves some of your issues. We think you'll enjoy using it."

And that's what we did with Mineral UI. It's developed as a product. So like you would do user research with any other product, we do that with Mineral, except the users are developers. And this particular one is actually also open source, so it's out there if anyone wants to use it as well.

Once you've decided that you want to take a technology and spread it across the organization, you need to have a plan for shepherding that new technology. Teams can often get kind of caught up in hype-driven development, the next newest thing out there. It's all over Twitter. "I really want to use it."

And it's kind of the architecture role to sit down with the team and say, "Hey, we actually need to do some due diligence before we start using this."

You've probably all seen this chart before. This is kind of a graph of technology maturity. It goes along. And it can be dangerous if you bring a technology into your organization when it's at that peak of inflated expectations stage, because you may lose support for it, and it may rapidly turn into tech debt.

So if you can get it into the trough of disillusionment or after that, that may be a better option, because the community understands that technology a little bit better, and you better understand the trade-offs.

Some of the diligence that you can look at when evaluating technologies are its level of community support. If you have higher levels of community support, it's going to be easier for your developers to do their jobs, and it's also going to be easier to find people to hire who already know how to use this technology.

Any technology, you have to make sure it meets your performance and scalability guidelines. You don't want to introduce a new technology and then three months from now find out that it can't meet your requirements, and you need to pull it out or spend a lot of time optimizing it.

Backwards-compatible upgrades is a big thing. This contributes to maintenance costs quite a bit. If you adopt a technology where you have to rewrite the whole code base every time you do an upgrade, you're either going to spend a lot of money maintaining this or you're not going to maintain it, and it's going to be instant tech debt.

The other issue that I think architects really need to focus on is simplicity. A lot of times, I see both processes and tools billed as enterprise, and they're very complicated and complex. And to me, that's kind of the opposite, because the larger your organization is, the more difficult it is to scale processes and tools across it. So really, the simpler things can be, the easier it's going to be to do that scaling.

Once you've done your due diligence, you need to work on a rollout plan. And of course, you're never just adding new technology. You're usually integrating it with some existing process. So you need an integration plan to figure out how that's going to work.

And when you're evaluating technologies, it can be important to determine not only how well this technology works, but how easy it is to integrate. And hopefully, in addition to integration, you can also do some deprecation and delete some of your old technology. So you can add something and remove something to keep the tech debt under control.

Developer training is really important in your rollout plan. How are you going to train the whole organization? It's likely that the team that pitched this idea to you originally is really excited about it, but the rest of the teams might not be. So getting those teams involved and into the training process is an important part.

So a case study for this: we adopted Clojure for most of our back-end services. It started with a team spike. A team said, "Hey, we really like Clojure. It has a large community. We like how the functional aspects of it work."

It was then brought forward to a larger group of people, including the architects, who did some analysis on this, said, "Yep, there's a community there. We can hire people for it. In fact, potential candidates are really passionate about this. It runs on the JVM, so we don't have to change a lot of our ops systems that are built around the JVM. So let's go forward with it."

So the next step was we built a template service, and the goal with this was to encourage teams to use this. The template service included things like authentication and hooks into our metric system, hooks into Kafka, and other things that are kind of boilerplate code that you need to write for every single service that you would want to write.

So by providing those, we kind of offered an incentive to say, "Hey, if you're writing a new service, you can pick this up and get it for free."

And then once we rolled it out, our training consisted of primarily a book club, where we bought books for the whole team, and then every week we went through a chapter and reviewed the exercises together. And it was important for us for that to be voluntary, for people who might have felt like this was being pushed upon them.

Once you roll out your code, it's always important to evaluate how that worked. What work is left that we need to invest in? How can we do better on the next tech rollout, and is it working as expected?

And on that last one, "Is it working as expected?" if the answer is no, you want to make sure that you have a rollback plan so that you can do something about it, because this will eventually happen. You'll have some rollout that just didn't work well.

An example for us was Flowtype. Flowtype is a way to add static typing to JavaScript. We invested in adding Flowtype for one of our code bases. We invested in some developer training, and then when we got to the retro step, "Is this providing value?" The answer was, not as much as we expected.

We had expected that this would reduce runtime errors and increase developer productivity by adding code hinting and autocomplete to their IDEs. But for this particular project, it turned out that we weren't getting all the value we wanted out of it, and it was a little bit more difficult than we expected to implement.

So we had a plan for this. We had a JS codemod where we could rip this out, and that's what we did. We went ahead and ripped it out.

Some people in the organization thought this was a failure because we invested in something that we didn't end up using, and other people in the organization thought this was a failure because we didn't keep investing in it until we hit that point where we were getting enough value out of it.

But I thought it was healthy in terms of an agile DevOps organization, where you can honestly evaluate your decisions and admit that it didn't work and pivot.

I also want to note that this is in no way a condemnation of Flowtype. We use it on a ton of projects, and it's actually a really great technology. It just didn't work for this one particular legacy service due to some specific issues for that particular service.

I think the health of your tech stack can also have a big impact on developer or R&D morale. Everyone in R&D wants to work on career-relevant tech, and everyone wants to get features shipped.

So some of the frustrations that R&D teams can feel are they're working with tech debt or legacy tech that is either not relevant to their career or is just slow and cumbersome to deal with. If you're working on JSPs all day or something like that, it can feel like you're in kind of a pit that you can't get out of.

Obviously, we want to focus on lead times. That's a lot of the focus of DevOps in general, is focusing on lead times and cumbersome and slow tooling. If it takes 10 minutes to build your application and 15 minutes to test your application so that you have a 25-minute cycle every time you need to test a change or to push it to CI or something like that, that adds up really quickly and can really kill productivity.

And any time you kill productivity, you're also going to kill R&D morale.

So for a case study for us: the architecture team this spring did a survey and we asked our developers, "Hey, how happy are you? And what's the most frustrating part of your day?"

And for one particular service, the developers responded, "Testing is just too difficult. Testing takes up too much of my time, and it's really frustrating."

So we took a look, and it turned out with this particular service, we had five different automated test suites. Each one used different technologies, some of them different languages, different frameworks, and completely different patterns of how you write a test.

So if you were working in this environment and you wrote a feature, you may have to touch all five of those test frameworks just to get your code tested and deployed.

So we sat down and we said, "Okay, well, what's the ideal solution?" And we came up with this testing pyramid, which has hot sauce in the bottom and tequila, and then a taco at the top, which is like the best type of pyramid.

And to get from point A to point B, we created a deprecation plan to deprecate a couple of those test suites. We identified high-value tests within those two test frameworks, moved them over, and then with only three test frameworks, it gave us a little bit more bandwidth. So we were able to invest in some of the tooling and libraries around the remaining three, making them easier to use.

And the result was the developers were happier spending less time writing tests and getting their features shipped a little bit easier.

Individual teams are often focused on delivering the features that have been assigned to them, and they don't often have the time or the bandwidth to plan what's coming in the future.

So that's another place where the architecture team can help: to plan for managing or hopefully eliminating tech debt, planning for upcoming features and growth, and evaluating emerging technologies that are coming down the road.

So at Agile Central, we use the same agile planning process for architecture as we did for product. Basically, what would happen would be the architecture team, with help from all of R&D, would prioritize the top technical items.

So it could be things like upgrade to the next version of Oracle, or investigating if we want to use GraphQL, or things like that. And then once we have that backlog that has been defined by the architects, it gets merged with the product backlog into an overall stack-ranking prioritization of all of the things we want to work on for the particular product.

And of course, this is a DevOps conference, so another job that the architecture team can help with is ensuring rapid and safe deployments. Rapid deployments reduce lead time, and safety improves the user experience.

So at Agile Central, I would say we have two types of DevOps teams. With smaller services, which are usually our newer services, kind of all the DevOps operation is usually built into the team itself. So the team that writes the service is also responsible for writing the CI and CD and deployment pathways for that particular service.

But for some of our larger services that get dozens and dozens of commits every day, maintaining that CI/CD pipeline is kind of a job in and of itself. So we have teams who are dedicated, who we call DevOps teams. These teams don't deploy code, they don't test code, but they write the tooling to help feature teams do that work.

Once you deploy your code out to production, you need some infrastructure for observability to make sure that deploy was successful. And there's two broad aspects of observability.

One is system metrics. So is the deploy working? Are the servers running? Is my latency okay?

And the second set of observability is user metrics. Are people actually using this feature that I deployed? Are people confused by it? What's going on there?

At Agile Central, we had a team dedicated to collecting user metrics and system metrics and analyzing these. And that's also what my team today does. At FreshTracks, we're working on monitoring for Kubernetes, and we're hoping to increase the level of observability in today's more modern container space, where machines are more ephemeral and coming and going constantly.

So once you've rolled out, you'll inevitably have outages. And the way our architecture team dealt with this is we retroed every outage, and we want to always identify solutions instead of blame.

I think one thing that you'll notice is not on this slide is root cause analysis, and I think that kind of goes with Sidney's talk this morning in that you don't want to solve the problem you just have. You want to see how you can do better for the next thing that might be coming down the road.

But even with all of that, it's never perfect. And I think it's important to... When you make an architecture decision, it's the best decision that you can make for a particular point in time, in the context of your system and your users and your product.

But technology continuously evolves over time, just like your product. So the architecture process is really more about putting a process in place that can successfully adapt to change, rather than making any one decision the right decision.

So some of the continuing issues we have in our agile DevOps and architecture is proving DevOps architecture ROI. I think oftentimes product comes out with numbers, "Hey, this new feature is going to have an ROI of X." And for some reason, sometimes those are seen as more believable than the ROI figures generated by the architecture team, even though I don't know if they are any more accurate.

We collect a lot of metrics, and we don't have enough data scientists to analyze it all. So we have some simple KPIs we look at and some moderately complicated KPIs. But I think there's a lot more value there that we could really extract out of this information.

And too many functional silos. So we have cross-functional teams, but a team that writes a particular service, they write that in a cross-functional manner, but the other teams may not feel like they can then also work on that service or not have the knowledge to do that. So there's still some siloing there, even though we have cross-functional teams.

So thanks.

The slides are on GitHub if you search for React Present. Also, it looks like we have two minutes if we have any really short questions.

Q&A

Q: Have you tried to standardize processes across your teams?

A: Yeah. So they're slightly standardized. There are some aspects that are standardized and some that aren't. We use a standardized estimating system so that when we do quarterly planning, we have a little bit better grasp of what's going through.

But in terms of the weekly activities of how a team does their work week to week, that's entirely up to the teams.

Q: So how do you ensure consistency and quality across teams?

A: Yeah. So I think that's the whole kind of role of the architects and the directors, of ensuring that the work they do is backed up by the proper testing strategies before it gets to the point where we're ready to deploy.

Q: When I submit my user feedback, where does that go?

A: Okay. So it goes into a database that mostly the product managers read, but the engineers can also read as well. And some of it, depending on if it's associated with the beta feature, actually gets emailed directly to some people.

Thanks.

Okay. Thank you.