Adapting the Squad Model at IBM: DevOps and the IBM Marketplace

Log in to watch

San Francisco 2016

Adapting the Squad Model at IBM: DevOps and the IBM Marketplace

Senior Software Engineering Manager · IBM

IBM is a company that loves to re-invent itself, but adopting a new culture within a company as large as IBM is a gradual process. I'll talk about my personal experience with our transformation of the IBM Marketplace, starting in May of 2015. We started with an ugly website running on traditional enterprise web infrastructure, with four-week sprints and a remote operations team.

Within about six months we built a far more agile, squad-based organization that uses true continuous delivery. The Bluemix infrastructure that runs the new Marketplace platform is also lightweight and modern, with very high availability, and our developers manage their own infrastructure.

Finally, I'll talk briefly about the design evolution of the website, which proceeded in parallel with the organizational and engineering infrastructure changes.

Chapters

Full transcript

The complete talk, organized by section.

Ann Marie Fred

I'm Ann Marie Fred. I'm a senior engineering manager and DevOps coach in the IBM Digital Business group, specifically in the Marketplace Engineering organization.

So what I'll cover today is a bit of history: where the marketplace started and why it needed an overhaul, how we restructured our organization, how we modernized the website infrastructure, how we integrated design into our continuous delivery process, and some of the big challenges that we're facing going forward.

So here I am within the IBM org chart. I'm in the Digital Business group. The vice president level is the Digital Platform Engineering organization. My manager runs Commerce Platform Engineering, and I'll define that for you.

That organization develops and runs the website at ibm.com/marketplace. That also includes the checkout experience, the subscription management, actually deploying the free trials and SaaS offerings, support, and so on.

Specifically, I manage two squads in that organization. There's the Discovery Navigation and Search UI squad and the Storefront Experience squad. And you can always go check out our pages at ibm.com/marketplace.

So the Discovery Search UI and Navigation squad owns the marketplace landing page in all the various languages and locales that we render, as well as various search and navigation pages. The Storefront squad owns the product details pages, and these include product information, pricing information, links to get free trials, and actually purchase the product and things like that.

So taking a step back, why is DevOps important to IBM?

Well, first of all, we want to be more competitive. We want to streamline our engineering and operations practices. We want to foster highly skilled and engaged teams who love coming to work, and we want to further our culture of innovation and excellence.

So here's a little bit of history from at least my part of IBM. I started out in 2011 on an advanced technology team that was a collaboration between Rational and Tivoli, and we went out with the mission to learn what DevOps was and how we could apply it to our own organization.

Also, in 2012, IBM acquired Green Hat, which we integrated into the product that we were developing at the time, which was SmartCloud Continuous Delivery. And we started IBM's first DevOps consulting practices.

In 2013, IBM acquired UrbanCode. At that point, we sunset the team that I was on, and I moved to a legacy software team, as well as other people who I was working with, and we started driving the DevOps practices into those legacy software teams. Like, could they do continuous deployment, more automated testing, and so on?

In 2014, I was on the Service Engage team, which was running a website that sold about a dozen IBM software-as-a-service offerings, and it was built on a pure DevOps platform. And we also started scaling up our DevOps enablement training across IBM, and we started doing IBM Design Thinking, which I'll talk about just a little bit later.

In 2015, we started the Whitewater organization, which is another DevOps team that runs the tools that enable other groups within IBM to be more efficient in adopting DevOps practices. So they run things like our installation of GitHub Enterprise and Travis CI and our Slack accounts.

And also in 2015, we rebooted the IBM Marketplace.

So where was the IBM Cloud Marketplace in early 2015, and why did we decide that we needed to reboot it?

It was called the IBM Cloud Marketplace because it covered about 150 products from the IBM Cloud division. We had some business agility issues. Our product onboarding process was slow and manual and took at least two months to complete.

We also had a four-week sprint cycle on the rendering stack. So from the time you requested a change to the time it was delivered took at least four to eight weeks. And we had some usability issues. Only a few of the products were actually available to purchase online. The content quality was poor. A lot of it was text only. We had garbage characters in there because people had copied and pasted from PDF files and so on. And the rendering was difficult to change. It was pretty ugly and drab.

So here's an example of one page on that old rendering stack. This is Watson Analytics from November of 2014.

Sorry. Got a cough.

And this is the same page in June of 2015, right before we cut over to the new stack. It's not very inspiring, if you ask me.

So what did we do to start our transformation?

Well, step one was to build a more agile organization. So we started a new marketplace engineering organization, more or less from scratch. We brought in dozens of new hires. More than half the organization was new to IBM.

We pulled in the Service Engage development team for their expertise with DevOps and continuous delivery. We also pulled in the partner marketplace team so we could continue to sell third-party products in the marketplace. And we accelerated the DevOps transformation of our legacy back-end systems.

So we reorganized them into squads that owned the entire life cycle of their service, and we shortened the sprint cycles from the four- to six-week range down to the one- to two-week range.

So here are some of the DevOps practices that have worked really well for us.

We have DevOps boot camps for our leadership and also for the members of the squads. We use continuous delivery to the IBM Bluemix platform as a service. We have many small deployments per day. Basically, every time a developer commits a change into the master branch in GitHub, it gets automatically deployed, usually using Travis CI if it's a simple deployment. And the deployments are zero downtime using a blue-green deploy.

And we have 100% automated unit test coverage. We use UrbanCode Deploy if we have complex deployments. So for example, myIBM, which is another one of the marketplace components, interfaces with our payment system, so it has to be able to connect to the staging checkout system for testing purposes. So they use UrbanCode Deploy.

We have fast deployments, usually in the five- to 20-minute range, which allows us to make repairs very quickly. And we instituted social coding on GitHub Enterprise, where the old IBM practice was treating product code as jewel code that was to be protected and limited to only the product team that was working on it. Now we have social coding and encourage people to collaborate across squads.

I don't have time to get into those in any detail, unfortunately, but I would like to refer you to the Bluemix Garage Method website, which is at ibm.com/devops/method. And it basically has an article on each of these practices, and it's really great. And also, if you stop by the IBM booth, they have a Bluemix Garage Method field guide that you can get for free.

The next few slides list some more of the practices that have been very successful for us. Again, I don't have time to dig into these in any detail, but I'll run through them really quickly.

We have the squad model, which was inspired by Spotify. We try to have a just culture, which is inspired by Etsy, includes things like blameless postmortems. We have microservices where we can, which is based on the Netflix model. We use lean and Kanban practices, which come from Toyota. So this includes things like minimum viable products, limiting the work in progress, and using the pull model for work.

We also use Scrum, which has a long history of success in IBM, and we use this to manage work across squads. So we've gotten all the squads onto these weekly or sometimes biweekly sprints, and they can open stories against each other in Rational Team Concert, which is a tool we use to coordinate between the squads.

And finally, design thinking, which is an IBM-wide initiative. And this is not just the designers who use design thinking, it's all IBMers. So it's teaching us to take a step back and think at a higher level, make sure that we're building the right thing, and then DevOps is how we build the thing right.

So I'm going to talk a little bit more about the squad model and how we try to make a huge company feel like a small one.

So our version of the squad model is not exactly the same as Spotify's, but what we have is we have small and focused teams. They're co-located. They're independent from each other. They're accountable for their own product deliveries. They're autonomous, and they're self-managing.

And to make that happen, we have people who fulfill each of these roles in each of the squads. So we have the squad lead, which is the product manager who controls what features get into the component and in what order. We have a development manager like myself. We have a technical lead. We have several full-stack developers, and they're responsible for developing, testing, deploying, operating, and supporting their components.

We have dedicated visual and user experience designers on each squad, at least the ones that have a front end to them. Most squads have a project manager, and then where needed, we have subject matter experts.

So let's talk about the scale of this squadification. The IBM Digital Business group is several hundred people, and it's at least 75 squads. We just added some more. The Marketplace organization within that is about 20 of those squads. And the Marketplace uses on the order of 100 services to run that website when you include all the back-end systems and operational tools. The level of DevOps adoption and squadification varies even within Marketplace Engineering.

All right, so now that we had a more agile organization, the next step was to do a bake-off. So people were not completely convinced that the old platform wasn't going to work for us. So we ran both in parallel.

I heard that they had about 90 people running each of the platforms for those six months. So in May of 2015, we started building a new rendering platform. We used the legacy WebSphere content management system, but we used it in a headless mode. So we got them to export all the content to a cloud and NoSQL database.

We kept the legacy systems for pricing, checkout, and payment, subscription management, SaaS offerings, and the automated deployments of those, but we put new REST APIs in front of them so we could use them more effectively. And we took the Nautilus rendering engine from the Service Engage website, which is based on Node and Express.

And we enabled the team with best-of-breed tools, including GitHub Enterprise, Jenkins, Bluemix, New Relic, PagerDuty, Slack, and so on. And these are the tools from the Whitewater team.

The other thing we had to do before we cut over to the new stack was enable globalization and localization of the new stack. So we added new countries gradually throughout 2016. We now have nine languages, which was a massive, massive translation effort.

We have product blacklists for each country, so each country has a different list of products that can be sold there. Each country has its own pricing and currency information. They have unique financing and SaaS security information for each country. We can localize which products are featured on the landing and navigation pages, and what the page banners are, at a locale basis.

We can route our live chat. So if you request a live chat on the website, we can route it to the right country and language based on where you are and the right product area. And we also auto-detect the country you're coming from when none is specified.

So once we proved that we could reach feature parity with the old marketplace and have a better design experience, it was time to cut over to the new platform and start to scale up.

So if you go back to 2014, you had the Cloud Marketplace with about 150 products and Service Engage was about 12. We started that reboot in May of 2015. We had our bake-off period in October. We reached feature parity and started the cutover in November. And basically we just took the old URLs and used Akamai to route them to the new platform.

And by early 2016, we had about 450 products on the new marketplace. IBM decided that we weren't just going to put our cloud products on there, we were going to put all of our products on that marketplace, and also our third-party products that we sell. So we renamed it to the universal marketplace, and we continued our globalization efforts, and we expanded the catalog to include things beyond software, like hardware, services, and online courses.

Oops.

So again, the scale, it's more than 700 products: roughly 500 IBM, roughly 250 third-party products. More than 300 of them have free trials. More than 200 can be purchased online. We have more than 100 online courses. So all in all, that's roughly 50,000 different pages that we're rendering across all the different locales. And the number of marketplace visitors has tripled from a year ago to today.

And so what's the infrastructure for this? Well, it's smaller than you might think. So the product details and navigation pages, which is most of the 50,000, are running on about 10 Bluemix Node.js instances in two data centers for failover. And we're able to achieve almost 100% uptime, even during the DNS DDoS attack a couple of weeks ago, because we aggressively cache using Akamai and our content delivery network.

So we do occasionally have problems in the underlying system, but we just make sure that we fix them before they get out through the cache. And during the DDoS attack, we just told Akamai not to clear the cache for 72 hours, and we were fine.

We also run the search UI on Bluemix, and that's a similar number. It's on the order of 10 Bluemix nodes. And we can't cache that because it's search, it's interactive. And it works.

So the new platform has 99.4% uptime since the beginning of this year. The old platform is still running, and it's had 96.5% uptime. And actually, the only major outage on the storefront was all within one day and was only in French.

All right, so taking a step back, I want to talk about the design improvements and how we've integrated that into our continuous delivery cycle.

So back in 2014, we codified and socialized the IBM Design Language, which is sort of a high-level view of what the IBM look and feel is going to be. And we hired and trained hundreds of designers.

In 2015, we updated our navigation standards, what we call V18, across all of IBM.com, and we started assigning designers to individual squads. And these designers create both grand visions of what the page should look like six months from now, as well as incremental changes that show the developers how to make changes from week to week.

We also trained thousands of people in IBM Design Thinking. And then in 2016, around the beginning of this year, we started using Optimizely to do a lot of A/B testing on our website. And we also do total customer experience audits, where we have people walk through the user journey and make sure it's as smooth as possible, and then they open stories against the squads to clean up problems.

And the result of all this is that we won Gold awards from the W3 Awards in four categories this year. And that's for design, website design awards.

Here's just an example of what iterative design looks like. So the designer has taken one part of the screen and showed a few changes that a developer can make in a couple of hours.

And the next few slides just kind of show how the page has evolved over time. So on the upper left, you see what the Marketplace search UI looked like six months ago, and what it looks like now. You see there's a lot more information about each product, and it's also translated. That's Korean.

If you remember what Watson Analytics looked like in June 2015 before the cutover, this is what it looked like in February of 2016. So it's the same content, just rendered with a different template. And this is what it looks like today.

And the landing page is another example. So you can see on the upper left what it looked like in September of 2015 and what it looks like now. It's a much more flexible template now than it was before.

We learned a few things about putting designers in our squads. One is that getting the balance between design and development right takes some practice. So if you don't have any designers, you're going to have really ugly user interfaces. At one point, we had seven designers on the Storefront squad to nine developers. That was also a major mistake because the designers got way ahead of development and just made the developers look horribly slow.

And the right ratio is going to depend on the squad and its mission. But starting with one designer to every four developers is a good rule of thumb.

So what's next?

Well, we have to become a more data-driven organization. So we need to do more usability studies, continue doing A/B testing. We also are learning how to use our traffic and heat map data more effectively, and other analytics that we can gather on the website. And finally, we use our monitoring feedback.

Sorry.

So what are some of the major challenges and things that I'm hoping to learn from you guys?

Well, one is how to better plan for one-week sprints. It's really difficult. It's hard to break stories down into small stories with small designs. The stories have to be clear and well-defined with really good acceptance criteria. All of your dependencies have to be ready to go before you start, and if you miss any of these steps, then you have false starts and unnecessary delays, right?

And also, it's really difficult to coordinate changes across these squads since they're independent. They have their own missions, conflicting priorities, and so on. And when you have dependencies, it's hard to get them scheduled at the right time, especially if you have a chain of dependencies where you have to have a handoff from squad to squad. That can make your delivery time take a long time.

Also, we have urgent requirements that are coming in from every direction, and this can lead to burnout, so it's something that we have to continually address.

So that's the main content of my talk. I did want to encourage everybody to submit DevOps talk proposals to InterConnect. The call for papers ends this Friday. So that's at ibm.com/interconnect.

And here's my contact information. I would love to talk to you after this. And remember to stop by the IBM booth if you want to get one of those Bluemix Garage Method handouts.

Great. How much time?

Any questions?

Q&A

Q: Since the squads are independent, as you said, how do you ensure that they stick to your platform standards, that they don't go off and do whatever they want? And also, how do you prevent them from reinventing the wheel and creating solutions for the same problems that the other teams have faced?

A: Right. So we do allow the squads quite a bit of leeway in the technologies they use. So for example, Storefront uses React. myIBM uses Angular to render.

We have some things that are standardized across squads, like using New Relic for monitoring so that we can drill down through the monitoring system for the various components. And we also have to make sure that they're integrated into our operations dashboard, and that they don't all have to use PagerDuty, but they all have to have a way for us to page out to the other squads.

So it's kind of like a lightweight set of requirements that each squad has to meet, and then within that, they're able to do what they want to.

Most of the new squads use JavaScript as a programming language, and that's helpful from a skills perspective so that people can contribute to the various squads.

And the other thing that we standardized is the release management practices for continuous delivery. So there are a few things that you have to do, like accessibility testing, the 100% code coverage, security testing, and things like that. But we also tell squads what they don't have to do, like gate production deployments with a human and things like that. So we put a lot of effort into getting just the right level of process there.

Q: Hi. I took a picture of your slides there. Thanks a lot, by the way, for sharing your squad roles. So you mentioned just now about release manager. Was that actually one of the squad roles, or is one of the people in the squad role responsible for that release? Because the thing I was wondering about was the integration of the infrastructure needs, security needs, those other outlying things. Are you putting that in your squad, or is that still an outlying type of consulting service that you guys have to go through?

A: Right. So that's a collaboration between the project management office, the PMO, and the development managers like myself. So the PMO and the development managers define what those release management processes and standards were going to be, and then we, the development managers, are responsible for getting the stories into the product backlog to implement those changes. Does that make sense?

So we have one more.

Q: Hi. How much rotation do you have among the resources on your squads?

A: Not as much as you would think. So we always say that people are free to change squads, but they very rarely ask to do it.

We have had squads where big squads were split into two or three squads. We've had squads where small ones were merged. We have taken squads that were distributed across locations and co-located them, and we have some squads that have spun up and some that have shut down in that two-year period.

But generally, we don't have a lot of developers shifting squads just because they asked for it. It is possible, though.

Anyone else?

Okay. Yeah. All right. Well, thank you very much.