SAP’s DevOps Journey: From Building an App to Building a Cloud
SAP has been using a DevOps & Continuous Delivery approach for building its web and mobile apps for several years, and is now building and running a global cloud at the scale needed to support the digital transformation needs of its customers. This talk recaps the story of how SAP originally adopted DevOps practices before moving on to describe how the Cloud Infrastructure Services team is building and operating its 3rd generation cloud automation system using microservices, containers and open-source software.
Chapters
Full transcript
The complete talk, organized by section.
Marc Ng
My name is Marc Ng.
Thank you very much for joining me today for my talk on SAP's DevOps journey.
Now, I'm assuming you guys are all familiar with SAP, but by way of an introduction anyway, SAP is a German software company. We've been around for nearly 45 years now. 80,000 employees, 130 locations worldwide, and around half of those employees, around 40,000, are actually in technology-related roles. So think developers, engineers, architects.
Now, even if you're not an end user of SAP directly, you may still inadvertently have come into contact with SAP, since over three-quarters of the world's entire transaction revenue touches an SAP system at some point in its workflow.
As for me, I've been with SAP for about 16 years now. I'm currently based, and have been for the last couple of years, out of Berlin in Germany. Prior to that, I was based out of London in the UK.
So you can all imagine my dismay when I fly halfway around the world to avoid the Brexit madness, and then I come here, and you guys have just as crazy people as we do over there. People are people.
So I'm currently with the Cloud Engagement and Consulting team. Prior to that, I was part of SAP's IT, in a team which was responsible for delivering, building, and running internal and external web applications. And this is really where most of this journey takes place.
So about the journey that I want to take you on today: it spans about six years, and I'm hoping that it will illustrate to you that it's possible for a small team within a huge corporation such as SAP to drive the change towards a DevOps culture. So the environment has to be right. There has to be a willingness to change. There has to be a willingness for collaboration. And of course, you have to get management buy-in.
So this is all the good stuff that we've heard all the other talkers speak about as well during the rest of the summit.
Now let me take you back to 2010. In the web team, we'd already given some consideration to some of the Lean principles, and we'd already got some nice tooling in place. So we had source code version control with Perforce. We were using the Atlassian suite of products for issue tracking and for build automation, and we were already doing monthly releases.
Some of the things that we weren't doing so well: we had a long lead time for hardware. So from requesting the hardware to actually getting hold of anything could be anywhere between three to six months. A very labor-intensive QA cycle, all manual, done by a pool of QA engineers. And the code deployment was done from spreadsheets. So step-by-step steps in a sheet, which had to be followed.
And as I said, physical hardware, so the actual production deployment required some sort of downtime.
We also committed the deadly sin of we were in silos. That dirty word again that people have spoken about at this summit as well. So development, operations, and infrastructure, they were all in completely separate units: separate managers, separate strategies, and they really didn't want to talk to each other if they absolutely had to.
So the release cycle that we had, or that we would've liked to have had at that point, looked a little bit like this. Four weeks of development, followed by two weeks of QA, then the production go-live ideally over a weekend. And that basically gave you a six-week cycle between the business request coming in to the functionality actually going live on production.
As I said, that's the theory. Now, in practice, it looked a little bit more like this. The initial four-week development cycle was fine. You would have a development close. The system would be given to QA. They would typically see the system and the functionality for the first time at that point. They would start testing, and then they would start doing this really annoying thing of finding bugs.
And of course, those bugs need to be fixed. So those would go back through feedback into the development. And then you'd get the business coming in last minute now with things like, "Marc, remember the URL that we talked about, that we discussed, that we got approved, that we've talked about for the last four weeks? Well, we printed some marketing material and there was a typo in it, but we've already publicized it, so it's now out in the wild. We can't change it anymore. But you guys haven't gone live yet. How difficult can it be to change a URL?"
So all of this was happening, and eventually you'd get to the point where you're ready to release.
And so you'd have operations coming in on a Saturday morning, maybe 4:00 a.m. They'd pull this sheet of deployment steps. Again, it's pretty much the first time they've seen it. There's a bunch of amendments in there from the bug fixes and from the late requirements that came in. So they start working through this sheet. They get to maybe step five, and it's a little bit ambiguous, so they make the call, "Can't be too bad if we just skip that step and just run through to the end."
Well, of course, lo and behold, surprise, surprise, the system doesn't come back up at the end.
Now, remember I said that there was a downtime? So they'd have actually taken the production environment down at this point. It's offline. So now there's frantic calls to developers asking, "What do we do?"
Development spend now the weekend as well trying to fix the issue, and eventually they get the system back online. We're now at Monday morning. Monday morning happens. The developers now come in, and now they're faced with the realization that they've got four weeks of development work which they need to deliver, but there's only two weeks left in the cycle, with a development close coming up, which can't be moved.
So what was it that broke the camel's back?
Well, in 2010, within the web team, we had an estimated capacity of 20,000 person-days amongst 100 staff. The actual demand that came in there was three times that, and of course, there's no way you're going to get budget to hire another 200 staff.
So this theme will repeat itself throughout the journey. It's about the people. And you might hear me mention it once or twice, so keep an eye out for it.
And at this point, it's about a change in the mindset of the people. So the chief architect at that time, he comes across this book by Jez Humble, and he really sees this as the solution to the issues that we're having. And he takes away from it two key aspects: automate everything and version control everything.
In fact, so much so does he believe that this is the solution, that he makes the reading of this book a bonus-relevant objective.
Now, the director of the unit, his boss, trusted the chief architect, and the chief architect was able to get his buy-in into his vision as well. So the director agreed to take 10% of all of the project budgets that came in for that year and to put them to one side, without necessarily being overly transparent about it, and to allocate that to this DevOps initiative.
Now, rather than to roll this out across the whole department, a pilot project was chosen, this SAP ID service as it was called back then. And I'm hoping this will resonate with some of the other larger organizations. At that point, SAP had a lot of web applications and websites, all with different user IDs, with different passwords. And this project, the idea of it was to create a seamless single sign-on between all of those applications so that you would only have one password.
But this application would eventually be rolled out to millions of users, since it would also go external. So it was definitely business critical.
And it's about the people, and it's about a change in the way people work. So we removed the silos, and we put together this cross-functional team of 12 people, so very much in Scrum mode: product owner, Scrum master, QA specialists. And at this point, we now had dev and ops together working in one team.
Now, it was still a virtual team, which is not ideal, but the five countries there which they were spread across, the good thing was they're relatively close in terms of the time zone, so you're talking about maybe two or three hours. So it's not a case of the problem that we sometimes have when you're working from Europe with the States, where you ask a question in the evening and you have to wait a whole night before you get the answer.
And task one really was to increase the test coverage. And to do that, we used this tool called Cucumber, which allows you to describe and document system behavior without mentioning anything about the implementation method at this point, before writing a single line of code.
And it uses this language called Gherkin, which you see at the bottom there, which is really understandable. It's understandable by everyone. It's just normal spoken language using this given-when-then format.
And at this point, the product owner, so the business, was able to collaborate with the technology guys, the QA and the development, at a really early stage. And everybody was on the same page. Everybody was speaking the same language.
Now, the really cool thing about this Gherkin language is you're able to correspond the individual lines in the scenario to Java or Ruby methods in the background. So the Java or Ruby language is only the language of the test, so your application can still be written in whatever else you like.
But what this gives you now is you've now got this executable specification, and it provides you the possibility to run an agreed-upon requirement. So this is the requirement that you agreed with business, and automatically test whether the functionality works as expected. And it's the QA and the developers, again, collaborating early to write this test logic.
But it's about the people and some of the challenges that we had around the human aspect of this. So the developers at that point didn't really have a culture of test-driven development. So asking them to write all of these tests, to get all of this test coverage up front, was a challenge.
So what we did was we asked them just to write, or mainly to focus on, the successful tests, on the expected outcomes of the functionality, so on the happy path. The edge cases would be taken care of when they're finding bugs. They would then write the test for those then.
The other challenge we had was with the different languages that we had. And I don't just mean the spoken languages. As you saw, we weren't really co-located. With this, I mean the difference in language between the business and the technologists, which again is huge.
But again, using this Gherkin language, as you saw before, it put everybody on the same page. Everybody understood what we were talking about, and the early collaboration really helped again. And it really gave everybody a shared sense of ownership of this particular functionality.
You'll remember I spoke about the spreadsheets or the Word documents we had with all of the deployment steps in it. Clearly, that was prone to huge human error. So we really needed a configuration management tool at this point.
We went for Chef purely because it's the first thing we looked at. It fit the requirement that we had, and we didn't look anywhere else. But of course, there's other tools out there, Puppet, Ansible. They all do this thing really well.
And what Chef allows you to do is, it allows you to create scripts, which they call recipes and cookbooks, which give you a real consistency and take away this human error. So once you've written your script, you can run it over and over again and get the same result at the end of it. And it gives you a real predictability and it gives you real confidence.
The other thing it gives you is this concept of infrastructure as code. So you've now got these recipes, which of course go into your version control system. And at that point, you've actually documented and version-controlled your entire landscape.
Now, the other thing which is possible using these cookbooks now is this concept of blue-green deployment. If you have a cloud available to you, that's ideal because what you can do at this point is, when it actually comes time to release, you simply spin up an exact copy of the landscape that you have in production, which is offline. You deploy to that landscape. You'd run some basic tests to make sure everything is running. And then to go live, all you do is you switch the load balancer over to direct traffic to the new instance. The new instance goes online. Your old instance goes offline.
Typically, you'd hold onto the old instance two, three days, maybe a week, depending. And once you're happy with the live instance, now you can just destroy that old instance.
But what's really cool is, and the reason you're holding onto the old instance is, if, let's say, in the first week now you have an incident and the person on support at 2:00 in the morning can't figure out what's going on, all they need to know how to do is to switch the load balancer back over. You know you've got a robust backup system there for now. You switch it over, the person can go back to bed and troubleshoot the issue in the morning.
So at this point, if the automated testing gave us confidence to release, Chef really gave us confidence to deploy.
But again, it's about the people, and it's the challenge that we had in everybody on the team now having to learn Chef and Ruby, which, with Chef particularly, can be quite a steep learning curve initially. Now, what you do end up with at the end of it is, though, you end up with a real cross-functional expertise within your team.
The other challenge was with all of those manual scripts that we had within the spreadsheets, et cetera, converting all of that now into Chef recipes or cookbooks.
But again, the result is that dev and ops are collaborating now on these recipes. They're working on them together. They're owning them together. And what's really cool is when you then later on look at these recipes within your source code version control system, you actually see the commits that have gone in, one by dev, one by ops, one by dev, one by ops. And it's like they really have a shared sense of ownership now of this code.
Now, we'd already made some moves towards virtualization at this point. So we had pools of VMs which were allocated to projects, but there was still some red tape involved. You still had to email people to get approval for this, to get hold of it. It took a couple of days.
Once you had your VM, there was still manual tasks to do. So you would still have to register it with a Chef server. You'd have to do an initial Chef client run in order to enable the instance for automation, and you'd have to then validate the installation.
So one of the really clever guys on the team wrote this tool called Cocktail, which enabled us to automate all of those actions to create a really complex landscape with very few commands. And at this point now, we're able to reduce the deployment time from hours to minutes. And now we're very much closer to the original vision that we had of the monthly deployment.
So we now have some really nice tools in place. We've really improved there. But now we mustn't forget about the most important part, and that's the people.
So it's about fostering a culture of self-improvement in the people as well. So we were already doing Agile. We were already in a Scrum team, but we were doing Agile in terms of the Agile that you would read on the internet or from blogs.
And what we needed now was to invest in the people and to get some budget for some real Agile coaching. So we were lucky in this that the product owner in Berlin, he actually knew of a coaching company where the head coach was very well versed in software development. He had something like 20 years of experience, and he was really able to come on board with us to really understand the problems that we had and to really coach us through it.
Now, the other thing is we obviously got the whole team together in Berlin for this workshop. And this is really important, to get the team together. And they worked together for about three weeks, so in working mode as well. They learned how each other worked, how each other talked. And there's also the element of the time you spend together outside of work, so with team events and things like that. So you're really getting to know these people that you need to trust within your team, essentially.
And the pilot project was successful, and it was now time to roll it out to the rest of the IT department.
So in 2011, a new tool was created, and it's called Barkeeper. And this did away with the pools of VMs that we had before, and it allowed us to allocate VMs directly via communication to the vSphere Cloud API. So this had a really nice web UI, and it was essentially a project self-service. You could create your own projects. You could create your own landscapes in there.
And at this point now, the scheduled deployment went to every two weeks, but we would frequently deploy two to three times a week due to bug fixes and things like that.
But it's also about disseminating this DevOps culture to the rest of the people in the department. So there was more projects now on Barkeeper, dozens of projects were using it, but it was about taking this Scrum and Agile coaching now to the rest of the department. And at this point now, it became the norm for ops engineers to be embedded within the teams.
And the success was recognized throughout the company. And you'll remember I mentioned that half of the employees within the company are in technology-related roles, so the potential audience for a platform like this, a DevOps platform for the company, was huge.
So in 2013, a new project started called the Monsoon Project. And the idea was really to create this custom-developed private cloud and automation platform, an infrastructure-as-a-service layer, roughly equivalent to OpenStack, and an automation framework using Chef and MCollective.
Now, it was custom developed in-house, Ruby on Rails components, all running as microservices. And within this project, it was a two-week development cycle with continuous integration and testing, and at this point, daily automated deployments.
And as of late 2015, hundreds of internal and external applications within SAP were running on this Monsoon platform. There was a particularly large adoption around the cloud-native applications, so some of the acquisitions that we'd made, cloud companies such as SuccessFactors and Ariba. So thousands of developers now using this platform, 30,000 VMs, 60,000 storage volumes, et cetera.
But of course, we were careful not to forget the things that we had learned early on. And the team has grown somewhat. It's now 20, 25 people. You'll notice there's only half a QA effort involved now, using ChatOps via an internal IRC server.
And what's important, again, is every three to six months, we make the time to get together for maybe a week. Maybe we throw a workshop in there as well, but it's really about, again, as I said, that time outside of work, re-interfacing with the people and refreshing those connections that really get lost when you're working in a virtual team.
Is that the end? Well, not quite, because today we are developing a whole new platform. It is now based on OpenStack. It uses the core services of OpenStack, and it follows a whole new strategy which we're calling the Converged Cloud.
There's actually a couple of services as well which we're developing ourselves as well, which are, as I've listed here, for instance, an automation framework and also a user dashboard.
So this is now OpenStack running in containers on Kubernetes across 13 regions and 18-plus data centers. That number of data centers is changing all the time, more and more coming on board. And it allows us to go from bare metal to scaled OpenStack cluster in less than 60 minutes now.
Now, what's important is down here on the process. This team now actually does over 90% of its work on open source enhancements, so it's actually contributing back to the community. Only 10% of the development is now done on our own stuff. And that's because either it's very SAP-centric, very SAP-specific, not really interesting to the community, or it's on stuff which we are planning to open source, but we haven't done yet. So that number will come down.
And within this project now, we're doing continuous automated deployment of both infrastructure and code.
And again, we mustn't forget, it's about the people. The team has grown somewhat now, maybe 30 or 40 people now. Still only half a QA effort. And we've now, due to the sheer cognitive overload, we've had to split into three sub-teams. Now, we're careful not to create new silos, and we really try to keep these sharing sessions across the teams very frequent.
And of course, the journey continues. So the current challenges we have now are dealing with the sheer technology explosion and the knowledge explosion that goes with it.
There's this really cool periodic table of DevOps tools, which is published by XebiaLabs, who are also here exhibiting. And I don't know whether you guys have seen this, but I'd encourage you to take a look because the breadth of tools which you might potentially have to come into contact with is terribly scary. So yeah, definitely take a look at that.
But the biggest challenge really that we have now is a new journey which we're embarking on, and that's the one of adopting a whole new culture now, the culture of open source. But that really is another story for another time.
So in conclusion, the key takeaway that I have for you, and just in case you missed it, it really is about the people. Thank you for your attention.
Q&A
Questions?
Oh, well, hey. He sent us over here. Pick me.
Q: Yeah, pick me. Thank you. Mark Landy from Johnson & Johnson. I saw a lot of the products that you had there, but I didn't see the core: ECC, Suite on HANA, those elements. Are you going to get to there, too, to get those on board?
A: Yeah. With the new platform, we will.
Q: Good.
A: So part of the reason to go with the new platform is to be able to support some of those tools as well. It would've been more of a challenge with our in-house-developed platform, I think.
Q: Oh, yeah. No, I... so congratulations.
A: Thanks.
Q: Terrific.
A: Any? Yep.
Q: I'd like to expand on the previous question a bit. So us trying to do DevOps in that environment, SAP is one of those big monolithic applications that's really hard to do DevOps around the SAP environment. A lot of times it even requires physical hardware because of the memory and CPU requirements. So what's the timeline to be able to get the large ERP systems to more of the DevOps approach, where we can use some of the tools we've been talking around this conference?
A: What's the length of a piece of string?
It's difficult. That point is very difficult, I think. As you say, with the old ERP systems, unless they're one of the systems which are moving in some way to microservices or reinventing themselves a little bit, I don't know, to be honest with you.
I don't want to give you a timeline, and it's definitely going to be a challenge, absolutely. Yeah. This really lends itself to all of the cloud-native applications very much so. They work pretty much out of the box on this. But yeah, with things like that, it's... Yeah. Sorry, I can't give you an answer on that one.
Q: My question is, when a team has a different product that it needs to put out or application, when is the infrastructure sizing, when is that decision made, and how is it made in terms of how many servers do I need to spin up for my new particular product?
A: You mean from the project side or in terms of how much can they request from the private cloud now?
Q: From the project side. So if I have a new product that I'm rolling out, how do I size or scale my infrastructure footprint for that?
A: Well, I don't think that's changed. You still have to do an initial architecture and decide on that. But the beauty of using this cloud now is you can start small, and then you just expand.
So all of this stuff is billed by the hour, I believe, and by some sort of resource allocation, I think. So you can really start small, and then as you decide you need to elastically scale something, or even just need to scale something whichever way because you need more resources, then it's just really about the money, I think, at that point.
Anyone else?
Thank you very much.
Thank you, guys.