Learnings from a DevOps Organization in the Making
GE Software is building a platform for Industrial Internet. Over the last few months we have transformed our services into a cloud only offering. To achieve these results we have had to ramp our devops practices at an unprecedented rate, and we continue to do that as we move forward. Since we are building a platform to build application which monitor industrial assets such as: Wind Turbines, Jet Engines etc, security and compliance are extremely critical. While, these requirements require us to have clear delineation between operations and development, we continue to push the boundaries and blur the lines between dev and ops. This talk cover the organization model, development and operations practices that we have adopted:
Top Challenges & insights:
-Access control to production instances & policies on how to manage them
-How to prevent developer’s throwing code over the wall?
-How to take realtime feedback from deployed services to improve the development practices?
Learnings:
-True collaboration is driven by having a critical mass across the teams, who share some core beliefs.
-Talk of pager duty is generally not a motivator for developers, sometimes it can turn them off from thinking about operations
-Giving developers more control in the outcome of the service they have built is a much better way driving true collaboration
Chapters
Full transcript
The complete talk, organized by section.
Vineet Banga
Thank you. Just want to take an opportunity to thank the folks at DevOps Enterprise Summit for giving us the opportunity to share our thoughts and experiences with you.
I'm Vineet Banga. I'm a development manager at GE Software, GE Digital actually now. And my partner in crime today is Jake Johnson. He's director of cloud operations.
Here's our plan. We want to talk about what GE is doing in software, just to set the context, and where did we start. We've been doing this for about two or three years, so I just want to take you back a year or so ago as to where we were and the changes that we've made over the last year or so. The lessons that we've learned, and then top takeaways and challenges, and hopefully, if time permits, we'll try to do a demo.
What is GE building? At GE, we are building the Predix platform, which is a hosted platform-as-a-service offering, which allows industrial assets to connect to the cloud and amongst each other, so we can produce business insights to reduce unplanned downtime. And the goal is to build a platform which streamlines development and deployment of applications in the industrial IoT space.
Specifically, I spend most of my time working on the middle layer, and more specifically on the Predix services layer, which are the boxes on the top. So I'm from the development organization, and I basically build services and manage them once they're deployed.
Just want to take you back a year or so ago. At GE, we've always been an agile shop. The Predix platform team has always been agile, and we've used all the buzzwords that you can think of. We've been agile. We've been using Lean methodology. We've been following the best practices in terms of CI/CD. We've been using test-driven development. We've automated most of our testing.
And just to specifically point out one project that we were working on, we were building a service last year. So we put together a Scrum team of three developers and two folks from the operations team to have this complete DevOps team, to make sure that this team has complete responsibility and ownership of the service that they were building.
And it went pretty well. We actually delivered the product and the service in two and a half months. It was a fairly successful project. But when we did a retrospective of that project, what we saw was that here was a team which was working together. They were really collaborating well. The devs got some insight into how operations works, and operations was aware of how this service was built from day one.
But there was still some disconnect. You could see that in the beginning part of the project, the operations team was not as busy as development was ramping up and understanding the requirements and figuring out how to build this thing. And then towards the end of the project, the dev team actually started to slow down once their stuff was complete. Their part of the pipeline was green. The artifact was pushed to Artifactory and things like that. And that's when the operations team was really busy, making sure their Chef recipes and their monitoring for that service was set up and everything was working properly.
So there was a disconnect. It wasn't flowing like a seamless pipeline. And although the team was responsible for the whole thing, it was more a mindset of developer is responsible for any issues with the code itself, and operations team is responsible for actually deploying it and making sure there are no issues in deployment of it.
So we were trying to brainstorm internally as to how we can make this a little bit better, and so that's the transformation that we'd made over the last year or so. And we want to share those experiences by means of asking a few questions.
The first question that I want to bring up is, is DevOps equal to Dev plus Ops? Is that all there is to DevOps? Are we just putting developers and operators together in one team? And the fact that operations resembles software development a lot these days, is that the whole crux of it?
And I think you would agree that that's all means to an end. Our goal eventually is to make sure that we can get code from a developer's laptop to production in a seamless, fast, agile, and deterministic fashion. So that's the end goal.
And we were trying to figure out what's the best way to get this thing going even faster. And what we've learned at GE, and what our experience has shown us, is that we can make another jump, another exponential jump in productivity and in speed by actually adopting a platform. By having a platform in the middle, which gives the right abstractions to the developers so that they can build their service, and without really understanding the details of infrastructure in terms of VMs and networking, they can reliably push their application into production and have confidence that this thing works because the platform that is exposed by the operations team implements a certain contract. And as long as that contract is true, they can rely on the pipeline that they have built to push their application.
So the main lesson that we learned was that putting a platform in the middle actually accelerates your DevOps practice in the organization. It gives the right abstractions for the developers to focus on what they are good at, at building that application and the code. And having the right abstractions, they don't have to understand the VMs and the infrastructure and all that. And operations can focus on building this platform and enhancing it.
Another benefit that we found was that in our earlier project, when we built our pipeline for our application, like I said, there was a disconnect. You had development team which really understood the first half of the pipeline, which was going through the tests and the integration test and through the deployment to the staging and putting it in Artifactory kind of a thing. And operations team understood the rest of the pipeline. So there was a disconnected pipeline.
With a platform in the mix, the development team could actually own a pipeline which took the code from code commit, check-in test, to integration test, to a staging environment, to a deployment environment, to a production environment. So they owned the full lifecycle of that service.
The next thing I want to discuss is, once we adopted this platform, and this is where as a part of development team, I find myself guilty as well, that once development takes over this responsibility of managing the service or application end-to-end, from when they start coding to an integration test all the way to production, one big thing that we missed out, and we are improving on this, but we missed out in the early part of adopting this approach, was the monitoring aspect.
So this is one aspect that does not come naturally to developers, and it's really an important aspect of DevOps. In our platform, we've made sure that there is an ability to integrate your application and your services with monitoring tools to make sure you can actually integrate your application with log analysis tool. And we've made sure that any time anybody's pushing application or service in staging or in production, their service or application has to be bound to the right monitoring service as well as the right log analysis service.
This is a learning for the development side of the organization, that monitoring has to follow the CI/CD. It has to be part of each deployment, and it's not just an ops function. Monitoring tools actually allow the developers to take feedback from how their service or application is being used, and it helps them evolve that application. So it's not an ops function. It's not a support function. Developers have to take responsibility of making sure their application is bound to monitoring tools and then continuously monitor it to understand the usage of their service or app and improve on it as they move forward.
With that, I'm going to hand it over to Jake, who's going to cover the rest of the presentation. Thank you.
Jake Johnson
Hi, everyone. My name's Jake Johnson, and I'm the cloud operations director for GE. I work with Vineet over at GE Software, and specifically, I run the platform that runs Predix Cloud.
Quick show of hands: how many people would characterize themselves as an ops in this room? And how many devs do we have here?
Not too bad. Pretty even mix.
Yeah.
I thought you'd be outnumbered.
I thought so, too.
I guess not.
First let me talk a little bit about the platform. We're building Predix on top of a PaaS called Cloud Foundry. That's sort of the platform that we selected, and a lot of reasons behind it, and I'm not going to share all of them, but I think there's a couple of key things that we get out of using Cloud Foundry as a platform.
So let me advance my slide.
I think you want to go down.
Thanks.
There you go.
There's a lot of reasons that we chose Cloud Foundry, but specifically is one item that Vineet touched upon, and that's that it abstracts away the programmable interface from the infrastructure itself. I think it's super important. It enables a lot of things, but most importantly, it lets the development teams work with higher-level concepts that they're used to, like applications and services and things of that nature.
And they don't have to think about all the things they used to have to think about when they came to the ops team before. Like, "Hey, I need a server with this much memory and this much CPU." That's my team's job. We run the platform, and so it's up to us to make sure that not only is the platform up and running, but do we have the right match of compute resource to make sure that the capacity that we have underneath is meeting the demand being put on us by the application teams and their apps.
So I think it's great. It allows us to work independently, and of course there's a lot of other tools and services that we've brought in along the way, like CI/CD tools and monitoring and log analysis and things like that. But fundamentally, I think the platform is what allows us to work independently, effectively because it creates that division between the teams.
Lesson three for us was platform is key, says the platform guy. It sounds a little self-serving, but allow me to convince you why that's the truth.
One is a well-defined interface, like I said before. I think you can approach DevOps and you can approach automation with a lot of tools. I've seen people approach it in a bad way in the past where it's like, "Let's just automate the whole thing end to end. We'll write a big deploy script." I don't care what tool you use: Chef, Ansible, Puppet, BOSH, whatever. You can automate the thing and it can still be a nightmare, usually because you haven't thought about what that contract is between you and the development team.
That contract's an important part of it, and if you don't think about it up front, it doesn't matter what tool you use. You'll end up creating this really awkward collaboration point inside of your DevOps automation scripts. So I think that strict interface that Cloud Foundry provides us is key. You don't need Cloud Foundry to do it, but I think having that contract is a really important step. It's one that we undertook early on.
I think the second big benefit we get out of the platform is that there's a standardization that comes with it. And by standardization I mean, of course, everybody would love to just bring every message queue under the sun and offer that as a service, but one's probably good for most use cases. Same thing goes for relational databases. I don't want to have a zillion. MySQL, Postgres are probably enough for right now.
So I think allowing us to sort of standardize on a platform and a set of services that are offered through that platform creates a lot of operational efficiency on my team specifically because we don't have to operate that many different things.
The next question that we kind of considered when we started off in this DevOps journey was should we change our organization to align to this new way of thinking? So the answer is no, we didn't, and I don't think you need to.
I think people approach DevOps and think automatically that I need to create a DevOps team to go do the DevOps. And I think that's not necessarily the right way to think about it. DevOps is more a way of approaching the problem. It's creating a culture that allows dev teams and operations team to work more effectively with one another. And I think that your org structure really becomes irrelevant. It's more important that you set up the right tools and processes to facilitate that.
At the end of the day, I think, frankly, it doesn't really matter who reports to who when you're talking about creating a DevOps culture. I think simple things like workstation proximity are a much better way to achieve that. Making sure your ops team is co-located and your dev team is co-located and that they know how to find one another so they can work together is, I think, what you need to do. And at least in our case, this DevOps transformation didn't come along with a big reorganization within GE.
This is a fun one: who's in control? So I think of this as the big red button question. Who is allowed to push the deploy button? Or who is allowed to log into prod and fix stuff, right? I think that's the question we have here, and the answer is not Vineet. No, I'm kidding.
Never.
The reality is the machines are what's in control. So the approach that we take is, first of all, automate everything. Create a solid CI/CD pipeline. Ensure that it isn't humans that have to get into the system to do those changes. There shouldn't be a human pushing the big red button.
Now, the reality is GE works in a lot of pretty heavily regulated industries, and nobody's going to be comfortable with the concept of a software developer committing something to GitHub and then that getting deployed to an MRI machine right away. That just won't happen. So there's always going to be this need for human beings using their judgment, looking at the process, and sort of serving as a gate as you move your software from dev to stage to prod.
But I don't think that means you have to toss away all the benefits of DevOps. Sure, we do very simple things like setting up approval steps, ensuring that the right people are on a call with incident management. So it gets interesting. It's not perfect, but the reality is what we've found is that we can achieve a pretty successful DevOps organization, even working in a highly regulated environment. It's just a matter of understanding everybody's role, having clarity on who does what, and making sure that either you build automation and tools around those processes or making sure that the right people are engaged when you're dealing with an incident.
I'm going to try to get through this quickly so that we can get to the demo and then also have time for Q&A.
Top takeaways for us. First of all, again from the platform guy, a platform can accelerate your DevOps journey. It provides great tools to ensure that teams work independently, successfully. Continuous monitoring is huge, making sure that your services are built to have that monitoring capability in place day one, so that you can get good visibility to how they operate.
Giving teams control over their outcomes, so making sure that teams feel empowered to control their own destiny and that they aren't being blocked by multiple levels of red tape, and then, like I said, co-locating them so they can work well together.
Challenges that we faced. Our CI/CD tooling is probably a weak spot. You need to make sure it integrates well with your platform. That's an area where we're getting better. Platform maturity is just something that we're working through over time, and as you build a platform, there's just sort of a natural curve to it.
Building security and compliance in the pipelines: this is a specific challenge for GE. We want to have pipelines. We want to have automation, but it's important to have the right checks and balances in place as well.
And then the last two points are just finding people who can bridge the divide. So whether it's a dev that thinks like an ops or an ops that thinks like a dev, I think it's important for those teams to have empathy for one another and understand the other side of the fence really well, especially having done that job perhaps, or just at least being able to put themselves in one another's shoes.
So that's the takeaways and challenges. I'm going to turn it over to Vineet to run us through a demo, and I think we'll do Q&A after that.
Vineet Banga
Sure. All right, so everything crossed.
I didn't want to see that. And just going to make these things a bit bigger.
So what I wanted to show you in the demo is, as a developer, what I've done is I have a simple app which exposes an endpoint as a service, and it just says, "Greetings from Predix." So it's just one line of code.
And so if you go hit this endpoint, this is where it's deployed. We call it DOES demo. And if I just hit this endpoint, it returns, "Greetings from Predix 1.0."
What I want to show is that as a developer, let's say I get an issue, I need to update my service or my app. What would it take for me to do this without help from operations? How can I rely on this platform to do this? So we've talked a lot about abstraction, and it's easier for developers to do this by themselves. So that's what I just want to give you a flavor of what it feels like.
This is my 1.0 application, and then I change this code to, say, 2.0. And what I've done so far is that I have deployed this in my staging area. So if I just do `cf a`, which is CF applications, I will see that I have two applications. If I can find my mouse cursor here, I'll just highlight this. So there is a DOES demo running here, and then there's the DOES demo two. So that's my second version of my app.
And just to show you that I'm not lying, if I go to the...
Live.
Yeah. Have to show that, right? If I go to DOES demo two, it'll say, "Greetings from Predix 2.0."
Another check that we can do is, I have written a script which hits this endpoint continuously, like 10 times, just to see that that's where it's going. So right now it's hitting on this endpoint called DOES demo. We're just going to assume for this demo that this is the endpoint that I'm sharing with my customers to hit.
So the DOES demo two is my, let's say, in my Blue/Green deployment, is my blue deployment, that I just want to test out whether this 2.0 version of my service is working or not. So that has a URL too, which is DOES demo two, but that's not what's exposed to my customers.
Right now you can see my current version is exposed at this endpoint. My number two version is exposed at DOES demo two endpoint. And just to prove it, we're going to run through the script, which hits this endpoint in a continuous fashion. So you'll see that it's returning, "Greetings from Predix 1.0."
So now what I want to do is, I want to expose my second version of the service at the same endpoint. So what I'm going to say is `cf map-route`. So this is the CLI which helps us talk to our platform, let me do things like create routes. I'm not dealing with networking and DNS and all that, but a simple command like this.
So `cf map-route does-demo-2`. I'm going to take my domain, which is this guy, and I'm going to say that, "Route this application to this particular host name, DOES demo," which is the one which is externally visible to customers.
So I'm going to run that command, and it says, "Creating the route," and it says, "Already exists." So I'm going to `cf a`, and I'll see that now, again, if I can find my mouse cursor here, that I have the two apps that I had. My number two app is also pointed to this particular URL that is exposed externally. And if I hit this endpoint in the browser, well, not this one, the original one which I've shared with my customers, I'm going to see 1.0 here because keep in mind, there are two apps which are sitting behind this one URL. So which one it's going to hit? It's round-robin load balanced.
So let's try the script that I had to see. That's going to hit it at the same time concurrently. So we might see some load balancing. So you can see that, I hope it's big enough. I hope you can see that at some point it actually hit 1.0 version, at some point it hit 2.0 version of the application.
So now I have a situation where I have both versions of my service running available at the same time. And let's say now I'm pretty confident that the second version looks good enough, so I can tear down my first version of the service.
So all I need to do now is I'm going to do `cf`. Let's make this up here. `cf unmap-route does-demo`. And I'm going to put that domain name, and I'm going to say that, "Unmap this application from this particular host name."
So it says, "Okay, I've removed it." Let me just check that. So I'll do `cf a` again. That's my application list. So now you'll see that the application that I had originally, the 1.0 version of my service, does not have any routes mapped to it anymore.
And if I hit this endpoint now, which said 1.0, now it's going to say 2.0. And if I ran my script again, now I'm going to consistently see 2.0.
So all I've done is simple CLI commands, `cf map-route`, `unmap-route`, and I'm able to do a Blue/Green deployment in a matter of minutes as part of a demo. So this is what we talk about when we are saying, as developers, we have the right level of abstraction to work with the infrastructure.
I don't need to be a networking expert, or a virtualization expert, or VM expert to understand what's happening. I'm relying on the contract that the containers provide, that my service will run the way I coded it on my machine. And then I'm relying on these commands, which allow me to use the rest of the infrastructure to make sure I can do things like Blue/Green deployment in an easy fashion.
And if I were to do this using Chef and all these other tools, I would be running behind Jake and his team. But this is something I can do by myself, and that's what allows the development team to run in a faster manner and own that end-to-end pipeline that we referred to earlier in the presentation.
So that's the demo that I wanted to show you. And I think we have a couple of minutes left. If there are any questions and answers, we'd be happy to take them.
Q&A
Any questions? Questions?
Okay. There's one in the back.
Whoa. That would've been great if I'd have tripped.
You're welcome.
Q: So I had a question about the platform. You spoke a lot about the platform allowing developers and operation folks to kind of do their own thing, and it's kind of a common playground for those guys, right? And so my question is, do you feel like that needed to happen because typically there's a sandbox there, that they don't like to play in each other's backyards? So is that a way to remove the barrier there that's between the dev team and the ops team?
A: Yeah, it's interesting. It sometimes seems counterintuitive because people hear DevOps and they think, "Let's bring the devs and the ops together," right? And so what we're talking about here is building a platform that in some ways creates a boundary between the teams.
I don't think it's so much about creating a sandbox because we have production. For me, this is a production environment that I use, and it's a service that we provide to the dev teams. I think it's more about having the right level of abstraction for them to work in, and a simple, very well-defined contract with that team, and that allows us to change at different pace. So they can work in their own SDLC, in their own cycle, with zero dependency on my team. My team can roll deployments underneath without affecting them.
And so I think about it as a dependency management problem, similar to you would find in software. It's like having a simple interface.
A: And just to add to that, what I would add is, as the developers, when we were trying to run that model last year with a single team, which had the service coded in Java and then the deployment was actually done through Chef recipes, on the development side, we had very limited understanding of how the Chef was working, and how we were setting up the routes, and what VMs were allocated, and how we were scaling that.
It was hard to find developers who could cross that boundary and have a full depth of understanding of development, as well as of the operational details of the things. So, yeah.
So we have one more question. The networking hall is open. Please go down and support the sponsors that make this happen for us. And another question for you guys.
Q: So on the operations side, do you still have the traditional configuration management processes where you have to initiate and field change requests, et cetera?
A: Yeah, absolutely. We have infrastructure. We have a CMDB. We have to attest to those sorts of change control processes in order for us to achieve the certifications that are necessary for our customers' peace of mind, right? So yeah, those things happen now. Thankfully, Vineet does it. Dev teams don't need to worry about all that. That sort of happens underneath this abstraction layer that we're describing.
Q: So when you go to a production deployment, an update for an application, how does that happen between the two groups that you've described?
A: So the change control process has to be well-defined, but I think the most important thing is to have the right separation of responsibilities. At the end of the day, the pipeline will carry the change so far, and then some other team who has responsibility for carrying it to the next step, steps in. So, I don't know, does that answer your question?
Q: Yes.
A: Yeah.
Anybody else? Any other questions? Yeah. There's one in the middle here.
Q: So my other question was, this DevOps model, does it work across distributed teams? You talk a lot about co-location of the teams. Do you think that that works across distributed teams or no?
A: I'd say we haven't experimented with it much. We do have a few examples, and most of us are located here in the Bay Area. There's pockets of people that are working in other sites, but I don't think we have a ton of experience really to draw upon. I suspect it can, but I don't have any facts to back that up.
Was there any other? There was a gentleman in the middle of the room.
I'm sure I'll ask a question.
Now that I put him on the spot.
Q: You did put me on the spot. Thanks for that. Now I feel awkward. I hope it's a good question.
I know that GE used to work with Six Sigma, right? So big picture from the business point of view, you guys are part of the whole process that sits here, and DevOps kind of fits into it, process leading to value. Has that helped you guys out get where you guys are today, where you guys have support from the executives and stuff like that? Because some of the stuff, we all know, you got to get buy-in. Were you guys originally there when this started, or are you guys sort of coming out afterwards, and what kind of support do you get from the executives?
A: So let me repeat the question back and make sure I understand it correctly. So is it a question about how does DevOps fit into a company like GE with this history of Six Sigma?
Q: Yes.
A: Okay. I think San Ramon's a little different. I don't know what your take is.
A: Yeah. I don't know. I think from an executive standpoint and the management standpoint, they've been evolving since we've been doing more and more of software. They've actually adopted the practice of FastWorks, which is more of an agile-based process methodology, and that's what we are mainly running with at GE Digital in San Ramon.
We're still influenced by the Six Sigma practice. We have plenty of black belts at GE in San Ramon. And believe it or not, they actually fully endorse the agile and the FastWorks methodology, and this whole experiment and transformation that we underwent was with their blessing, and I think we have pretty good backing from our execs and management.
All right. Anyone else?
Q: I have a question. Could you talk a little bit more about the CI/CD tools and monitoring tools that you guys use?
A: Sure. So we are using Jenkins for our CI/CD pipelines. For our log analysis, we are using Logstash and Kibana as a UI. And for monitoring, we're using New Relic.
A: I would add to that, we use Concourse in some places for CI/CD of the platform itself, and we have some infrastructure monitoring using a tool called Sensu. So those are there, too.
Q: A tool called what?
A: Sensu.
Q: Okay. But these tools are more or less, each one covers a particular area. For instance, when you use Jenkins usually is for CI, for build and automated testing. Monitoring is New Relic. To put everything together, is that something that you guys do through in-house-built tools or manual, just define the process or what?
A: So we have tried to tie these things together in the platform that we refer to a lot. And the idea is that just like I was able to do a Blue/Green deployment with the platform here, I can do things like create a new pipeline in Jenkins, and we have some templates which are preconfigured, and you can get started off pretty quickly.
So we've tried to integrate these things and bring them under a common umbrella so that all our application and services get the same consistent model that we would like them to have.
Thanks. That was a great question.
All right, anybody else?
Okay, then let's go drink.
All right.
All right.
Perfect. Thank you.
Thank you.