Service Ownership: DevOps for Salesforce

Log in to watch

San Francisco 2017

Download slides

Service Ownership: DevOps for Salesforce

Arun Singh

Director Customer Advocacy · salesforce.com

Ray Finn

Director Site Reliability · salesforce.com

This is the a presentation detailing the migration from handing over work to our NOC and site reliability team and moving to an Infra as a service and service ownership for our development team. The story will be told from an engineering and business side explaining why as a company this was right model going forward.

Salesforce model and service ownership: Why as a company have we decided to adopt Service Ownership and how this cultural shift impacts our customers.

Acquisitions and Service Ownership: What does becoming salesforce mean and how does the Service Ownership model help that.

Arun Singh, Director Customer Advocacy, salesforce.com

Ray Finn, Director Site reliability, salesforce.com

Chapters

Full transcript

The complete talk, organized by section.

Arun Singh

I'm Arun Singh, and I'll kick off the presentation.

Before we do that, thank you, each one of you, for coming to this session. We'll be sharing both high level, and then Ray will be going deeper into some of the concepts and technical deep dives.

Before I get started, I'll give Ray the opportunity to introduce himself, who he is, and what he does.

Ray Finn

Oh, yeah. My name is Ray Finn. I run Global Services Management. I know it says Director of Site Reliability outside, but I think that was an old title that still stuck with me. I can never get rid of it.

What we do is we manage the ITSM systems within Salesforce, which is incident, change, problem, and also service ownership, which is what we'll get in-depth there. I've been in Salesforce now for seven years. It's the longest I've ever been at any company, so it must be good.

Rumor has it we're hiring as well. There's careers at salesforce.com if people are interested.

Arun?

Arun Singh

All right. Like I said, I'm Arun Singh, and I'm not part of ITSM or site reliability, but I'm part of the infrastructure team, and I talk to a lot of our customers. So the part that I do is I bring the voice of the customer. Because Salesforce is running a service, and at the end of the service, there's a customer. So anytime we have an incident or we have some kind of a performance degradation, it affects a customer.

To roll all of that, funnel it back into strategies that Ray's team is developing to make a better customer experience, I help out in that. Hopefully some of you, while you're doing your day-to-day job, you have this team that you talk to which really brings the business aspects of things.

What I'm going to do in the next few minutes is really go at a high level: why are we doing what we are doing, and what's the business aspect of it? And then quickly hand it over to Ray to talk a little bit more on the technical side.

One of the key things for service ownership, and how we started thinking about it, is the complete proliferation of data that's happening in the industry today. And we are seeing it not just at Salesforce. I'm pretty sure you're seeing it in your business. Even in the technical roles, you're seeing that the data is growing, and it's over decades and technology, right?

We were at the time when we were in mainframes and data was just this much because there was just 100K machines talking to each other, or some people owned those machines, and it was very, very cordoned off. And then we went into client-server and things changed. A lot more data is being projected that way.

And coming into how Salesforce changed certain things, which is called the cloud, now it's bring your own device. There are mobile devices, there are IoT devices, all of this. Now devices talking to each other, which just brings more and more data. So data growth is something that we are seeing in the industry, we are seeing at Salesforce.

But from our lens, if I may just step down a little bit from the industry lens to our lens, one of the big motivations for service ownership is the amount or the velocity of growth that we have had. And the velocity of growth that we have had is not just from a data perspective and transactions that we are doing. So we do about 1.1 trillion transactions per year on our platform, just the core side of things, that is Sales Cloud, Service Cloud.

But over the time, what we have done to make sure that our business objectives are met, we have acquired a lot of new technologies. We have acquired a lot of new companies to cater to the business needs of our customers.

For example, now we have Commerce Cloud, which was an acquisition that we did of a company that was running not on Salesforce data centers, but on a third-party data center. They were having data sets or data objects that were very different from what traditionally Salesforce had been doing.

The other one is Marketing Cloud, which is the ExactTarget acquisition that we did. And ExactTarget runs on a completely different software stack, completely different hardware stack than traditionally what Salesforce was running on.

So as we are doing this from a business perspective and bringing all of these businesses in, it poses a challenge for teams like Ray's and others, where they go, "Okay, now this person is running their own service, this person is running their own service. What is going on?" Right? So it becomes very difficult to look at the disparate amount of data, the vast amount of data, and the different technologies that are in play, and different teams having their own different methods of looking at it.

Based on that growth, we started to look at things. If you look at Salesforce 10 years back, the customer experience, the customer service that was there, to what Salesforce is today in terms of the customer experience and service, is a lot different, right?

It's no shame to say that in our older years or in our younger years, we had a lot of things that were done manually, and that was not the best way, not the most optimal way of doing things, because we know humans make errors, and that causes problems. And when humans make errors, one of the other philosophies I'm pretty sure you have seen is that we do finger-pointing, right? We're like, "Well, my service was okay. It was Ray's thing that completely broke, and that probably broke the whole system."

All that is okay, and things will get sorted out, but from a business objective, from a customer experience perspective, the customer does not really... Our customers have become much, much smarter. They're much more demanding. They don't really care what you're doing behind the scenes. They want the thing to work, right?

So our vision from the microcosm of service ownership is to have software-defined everything. From manual operations, from an incident perspective, who sets the goal till the end? Who handles exceptions? When you're in the manual side of things, everything is human. Life is not good. People like me have to go out and talk to a lot of our customers and explain, "Why did this break? What happened? Why won't this happen again?" And that really hits our number one value, which is trust.

So coming to the automated operations, we are actually between column two and column three today, and Ray will explain a little bit more of that. But having something which has an autonomous framework, which can have the right guardrails. The setting of the goal and the handling of the exception when an incident happens, that should be human.

So that site reliability piece that Ray mentioned, when an incident happens, we have the right people on the call, we have the right people taking a look at it.

I'll leave you with that thought. It's a high-level objective. I'll be here for Q&A, and I'll be here for probably the cocktail hour as well, so feel free to grab and talk a little bit more. But let's dive deeper into service ownership at Salesforce, and Ray will talk about that.

Ray Finn

So the question we're always asked is, how did you get started? And like most engineers, we went off and we talked to... As Arun alluded to, we've purchased all these new products, new technology, and we were following the handle-the-work-over-to-site-reliability at some point. You know what I mean? So we're saying, "This is not working for us going forward, so what are we going to do?"

So we went out, interviewed our client companies out there, saw how they handle it. We've all read the white papers and stuff that they have out there on how they handle it. Then we thought, "Okay." We came to these kind of conferences. Last year we're at DOES 2016, and we said to ourselves, myself and two others, "We've figured all this out now. We're going to go back in, and we're going to write it out for the whole of Salesforce. Like for acquisitions, for stuff."

So I went off and I created this, and what I thought was all-inclusive. This is service ownership. This is DevOps for Salesforce. It's all sorted.

But then I realized very quickly, the moment I go off and tell everybody what to do is the moment that nobody will do it. You know what I mean? So I went back to them. I said, "Hey, guys, why is nobody interested in this?" And they're going, "Why should we? You just made it all up. You just wrote it out for everybody. You decided what it was, never involved us, so we've no reason to take part."

So I go, "Okay, maybe I made a slight mistake there." And I went back to them and said, "So what do you want to do?" And they go... They just wanted to be involved. They wanted to be, "If I own part of it, then I should be designing what's right for my part."

So we said, "Okay, then. We'll go back and we'll create it out."

That's not changing for me now.

So the first thing we discovered is nobody understood what we meant when we said service ownership, and nobody understood what a service is. And we thought to ourselves, "How could you not know what a service is if you're running it?"

So this is our definition, and I'm sure you all can pick many holes in it, because I can pick holes in it all the time when I look at it. But in the end, it's the best we could come up with to understand what a service is. And it's always fun because even our own teams, like site reliability, they once told me that they don't run any services.

I was, "What are you talking about? You do incident response, you do alert response. Are they not services?"

And they're going, "Yeah, but you know, it's not real services."

Well, the business relies on them. Everyone else relies on them. They are very important services for the company.

So we had to go back and then... This is our definition, and it's wishy-washy, but we'll get there.

And then we had to go and say, why are we adopting service ownership? What is the point of that? What do I as a product team gain from it, and what do I as a site reliability team gain from it?

So if you start at the product side, originally we thought we just wanted them to take on site reliability work. We want you guys to own it. We want you to build it. We want you to manage it. We want you to go on call 24/7. We want you to deal with customers. We want you to take over everything, is what they had in their mind.

So they said, "Okay, let's build our own site reliability team in the product side, and let's just ignore all this, and we won't have to deal with them."

So we had to go in and convince them. The idea was, there has to be a point in time where the return on investment is not there. So for product teams, they need to own part of it, but they can't own it all. And that's the thing, the most important part I always came to. They can't own everything.

So we started off, and we broke it up into smaller parts. What do you want from the company? And most product teams said, "I want to get rid of... I don't want to have to worry about release management. I don't want to have to have the release schedule. I want them out of my way. I want these people... I want you to make my life easy. I want you to help me when I have an idea. I want to be able to put that idea into a product straight away."

And so we said, "Okay, then."

And then we went back to site reliability. What's going on here? And they were saying, "The complexity. We can't handle the complexity. There's nobody who's a technical expert in all of these areas. We can't keep expanding the team because we don't have the budget. So we need the teams to take ownership of part of this." You know what I mean? Up to a point.

So then we defined it. Let's come up with a model. It's not an operational model. Because every time I say an operation... The SOM model is a maturity model, not an operational model. Because the big difference between an operational model is we're not telling you how to do it. You know what I mean? We're telling you what it should look like. Which is slightly different.

So we broke it up into lanes. We went off to the individual teams and we said, "Okay, monitoring team, what do you want? What do you think service owners should be from your side?"

And it's very simple. We want diagnostics. We want visibility. We want analytics. If you have a service, you should be telling us what's happening. You should have the ability to monitor it. You should have the ability to tell me if it's up or down. And that was one of the most important parts there. So it's just simple key metrics, key thresholds, just ensure that you're doing that.

And then site reliability, what do you want? We only want the alerts to come to us that matter. We want to have proper severity definition. We want to have the alerts. We want the alerts to make sense, and we want to have knowledge base articles, if you're expecting site reliability to do some work for you.

So we defined it as what percentage of manual actions you have. So if you want site reliability to do something, you have to tell them what you want them to do. They have to know what to do. They should not have to go and call somebody and ask them.

And then incident response. We want to be able to get the right people. If you run like we used to do, we used to have a product on-call team. So you call a person, and he goes, "Well, I don't even work in that area. I'm on the Service Cloud. I have no idea what you're talking about." You'll have to go back up the chain and call again and get the right person.

Because you need the right people on call. You need the right people to answer the phone, and you need those people to be there. And that's what they wanted. So they own alert and incident response to site reliability.

Okay, solutions management, which is really problem management, but we hate the word problem because we want people to talk about solutions, not problems. What do we expect from people? We want a single source of truth. We want to know what workarounds are there. We want to know what people are doing. We want to know when they're going to deliver it. And that's what we want from the solutions side.

So we want people to get together, run their retros. We don't say postmortems. We have all these very strange things that we like to get rid of the word postmortem because it suggests everyone's dead. So we moved it to retros, or after-action reviews at first, and then retros.

But we want everybody to be coming to them. We want them to make sure that they leave the retro with a list of work items, a list of things they're going to do, and a time when they will deliver it.

And then release process and release management. They just really want you to test your releases, and they want to make sure you're automating your releases, and they don't want releases to break, because when they have to roll back or roll forward, they really just don't want any involvement.

So most of the time, the release team will say to you guys, "Just check it in. Just make sure it's working. Make sure you've tested it. And when we release it out there, don't have us sit there and get everyone together and go back to incident management, and then go back into solutions management, spend all day talking about something that you guys have promised us you're going to test and we're going to work together."

And then, obviously, change management.

I was asked by our executive leadership recently enough to go out and interview companies and ask them, "What percentage of their failures are related to change?"

And I foolishly thought that would be an easy task until I went out and I reached out to these companies and said, "Hey, guys, can I ask you a question? I can't give you any information about our company, but can you tell me what percentage of your failures or your incidents are related to changes?"

And surprisingly enough, I got no answer. So I rang up Gartner and said, "Hey, Gartner, any chance you can give me any information about what the industry standard is?" No answer from them. I went to Python, asked them. I thought they might know. Puppet, and even asked PagerDuty if they could give me anything.

But it turns out nobody wants to answer that question. But from Salesforce's point of view, it is one of the main areas we focus on. We constantly change, and I'll use the example of some of the changes we did.

In the last, I think it was three years now, we have almost increased the rate of change by 10 times. And generally, it's just from implementing proper change management procedures and policies. I know everyone hates policies, but they're not there to make things worse for you. They're actually there to make your life slightly easier. And if you do actually break something, which I would never do, but some of you guys might, then you're covered. I try not to swear. I was going to use a swear word there, but I won't use it.

Okay. Capacity management.

We had a centralized capacity management team, and their job was capacity for Salesforce, which is crazy. Salesforce is now a company of 30,000 people, of such a multitude of different products. How can we have this one centralized team who's forecasting the future for everybody?

Their job is business forecasting. Your job is forecasting your usage. Simple things like CPU usage, memory usage. You guys need to forecast that. You need to be planning it. And we pushed back the teams that goes, "Relying on a centralized business team to decide your future is crazy. Why would you want to be in a situation where the business is defining how you're going to build your services?"

I know they have all the money, and they control all the purse strings and stuff, but in the end, do you really want the business defining that? So if you can prove your future and show what you're going to do and build your systems for that future, it's great to take control of that for yourself.

If anyone has any questions, you can stop me in the middle of things as well, by the way.

Okay, so traffic management. Again, this is another one that is unusual, but it kind of relates to Salesforce. What we're saying here is your service should not come on or stay on if it's going to make things worse. You know what I mean? Either you should be able to throttle, you should be able to turn yourself off, you should be aware of the environment. If you turn on your service and you're making things worse for the company, and your service doesn't know this, and it's just sitting there and hoping for the best.

I'll use an example from sometime in my past in a previous company where the Kerberos servers or such like that, where when they stay open, everything else goes down. You still can't get into boxes, so somebody has to go in and switch off the Kerberos servers.

Back in the day, I'd go, "Can someone go in the data center just and pull those plugs so I can get into the root, so I can actually do stuff?"

It's just some of those things, and that's a prime example. And people don't really realize because they don't understand where their service lives in the infra... well, the whole environment. So you need to know what's happening.

Another side as well, you need to know the traffic that's flowing through you. If I give you a 10-gig network bandwidth, and you're constantly hovering around that 10 gigs, we all know what happens when you start reaching the end of your bandwidth of your network. You're going to make things worse for everybody. Because very seldom you get everything yourself, and if you're utilizing all the bandwidth, nobody else can.

So they're the kind of things. Understand where you live. Understand the environment. Understand how you're sending your data, and what you're doing, and who you're managing.

So there are the seven streams that we came up with.

See, this is backwards. I did fix it, but what can you do? But I posted it up, but they didn't have it for me.

So we still have a schism, and that's why we have the little earthquake. I thought it'd be useful for San Francisco. But we still have a schism across as well between macro and micro. Because everything I talked about there sounded very macro, that we're expecting the teams to do all this.

But then we all talk about Dockerization, Kubernetes, and micro-servicing, whatever you want to call it, and we're all saying, "Look, we want to be inno..." I hate saying innovative. That's how I would say it, but innovative would be how you would say it here. But we want to be on the cutting edge, and the only way to do that is to be able to bring things live fast. And if you can't bring things live fast, then someone's going to beat you to it. So you want to bring these small services.

You start off, you have your infrastructure as a service, then you have your platform as a service, then you have your software service. So you have code, low code, no code kind of deliverables, and we want the teams to be able to do that.

But these teams are only two or three people. So if we follow the policy I just did, they're going to go, "There's no way we're going to be 24/7 on call. There's no way that I'm going to do capacity management. There's no way I'm going to do all this stuff with my team of three people who you're expecting me to deliver all this software for."

So then it comes back to where does that fit in? And even on the macro side, you have the large teams, the agile delivery, all that kind of stuff that we're currently doing. And their plan always is to hand it over to site reliability, which is their level two on what everyone showed earlier.

The plan is always, "Here's an operational readiness doc. Site reliability, you now monitor. It's live. Have a great time. Call us when it breaks."

So then we had to break it down. They need a hybrid model. This is why I joked about earlier about it's not an operational model. It's a maturity model. Because we don't want to get to every level of maturity. So there has to be a hybrid model. If you're going to build small teams and you're going to Dockerize your microservices, you have to hand over work to a large team at some point. They can still own the code, they can still own the stuff, but somebody eventually is going to have to handle it later.

And then the problem also is end of life. So you do amazing job with your SOM maturity model, and you get up to your level five everywhere, which is future kind of amazing stuff that nobody ever achieves. But then eventually you have to cut off support for that service. So you have to reduce the amount of effort you're putting in to keep that service up. So how does that work?

Again, they say the model doesn't really help them with that. It just, can we jump back levels? People just generally don't like to reduce down.

And then we have existing services. Services who are already in production there a long time, and who followed the old model of, "Hey, site reliability, I'm going to send you all these alerts. Can you please acknowledge them? Use your alert correlation engine to manage them for me because you've always done it. Just keep doing it. It works for me. I never have to worry about it. I never talk to customers, and so nobody ever tells me there's a problem."

And there's all those teams as well. You're working with those teams. You're trying to convince those teams that this makes sense for the company as a whole and for them as a whole.

And changing culture is always the most complex thing. I found anytime you come into a place and you want to change culture, you really, really are in for a fight a lot of the times. You know what I mean? Because if they're used to doing something, you can't take it away from them.

It's the gray vote that they use in elections. You can't touch the gray vote because you take something away from pensioners, and they'll all turn on you very fast. It's that kind of thing. You've got to be careful. Taking something away from somebody without giving them something in return is extremely difficult.

Okay. So there was what we're looking for from here. We know that we don't have all the answers, and I'm hoping some of you have solved the problems I've just said because I haven't solved those problems. That's why I put them up there. And what we're really looking for is we're looking for you guys to come back and tell us how you dealt with these issues, how you deal with putting in the DevOps. We're not allowed to use that word, but service ownership. You know what I mean? How you guys are doing.

And we want you guys to come and ask us questions, and we want you to be able to work with us, figure it all out together. And that's why we're here.

And obviously, we're planning to write a white paper as well, which will be coming out later in the year. It's probably not true. It's probably next year at this stage. But we're definitely looking to talk to people.

And if anyone ever wants to discuss what percentage of their failure is related to change, I would love to discuss that as well. You know what I mean? If you want to tell me.

Okay. Jeez, I'm after talking for... I was supposed to give five minutes for chat. Thank you, everyone, for coming, and we're looking for anyone got any questions or something they want to point out, or even just say, "Hey, guys. Here's what I would do here." We take that as well.

Look around you for questions.

Oh, no questions.

See, I told you we had all the answers.

Yeah. Clearly not.

Thank you very much, everybody.