How Swarming Transforms Enterprise Support to Work Better with DevOps
The growth of DevOps in large enterprises is driving service organisations to re-think established approaches to technical and customer support.
This presentation explores the issues being encountered by DevOps teams as their output becomes more significant in enterprises, particularly where they are drawn into traditional support structures which often conflict with the principles of DevOps. We will discuss the ongoing emergence of Swarming as an alternative to 3-tier support structures, showing the latest examples from a diverse range of industries including software, telecommunications, and automotive.
The presentation will show how Swarming enables smarter interaction between DevOps teams and established ITSM functions, and will show – using the Cynefin framework as an example - how Swarming enables provision of adaptive support for complex distributed systems in a way that tiered support can’t.
Jon is a Principal Product Manager at BMC with nearly two decades of experience in enterprise IT. As a senior product leader, his role is focused the creation of a new generation of innovative, UX-focused support tools which meet the evolving challenge of providing technical customer service, at scale, in transforming enterprises.
As a passionate DevOps advocate, Jon has spent a number of years working closely with the DevOps community, aiming to help DevOps to succeed amid the realities of complex enterprises.
Jon has regularly presented at DevOps events, including DOES18 in Las Vegas, about the challenges awaiting DevOps groups as they establish themselves in enterprises, while also presenting at numerous IT Service Management events about that community's opportunity and responsibility and to be an enabler for its success.
Chapters
Full transcript
The complete talk, organized by section.
Jon Hall
It's really great to be here, not least because I went to my first DevOps conference by accident.
In about 2015, I saw a conference called Configuration Management Camp, which some of you might know, over in Belgium.
I'm from the ITSM world. That's my whole background, and configuration management means CMDBs, and while I was quite surprised to see a whole conference about it, it was easy to drive to. Of course, I was a little bit wrong, but I went anyway because it looked interesting, looked kind of connected.
It's sort of doubly profound for me being here because when I look across the river, I can see where my dad started his career, and of course it looked a little bit different back then. Canary Wharf was a banana warehouse, and that's where he turned up at 16 years old to go to sea.
So it's a little bit... I like coming here and seeing that.
These days, though, what I want to talk about is the way complexity in our enterprises makes DevOps and ITSM have a kind of mutual need.
When I went to that conference in Belgium in 2015, most of the people I was talking to were from startups and from the open source community, and mainly not focused on enterprises.
It was cool hanging out with Spotify back then and other people, but I went to the same conference this year, and a lot of the people there were my customers. I work in the enterprise service management space, so we deal with the biggest companies in the world.
And in that world, digital transformation is the buzzword.
I think what it really means to us is that the systems that we're helping people to support have got a whole lot more complex, and there's many more of them.
So these days, if we're the customer of a service, we might, for example, be onboarding a new employee at our company, and a lot of that stuff might've been done by email and paper trail before, but now we touch a whole lot of different services.
In that example, we're talking about HR and facilities and IT.
Let's take a more general example. We change our airline seat.
For us these days, that's click one button, but when you do that, you're probably hitting 50 or more systems, because they've got to worry about your status and your billing, and they've got to think about operational systems for the airline, and somewhere there's a dusty mainframe that stops travel agents assigning that seat to somewhere else.
So it's a pretty difficult world, and in reality, these big enterprise services are very, very complex. The DevOps picture of 2015 translated to the enterprise gets a lot harder to deal with. But what we see when we look at some of the numbers is that DevOps is, of course, changing the world.
We know that, and in the IT service management industry, it's been a little hard at times for those of us who are reform-minded and looking to the future, often to convince some of our peers that they've got to think differently. The docks are gone, but they're not really gone. What's happened is that beating heart of the shipping industry of 40 years ago has now moved up the way to Tilbury, and to Harwich, and to Felixstowe, and instead of my dad sitting watching people unload a ship for a month, you now get these cranes that do it all in
half a day, and the ships are 400 meters long and enormous and look great from the beach where I live.
So the world is really different, and yet it's sort of the same, because here, even if we look at the DevOps report, even the very best companies, that low percentage who are achieving this elite status, they're still spending half of their time being reactive.
Now, what service management should be doing, what the people in these organizations that you deal with should be doing, is taking some of that work away.
I'm a product manager on an enterprise product, and I cannot get drawn into every single support ticket because I have thousands of corporate customers.
I have to have a support framework in place that can deal with a lot of those things. I can work with them to make sure that the same issues don't keep coming to me, that we fix every problem once, and after that, they're equipped to do that.
And that, I think, is really what the service management world should be bringing to you as DevOps practitioners.
We should be focused on making that 50% turn into 60, to 70, to 80, and taking away some of that toil.
But the problem is, a lot of what happens in ITSM is still built around structures that don't really adapt, and one of the classic examples is this tiered support system. It's rooted in scientific management.
It probably makes a lot of sense back in the day of simple stacks, maybe.
And part of the challenge with this is that to get to experts, you're already introducing several queues. So hopefully level one support is pretty sophisticated, a lot of self-service, a lot of knowledge management.
Hopefully they can fix things.
But as we'll discuss a little later, stuff's still going to break, and it's going to break in complex and novel ways, and you are going to need to have complex specialists.
Your DevOps teams tend to find themself at the bottom of that stack and dealing with support tickets more than they want to, and especially they're dealing with support tickets that have spent time going through two queues, so they're also dealing with a pissed-off customer, which isn't a really good way to start.
And that's why a lot of us have been focused on new ways of dealing with this, and swarming has coalesced as a very different response.
I don't think it's a coincidence that this looks a lot like a slide that was presented by Jonas Elmquist when he was talking about the Phoenix framework on Tuesday.
Effectively, we've had organizations set up around tiering, and we're needing to break that down and do something that looks a little different.
And so swarming, I think this is one of the best definitions, and it's from the Consortium for Service Innovation, who also do a framework called Knowledge-Centered Service, which is also very useful in this context, but that's for another day. In a sense, in this context, swarming is about taking away those tiers and being much more dynamic and much more flexible.
Now, what's been interesting for me is we've been doing this in our own customer support organization. So for me, it's almost customer zero.
They're the customer who's across the office who can yell at me.
They've been doing this for now some five years.
And one of my favorite quotes from one of the first times I sat down with one of the people who've been doing it was...
This chap was one of the most senior support people on the Remedy product line. He'd been working with Remedy more than 10 years.
The kind of person you think, "This man probably couldn't learn much more.
He can do everything."
And yet he started out the conversation by telling me that swarming had doubled his knowledge in a year, which of course, made me very curious.
So I'd like to just talk about how we do it, emphasizing this is a very self-organized process.
Different companies look different, and I'll show you some other examples in a minute. But our own customer support team, they start out with a typical level one support framework, the service desk, a self-service system.
This really concerns the things that come through that can't be fixed at that level.
And a few of those things will go to what they call a severity one swarm, and these are the situations that are really on fire.
And that swarm is just a bunch of agents who are on call-out on rota, and they will pull together the war room, and they'll do whatever needs doing to fix it. And this is nothing new. We've just called this swarming, but I haven't told you anything that you're probably not all doing.
The novelty comes somewhere else.
The novelty comes in the fact that for everything else, instead of trickling its way through perhaps a second tier of support and then maybe finding its way to a third, we have this thing called a dispatch swarm. Some teams call it a triage swarm.
If you're going to let people self-organize, they're going to come up with different names for similar things. So that has to be okay.
And most of the time, these people are meeting in small groups every hour or so, and all they're doing is they're watching the stream of incoming tickets.
We often pair an experienced person with a less experienced person.
That works really well for us.
And really, they're cherry-picking.
They're thinking, "Okay, never mind the old school sort of SLA thinking.
Never mind the fact that this is low priority and we've got five days.
Can we fix it now? Let's fix it now. Do we need to get back in touch with the customer just to clarify something? Let's do that.
Let's also make sure that anything that goes beyond this level gets properly quantified, clarified, all the data's right." Because one of the other big problems with that tiered support system is that things bounce up and down a lot.
That said, we still have a lot of products.
We have a lot of subgroups to those products.
My own product runs on cloud and it's on-prem, and we have obviously databases and servers and lots of different aspects of the product to support.
So we still have those specialist teams, and we're also a global company, so those teams live in different places.
And this is effectively the level three situation.
And not every ticket is an issue. Sometimes they can pick those things up and fix them normally. But what we've tried to avoid, what we've set out to get rid of is the ticket tennis between those teams that's all too common again, because it's not always clear who needs to do something.
So you're not allowed to do that.
What you have to do is if something is challenging, you bring it to what we call a backlog swarm. And how those things are organized is entirely up to the teams.
But typically, they'll meet at a scheduled slot, maybe a couple of hour-long slots a day. And these are focused on these challenging issues.
So they can't assign to each other.
The dispatch swarm can't pick out that individual they know is really, really good at that thing because what that tends to lead to is those people getting burned out pretty quickly.
They work together in a group and solve these issues.
One of my favorite things about this, I like to sit in on these sessions sometimes, is that although people with issues, they get prepared, they bring materials, they bring them to the group.
A lot of the time, the rest of the team come, even though they don't have any issues.
They are really enthusiastic about getting into these groups because they're so valuable to them. So that in itself is a really interesting effect of this.
Now, to make this work, we had to change quite a lot of things.
Firstly, the top two points are really important because they need a lot of executive backing.
In service organizations, quite frequently, people are buried in existing service level agreements.
For the service provider market, this is particularly complex because often there's contracts and payments tied to that.
So sometimes you have to just abandon that sequence of metrics that you measured before because they're often focused on individuals and single teams, and now we're not working like that.
And we also needed to give people a lot of freedom to feel they could organize their own way. And the way we actually did that was by starting with a few products and letting them get a footing, get established, and then start to disseminate the message both upwards and across other teams, and that went really well. It does actually put more people in touch with customers if you organize the way we do.
And some people needed a bit of help with that if they've always been back-end support people. Not everyone has that experience, and that's worth knowing about.
As we said, we had to ban tennis. You've got to stop people pinging tickets around to each other. And we also use this as an opportunity to develop more chat ops and introduce other tooling like mobile.
It works really well with people who are on the road.
The results, even in the first year for one of our major support teams, they saw really big quantitative improvements, mean time to resolution and customer satisfaction. But also the human aspects.
People got up to skill more quickly.
Even those experienced people developed more skills.
And we actually found that because they had more free time, because they weren't mired in queues, they started building more service offerings, which ended up actually improving our own revenues.
So final quote from our BMC people.
This is a quote I got a few months ago, having sort of revisited with one of those original teams.
And this analyst talked about the fact that it's good in two ways.
Firstly, you might well find someone on that call has experienced this problem before, and that's going to be really hard to find if you're just hunting people out by bouncing a ticket around.
But secondly, for new problems, you're bringing together that collective experience of problem-solving. So even though you might not have seen it before, the group can help.
So the good news is we've been banging this drum and some of our customers and people involved with the Consortium for Service Innovation have been pushing this.
And sometimes early on, it felt a little difficult to break down the entrenched social structures of service management, but now it's starting to appear in things.
I actually got involved as an author on the new ITIL framework, I think possibly to be a little bit of a troublemaker, but a lot has had to change, and it's definitely headed the right way.
And we're seeing in other service management things like a new framework that came out called VeriSM, which has some people involved who have spoken at this conference, that again is really starting to push this.
So now our customers are picking up the phone and asking us about what we're doing, which is good news because we've got some nice stuff coming.
But we also deal with other companies who do it, and I wanted to share a few other examples.
This example is a telco who are in Central America. Well, this is a service desk for the telco in Central America, and they're doing chat-based support rather than phone-based support.
And what's interesting for them is these agents can put the chat on hold, and they've got access to what I kind of...
I like the phrase always-on swarm, as one of their employees called it.
They have these groups of people who are typically assigned from specialist teams and on rotation, and they'll be sitting in a Slack channel, and the service desk agent can sort of say to the customer, "I won't be long." And they can go in for up to three minutes and get advice from these teams.
And the benefit of that is by donating a few members of staff on a rota basis to these swarms, those teams then find less things actually being escalated to them because it enables that service desk to fix things in a way that wouldn't have done anyway.
Another great example, I spoke with a car company, a major American car manufacturer, and their challenge is interesting because their connected cars division a few years ago, were responsible for the systems in about 200,000 cars.
And then their head of department got promoted to a very senior role, and now that company is putting this technology in every single car, and they make many millions of cars a year.
So suddenly they've got to be able to scale both the first-line support and the back-end support. And so these first responders, as they call them, can convene a swarm, and they can pull people in from other teams.
And again, those teams are putting somebody on rotation.
So that brings together this group of people.
Again, they've cut level two and three assignments out of this.
They also use this to bring in third parties.
Of course, increasingly, that big, complex picture of systems includes a lot of third parties, and one of the biggest pain points has always been assigning to third parties because that usually involves throwing the ticket over a very high fence into someone else's system.
This kind of format enables you to bring people into these things, and more and more, we're seeing representatives of Microsoft and Amazon and Google and other major providers being brought into this kind of support structure.
And then one of the other benefits they called out was they can shrink the thing as well. So if it turns out that somebody's not needing to be involved, for them, it's just a short conversation instead of having to pick a ticket out of a queue and explain why they're not involved and find someone else to reassign it to.
So this all sounds really good, right?
And mostly it is.
But it would be wrong not to call out some of the things that people are talking about as potential issues.
Firstly, again, one of the things entrenched in tiered support is the fact that things are cheaper at the first line than the third line.
You're paying those people less, and they take less time, in theory, per ticket.
So a lot of organizations work on benchmarks that might say it's $20 for a first-line fix, and it's $250 for a third-line fix. We're starting to bring in second and third-line type people a little sooner, or at least that's the perception.
I think that needs to be balanced against the fact that stuff's actually working a lot better. But of course, I'm talking about perceptions here, and this is a significant one that people face from managers.
Also, because we're stuck in a world of a lot of reporting and metrics being focused on individuals, each individual's fix rate is a common one.
Which I think is a really bad metric in a world where very few complex issues are solved by one person. But it's out there, and they're quite dominant, and you've got to break organizations out of that mindset. Even simple things like time zones can be interesting challenges. If instead of a one-to-one assignment, you might have to sort of shift one thing from an entire group to another. And then there's some kind of human problems.
Firstly, any conversation can be dominated by individuals.
We all know this, and that can apply in many contexts, and it applies here. And also, it can be interesting to find the right people. If you've just joined a big company, how do you know who to invite? And that intelligent swarming model that the Consortium for Service Innovation talk about, which I've tweeted a link to, hopefully any minute now, is... is very focused on that challenge.
And for us, we're looking at the way machine learning and other cognitive technologies can help with that.
But anyway, we're at a DevOps conference, so what has this got to do with DevOps?
I've mostly talked ITSM. Well, firstly, as I explained, I think we need each other.
I think we have something to offer from our community to the DevOps community, and I've been presenting now at DevOps conferences for years, and that sentiment seems to get stronger and stronger.
Which is good, but I like this quote. It's years old now, but I don't think it's changed.
"If you try and just adapt existing stuff to DevOps, it doesn't work."
So DevOps is really surprising service desks because essentially this comes down to the fact that as companies transform digitally and as they adopt DevOps, there is a lot more software being built in-house, and it happens a lot more quickly, and it happens in ways they'd never really seen before. And even just the fact that those developers, they're not interested in using your ITSM ticketing system.
Thank you, we've got Jira, we've got other SDLC tools.
We're doing our stuff, and it works great, and if you look at the DevOps report, you notice our change results are better. Fair point.
And there's also, because we are, as companies, building more digital services, we're suddenly seeing the external customer more and more as well. So again, service management has always been quite inward, employee-focused. That's all changing.
But as I say, how do you scale? We see companies, DevOps teams emerge from little startups within the company, experiments, proofs of concept, and they go big.
And when they go big, as I say, this is now where we've got something to offer.
We see people needing to deal with being on call as developers.
Classic problem discussed at this conference.
It's sometimes hard. As a development team, you've got 50 tickets sitting in a queue, but you've got 50 enhancements requests sitting in another queue.
How do you know what's what? And one thing service management is very good at is defining and setting and showing context in an enterprise setting.
We've got good technologies, and we've got good practices.
We just have to make sure we bring them to you in the right way.
And that's the problem.
This is something I present at ITSM conferences a lot.
How am I going to annoy somebody in the DevOps world?
I'm going to implement a lot of queues.
I'm going to set up silos and make sure they can only talk to each other down horrible asynchronous paths.
I'm going to burn some people out, and we're not going to share knowledge very well. That would be pretty annoying to everybody here.
I think these are the things that the DevOps world has done a great job from the ground up of getting out of the industry.
Oops.
So this is the message I present, I tell ITSM all the time as well.
This aligns really well to DevOps.
Because it is about autonomy, it's self-organizing systems.
It really builds knowledge transfer and skills, and that means that we can take some of that stuff off your hands.
We're not using emails and asynchronous assignments.
We're using tools like ChatOps. We're busy building on Microsoft Teams now because actually that's what a lot of our enterprise customers are focused on, and it will just bring that better experience.
The ticketing system is still there, but it's just adding value where everyone's actually working.
So it's the right thing to do, I think, and I think it's just from a point of view of people, it's making a big difference in our sector, and it's helping us connect better, which is why I think people like me are increasingly attending and talking at these conferences. But I'd also go further and argue that we can't adapt what we do to a lot of the new ways of thinking without doing this. So just taking one example.
Do you know the Cynefin framework? I see a lot of nodding heads.
I'll do a quick overview.
Cynefin is effectively a reasoning and decision support framework, which starts on the basis that different types of issues have different natures and need different responses. Dave Snowden is its creator.
I will tweet, hopefully I've already tweeted a link to a really great podcast that he put together.
But essentially it defines each issue as being in one of these categories, and some of them are, the green area is sort of liminal areas, which are the things which are not quite in one or the other.
So obvious things, clear cause and effect.
This breaks, so this happens.
Complicated things are those where it's similar. There's still a linear cause and effect, but it takes a lot more investigation to find out. I think the fun really starts in complex and chaotic.
Complex may not have any linear path between cause and effect.
Chaotic are things that are just plain on fire, and we're dealing with the impact as much as we're dealing with the problem.
And especially, we saw that picture at the beginning. Systems are very complex.
Complex systems fail in complex ways, and that needs a particular type of response.
So as I said, I went back to Configuration Management Camp this year, and I saw an amazing talk by Charity Majors which, again, tweeted that one too.
Check it out. She was talking about different failures, and she was talking about this in the context of observability versus monitoring. So related to what I'm talking about.
But these things apply equally when we're supporting systems, and I love these.
The first one, 10% of things crash unpredictably. By the time we go and look at it, it's cleared up.
We run a platform. I know this well because we sell a platform.
How can you tell whether someone's broken something or whether the platform itself is broken? It can be really hard to get to that.
And I think this is my favorite because this sounds like so many issues I've seen over the years. Stuff gets slower every Tuesday, normally in Bulgaria.
We're not sure why. And then it clears up.
Now, the thing with all these is that these systems now are probably underpinned by tens of thousands of microservices.
Again, there's probably mainframe over here, there's brand-new stuff over there, there's old cloud, new cloud, internal servers.
Occasionally, still a machine under someone's desk with a beeping light on.
Charity was talking about observability, but put yourself in the position of a first-line support agent. Where do I go with that?
If I've got to assign to a specific team, what the hell do I do?
Cynefin's approach to complex systems is really quite interesting because, again, it's quite theoretical, but very practical.
And they do this in not just IT, but a lot of other areas of the world, government response and things, disasters.
If we don't have a linear, obvious cause, we have to try different things.
We have to have parallel safe-to-fail experiments, as Cynefin puts it.
We need to be able to figure out some things that it could be, experiment on them.
Often, then we just have to observe the effect of those experiments, complicated by the fact that in doing the experiments, we're also having an effect on this complex system. We dig up a road in a city to fix one problem, and we probably create five new problems because a city street landscape is a good example of a complex system.
So all we can do is figure out what works quite well, keep doing it, monitor, check, experiment.
How the hell could you do any of those things if you've got to assign it to a support team and they have to assign it to another support team?
Completely impossible.
Whereas if we're swarming, well, firstly, we can do our dispatch swarm, figure out what the nature of this problem is.
We can start to do that probing, that experimenting, what's actually going on, what could this be?
We can then start to build parallel work streams with different people, experiment, observe, monitor. And then hopefully at the end, when we've done enough of that and we can figure out how to actually respond to this, we can again let people go, pull together the right team and do that.
Again, it looks like theory, but this is being implemented in a lot of different contexts, and a swarming approach really works well with it and enables that.
And in fact, if we look at the way other domains can be sorted, well, obvious stuff. To be honest, if you're taking more than one call on an obvious problem in a modern ITSM setting, you're doing everything wrong.
Or at least taking more than one set of calls, you should be automating that stuff. So not a lot to see here.
Complicated, though, as we said, this is where it might take some work to identify what that linear cause is.
This really does echo our dispatch swarm.
We're bringing together several sets of skills, giving them a bit of time to investigate and work on it. Chaotic stuff, again, this is where we might actually have to be containing the situation before we even really get on with resolving it. So again, if we have somebody coordinating a swarming response, this works much better than some kind of ticket assignment model because you can have people mollifying the screaming customers.
You can have people figuring out a way to just get things up and running again.
These complex systems are constantly failing in little ways, so often we can reroute around the problem, and then we can start getting on with the analysis once the fire is only smoldering rather than blazing.
So again, this stuff is only possible if we change some of these entrenched working methods.
And if we do that, then we can really start to offer a lot of value to you in the DevOps world, who are building these complex systems, changing the industry, changing the world.
Hopefully, we can start taking some of that toil away and let you do more of it.
And this is where I think service management is having to go.
This is where those of us who are working on ITIL 4 are trying to change people's thinking and make them think about value streams and make them think about innovations and practices rather than defined processes. And I think it's actually going pretty well.
I was pretty pleased when I saw the ITIL 4 Foundation, which is why I volunteered my time to carry on with the next step.
So what's next? Well, in doing that, we need to be able to talk to you.
I've got so much value over the last five years from coming to DevOps conferences and just sitting down and talking with people.
Increasingly, now I get access to DevOps teams through our enterprise customers because it's going so big. But you've not stopped. This industry itself is evolving so rapidly that I would encourage you to find opportunities to talk with the teams in your business who are providing that kind of enterprise support framework because there are ways they can help you, but they have to know how.
And I think you're in a really good position to do that because the DevOps community are probably stronger agents of change in enterprises than ITSM are these days. Your CIOs read that state of DevOps report. We see these numbers. We see what's happening.
And really, I think by working together, we've got a great opportunity to build these much stronger relationships.
Finally, then, a few more things. I've tweeted these links, hopefully.
I once did that wrong and tweeted them about six o'clock the next morning, but I'm at least in my own time zone here, so can probably work Buffer's time zone system. So on the left there, the Consortium for Service Innovation have a lot of information on their model of swarming.
It's kind of the one we work to at BMC in our customer service division and contribute to.
And other companies involved in that are companies like Ericsson and Cisco.
You see all those names on there.
That podcast I mentioned, Dave Snowden, there's lots of opportunity to see Dave Snowden talk about it.
This is just one great example on a podcast called "Boss Level." I like this because he explains it all in 10 minutes better than I've seen anyone explain it in an hour.
Charity Majors' talk from Config Management Camp is brilliant on all levels, I think. I sat there and felt I was learning and learning.
And then it was all just so apparent to me that we had exactly the same problems addressing these complex systems in old ways.
And then finally, there's a blog of mine, which is kind of the trigger that brought me into this conference, I think, was when Gene saw this. But it's effectively a long-form blog that just breaks down swarming, talks about why it works well with DevOps.
So thank you very much. I hope it's been interesting.
I'm always around to be contacted.
Like I say, I love learning from what you're doing, so please do get in touch, and thank you very much.