DevOps Patterns & Antipatterns for Continuous Software Updates

Log in to watch

Las Vegas 2019

DevOps Patterns & Antipatterns for Continuous Software Updates

So, you want to update the software for your user, be it the nodes in your K8s cluster, a browser on user's desktop, an app in user's smartphone or even a user's car. What can possibly go wrong?

In this talk, we'll analyze real-world software update fails and how multiple DevOps patterns, that fit a variety of scenarios, could have saved the developers. Manually making sure that everything works before sending update and expecting the user to do acceptance tests before they update is most definitely not on the list of such patterns.

Chapters

Full transcript

The complete talk, organized by section.

Stephen Chin

[~0:02] So,

[~0:04] someone in the audience called this out. I have a small typo on my slide. It should say, "Steve on Java" rather than the missing J. So apologies on the typo on the first slide. It gets better from here.

[~0:18] So I run the developer relations team at JFrog. So the plan for today is we're going to chat a little bit about the reasons why you want to do continuous updates and update your software.

[~0:30] Some anti-patterns for things which can go wrong. I think anti-patterns are always the fun part. You look at other people's mistakes and you're like, "I'm not that guy. I can do better."

[~0:39] But then some practical advice on what you guys can do with your process to avoid some of these mistakes and common practices that'll save you from hitting issues with your updates and affecting your customer base.

[~0:53] So, first question we all want to ask ourselves is why do we care about software updates?

[~1:01] So I hear a lot of good people in the audience shouting out questions. So, security. What?

[~1:07] Could be some new features coming in so that- Features. Okay. So that's my number one reason, because we have these annoying folks, users,

[~1:17] and of course, what do they want?

[~1:20] Features.

[~1:21] And when do you think they want the features? As soon as you release. Yeah. Now. So you guys all have customers I can see in the audience. This is not a new problem for you guys. And it wasn't always like this, right? So-

[~1:34] No. What?

[~1:35] No customers for you? No, I said, do they even know what they want? No!

[~1:40] No. Yes. You're very right. So it wasn't always like this. So it used to be like you got a feature phone, and when you wanted to update your device, you brought it back to the store because maybe you wanted Snake on your phone, which is clearly why Nokia won the smartphone war or the feature phone war. And you took it back to the store and you got a new phone.

[~2:02] Maybe eventually you could update your software by cable.

[~2:05] Although initially, if you remember on the early feature phones, all the cable was good for was transferring contacts. You couldn't even update anything or connect it to the internet.

[~2:16] Then this iPhone came around. This was the big game changer, but you had to actually get prompted for updates, so you had to make a choice about updates. And hopefully, we're in a world now where your phone is constantly being updated. You don't even think about it. The apps on your phone, what you're using for software, all of this stuff is continuously getting pushed down as updates, and this is what users expect now. This is the expectation on any software application, is that you're going to continuously get updates.

[~2:48] Okay. So now the problem here is all of these updates,

[~2:54] this is the new

[~2:57] oil spill because there's security vulnerabilities which can destroy your customers and destroy your business. And this is actually something someone else -- this is kind of the second big reason why you want to do updates, is because you have to patch security vulnerabilities, otherwise you're impacting your users.

[~3:14] Okay, so we have features, we have updates. We all know that we want to do updates, but the question is how long does it take us to actually get updates? And a good model for looking at this is, you guys all drive cars, right? So this is the braking distance that you need to stop your car in the case of an accident or to avoid a collision.

[~3:35] So there's two different things going on here. One is your thinking time, that there's an obstacle, I need to stop the car. The second one's the braking distance. Now I've pressed the brake pedal, now the car needs some time to actually mechanically stop. And these are two separate actions. One of them is a thought process, so it happens pretty much linearly as you

[~4:00] have to stop at shorter distances.

[~4:03] The other one is a mechanical process. So the faster you're going, the harder it is to stop a big hunk of iron.

[~4:10] And when you add these two up, this is the actual stopping distance that you need to stop your car, and you can think of fixing software as a very similar process.

[~4:21] So first you identify that you have a reason why you need to update, then you fix the issue,

[~4:29] and finally you deploy that so your end users can actually take advantage of this new fix. And there's a bunch of examples of this where this has not been a very quick process. So we're getting into the anti-pattern process of this.

[~4:44] So one example is back in 2017, there was actually a shutdown of a UK hospital due to ransomware attacks. So this is horrible stuff.

[~4:55] So they have X-ray machines, they have dialysis machines, they have patients who depend upon healthcare, and their machines were hacked.

[~5:07] They had to shut down the hospital. They had to patch this, and it took them a very long time to -- Obviously, they identified it pretty quickly, but then actually doing the OS upgrade and then deploying this took a very long time where the hospital was not operational. And the problem in this case was that they were running an outdated version of Windows. They were running Windows XP.

[~5:33] So clearly they weren't doing a good job staying up to date on their updates.

[~5:38] That probably was the first set of issues.

[~5:40] Obviously, the fix is easy, upgrade to a modern version of Windows, which has a bunch of security patches. But it's not that easy when you haven't done updates for decades.

[~5:51] Okay. So, another example. Probably all of us remember Equifax and probably were impacted in some way when they lost all this personal data. It was worse than just a credit card leak because attackers knew your Social Security number.

[~6:06] They could open up new accounts in your name. And actually, there was so much data put out on the market on personal information that Social Security numbers went down to being worth only $10 a piece. So it actually pushed down the price of illegal information because so much of it was available.

[~6:22] So the case here was, it took them a while to even identify that they had a vulnerability. So hackers had access to the security vulnerability for a couple of months.

[~6:32] It turns out that they were running an outdated version of Struts, so they had to do an upgrade of this. Fortunately, it was already fixed in the current version, so it was just an upgrade to an existing library version. But they weren't using a continuous update or a continuous deployment process, so it took them two months to just get the update out to the market.

[~6:52] So again, huge security vulnerability, which was open for several months to identify and then to fix it.

[~7:00] And then, so another example, Spectre and Meltdown. So this hit in January 2018. So Meltdown is the easier of the two because it simply requires a few JavaScript lines of code to hack, but at the same time, it can be fixed systematically. Spectre is a bigger problem because it relies upon branch execution strategies inside the processor.

[~7:22] So basically what you're doing is you're trying to get programs using predictive branch execution to do things, which then give you information about what's happening in memory in another process.

[~7:31] Not only it can be another process running in the same machine, you can also target a virtual machine running inside of it, so it doesn't isolate you between containers.

[~7:40] So this is really, really bad stuff, and in a lot of cases, since there's not simply a fix the hardware manufacturers can put out for it because it's inherent in the design, you need to do this in software. So you need to patch your software to account for Spectre. A bunch of libraries were updated specifically for this. There are constantly new attacks and new ways of exploiting, which need to be addressed. And then you get in the situation where you need to update your software very, very fast if there's a new Spectre exploit or a new way of attacking your software, which isn't accounted for. So you need to identify this as fast as possible, fix it as fast as possible, and then deploy it as fast as possible, right? So this is the reason why you need continuous updates. Stand by, please.

[~8:26] We'll stand by for just a sec.

[~8:29] So now, hopefully, I've convinced you guys you want to update your software faster.

[~8:35] And as you guys know, I'm a Java hacker, and the Java guys actually recently changed their release model, right? So Mark Reinhold announced back in 2017 that they were going to move to a release model where Java, rather than being shipped every five years, or more practically every seven or eight years, they were going to release it every six months.

[~9:01] So this is great stuff, right? So, the core platform that a lot of our enterprise software is built in is actually going to be updated on a more regular cadence. They're going to get more features out more quickly. They're going to help us developers with patching security vulnerabilities more reliably. And

[~9:19] so there's a study which actually looks at the usage of different Java libraries and how well they're adopted.

[~9:27] So this is the State of the Developer Ecosystem report in 2019, 7,000 developers, so very credible poll. And,

[~9:35] yeah, we're not doing so good on adoption past Java 8.

[~9:40] So Java 8 was the version before the announcement was made on the release cadence.

[~9:44] Java 9, 10, 11, 12, we're up to 13 now, are the more recent releases.

[~9:52] And so we have a problem here, right? What happened?

[~9:57] Okay, so to understand where we are, this is a graph which shows you the thought process people go through when they want to do updates.

[~10:06] And so an update's available.

[~10:09] Do we want it?

[~10:10] So if we don't want the update, if it doesn't have features we want, if there's no security issues, maybe we don't even care about the update, but hopefully, we want it. And then we ask ourselves, are there any high risks with this?

[~10:22] If there's no risk at all, we're probably willing to update it because it's easy.

[~10:25] But if there are risks, then you ask yourself, do you trust the update? Because potentially this could cause issues for me, it could cause downtime, it might need to be retested, and

[~10:36] why?

[~10:39] So when you ask yourself, why do I want to update? Do we trust the updates? Really, the answer to this is best shown in a comic.

[~10:53] So really, the problem is we don't trust the process, right?

[~10:58] So if we trusted the process, we might be willing to go along with updates, but typically, we don't trust large companies to QA the software. We know there's going to be issues. We have a track record in our industry of releasing software which has bugs in it. And this is a complexity problem. If you look at the complexity of software, the complexity of software keeps increasing over time.

[~11:22] So, we start out with agile processes.

[~11:25] So you're releasing software faster. Continuous integration, now we have builds running, we're continuously updating code, we're continuously delivering code.

[~11:37] Next level is you have infrastructure as code, so now everything inside of your organization is treated as code to deploy to servers.

[~11:44] Then you have microservices and serverless and smaller bits of code.

[~11:50] Containers, runtimes like Docker and Kubernetes. And then finally, in the IoT world, everything needs to be updated.

[~11:58] So as the complexity of the system goes up, it's harder and harder to

[~12:05] actually determine that you're not going to hit any software bugs when you update. And the other aspect of this is data.

[~12:13] So the amount of data in the world is increasing exponentially. This is some data from Seagate. So arguably, they sell storage, they

[~12:21] predict high. But this isn't too far off, right? So we already know in 2017 that we have over 20 zettabytes. Does anyone know what a zettabyte is?

[~12:33] A lot of zeros. A lot, yeah. Okay. So the answer is a lot of zeros. If you know what a petabyte is, 1,000 petabytes is an exabyte, 1,000 exabytes is a zettabyte. So it's a lot of storage. And the prediction is by 2025, we're going to get to 175 zettabytes. So that's a lot of data, and if you do an update, you need to make sure it works with a lot of data. The question is, how do you test with that much data?

[~12:57] Well, the answer is probably you don't, because it's simply impossible to exactly mirror what's happening in production with large data sets inside of a QA region, inside of a test region.

[~13:10] So, one example of this is

[~13:13] some people get these

[~13:16] unsolicited letters from China. And then inside the letter, you get a red sock or a little bit of red tape or a black cloth or a little ring, and they're just sent out to random people. There's actually a thread on Facebook about empty envelope from China. So why are people getting these random envelopes?

[~13:36] Is it some sort of government plot, maybe China's trying to spam the US or destroy the federal

[~13:47] shipping system? So the answer is, this is quality control.

[~13:53] So in a large shipping system where you have to actually verify end-to-end, if you want to do an end-to-end test, the end-to-end test is shipping something to somebody. So in China, like Alibaba and these large companies, they actually test their end-to-end verification by shipping out random packages sometimes to make sure that your package is going to get to the final destination. And so this is the challenge. In extremely large systems, it's hard to check that you actually are not introducing any problems.

[~14:21] Okay, so getting back to how do we update. So, we go back to, do we trust the update? If we trust it, yes, we'll update. If we don't trust it, we have another option. Can we verify the update? So can we verify that this is something which won't break production? And if the answer is yes, then we might trust it, but probably increasingly, the amount of time needed to verify the update is going to be very long.

[~14:47] So it's very long and labor-intensive to verify the update and make sure that it doesn't introduce any problems.

[~14:53] And of course, if we can't verify it, then we're back to no, which is not great. So this is a problem, right? So we have a lot of food on our table. We have a lot of features as a user.

[~15:04] Do we need more features at the risk of the update? And the balance here is, is the feature more valuable for doing the update, or is the cost higher for doing the update? So this is what we're doing as either individual users or as consumers of open source libraries or other packages which we're importing into our own projects, is we make this trade-off, like how long is it going to test me versus is it a feature I actually need?

[~15:33] And this is a problem, right? This means that you're often going to choose not to update. You might have security vulnerabilities, you might not be exposing features, you might not be taking advantage of the latest libraries, and we need to find a solution to this. So one way of doing this is to look at what the industry does as an example. So we're going to look at some folks in the industry who cheat the system and see what they're doing. How are they getting updates out to their users, or how are they getting folks to make updates without doing the time-intensive verification?

[~16:05] Okay, so there's actually some examples of this you guys probably are very familiar with just from using your computers. So your browser is one example.

[~16:14] Who knows which version of the browser you're running on your desktop?

[~16:19] Okay, so a couple folks do. You're probably developers.

[~16:23] For the rest of us,

[~16:26] Firefox started incrementing the version number so quickly on your browser, it's actually hard to keep track. Every time you open the browser, it's like you have a new update. Chrome does the same thing. Safari does the same thing. So basically, you're probably not really keeping track of the version of your browser unless you're doing software development and testing specific browser issues.

[~16:46] Second one is Twitter in your browser. Do you guys even know what version of Twitter?

[~16:51] Well, this one's an easy one, you can't because it's a software as a service, and you probably don't care as long as it works. As long as it's continually being updated and it works, it's fine. Twitter on your smartphone is a similar story. It has a version.

[~17:05] I don't even know how you get access to it because they push app updates to you, but you probably don't care about that either.

[~17:12] What about your smartphone OS? Who knows what version of their smartphone OS? Okay, so a lot more hands, and there's a reason for this.

[~17:20] Updating your smartphone OS is risky.

[~17:24] When you go from major iPhone update to the next version of the update, we all know that the first version of the software is buggy, has issues.

[~17:34] My wife was actually complaining to me on a trip recently because the latest iOS update messed up her to-do list. And I checked online, and there was a known bug with Apple where they put a new to-do app out. The migration of to-do items didn't work in the first version. It's getting fixed in a subsequent update, right? So this is the problem. When you make very large, not granular changes, then the chance of having a high-risk update is much higher.

[~18:03] Okay. So what we'd like to happen is we'd like to have small updates that continually get pushed, but then the question is, what can possibly go wrong with this model? And there's a bunch of things which can go wrong, actually. So OnHub, which is now owned by Google, is a Wi-Fi router that's self-updating. So this is awesome stuff, right? So just like you have your thermostat, you have your Wi-Fi hub, it automatically downloads new software, it updates online. It's a self-improving Wi-Fi hub. You get new features pushed to it constantly. What could possibly go wrong?

[~18:36] Well, of course, a lot, because the Wi-Fi hub is how you get access to the internet. Google pushed an update where they actually broke and reset the settings on the routers. Because the router is your only access to the internet, they couldn't then push an update to fix it.

[~18:53] So this is a problem.

[~18:55] There is a fix to this, but it is slightly complicated. If you're doing something like this on edge devices, you want to have a local rollback strategy.

[~19:04] So basically,

[~19:06] in this case, for a Wi-Fi router, what you'd want to happen is it would have a local copy of the last version of the OS. It would do a self-health check to see if it can connect to the internet. If after a certain time the update's not working, it automatically rolls back and calls home. Right? So that's ideally what you'd like to have in a situation like this. The caveat with the local rollbacks is if you don't need it, often the implementation complexity of local rollbacks outweighs the benefit.

[~19:35] So then getting to the Internet of Things, pretty much everything's in the Internet of Things now, and there's a whole bunch of different devices, possibly even smart cars, which get updated constantly, hopefully not while we're driving.

[~19:49] And there's actually an example of why it's important to be doing updates to your car continuously, which is Jaguar had an issue with their cars where they had to do a massive recall, and it was a problem with the braking system.

[~20:06] So obviously, if you're driving a car, you want to make sure that... This is a safety feature. Brakes should obviously be working. Fortunately, in this case, the core braking function was fine. The car stopped. It was the regenerative braking which was broken. So you could take it back for the recall, they would fix it. But the problem with this is it's extremely expensive to then take cars physically back and to do manual recalls of them. So, the answer here is over-the-air updates.

[~20:33] Tesla does this. A bunch of other car companies do this. Jaguar is doing this as well now.

[~20:39] And this helps you to avoid the problem of users not doing updates and also pushing updates when something critical comes out.

[~20:47] So continuous updates are even better than over-the-air updates.

[~20:51] And even though Tesla is doing over-the-air updates, they're not doing continuous updates. And one of the problems it introduces is stuff like this. So there was an issue with Tesla with phantom braking.

[~21:03] The way phantom braking works is you're cruising down the freeway, it has all those automatic collision detection systems working, and it thinks there's an obstacle when there's actually not one, and the car suddenly stops.

[~21:15] So this is incredibly dangerous. It's dangerous for a different reason, right? Not that you can't stop the car, but that somebody might crash into you because the car thinks that it needs to stop for an obstacle.

[~21:25] This was a trending thread on the Tesla forums. It was a big issue. They identified that it was a software issue. They fixed it in a patch. The patch for the phantom braking was in red there. "This release contains minor improvements and bug fixes." It took a couple of weeks to come out, Tesla updates about every two weeks, because it was waiting for a very important feature, which is chess.

[~21:49] So you don't want critical features waiting on large features. So the answer here is do granular updates, do continuous updates.

[~22:01] Do batch updates in small sizes, and then this way your end users are getting important fixes, and they're not waiting for large features to come out.

[~22:12] Okay. So, another example of this in terms of the mobile space.

[~22:17] So most of you auto-update applications, right?

[~22:21] And there's a game called Noob's Adventure, which is done by a developer, where he actually documented the process of building the game as part of the game itself. So it's kind of a cute developer story built into a game.

[~22:35] And one of the challenges he ran into on the game was a new feature update which broke a certain percentage of users. Basically, the problem was, with the feature update, some of the Apple servers would return prices without dollar signs, some would return prices with dollar signs. Dollar signs was the templating character used in some of the scripts in the game.

[~23:00] Therefore, even though it was tested, and then it shouldn't have broken, on a certain number of end users, it would break randomly.

[~23:09] And so it took a while to even identify that this was an issue.

[~23:13] It took time to fix the issue, and one of the challenges with the App Store is there's no way to do a rollback.

[~23:21] So you can't say, "I have a bug. I need to roll back to the previous version." You need to push a new version of your software, and then Apple will then go through the whole validation process and let you push the new version of the software. So the update pattern for this is do canary updates.

[~23:38] So when you push a new feature that might affect users, you want to push it to a few users at first, let them test it, and observe the problem, and then if there's an issue, then you can revert back before you affect your entire customer base.

[~23:52] Another pattern here is observability.

[~23:55] So some problems are really hard to trace. If you build observability and monitoring into your application, that makes it easier to identify when there's an issue.

[~24:04] And then another thing is rollbacks.

[~24:07] So in the case of the App Store, you can't do rollbacks within the App Store, but what you can do is you can do feature flags. So feature flags allow you to release a version, have some features turned on or turned off via configuration. And then if you hit an issue, you can roll back to the previous implementation inside your own code base rather than relying on the app store to do the rollback for you, which in this case, Apple doesn't support. And then once you get a few versions down and you realize that code's not needed, you can take out the code, which is surrounded by a feature flag.

[~24:41] Okay, so another example of this is an entirely different space. We've been talking a lot about

[~24:50] IoT devices and mobile devices, but the same thing applies to server-side software, which a lot of us build. And Knight Capital, this is a historic example of a huge fail in our industry. John Willis loves to use this example as well.

[~25:05] And basically what happened is a company disappeared overnight by a big IT failure in how they do their DevOps tool chain.

[~25:16] And what happened in this case is they had a bug which was introduced

[~25:22] because one out of eight of their servers wasn't updated. So there's a manual process for doing the server updates.

[~25:29] They made changes in the API between the client and the server.

[~25:33] If you hit one of the servers which wasn't updated, it failed.

[~25:37] When they tried to debug this problem, they rolled back the servers.

[~25:43] So now all eight out of eight servers were running old code. The APIs, the clients are running the new code. All of the API requests are failing now. If you're in the trading industry, where you have millions of dollars every minute exchanging, you can imagine debugging a situation like this, where you're losing money, your servers are down, incredibly stressful. They finally identified the issue and figured it out, but by that time it was too late. They lost $400 million and went out of business the next day.

[~26:13] So this is a classic example of failure, and in this case, automated deployments would've helped them out, right?

[~26:20] We're really, really bad at repetitive tasks. If you can automate tasks which humans do, you're less likely to have problems with this.

[~26:29] It's going to be easier to debug and troubleshoot.

[~26:33] And another one is to do frequent updates. If you only update infrequently, when you actually go to do the updates, you don't have the muscle memory needed to actually effectively do software updates in a repeatable way that's reliable.

[~26:48] And then finally, state awareness.

[~26:50] So something to keep in mind, and it affected this case, is that when you're deploying code, you have to be very careful about the state of the system, the APIs which you're using, because that can also affect how the updates happen.

[~27:03] And rolling back might not fix things if you have state involved in the system.

[~27:09] Okay, and then one final example.

[~27:12] So Verizon

[~27:14] had an outage which was caused by some of their upstream providers, or rather Cloudflare had an issue, and they blamed Verizon and Noction. So he was the CEO of Cloudflare, blaming some of the folks downstream from them.

[~27:32] And so this is kind of how we work in the business, right? So when we have an issue and we see our competitors fall down, we like to point a finger and say, "Oh, look at those guys. They screwed up."

[~27:43] A couple of weeks later, Cloudflare went down by themselves.

[~27:49] And the internet was not kind to them.

[~27:51] So if you bash other people, it comes back to you in spades.

[~27:56] So, this was an example where first of all, there's a real-life pattern. Be kind.

[~28:06] And then there's also a technical problem here. So basically, the whole cloud went dark as a result of Cloudflare having issues, and the reason why this occurred is because they had a bunch of regular expressions which were used for filtering.

[~28:21] One bad regular expression caused the entire system to spike on load, and that took down the entire system. So it's very simple. As you guys know, if you've done any regular expressions, they're very easy to write, impossible to read, impossible to validate the correctness of, and you can destroy your entire deployment with a single misconfigured regular expression, and in this case, affect the entire Earth.

[~28:48] So again, the pattern here is canary releases. Don't ship your code. Don't ship your regular expressions on all the servers. Test it on a few.

[~28:57] Netflix, Facebook, a whole bunch of companies are very good at doing this, where they test all their new releases on us, the unwitting end user.

[~29:05] But just a few of us, and then they roll back if there's an issue.

[~29:09] And to summarize, these are all of the different patterns we talked about. So doing frequent updates, doing automatic updates, making sure that it's all tested, doing canary releases, being aware of state-aware effects in your system, having observability, and in some cases, doing local rollbacks. It doesn't always make sense, but if you have edge devices where you think there might be problems updating them if they have an issue, having a local rollback strategy will help you on the edge.

[~29:39] Okay, so getting back to our diagram. So do we want it?

[~29:42] Sure. Yes. And even if not, we're going to auto-update you.

[~29:46] Are there high risks?

[~29:48] If there aren't, great, but if yes, hopefully, we now trust the update, and we can actually get folks to effectively update our software and take incremental updates to what you're working on.

[~30:01] So, a quick quote.

[~30:03] This is from the "Liquid Software" book.

[~30:06] "Our goal is to transition from bulk and rare software updates to extremely tiny and extremely frequent software updates, so tiny and so frequent they provide an illusion of software flowing from development to the update target."

[~30:22] This book is co-authored by our founders, Yoav Landman, Fred Simon, and Brooke Sadowski, who is one of our developer advocates.

[~30:32] And

[~30:34] you can come by our booth and find out more, but this is kind of the overall picture we're trying to paint. So all of that stuff in the top right, where you need updates or you need automation, that's liquid software. There are some cases where you might need to do manual updates, or you might need to avoid updates in critical situations like...

[~30:53] Does anyone want their plane updated while they're flying on it?

[~30:58] Okay, so perhaps not, but what about if there's a hacker on your plane? Would you want a security vulnerability patched? Yeah.

[~31:06] Okay, so maybe a few folks want the update in that case.

[~31:09] So hopefully, we're all moving towards a world of continuous updates. Thank you guys very much for coming to see the presentation, and enjoy the rest of the conference.