You Wouldn’t Drive Kids to School in a Golf Cart. Why Run Business Applications with a Build Tool?
Back in the day – five to ten years ago – business applications were orchestrated with purpose built enterprise solutions owned and managed exclusively by IT Operations.
Along comes the concept of DevOps and the SDLC transitions to the fully automated delivery pipeline you know today where everything is expected to be embedded into the code. Since developers know and love scripting, or cron or Jenkins or whatever automation tool they have access to, it became those tools they used to build their operational instrumentation and called it a day!
How would DevOps folks react if you suggested they should manage their delivery pipeline with cron?
Yeah, that’s how everyone should react to managing payroll or inventory or payments or any sophisticated business service with Jenkins or Puppet or Chef or ANY tool not purpose built for such functionality.
Come to learn how PayPal, Amadeus and similar enterprises are orchestrating their critical business applications with a DevOps enabled, Jobs-as-Code approach.
Joe Goldberg is an IT professional with several decades of experience in the design, development, implementation, sales and marketing of enterprise solutions to Global 2000 organizations. Joe has been active in helping BMC products leverage new technology to deliver market-leading solutions with a focus on Workload Automation, Big Data, Cloud and DevOps.
Chapters
Full transcript
The complete talk, organized by section.
Joe Goldberg
If you look at this title, you might think that I have a slightly different intention than I do, so I want to be clear. I am not here to bash build tools. I think build tools are great. What I want to talk about, however, is domain specificity.
There are a lot of different categories of automation. If we look at our SDLC or our pipelines, however you want to look at the set of tools you have, you've got build tools, configuration management tools, security tools, testing tools, networking tools, database tools -- tons and tons of different tools. All of them provide some measure, probably a great deal, of automation. If we're talking about an automated pipeline, obviously a great deal of automation. But they are all a little bit different, and in some cases a lot different.
There is certainly some measure of overlap when you're dealing with automation, but I would argue that each one of them exists because they has a set of functions that are specific to their particular domain. If you ever have questions about how specific they are, if you're a build person, imagine having to do builds with, let's say, Ansible. Or if you're a config management tool, imagine managing your infrastructure with Jenkins, and so forth. Each one of these tools has its place.
I would argue that in the operational world, there's a great deal of tool availability as well. They all have their place because they have specific operational capabilities, such as the ability to maybe visualize and perform operational activities and interact with stuff in your environment, and we'll talk about those. In the operational world, I just selected a few. I put Control-M first -- that's our particular solution -- but there's a bunch. If you have any doubts about how many, every single cloud vendor, lots of commercial tools, lots of organizations have built and open sourced their own tools. There are just a ton of operational workflow management tools.
What has existed in the past, and the reason this is perhaps a subject of conversation today when it wasn't in the past, whether it's the recent past or if you're still taking a traditional approach to how you manage your entire SDLC, is that there have been these silos that imposed that domain specificity. Not that it was a good thing that these silos existed, but if you were a build person, you kind of threw stuff over the wall to your system administrators or config management. They used their tools. They didn't even have access to your tools if they wanted to. If you're in config management or security management or database management, et cetera, you had your own tools, and probably you didn't have access to anybody else's tools. That was certainly the case in the operational world.
The kinds of tools we're talking about that manage workflows and orchestrate applications still today, in many organizations, are the domain and ownership of operations and the operations folks. There is either a request mechanism or whatever, and this is the nature of the situation as it existed for quite a long time.
As these walls began to disappear, I think we today suffer from what I would describe as the hammer syndrome. It's really who's wielding the hammer as to what tool they want to use. If you're a build person, everything looks like a build problem. If you're a config management person, everything looks like a config management problem, and so forth and so on. This is the argument for domain specificity. There's good reason why we have these different tools. There's a great set of capabilities that they each have for their particular domain, and the argument simply is: choose the right tool for the job.
If we're talking about orchestrating applications in a production environment, what are some of the capabilities that I think are critical for us to have in this kind of environment? You want to be able to have a view that is abstracted and elevated above technology for the purposes of running applications. You're running business services. You have to deliver or support business services. When your customers or your customer's customers, or whoever is consuming those services, interact with them, they have either little or possibly no interest in the underlying technology. They don't care how complex your stack is. They don't care whether you're on-prem or in the cloud. They don't care whether you're using containerization. They have a service that they want to consume, and you, from an operational perspective, have to be able to support it.
If there's a problem anywhere in that complex, arguably ever-increasingly complex environment, you need to be able to find out where it is. You need to understand what is the impact of a problem in one place, downstream or upstream. This notion of end-to-end view across a complex technology stack becomes really, really important.
Another characteristic is that once you get into production, or into the data center, or however you're exposing your services to the world, then there are certain things that are taken for granted and expected. Things like auditing and security and scale just have to be there. What may be a great tool, even though it may have certain characteristics for relatively low-volume activity, when you get to high volume in a production scenario, there are certain expectations that have to be met.
One that is frequently overlooked, especially for a technical audience and from a technical perspective, is the diversity of users that exist in the operational production environment. Going back to that hammer picture: if you're a developer, an engineer, or even an operations analyst, and your view of your tool, your technology, is something that you work with and live with all the time, and it's really highly technical and complex but you're perfectly comfortable with it, that may be great at some stage in the SDLC. But when that facility is exposed to the world, when you get into an enterprise or web-scale deployment, you've got people who may be highly technical and savvy, who can consume no matter what you put out there. And you can have people all the way on the opposite scale who are not very technically savvy, or even if they are, they're not interested in becoming expert in a particular technology. Their goal, their need, is to consume a particular service for the purposes of either the business or the transaction that they're performing. It is this diversity that frequently is not apparent until you get to at least some kind of user testing or fairly far down the SDLC. So, a really important consideration.
I mention cloud and containers today because those are the conversations, the environments and technology stacks that we talk about today. Obviously, whatever tooling you use has to be able to support the complex environments and the technology stacks we have today. But it really has to be open to a degree that it can evolve and support future stuff. Today we're talking about cloud and containers. A couple of years ago, we were talking about virtualization. A couple of years before that was internet and web technology, and before that it was something else. I am sure a couple of years from now we'll be talking about other things. Containers and cloud will be old hat, everybody will be doing it, and there will be a need to be doing or supporting other things in addition.
Another characteristic that's a function of enterprise is that rarely do things disappear entirely. The need to orchestrate and have dependencies and visualization across complexity means that complexity is not just the stuff you have today, but probably stuff that you've had in the past that you may be bringing forward, and certainly stuff that you're going to have in the future.
The final thing, depending on your perspective, is either, well, duh, yeah, or a major challenge. If you're looking at traditional operational tools, the ones that have lived and still today exist in the operational world, many of them were built and today still are challenged by integration into an SDLC. Given that we're talking about DevOps and CI/CD, whatever tooling you have must be something that can be embedded in an automated pipeline.
The term that I'm using here is jobs as code, but if you expand that thinking, what we're talking about here is operational instrumentation, application workflows, however you want to define it, that are a logical part of the application. That means just like you have Java or Python or whatever language of choice you're coding your business logic in, and whatever tool or language of choice you're coding your infrastructure in, you need to be able to code your operational instrumentation similarly in code. Commit it to version control. Submit it to all facets of your automated delivery pipeline, so that in order to realize the real benefits of DevOps or CI/CD, where eventually however you choose to deploy into production becomes a non-event, that can only happen if the entire application, including this kind of instrumentation, has also been riding along the application, has been built together, has been tested together, and has been inserted or embedded into whatever test environments you've constructed as you move down the line.
That becomes a really critical component. Coming at this world from an operations perspective, the tools that in the past have lived in operations are challenged by this. I would argue this is absolutely mandatory: you have to have these kind of characteristics in whatever tooling you're going to select.
A couple of stories from customers that have been using our tooling reflect this kind of journey. Both of these are fairly large companies. Hopefully one at least is a household name. The other one is perhaps a little bit less known, but certainly a huge enterprise.
In the case of PayPal, they are really a financial services organization, so a lot of their applications still include things like moving money around, doing reconciliation, and payment processing. Fundamentally, that's what they are as a company. They've been using this kind of tooling to run those kind of applications for a very long time. Several years back, they embarked on their journey to move from traditional ways that they were developing applications -- which, as you can see, was time-consuming, slow, and took a long time -- to a position where they now have this developer portal that lets them automate every aspect of how they build applications, including how they build their workflows. Those workflows now are constructed in code, ride along with the entire application, and are completely tested as they go from inception and construction or development all the way through to production.
Similarly, Amadeus is a company that provides IT services to the travel industry, and they too have embarked and are well along on the DevOps journey that has seen them now move from a traditional data center to a private cloud with a mix of public. The need to also manage and move all of their workload dynamically among those environments and the workflows -- the only way they could meet that test is to move to a DevOps model where the workflows were constructed in code, rigorously tested throughout, and deployed to production. Because if they didn't do that, most of us probably wouldn't be here. We'd be somewhere else stuck in an airport. Amadeus touches about 95% of all commercial airline traffic in the world. If they have any kind of outages, they're reflected in airline traffic literally around the world. So, a highly demanding environment.
What I'd like to do for the rest of the time is give you a little bit of a demo and a flavor of what I'm talking about. Let's say I'm a developer, or an architect, or part of a team that's embarking on creating a new application. We sketch out, obviously, a very simplistic flow, and hopefully this gives you insight into what I'm talking about, what kind of instrumentation we're talking about, and where it fits in.
I'm talking about predictive maintenance for trucks in this case. This is actually based on a real customer use case. I've got IoT data streaming from my vehicles, and it's all streaming in real time. It lands in a public cloud, serviced and delivered by a telematics company. The goal that I have here is to identify potential vehicle failures and get them repaired while the vehicle is still mobile, so I don't have to find it and tow it and all this other stuff -- obviously with a goal toward significantly reducing the amount of time it takes to maintain my vehicles.
These are some of the things I have to think about. I'm getting the data in. I have a regression model that's going to determine whether a particular set of information is indicating a possible failure. However, I've got a model that I need to train, so I've got to do that every so often. As maybe my predictive capabilities change, I need to train that model once a week, once a month, or when I see that the predictive levels maybe are dropping. That's something I need to do on a deferred basis: standard housekeeping and maintenance.
In addition, once I get an indication of potential failure, I have got to enrich it and marry it with a whole bunch of traditional system-of-record data. I need to find out who the customer is, what the vehicle warranty information is, what kind of parts I need, and where my service center is. I can determine where the truck is. I could book a service appointment, but maybe the distribution or repair center doesn't have the part. I need to order it and get it to that particular location. Maybe that service center doesn't have space for a few hours or a few days. These are all the things that I have to take into account. I may have to order the part using my inventory system, and I have to wait until that's going to become available, all keeping in mind how long I am predicting the vehicle is still going to be operational before it fails.
You can see that there are a lot of different components to such an application, many of which happen in real time based on either data streaming from a vehicle or what's going on with the vehicle location, but also other things that are either deferred or somehow long-running or interacting with other applications and systems.
Let's take a quick look at the components that I'm dealing with. I'm using Eclipse here as a model, but it really doesn't matter which IDE or development environment I'm using. You can see up in the number of tabs that I have, I'm dealing with a whole bunch of different components to my application. I've got some Scala code because I've got a linear regression model. I've got some sort of make files or build information. And I have this JSON here that is my workflow. This workflow is that diagram or flowchart that you just saw: when I get some data, I need to extract from my traditional systems, marry and enrich that data, do some file transfers, and may need some additional resources and so forth. It's all here.
I, as a developer, probably am not super familiar with this kind of stuff. Like any other language, I need something that will allow me to validate this. There's a service that we provide that will let me validate this stuff. I can validate my syntax, and this is telling me that it's correct. Let's say if we make an error, just to highlight that we have the potential to make errors and see how they fall out. I just introduced an error, and I get told that I get an error. I can fix that and make sure that I get it correct. I iterate through this process of building this.
Even this process of building it has a challenge in large organizations: I need to understand what my standards are. There's a whole sort of standards and rules definition under the covers that tell me what my standards in the operational environment are and help me along.
Once I validate the syntax, if I'm a developer, I want to make sure this thing actually works. I wouldn't just code Java or Python and then, as soon as I had no errors, assume it was correct. I need to execute it to make sure that it's correct. Similarly here. You may have seen that I have an option to actually test. What will happen with that is that this is going to get submitted to an environment to actually run this stuff. The flowchart that you saw that we have now built via this JSON, I can run this to make sure that it runs successfully. This is kind of my unit testing.
I have a personal test environment. I could either run that personal test environment as a virtual appliance in VMware or VirtualBox. In my case, I'm using an environment on AWS, and you can see that I got a response, that I've got a run ID here. I can even look at this environment and see what that set of jobs looks like. There's my flow, and this is actually going to run.
In this case, this thing is waiting for user confirmation. I didn't want it to take off and run before I got here, so you can see it's waiting for me to confirm and it'll let it run. I'm not going to bother with doing that because I don't have the time, but the point is that I, as a developer, can not only validate this syntactically, but logically execute it and make sure that it's correct.
At that point, I can pass it on. The way I would pass it on is I would commit to my version control. I'm using Git in this case, but whichever one you happen to use. So my developer work is done. Once I commit, then the things that you would normally expect are going to happen. Whether it's just updating the flow itself or whether I'm updating other parts of my application, I am then going to trigger a build. Everything here is done a little bit slowly so that we can take a look at it.
Here is a very simplistic Jenkins pipeline. First I'm going to build my application. I'm using the Scala build tool because I've got Scala code. I am then going to use a service to do the validation of the syntax, as I showed you interactively. I would then deploy it to my first stage of testing, and then I might use a testing framework like Robot to run tests against it. If that was successful, I then may deploy it or push it to my next environment, and so forth down the line. All the kinds of things that you would expect to do with any other component of your application, I can do here.
In our particular tool, we have exposed all this functionality both via RESTful web services as well as providing a Node.js CLI, which is a thin wrapper implementing those same REST APIs. I'd like to speak a little bit about the different services and how they map to different phases in your pipeline. Our thinking here, and I think this is the way you must think about this, is to deconstruct your pipeline and then make sure you have services for the tooling that you're using that will support each one of those phases.
We saw the build service. The build service validates syntax, so I create my objects and I can validate them. The next service that we used was the run service, which lets me execute. It's not just simply running the jobs, but being able to interact with them. I can retrieve their output. I can perform operational actions. This is important in an operational world, but even in the context of running a pipeline.
I'm using Robot in this example for my test scenarios. In these tests, I'm performing the testing using those functions that I described. When I talk about running stuff over here, I want to be able to perform those same kind of tests or those same kind of operations in my test environment. I want to be able to run a job. I want to be able to look at the output to make sure it's okay. I may want to sort of haphazardly or randomly kill some of my jobs and see what the effect is going to be downstream.
Whatever complex scenario you want to construct, you have the ability in your testing framework to use these services to perform those kinds of operations. The intent is to make sure that as you move down the pipeline, just as you would with your business logic, you have the same kind of capabilities to perform the same kind of operations and the same level of testing on your operational instrumentation because it is an equal participant in your application.
This contrasts significantly with an approach we see lots of people take, where frequently the first time this instrumentation is actually invoked is when jobs get into either production or very close to production. Imagine if you are a developer writing a bunch of Java code and then, before you push to production, somebody changes half of it. You would say that's absolute insanity. But when it comes to the instrumentation that runs the application, those are the kind of things that are frequently done. This is the argument against that.
If you do want to find more, some of the other things that I think are important about such things is not just to have the capabilities, but also to be able to access them. For example, there is some information available on GitHub for this virtual appliance. If you want to get familiar with this particular solution, you can download what we call the workbench. It is about a one-gig download, but once you get it downloaded, from the time you fire it up you can be writing and running jobs and testing this stuff in a matter of minutes.
In addition to the actual appliance itself, you have a bunch of samples. If you go to github.com/control-m, there are several repos with samples and everything from a "Hello, world" to much more complex and sophisticated examples. If you go back to that first page that I showed you, there's a bunch of other resources here, such as all of the code reference and documentation. There's a Swagger UI reference to show you all of the API calls that are available and the functions. The intent is to provide you a rich set of capabilities that perform operational actions but can be consumed by developers and engineers within the SDLC or their CI/CD pipelines, just like they do for any other component of the application.
Q&A
At this point, I have, I think, a couple of minutes before we end. I'll open it up for any questions if anyone has any. And if not, I'll let you take off for lunch. Oh, sorry, did you?
Audience member: We have the same type of jobs. The problem that we face is we have a lot of ETL-related jobs. Under departments, there are distributed departments geographically. In the job space, what we concern is: my application runs at this stage, and there will be a job coordinator or someone who's defining that process flow. With this flow, it looks like the person who writes the application also owns the flow.
Joe Goldberg: Could absolutely, yeah. What you're describing, I think, is a much more traditional operational environment where there is a central group or IT ops owns it and you submit some kind of request. If that works, that's great, but everything that we have been hearing over the last several years from customers is that slows down the entire process. We had just an internal event last week where a large health insurance company spoke about their experience in moving to this kind of model. When they did their analysis of what was slowing down their ability to deliver, they found they were having anywhere from one to three weeks sometimes that their application was ready to go and then operations, because they were backed up, and because they're a health insurer that has, I don't know, tens of thousands of developers but a central group of 10 people doing this, was just simply slowing them down.
This is really one of the aspects of DevOps and CI/CD and democratizing this kind of functionality that, if your organization can sustain it, this is certainly an alternative. Again, I would argue the reason that you're probably doing what you're doing is because of standards, compliance, governance, and all of these can be addressed in an automated world. It shouldn't really be any different for the operational instrumentation than it is for the business logic and all the other components of that application.
Audience member: How do you handle database changes, and specifically DDL changes?
Joe Goldberg: The question was how do we handle database changes, and specifically DDL changes. From our perspective, we don't have to deal with that. That's part of the application; they have to take care of it. What we deal with is just the execution layer. But I would say in general, the way that that has to be addressed is that your test environment has to have all of the same components that your production environment has, and you have to be able to apply those changes and then execute whatever queries or applications may be dependent on those changes to make sure that they've been done correctly as you move down the pipeline.
Well, thank you very much. I think we're out of time, so thank you.