Oracle Corporation: 'oRE - DevOps Transformation at OMC
Ajay Chankramath will talk about how his team went about creating an SRE model at a place with very rigid ideas on who should do what. This has been challenging not just due to the cultural issues but also significant cost cutting as Oracle was forced to move to Cloud from a traditional capex model.
Ajay Chankramath has more 20 years of experience in Development and DevOps leadership roles in various industries. He is currently the Director of Platform & Release Engineering at the Marketing Cloud division of Oracle. Prior to that, he was Vice President of Development at Broadridge, a leader in Fintech as well as held several Senior DevOps Management roles at Xilinx, the pioneer of Fabless semiconductors. He is passionate about Developer Productivity, Self Sufficiency, SRE, cultural transformation and breaking down silos in organizations using lightweight self-enforceable processes.
Chapters
Full transcript
The complete talk, organized by section.
Ajay Chankramath
Okay, let's get started. I have an experience report, the way that Gene puts it every time. I have a classical experience report I want to talk about.
Did any of you attend Gian's presentation yesterday on site reliability engineering? Okay, we got a couple of people there. That's good. If you did, I'm sure you're going to appreciate this a little bit more. How about the presentation from Douglas at Rundeck this morning, the plenary session? I think that was also very relevant to what I'm going to talk about today.
Let's jump right in. You see the name here is `’oRE`. I do see some Oracle folks there. Are there any other Oracle folks in this audience? I see a few. I don't represent all of Oracle. I'm sure all of you know a lot of people at Oracle, or who have maybe worked at Oracle in the past. I represent a specific line of business at Oracle, and that's called Oracle Marketing Cloud.
Before I start `’oRE`, I want to tell you what `’oRE` is. That stands for Oracle Reliability Engineering. It is not a classical site reliability engineering model. We are talking about an Oracle Reliability Engineering model, and even in that, I want to talk about our specific implementation. That's why I'm calling it `’oRE`. I don't represent all of Oracle; this is about how Oracle Marketing Cloud has gone through the transformation of using reliability engineering principles.
To understand the context, think about how this $40 billion company is transitioning from where it was, this big database giant, into all these cloud-first policies and strategies you keep hearing about. To do that, I am breaking that $40 billion company down into a $200 million company. That is what we are in Oracle Marketing Cloud.
Marketing Cloud is a fairly easy concept to understand. If your companies are trying to sell a product or service to somebody else, the first thing you need is to set up a marketing campaign and generate sales leads out of that. That is where marketing automation tools come in. This field has exploded over the past 10 years or so. There are many companies and players. Oracle is one of the market leaders; Adobe and Salesforce are huge players too.
There are lots of components in Oracle Marketing Cloud: B2B, B2C, social marketing, content, and analytics-driven marketing. Out of that, the biggest component, and the one I am going to talk about, is B2B Marketing Cloud. Our business is about $200 million. We have about 2,000 customers, and the employees we are talking about are about 225 plus sales folks. Think of this as a classical SMB, not the whole large Oracle company, with fairly profitable margins.
What do I do there? I do platform and release engineering for Oracle Marketing Cloud. Our goals are fairly straightforward. First, we provide platforms for developers and possibly SREs to build from: tools and systems they can build on top of. Second, like all of you, we are moving to the cloud; in our case, Oracle Cloud. How do we make that transition process seamless? That is another focus area. Third is the classical DevOps value stream: tools for CI/CD, monitoring, and the deployment process end to end from a lifecycle point of view.
This gives you context for what we call platform engineering. It might be different from what you have. We look at the platform engineering team as the glue between pretty much all the different services needed for this product to get deployed. On the bottom are product services, like Scrum teams who need the services. On the left are cloud services, another set of teams and organizations that use some of the tools we have. We try to make the whole thing work as one process instead of multiple processes.
When I first talked about this at my company, the first question I got was, "Why don't you use the classical SRE model?" I have had trouble getting the classical SRE model working. You read about all the success at the FAANGs and everywhere, but I am not really seeing that at my place. There are mismatches.
There are a bunch of reasons. Oracle has grown tremendously in cloud over the past several years. I talked about a $40 billion company; 15% of the revenues come from cloud, and that 15% has had 2x growth over the past couple of years. When Oracle buys companies, it brings the whole stack in and does not make attempts to change that order. That was surprising to me when I joined Oracle. The belief is that if a business is successful, keep that success intact instead of disrupting it by trying to migrate it. That means multiple stacks with different technologies all playing together.
That brings challenges with skill sets. You are not going to have many people with uniform skill sets who could be turned into an SRE or join an SRE team. Because we are migrating from a classical SaaS platform to more of a cloud platform in the OCI paradigm, there is significant lack of interest in people understanding what is happening today. They fear that when we move to OCI, things will be different, so they would rather focus on the product now than do SRE kind of work.
The last challenge is compliance. I was at Nike's presentation yesterday, and I liked the concept of MVC, minimum viable compliance, but this has been hurting all of us. We can try to reduce compliance to the lowest possible level, but what if we have too many compliance activities pulling it down? That is always a challenge we work through.
In that context, let's talk about specific use cases. The first is DevOps as sticky glue. What I saw was a tacit expectation that every team's roles were well-defined, and there had to be some team or activity that glued everything together. I would call it duct tape or Band-Aid. Every step, from product definition to product development to deployment to monitoring to customer success, was going through DevOps. It is not about understanding what DevOps is; it is using DevOps as the de facto box that glues everything together. At first that felt great, because I was important. But the issue is that you get spread too thin, you become the critical path for everything, and overall efficiency comes down.
Why does that glue happen? Organizationally, the work splits into multiple lines of business. My SVP not even having a regular conversation with the SVP of the operations team makes my job much harder, because their priorities are misaligned with our world. If we are going to put a team in between, at least the lines of business have to talk to each other.
The next eye chart is the number one reason I thought we had to do something differently. To deploy one line of code to our customers on the cloud takes 21 different handoffs and six different teams. Think about that: one line of code. A customer comes back with an issue in a release. It might be a configuration issue, a code issue, or anything simple enough to be one line of code. The handoffs themselves are not bad when they are inside one team, like coding something and handing it to a tester. But here they transcend organizational boundaries. It always has to go to some approval somewhere, built into the Jira workflow. After every step, somebody has to ping somebody else: "Did you see my Jira request? Can you approve that?" It inherently slows down the work and becomes a morale issue for developers. They have heard about the problem, solved the problem, and want customers to have it right now, but they have to wait a week because it has to go through the approval board.
That was the context for what we did. The traditional SRE model is familiar from Google and the DevOps Handbook. In that model, SRE is a permanent role or group that takes over a product from developers when it is ready to launch. Development gets the product to launch readiness and handoff readiness, and then SRE takes it over. As Douglas said this morning, not every feature is SRE-ready; it has to reach a maturity level.
We flipped that around. In `’oRE`, we created a rolling role within the development community, within Scrum teams, so the work done by developers is done by `’oRE`. Typical developers do not all have to get to the skill level of using the platform to build what they need. Once the product reaches maturity, we do not need to go look for an SRE team because we do not have one. We use the traditional operations model there. We keep as many steps, silos, and processes as need to be there, but use our existing leverage to make it happen.
The primary difference is that in the SRE model, developers create and self-run for six months or some amount of time, and then SRE takes it over. In the `’oRE` model, `’oRE` does that work until the product reaches the needed maturity. Operations teams are very skilled at operating things, making sure they are monitored, and reporting back issues. By the time something reaches that maturity, operations has the SOPs needed to troubleshoot and make it work well. That is the fundamental difference between SRE and the `’oRE` model we created.
What does an `’oRE` need? They need many things from my platform engineering team. If they want to create a node type or an environment for builds, they need to know the build book and how to build that node, so they need Chef skill. For build configuration, local builds are not going to cut it; they need to integrate into CI builds as soon as possible. Their ability to use TeamCity, set up configs, and roll into the CI build process is critical.
Our platform team has created a generic ELK stack that anyone can plug into by having their own workflows. In the past, before `’oRE`, any team wanting telemetry would come to the platform team and ask us to do it. Now the `’oRE` can take the platform and design a workflow. They do not need to be expert in anything other than knowing how the product works.
Similarly, we are moving headfirst into containers and Kubernetes. If you need to deliver or deploy a service, you need to know how to run kubectl, set up Helm charts, and run them. That is another place `’oRE` can help.
We use Sensu not just for infrastructure monitoring, but for application monitoring too. Platform engineering provides containers that developers can check out, add application monitoring checks to, and push through the pipeline so all the elements get monitored in production. We expect `’oRE` to do that. For real-time metrics, primarily time-series data, we provide a TICK stack and `’oRE` can hook into that.
The way we have defined `’oRE` is as a consumer of platform services in a more efficient and somewhat elevated manner than a typical developer who is not initiated into those things.
Information is key for `’oRE` or developers to succeed. We provide a real-time dashboard that tells `’oRE`s and developers, at any point in time, the status of builds on each branch and the status of each pod where they have deployed code. That information is critical for making decisions without coming back to the platform team to ask how to do it.
As I mentioned, we do not use AWS, Azure, or Google, we do not have elastic resources in-house, and we do not even have access to OCI to get elastic resources. We take larger servers, containerize them, and run Kubernetes to create elastic resources on them. Platform engineering provides those solutions, enabling `’oRE`s to do what they need to do.
There are other cases where a team might need to access databases and run queries on shared database environments. Those things are provided as a service to `’oRE`s so they do not have to figure them out independently.
Everybody does CI, and I hope everybody does pre-flights. We do a significant number of pre-flights, and they are automated to the extent that merges are typically auto-merged. If pre-flights and tests pass, commits get merged. This ensures the trunk is never broken.
Chef DK has been a huge godsend for us. Without Chef DK, I do not think we would ever be as successful with `’oRE` as we have been. We provide cookbooks for various node types. `’oRE`s can make tweaks or add recipes so a new service can work on those nodes. If they had to come back to the platform team for that, it would be much more inefficient. The ability for `’oRE`s to download Chef DK, set up an environment using Vagrant and VirtualBox, get whatever OS they want installed, and test it is significantly useful.
Self-sufficiency is important. We need to empower developers and `’oRE`s instead of having them come back and open tickets. We built a seamless, self-sufficient reporting system for the services a typical `’oRE` needs: CI/CD, Kubernetes deployments, alerts, and similar things. Those are integrated into a simple in-house system built on Sinatra and MySQL.
These are the things the platform team provides. But we still have problems, and I would love feedback. The number one issue is `’oRE` lifecycle ownership. We started from a workflow with DevOps boxes in between; now it can look like we changed all of that to put PE everywhere. It is not as bad as that, but there is still dependency. Eventually our goal is for these things to be owned by `’oRE`s or a traditional SRE so there are no heavy and costly transitions.
If you look at systems architecture, infrastructure architecture, or launch plans, PEs are involved every step of the way. What if the actual developers, the people who code, owned these things? That is our eventual goal. We are still working through it. It is a progression and an evolutionary path. The next step is to start replacing PE with many `’oRE`s and see to what extent we can do it: 100%, or somewhat less?
To summarize, we still have many problems. As we migrate to cloud and OCI, Oracle's Cloud Initiative, we are seeing many things that are different. That is one reason we get hesitance from people to become SRE-like. Is it possible to eliminate the whole operations team with services? We are considering that. I do not know how practical it is. Sometimes it becomes a self-fulfilling prophecy: when operations teams see these changes, they move away from supporting some services. We have to think about a model in which those operational activities are supported by a true SRE model or by automated activities.
Training is another issue. An `’oRE` is typically a developer with ops skills who continues to be a developer. They are not changing groups. In a typical sprint, about 50% is on actual product features and the other 50% is on `’oRE` activities. Is that the right mix, and can we sustain it? If we have multiple `’oRE`s in a Scrum team, one developer may be more well-versed in Chef and another in Docker and Kubernetes. Depending on that, we may end up with multiple `’oRE`s. How sustainable is that model? I do not know. We are still figuring it out.
Going ahead, we want to see if we can automate out of this mess. One suggested idea we are exploring is super containers. I talked about monitoring, logging, telemetry, and the services platform engineering provides. What if we created containers that include all of those things out of the box? It becomes easier for developers to do it themselves instead of needing an `’oRE` or a separate set of people responsible for it. Those are some of the thoughts I have.
Q&A
Ajay Chankramath: We have another two-plus minutes. Any questions, any thoughts you want to share? Maybe I should ask this: are there people using SRE today in your organizations? Considering using SRE? Great. What is the biggest challenge you are seeing right now if you are trying to use an SRE model?
Audience member: We are not really a service organization. We are a bank. We are an internal part of the bank, so we do not offer a retail service of providing a service. We are just maintaining a system. I think SRE is geared toward that type of operation. Google invented it for their type of operation. Your way of modifying it to `’oRE` is probably the kind of concept we would have to look at.
Ajay Chankramath: That is something I found the hard way. We tried to do SRE in the traditional way. It was not working. I keep reading everywhere that it works great. Sure, it works great for places where you have the right kind of infrastructure to support it. How do we get there? I hope this has given you insight into the thought process that goes into trying to get to that point.
Anybody else want to share? If you are considering SRE, go ahead.
Audience member: For us, the main challenge is how to have developers interested in acquiring ops. We are trying the other way, where we have people who have ops skills and are trying to develop them as developers.
Ajay Chankramath: I have seen that. That is always a lot easier. People on the ops side say, "I want to learn some of these things," and a lot of ops people these days are fairly savvy with the toolsets out there. It is a good point. The way we have tried to address that is by providing developers the right tools so they do not have to go beyond a certain extent, which is always their concern: am I doing more core versus context? You focus on your core, but at the same time you understand what you need to do to make sure your core reaches customers as fast as possible. That is the mindset flip we have been able to at least start. There is a lot more to be done to get where we want to be, but that transition has started happening and we want to continue pursuing it.
All right. Thanks, everyone. Appreciate it.