When Failure Is Not an Option: From Fully Manual to Slack-based Deployments at ITIVITI

Log in to watch

London 2019

Download slides

When Failure Is Not an Option: From Fully Manual to Slack-based Deployments at ITIVITI

Antoine Moreau

SVP - Head of Infrastructure Services · ITIVITI

Lisa Wells

Vice President of Marketing · XebiaLabs

Every day, ITIVITI executes the orders of 29,000+ institutional clients on their popular securities trading platform, supporting 3,600+ physical connections and more than 38M messages daily. In 2015, the company embarked on a mission to transform their completely manual, unscalable deployment process into a fully automated deployment pipeline that is kicked off by users and driven by Slack. Given the high stakes, deployment failures simply could not be tolerated. And with more than 2,000 deployments a month, speed and efficiency were crucial too.

From people challenges to technology choices to build vs. buy decisions, this talk will describe ITIVITI’s DevOps journey: what happened, why it happened, how it succeeded, lessons learned, and what challenges remain. Come learn how one innovative enterprise streamlined software delivery while reducing risk, and how you can too.

Antoine started his career in the financial industry more than 12 years ago, working in trading rooms where he was involved in many different functions, such as commando development, functional support, and management. Antoine joined Ullink (merged with ITIVITI in 2018) almost 7 years ago to bring this trading room experience to their EMEA client service organization. After managing this division for more than 5 years, he moved into a program management role where he was in charge of DevOps initiatives and other integration programs. Antoine was recently appointed as the SVP - Head of Infrastructure Services, reporting to the CEO. In this role, he is tasked to further transform the infrastructure to a true infrastructure-as-a-service model, leveraging both his client-facing experience and his DevOps knowledge.

Lisa K. Wells is the Vice President of Product Marketing at XebiaLabs. Leveraging more than 25 years of experience in software, Lisa spends her time helping the company create value for the software development pipeline, crafting strategy, honing messaging, and generally herding cats. An Electrical Engineer who started her career in Applications Engineering, Lisa wrote three books on graphical programming before she moved to the ""Dark Side"" – product marketing. Here she found her niche helping companies effectively build, market, and sell technology that provides life-changing value to customers. Over the last 12 years, Lisa has been at the forefront of building winning product strategies and messaging for some of the world’s leading software tooling companies, working with SmartBear and CloudBees before joining XebiaLabs in 2015.

Chapters

Full transcript

The complete talk, organized by section.

Opening Video

We exist for one simple reason: to build technology that complements what traders do. Harnessing the power of people and technology so traders everywhere can do what it takes to succeed, faster, smarter, with greater insight to speed-read the market and expertly navigate the regulatory map. We build the most powerful platforms in trading and connectivity. We are Itiviti. Capture tomorrow.

Lisa Wells

Good morning, and thank you for coming to our talk. My name is Lisa Wells. I work for XebiaLabs running product marketing, and this is my friend and customer, Antoine Moreau. He is SVP and Head of Infrastructure Services at Itiviti.

Before we get started, I want to ask you all a personal question about your relationship with failure. How many people are like me, are totally afraid to fail? It messes you up. Raise your hand if you're just not okay with it. And then there's the maybe more sane way: failure is just a part of life. It's what happens. It's how you learn. How many people are good with that?

Pretty evenly split. In some business situations, the cost is too high and the business impact is too much to fail. That is the situation Itiviti was in as they wanted to transition from a very painful manual deployment process thousands of times a month to something better. What they ended up with was something fully automated, driven by Slack, and a team that was much happier being able to go home and rest easy at night.

Who here has heard of Itiviti? How about ULLINK? Both companies merged in 2018. Both were leading providers of securities trading platforms. They sell to investment companies who need to execute trades. If trades do not go through, that is mission-critical.

Itiviti itself is a software company, not a reluctant software company. Their business is software. They have 1,000 employees all over the world, 400 of them in development alone, plus another 250 in operations and client services. They have more than 1,650 clients, all with their own individual platform, and more than 30 years of experience.

How many people know XebiaLabs? We orchestrate your end-to-end DevOps toolchain if you have a complex pipeline. That is what we excel at: release orchestration and deployment automation. There are a lot of tools in DevOps and they all have to work together. You have to do the right things in the right order, make sure the right processes are followed, especially in mission-critical and regulated environments. You want to report on that automatically without digging through log files, and you want to do it without scripting. We are leaders in Forrester and Gartner reports for release orchestration and deployment automation.

Itiviti met XebiaLabs in 2015 at DevOpsDays in Paris, just as they were starting their DevOps journey. Antoine was there. I was not. He is going to take it from here.

Antoine Moreau

Thank you, Lisa. In this talk I am going to talk especially about the people in client services: software engineers, implementation engineers, support engineers, not exactly the engineering department. These 250 people service our managed clients. They deliver an infrastructure that works 24/6 and has to route millions of orders every day, so it is important that we do not fail.

Why is failing not an option? For my clients, if they cannot route orders on behalf of their own clients, their clients will take their trading away for at least the rest of the day. Those are tangible financial losses. In some cases we can fix it; in other cases we cannot, because the deployment was so massive or the changes were database schema changes you cannot roll back. If it impacts our clients, it can have regulatory consequences. As a vendor I am not regulated, but my clients are regulated, so a regulator coming to them can affect how I operate and service them. It really is not an option to fail.

In 2015, before our DevOps initiative, we were already at a 98% success rate in deployments. That is pretty good, but not enough. Those 2% were creating problems. We were spending about 13 minutes on average on any given change; the most complex changes took hours and hundreds of steps, so the flight risk was high. We already had DevOps-compatible tools used by engineering, like Ansible and Jenkins, but they were not used by client services at all.

By 2015 we could see problems with scalability. We were at the end of what we could do with manual processes. We went to conferences across Europe and ended up at DevOpsDays in Paris, where we met XebiaLabs. We sat down and said we had to fix problems. Some vendors were outside, some tools were open source, and we needed to do something.

Why did we want DevOps? Some people wanted DevOps because it was trendy and good on a resume. That was not the problem I wanted to solve. We had actual, real problems. Our environments were not under source control. Developers used source control, Jenkins, Git, Gerrit, and similar tools, but our platforms were not. Every time an implementation engineer made a change, he had to document it so it could be handed over to production in the same order. If you change this in that order, you need it replicated the same way in production or the result may differ.

What was bound to happen happened: dev, UAT, and production environments did not look alike. You test something in UAT, move it to production, think it will work, and it does not, because someone before you did something in UAT and never moved it to production. We also had tedious work and a strong dependency on the production team. The production team primarily operates in production, but we also needed them to deliver in UAT. If they were busy in production, we had to wait, so projects were delayed and bottlenecks formed. Those were the main reasons we wanted to review our delivery pipeline and how client services operated.

The challenges were expected in some ways: no tools within client services, no resources, no infrastructure to host the new stack, and people fearing for their positions. If you tell the production team, "I'm going to automate everything about delivering to production," they say, "This is exactly my job, so I am not sure about this initiative." The challenge I did not expect was that clients were not ready for the change. We told them we would change the way we operate, make it better and more robust. They said, "You fail once or twice a year, and then you fix it quite rapidly. Overall, I'm fine. I'd rather you spend your time developing new functionality for me." We had these challenges, but top management understood that we had no choice. Without the change, we would prevent the company from continuing to grow or sustaining its growth rate.

In 2015 we worked with whiteboards, paperwork, and many things, and ended up with a stack. At every step of the DevOps journey we asked whether to buy or build. There is no point building something already built elsewhere by people who solved the same problem. In other cases, when something is specific to us, we had to build. We ended up with a mixed solution: open source tools, tools from XebiaLabs, and things we built.

The first central piece is called UL Clone. We developed it. It interconnects through our Git repository and connects to production backup databases so we can grab everything not under source control. Its purpose is to create, in a few minutes, an exact replica of our client production environment. The implementation engineer can download it, click a few buttons, make and test changes locally, then push the changes into Git and Gerrit.

Then we use Gerrit for peer review and Gerrit and Jenkins for automated checks, code quality checks, and security checks. Once everything is done, Jenkins triggers a job, builds a DAR file, which is the platform as a whole: the JAR file, configuration, and so on. Jenkins moves it to the DevOps platform from XebiaLabs. That takes three or four minutes. Once it is in the platform, anyone in the company can deploy the package into dev or UAT. The only difference with production is that only a production engineer can decide to deploy something into production because we have best practices and other constraints.

This is the solution stack we built over the years. Slack is in the middle, though I have not mentioned it yet. First, we provide bespoke solutions hosted and managed by us. They are not SaaS, so we need flexibility across the chain and during deployments. One reason we chose XebiaLabs was that the UI is comprehensive and easy to get into, with little onboarding time. It also let us keep our production engineering best practices. Production engineers build rules in the tool for how to deliver a change in UAT or production. We can inject our own scripts, like Ghostbuster. We also trust our engineers. If they decide a step does not make sense for a specific delivery, they can right-click and skip it. We save time and space.

Automation is key. People do not like doing the same thing twice or three times, and this is where Slack kicks in. We already used Slack for instant messaging. Slack has open APIs, and we have good developers, so we developed JavaScript bots. The one I use most in our DevOps initiative is the deploy bot. It is connected to the entire stack: QYC, Jenkins, Git, XebiaLabs, and everywhere else.

From a single command you can schedule a change. You pushed it into Git, you are happy, and you want to schedule it in ten days. You take the commit ID, provide the platform name, and it schedules the change. It creates all the Jira tickets with the proper template, information, time, and everything you need. Then you are done and can move on. If you want to restart a platform, you say "notify clients" or "notify restart" and pick the platform name. It goes into QYC, finds the client and contacts, and sends an email saying the platform will restart in five minutes. The engineer does not have to work out who to email or how to notify them. We change this bot weekly, injecting new features and code because it is JavaScript and many people can collaborate.

This is the stack we started building four years ago. The outcomes at Itiviti are that we achieved what we wanted. We went from 98% to 100% deployment success. We more than doubled the number of changes delivered in UAT every month, without doubling the number of platforms. The production team headcount increased by only one, so the scalability issue was solved. It is faster: five minutes now, compared with half an hour before.

We also got unexpected benefits. We increased traceability. The stack uses documented things: Git, Jira, everything documented from start to end. If a client asks why we did something on a date, we have the entire stack. Another benefit is that we can now talk with engineering about more than side topics: we use Jenkins and Gerrit, and they use Jenkins and Gerrit. When they bring new best practices from an event, we share them, and they also share the costs.

What is next? First, I want to expose the whole stack to clients. I already have a beta running. I want clients to have an extranet where they can log in, see all the commits we have done and packages we have built for them, and decide when to deploy them into UAT, to start conservatively. We spend a lot of time coordinating with clients about when they want to test functionality, and we add no value in that process. I want them to take it over.

I also want more testing and more automated tests, especially integration tests. In my business, many things are interconnected. I want to start a journey of containers and move infrastructure toward infrastructure as a service. That will be challenging because we are not stateless and we are not microservices. Finally, when we designed the solution, we designed it with a monolithic, client-by-client approach: how can I handle this platform? We did not design it as: how can I handle all my platforms at once if I want to do the same change? That is the problem we have to address for mass deployments across the environment.

What did I learn? First, do not listen to your clients. If I had listened to my clients, I would not be here, and my company probably would not be growing at the same rate. I hope no clients are in the room; otherwise I will buy you a beer. Second, spend time thinking about what problem you really want to solve, not just being fancy. That was essential to our success. Third, in any major change, identify champions in teams that are reluctant. We found a person who helped, started to move, and understood that we did not want to replace him. We wanted to give him more value: review the content of what he delivers rather than how he delivers it.

Another thing I learned is that we got lucky. When we started in 2015, we did not know that on January 3, 2018, a major finance regulation called MiFID II would kick in. It forced us to re-engineer, redevelop, and upgrade every single client. If we had not had this solution then, we would not have made it. Even with the tool it was challenging; without it we would have been crushed. Try to look ahead: what new regulation is coming, how will the landscape change in the next year or two, and what will force your organization to evolve? Otherwise you might already be dead.

The last thing I learned is that, as mentioned, we have been monolithic. When you do these initiatives, have someone from the outside challenge you: why did you not think about this feature? That probably would have saved me time for a few months to come. Thank you. That was it.

Q&A

Lisa Wells asks how Antoine overcame client resistance after saying he did not listen to clients. Antoine says: through pains and tears, mainly. Clients were against the change because they did not see the value and because, working with banks and brokers, source control, Git, and Gerrit did not make sense to them. Itiviti had to explain the value. Clients worried they would lose flexibility, and moving to source control does mean they cannot fiddle as much as before, but they gain. Itiviti started with clients who were keen, then used them to advocate to reluctant clients. That was key; the community helped clients tell other clients that it was a good thing and not a waste of time.

An audience member asks how the development team was involved and whether there was resistance. Antoine says there was no resistance. Development had moved to Git and Gerrit a few years before and had suffered through it, so they helped client services use best practices, structure the work, and build automated rules. A few people moved from engineering to client services for the initiative, then moved back when it was mostly done. Client services has also hired DevOps engineers directly.

An audience member asks about the claim that deployments went from 98% to 100%. Antoine says every deployment is now successful within the organization. That does not mean clients are always happy about the deployment. Before, deployments failed because everything was manual and a missed step jeopardized the deployment. Now, if something does not work in production, it does not work in UAT either; it was not properly tested or the specifications were not clear, but the deployment itself does not fail.

The same exchange turns to integration testing. Antoine agrees that upstream and downstream systems matter. Itiviti collects messages from clients and sends them to exchanges, brokers, back offices, and middle offices. Sometimes a change seems isolated but has a waterfall effect, hence the need for more integration tests. The challenge is functional testing: any client routes thousands of messages daily, so isolating every unit test and making it viable is hard. They are working to narrow down the smaller message types that are unique.

An audience member asks about reluctance and skill gaps in the traditional operations team. Antoine expected more problems than he had. Implementation engineers are basically developers, so they were comfortable with the stack. Support people are more functional because they talk to traders and clients, and they embraced it because it empowered them. Previously they depended on implementation engineers and SSH access to make changes. Now they can check out, make a small CSV change, commit it, build a package, and send it out. They needed some training, but they wanted to get on board because it gave them flexibility and autonomy.

An audience member asks about the financial business case. Antoine says, beyond XebiaLabs cost, there was not much other money invested. There are people resources: one full-time employee maintains the stack, makes it sustainable, upgrades it, and keeps it running. That person is the one additional production-team headcount and is nested in the production team because they should own best practices for deploying and delivering changes into production. Antoine estimates this as roughly $100K.

Asked about payback, Antoine says productivity is higher and client satisfaction is higher, though client satisfaction is hard to value. Productivity is higher because projects are delivered on time more often, there is less drift, and they can pile up more projects. If he commits to a client that a project will start on a given date, he is now more confident it will actually start then, which also increases client satisfaction.

An audience member asks about audit and compliance rules. Antoine says the first rule is that only production engineers can deploy in production. XebiaLabs has groups of users, and only those users can click the button to move something to production. Other rules enforce compatibility: if you change one software component, you may need to change another. Compatibility matrices live in repositories and release centers. Slack bots constantly check those compatibility matrices and best practices. These rules change daily or biweekly as they learn from failures and add new checks, which is where the DevOps engineer spends a lot of time.

Asked whether they plan to move away from only approved production engineers clicking the button, Antoine says he would like to but cannot. His clients are regulated and Itiviti has ISO 27001 constraints. They must keep strong controls over who can access production and deliver changes into production. In some cases he thinks they should automate further, but he is not allowed.

Lisa closes by saying they will be in the speakers' corner that afternoon for more questions.