In Search of DevOps - the Evolution of a Data Warehouse Team Towards DevOps

Log in to watch

Las Vegas 2019

In Search of DevOps - the Evolution of a Data Warehouse Team Towards DevOps

Software Engineering · National Bank of Canada

This is the story map of our journey throughout our Corporate Data Department cultural transformation to create autonomous DevOps Squads managing multiple data warehouses.

We went through the classic transformation axis (people, process and tools) while continuously delivering projects, supporting 24/7 production environments and having fun.

We will share stories throughout more than 2 years of multiple iterations/improvements/fails/learnings by our team and with the help of other internal groups at National Bank of Canada.

A lot was learned starting with our redefinition of roles, to the way our "source code" (ETL) is stored in git repositories while we took control of our own software release life cycle and also experimented with database virtualization and with automated testing.

Chapters

Full transcript

The complete talk, organized by section.

Maxime Clerk-Lamalice

When I joined the data warehouse team, two things were said to me. First, Agile was not really working in this group, and also automating the deployment of software components in this team was really not possible. So we did both, and this is our story.

Before going further, who else in the audience is working in a data warehouse environment or data-related work? All right.

My name is Maxime Clerk-Lamalice. I'm a software engineer by training, based in Montreal, Canada. Yes, we do speak French in Canada. Most of my time is focused on software engineering practices and building high-performance teams, while obviously having fun.

A quick intro about my journey. I started in the startup world. I was there for 10 years. We built software and hardware for the healthcare industry. Then I went to e-commerce, then banking since 2016. I'm at National Bank right now. In 2018, so last year, I did attend the London edition, and since then it has been a personal goal for me to present at this conference, so I'm very excited.

National Bank of Canada, based in Montreal, is one of the six important banks in Canada. It's the leading one for Quebec. It's also very active for the small and business industry. Founded in 1859, it grew organically and through acquisition. Obviously this has an impact on the IT infrastructure, but also the data richness that is available.

The bank's mission is to have a positive impact on people's lives. It's doing it by building long relationships with the clients throughout four lines of business.

A couple of numbers to give you a sense of scope. More than 24,000 employees. From this number, around 2,000 in IT, and obviously 2.6 million clients. Keep in mind that National Bank is part of the Canadian banking system, which is highly regulated and very stable.

A disclaimer before going further. This story is not about big data, data lakes, visualization, or even the cloud. It's about corporate data. We're a team of around 65 employees maintaining multiple ODS, operational data stores, and EDW, enterprise data warehouses. We're doing it by having a pretty classical technological stack, which is DataStage, Oracle, and Control-M. Keep in mind that the data producers and data consumers are never sleeping, so we're always open, 24 hours.

A bit more about the context. We are one of the many teams at the bank using Agile and DevOps to deliver quicker, faster solutions. What is specific about us is that we are managing data, which means that we're very popular. We're involved in very small to very large programs, and the only thing in common that they all have is they are very anxious and impatient to have access to our data. Also, since we grew organically, our IT ecosystem is a mix of legacy and new systems, obviously.

Going back to 2016, a typical ETL deployment was using lots of interaction between different profiles. Over three teams were involved. We were releasing every three months or so, and the deployment was manual.

Where did we start our transformation? Where did we put most of our focus? Obviously, the classical three axes: people, process, technology. We started everywhere, but in iteration and always involving the team.

Let's go through some highlights. About the people: having an Agile mindset was key, so the bank did invest in training, not only for the IT people, but the business people, which led to having a mindset of this is my job versus I'm going to help the team. So a big change in the mindset of the people, which led to being able to have a you build it, you run it mindset at the squad level.

Secondly, we removed the QA people that came from a centralized group at the bank and moved the quality to the team level using different tactics: peer review, test coverage, and different built-in steps in the pipeline.

Lastly, we did merge the dev and the ops team, which was a big move for us. We have done this more than six months ago, and those are autonomous squads right now. Basically, from the request to the production environment, that was a major change for us.

About the process: to support the new autonomous squads, we changed the process. We used a unified backlog. We came from three different backlogs, dev, ops, and even management backlog, to a unified one. We were able to have a unique view of what was in the pipeline and what was coming next. Obviously, refinement happened with different POs at either the program level or the project level.

Secondly, we are a data team, so obviously we had lots of data: data about the commit, data about the time to release. It grew from only acquiring the data to having a conversation with the team about the performance, about the reflexes, about the different best practices. That was a nice shift for us.

Lastly, new requests. Basically everything that came into the group, we were able to focus it through a service desk and have different workflows to dispatch it to the right squad.

Technology. Obviously, DevOps is often associated with tech tools. So we built pipelines. We built pipelines to deploy ETL. We built pipelines to deploy our scheduler schedule. Most of the new tools that we did, we did it in our team, so people grew accustomed to maintaining it.

Earlier I talked about our classical stack. Right now there's a need for speed to have access to the data. New integration patterns are coming in with the new projects and programs. We're building the team; we're adding new profiles to the squad to have APIs, to have streaming capabilities. This is a great shift for us also.

Lastly, all our lower environments were switched from physical to virtual environments. With this we saved lots of disk space and gained lots of speed to refresh the data. But most importantly, we were able to have self-serve capabilities at the squad level so the developers could request and manage their own data pods, which were a big time saver for all the programs currently in the team.

What did change? Basically, the interaction of a deployment right now is at the squad level. There are no other teams involved. There are no multiple coordinators implied. There's no problem having the right information of what is currently being deployed. This fixed capability has also allowed us to have new interns coming in through mentorship, but also bring the architect closer to the implementation. Obviously, the pipelines that we talked about earlier are being used at the squad level. So a big shift for us.

Key results. This new approach for us is making it scalable, so we can scale up multiple squads, scale down based on the work required. We went from monthly to weekly deployments. Production is more stable, with 55% fewer incidents. The performance, or the global time it takes to ingest and manage all the data, went up, so much quicker. The same team, but obviously more fun, which was very important for us.

The timeline: there was no magic. We went from a classical banking model with a dev team, an ops team, and more of a data science BI background for the developers. We went to Agile and the DevOps transformation. Right now, we did the shift to autonomous, and the future for us is very exciting because the speed at which the team wants to change is getting faster and faster. That's very interesting for us. That means we can add new concepts and virtualize more environments, since we're using multiple tracks per squad.

Another tool that we used was using the team meeting as a snapshot in time to see what did change since the last meeting, what can we improve, and have feedback.

Obviously, there were some challenges. We're sharing three today with you. There was lots of fixing forward. Obviously prod was stable, but the knowledge of how it was fixed was not shared, so we made it public. We had what we call a wall of fail. Everyone in the group was able to see what went wrong, but most importantly, what was the solution, so it was shared with the whole group.

Secondly, we did custom software and we did pipelines. Obviously, we had to train our employees. But what we did not expect is we had to train neighboring groups also. That required more time than expected. Obviously this was a surprise, but the end goal was that the general knowledge of software engineering practices went up across the group and also neighboring groups. That was positive.

Lastly, we grew dependency on those DevOps experts. They were doing the coding; they were doing the pipeline. We switched this mentality. We made sure that our employees understood that those DevOps experts were doing development. Right now those experts are more coaches for our employees, making sure that the newly hired, but also the senior ones, are having the right information.

DataOps. DataOps is still a concept for us. We're not yet done with the DevOps transformation, obviously. But it's now built into our mission. We want to have open discussions about the complexity of managing data, the complexity of making data available throughout different groups, and making sure that people focus on the data, not only the code. Obviously, we are doing experimentation with new tools. For example, data as code, data catalog. We are using multiple vendors, and it's still a work in progress. Next year, obviously, I'll give you an update for sure.

A couple of must-dos. Be bold. Share your wins. Make sure that everyone knows about it. It's super important to inspire. We did develop a startup mindset at the department level. Obviously there's a tax in running this startup in the big corporation, so plan for it. Complexity, processes, it comes with it.

Get closer to your end users, and especially define data owners, who owns the data. This will simplify discussion with POs, with PMs, and the actual team about what to do with the data.

Work on hard and complex issues and problems. Adding, automating, regression testing in the complex data warehouse environment is very hard. You should be working on it right now, not only tackling quick wins.

Lastly, set the department mission, but stay aligned with the enterprise guidelines.

Let's talk more. What I want to hear about from you is: how did you tackle the automated testing in a complex data warehouse environment? Secondly, the conjunction of data governance and release management, how are you doing it? Those are two topics that I would like to hear more about from you. Thank you.

Q&A

Audience: When you combine the teams, let's say a BA engineer, how are they feeling? Are they able to use their niche skills? Are they learning from others? Because I usually get a complaint when you merge the teams: I have the niche skills. I want to only work on this. I don't want to learn that. How do you overcome when you bring the teams together?

Maxime Clerk-Lamalice: We have the notion of major, like the major and minor. Your major skills are this, but obviously we expect you to learn something else to help the team. Some people are not open to it. You cannot force them, but we can encourage it. That's what we're doing right now, and making it available through training and pairing also.

Audience: I think you described data pods earlier. What is that?

Maxime Clerk-Lamalice: We're using Delphix, and they have the notion of data pods. Basically, make a branching model apply to data. You have a source and you create pods of data that the developers or any employees in your team can use and play with time. You can run, for example, an ETL; it will do transformation, then you can go back in time. That's one of the notions that we're using right now.

Audience: Is that production data they would use or is it masked?

Maxime Clerk-Lamalice: It's a mix. It's based on the use case. It could be.

Audience: Can you talk a little bit more about your data virtualization, what you're using that for?

Maxime Clerk-Lamalice: Basically, we were using all physical environments, Oracle physical environments, and then we switched to this provider, Delphix, which is allowing us to replicate quickly an environment, a database, and to clone a database. Based on the need of the project, the lifecycle of this database will match the project. Once the project is done, we remove the environment, and then we create a new one based on the specs of the project. Developers are only accessing what they need to have for the project.

Audience: Do you use external tools to create those virtual copies of the databases?

Maxime Clerk-Lamalice: The Delphix environment is managing lots of the logic, but we also have pipeline to facilitate the creation of those different environments. We stick it together with...

Audience: Like Redgate or what's the other one?

Maxime Clerk-Lamalice: Not the Oracle case, but we could use this.

Audience: We have a very similar environment, DataStage and Delphix for virtualization. I'm curious how you're plugging in Delphix into your pipeline. That's one thing that we're trying to figure out.

Maxime Clerk-Lamalice: First, it's for non-prod environments. How we are using it: when we start a project, we create an environment specific to it. We configure DataStage and the scheduler, also the ecosystem, to have a self-maintained environment so people can use it. Then it's self-serve. We use a self-serve facility so developers can do the modification or roll back in time. I don't know if it is answering the question.

Audience: Right now we're using it to refresh our physical servers and test. But we're trying to figure out how do we give the developers that capability without them going crazy and standing up...

Maxime Clerk-Lamalice: At first it needs some coaching and DBAs are still involved to make sure that part of it is standardized, but we're starting to give access more and more.

Audience: Do you give access to end users in your data warehouse environment?

Maxime Clerk-Lamalice: Yes. We create specific environments for exploration or data engineers, which are locked.

Audience: So like an R&D environment?

Maxime Clerk-Lamalice: Kind of, yeah. So they can have their fun.

Audience: What type of tools do you provide in that type of environment?

Maxime Clerk-Lamalice: It's self-serve, so they plug it with whatever they want. We just make sure that it's secure and only the right data is accessed.

Audience: Can you discuss your schema and DDL promotions as you go through? I imagine you're branching to test those. What does that look like for lifecycle?

Maxime Clerk-Lamalice: It's not all automated. The DDLs are in Git and the DBAs are matching based on the requirements of the project. It's a pretty standard evolution of the data warehouse. There's no magic right now. We are integrating DBmaestro, but we're still too early to give you an integrated answer of how we are using it.

Audience: Regarding testing. Since you're refreshing your data, how do you validate your mapping? In application development, they'll give you a scenario saying this input was put into this function to get that output. But since you're refreshing your data, you don't really know what's in there, and then you do have to build something. What's the process to validate to make sure that whatever detailed code you're writing is correct?

Maxime Clerk-Lamalice: You mean regression testing?

Audience: Or unit testing.

Maxime Clerk-Lamalice: Unit testing is done at the DataStage level, so the small transformations are done. We're using more higher-level functional or almost end-to-end testing.

Audience: You get a story that says this table needs to have this data, and there's some type of formula...

Maxime Clerk-Lamalice: How we match the data for the project?

Audience: How do you match whatever requirements you have since you have refreshed data? I guess in some scenarios if a BA or someone gives you test data...

Maxime Clerk-Lamalice: In our case, the BA is generating synthetic data, so they're creating data to match the test cases, and this synthetic data is living inside the pod. Then it gets refreshed or recycled at the end of the project. They will create it, and obviously the end goal will be to version it, so when we destroy the pods we can reuse it in a different context, build a library of automated tests. Kind of regression. This is where we are working right now.

Audience: Our code base is about 90% DataStage or ETL. Are you guys looking to maybe do something else outside of DataStage?

Maxime Clerk-Lamalice: Yeah, for sure. There's no final decision right now, but obviously, 10 years of development, that's a lot of ETL that you cannot shift in a day. But yeah, obviously we want to migrate to a new, simpler approach to more DevOps.

Audience: In the development, the customer feels bringing new functionality in the application is quicker than really creating a report. In your experience, how do you bridge the gap? The data application writes what ETL you extract is maybe a totally different way. How do you minimize that effort so that it will be quicker to deliver to customer?

Maxime Clerk-Lamalice: It all comes down to how we, I would say on the architectural level, at the designer level, split the different portions of normalization of the data. Smaller ETL transformations that we need to do and push progressively. It's more at the design level: less big transformation, smaller blocks that can be deployed independently. Thank you very much, guys. Thank you.