Blackboard Learn - Keep Your Head in the Clouds

Log in to watch

San Francisco 2014

Blackboard Learn - Keep Your Head in the Clouds

Vice President, Chief Architect · Blackboard

How do you implement DevOps in a software company that has 16 years of established culture and processes? What if this organization is the industry leader and has everything to lose by changing?

Over the last two years, Blackboard has gone through an enormous change, from a company delivering enterprise software once every 18 months to one on the verge of delivering Cloud enabled education software through continuous deployment.

My presentation will talk about the triumphs and challenges of taking a group entrenched in years of legacy to a new vision of faster delivery of high quality software.

Chapters

Full transcript

The complete talk — auto-generated from the talk's captions.

So my name is David Ashman. I'm the chief architect of cloud architecture at Blackboard, and I'm going to talk a little bit about the transition that Blackboard has taken from enterprise to cloud, and through DevOps. So what do I do at Blackboard? Well, I build things.

I think a lot. Generally, I wear clothes when I do that, especially at the office. I manage people sometimes. But really, I build things for Blackboard, and I generally try to make things better for everybody.

And this presentation is about just that, making things just a little bit better at Blackboard through DevOps. So, a little bit about Blackboard. Blackboard is the industry leader in educational technology and services. We've been around for 17 years in this space.

We are currently privately held, but we spent many years as a public company. Prior to going private, we had revenues of about $450 million a year. All I can say now is we have more than that. And we are headquartered in Washington, DC.

We have an extensive portfolio of teaching and learning, communications, analytics, security, and transactional services and products. These products serve all stages of education from K through 16, and on into adult education. But today, I'm going to focus on our flagship Blackboard Learn product, our teaching and learning platform. So, some more about Blackboard Learn.

So it is 17 years old. It is our first product. It was started by a small team at Cornell University, and it has its roots in Perl. Over the years, it has evolved into millions of lines of Java code.

Many of those years, it actually spent as a hybrid of a Java and a Perl application. And anybody that's dealt with hybrid applications knows that's not an easy task to deal with. We service this application through development and operations in seven offices worldwide, several in the US, and going as far as Australia and China. And we have 700 people working on this product in development, testing, and operations.

Now, that's all of our products. We have a lot of people that sort of filter in and out of Learn and other products, so 700 total. We service about 1,000 clients in our hosting organization, and that's accomplished on about 3,000 virtual machines. Excuse me.

And we have about eight petabytes of content and data storage across those 1,000 clients there. Large system. And we are a horse. And like so many of you in this room, we have dealt with long lead times, six-plus month lead times.

In fact, we had one release that spanned 18 months before we could get it out the door. Like any product that's 17 years old, we carry a lot of technical debt. Now, we do whatever we can to try and get that debt down with each release, but inevitably, features fight technical debt, and we end up carrying debt release over release over release. We have experienced some high update failure rates in the past.

We've had releases that have resulted in hours of downtime for our clients and outright failures that have required them to roll back to older releases. Obviously, something that no company wants to put any client through. And we've dealt with the same communication issues that many of you have had in both directions. The story I like to tell about a lack of communication, obviously, I don't like to tell this story, but a time when we were releasing a new version of Learn that had a queuing mechanism in it to deal with synchronization in a cluster and cache invalidation.

And this queuing technology depended on multicast networking. And so we built it, we tested it, everything worked great. We pushed it out to our operations team, neglecting to tell them that we needed multicast and the network infrastructure was not designed for it, and it brought down the whole system. Nobody knew what was going on.

It created a lot of fires. Bad communication. And we had bad feedback loops in the past, too. The operations team would spend a lot of time and energy building scripts, building tools, building various things to make the application behave well in a production environment, but never tell us, and we were never able to put this into our product and make it easier for the operations team to end up operating the product when we were done with it.

But through the art of DevOps, we are now a better horse. So what changed? A lot changed. A lot has changed at Blackboard over the last few years, but I'll focus on three key ones, automation, cloud infrastructure, and culture, of course.

Oops. So first, automation. This chart here, as Gene likes to point out, this is the graph he really liked. This is the lines of code in our mainline product, in our Learn product.

This is the Java code ever since we started tracking it around 2005. What's the problem here? Well, the problem is, over time, this product has grown and grown and grown, especially when you start looking around the middle here. It is growing at such a pace that it's becoming this enormous product with so much complexity, so much insurmountable debt, that we were running into problems, both in development and operations, of significant failures in releases and problems with developers taking far too long for these products to get built out.

We needed a way to improve this. We tried using different tools. We introduced better code management tools, better build tools, but in the end, there was no way to look at it any other way than to say that we were building a monolith. And with that monolith comes all kinds of problems.

Poor code quality that was making it out into the field, slower release times, clients that would have to wait longer and longer and longer for releases to get out there with fixes or with new features. More instability. When you have larger products like this, you have more dependencies. You have the butterfly effect of a change on one side of the product causing an error on the other side of the product that you would never expect to happen.And internally, it was causing a lot slower developer productivity.

The times that a developer would have to wait for any kind of feedback on a change that they would make was getting longer and longer and longer. So we built more and more complex machinery inside to try and make it more and more lean and get better and better feedback. But all we were doing is building more, bigger, complex stuff to deal with a bigger and more complex application. In the end, it was resulting in 24 to 36-hour feedback loops on integration testing.

So a developer would write code, they would commit it, and they would have to wait 24 to 36 hours to get any feedback as to whether their change broke with somebody else's change. But the problem also was this was a bundling. This was the nightly build. So this was a bundling of all the changes that had happened during that day.

And that meant that when there was a failure, which happened fairly frequently, we had to rip it apart and figure out whose code change actually caused the problem and to get the ticket back to them to try and fix it. Enter Mike McGarr, sitting right there. Mike was hired at Blackboard by Steve Feldman, another colleague that's here this week, to kickstart DevOps at Blackboard. And he came in with a whole fresh set of eyes.

We had a large organization that had gone through a lot of change, but we needed new eyes. We needed new ideas to come in and really force us to think outside the box and think of new ways to approach build pipelines, to approach release cycles, to approach how we were building our product. The first problem we took on was that problem that I talked about earlier. We had this monolith.

How were we going to deal with making this monolith easier to manage, get a better release cadence, and get fixes out faster into the field? And to do this, we took on modularizing the product. What's that in the top corner? How do we get our product to go down at the end?

Well, we actually had something built into the Learn platform that we could leverage. We had a module technology, a component technology built in that had traditionally been used by third parties to extend our platform. Well, we leveraged that same technology. We ate our own dog food and started building our product using that same technology.

And this allowed us to start breaking apart this monolith and start understanding the dependencies between these components much, much better. And it also allowed us to start building these components independently and getting better feedback to our developers. We also introduced new technologies, better tools, more modern tools like the ones you've heard so far today, Jenkins, Chef, Vagrant for virtualization, and Gradle as a new build tool. And all of these combined together allowed us to take that 24 to 36-hour integration cycle feedback time down to 15 to 30 minutes.

Now, a developer could commit their code, and by the time they got back from getting a cup of coffee, they could know whether their change would integrate well with everybody else's change. And additionally, because now we were doing commit-level builds, we could know that if a build failed, we would know exactly who broke it and when they broke it and be able to automatically route back to them that the issue exists and tell them to go fix it. But of course, there's more. That was only the first phase.

That's only dealing with integration testing. There's always, on the other side of that, functional acceptance testing. Now, again, any of you guys that have a long history know that functional testing really starts with manual testing. We, as an organization, had already, years before we started doing this, had already taken on the effort of automating that manual functional testing.

And we had done a good job. We had taken a manual functional testing process that would take anywhere from a week to two weeks to run down to running in about 14 hours, which was a great improvement on feedback, but it wasn't enough. If you look at the actual pipeline of what was going on, we had the great work that we were doing around integration testing at the beginning. A developer is working on a feature, spends a half a day working on it, commits that code, and within 15 minutes, he's going to know that it integrates well with everybody else's code and it's working fine.

But after that, you have your acceptance tests. Each suite would run about 40 minutes to an hour, and in this case, I'm assuming that one suite covers the code that was changed. But the main problem was, even though we had automated these tests, the feedback from these tests were not simple green/red tests. It required human intervention, human analysis of what the failures were to determine if they were real failures or if they were environment or some other issue.

And that took an additional hour. And then it has to loop back. Again, these are nightly builds, so now it's a culmination of all the changes that had happened during that day, and now you have to unwind it and figure out who broke things. It gets much, much worse when you start thinking about the wait times.

So again, we're doing okay at the beginning. Five minutes, we're going to get some feedback to the developer, but then you start throwing in this problem in the middle. Thirty-six hours we need to wait for that nightly build. We need to wait to make sure that it's all okay.

We need to ship it and install it on all these servers. And then we need to get these test suites running. And then you got to wait for all the test suites to finish running because there's more than just one. There's dozens of them before you can get all the analysis done and then feed that back into the cycle.

And this was resulting in 9% efficiency around our development. That's horrible. This loop was taking three to six days for a developer to know that at the acceptance test level, their code was working. Additionally, because these tests were not red/green tests, we couldn't know for sure whether a failure was a real failure.

We were generating hundreds of failures. Sixty percent of the failures were actually attributable to scripting issues. Either they were invalid tests or tests that have fallen out of sync with the code. Thirty percent were data or environment issues, so data was left behind by a previous test that was breaking a subsequent test.

Seven percent were preexisting issues. And again, because we had this butterfly effect of a change on one side breaking the other side of the applicationThe same root cause could cause multiple different tests to break. And though we might be able to avoid running a test that we know is going to fail, some other one might fail down the line. You need to figure out, well, is this an issue we already know about, or do we have to open a new ticket?

And in the end, we were seeing 3% newly discovered issues. Again, not a very highly efficient process. And it was all evidence that we really had this inverted testing triangle. We were far too dependent on end-to-end GUI tests, a lot of browser clicking, and not enough at the integration and the unit testing level.

So, long story short, we had these elongated test cycles. We were running at three-plus months of testing. So those six-month cycles, half of it was being spent testing. It resulted in longer time to market.

We were struggling against competitors that could get out there much, much faster than us with features and fixes. So we were losing to these customers. And we had reduced visibility back to our development team, and that was evident in a couple different ways. Longer delays between coding and fixing, of course, and just far too much noise.

We had to wade through all this noise to understand what was going on. And what are we doing now to try and take this on? And we're actively working on this right now. We're starting to adopt test-driven development internally in Blackboard, which is great.

We're also working on a fully automated acceptance test pipeline. So we're taking out that middleman. We're taking out the human part that has to worry about analyzing and understanding the issues. We have a target of getting our acceptance test pipeline down to below 30 minutes, which is a pretty significant change coming from 14 hours.

So we have to start looking at new technologies and new ways of doing this testing, that it isn't a browser click. So we're rewriting our application using a JavaScript front-end and a REST back-end. So we're looking at using newer technologies like Jasmine and Protractor at the JavaScript level, and Rest Assured to be able to test our REST interfaces. All of this is with the goal of trying to get our six-plus month lead time down to one to two-week lead times.

A very aggressive push internally that we're in the middle of right now, but we're making great progress towards getting to that. So next I want to talk about cloud infrastructure. So developers, they have their deployment environment. They have their environments that were running on their machines that were really only tuned for them to run by themselves.

We had our test environments that could scale a little bit more, had more data in them, and were good for testing for multiple clients and multiple environments. And then we had our production environments that were really built for scaling, really built for production environments that needed to run for clients in the field. But none of these environments were the same. And what made it even worse is the environments were owned by different teams.

We had our production environment that was owned by our operations team, our production operations team, and we had product development that owned development and testing environments. This created the typical, well, it worked in dev issue, where we would do the work, we would push it out there, everything looked great, our testing was wonderful, but everything would go up in flames in production. And it really came back to that our environments were snowflakes. None of them were the same.

And so we had different deployment models where we had developers running build scripts on their own machines, and that would deploy into their own workstations. We had automated installers that would be running in our test environments, but those were individual environments for clean data sets for testing. And then we had gold masters and virtual machines that we would roll out for production environments. So, totally different ways of deploying our application and, additionally, different architectures.

Developers were typically working on Windows machines. We do support Windows for our self-hosted clients, but none of our clients in our production hosting environments are using Windows. They're all on Linux and Oracle. So we have our developers working on Windows, and we have our production running on Linux.

And then right there in the middle, we had testing that would run both, but they weren't even running clusters. So we had clusters running out in production. And anybody that does cluster development knows that there are a lot of issues that creep up because of clustered environments that we weren't seeing in testing. We introduced Chef, which was a great tool to start standardizing on how we were doing configuration management within our environments.

But again, we had two different efforts going on. We had our operations team that was independently doing Chef development from our development organization, both trying to get to that goal of configuration management to deploy our environments, but both doing it completely separately. Then came the cloud, cloud computing to save all. I was given the opportunity to build a super team of development and operations, our first true DevOps team, if you want to call it that.

It was going to be focused on a shadow project that would take our teaching and learning platform, Blackboard Learn, and move it into the cloud. We had run it on traditional hosting for years and years and years, but we wanted to see the benefits of cloud architecture. We wanted to see what cloud computing could do for our clients in terms of scalability and reliability of the platform. And of course, automation was our goal.

We started from the very ground. Everything was automated. And what this allowed us to do is truly implement infrastructure as code. Everything from orchestrating the environment up through installing and running the environments was automated, and it's the same automation that's used across all deployment environments.

Whether you're running in development, testing, or production, it is the same deployment code. And of course, because we're using AWS, which makes it super easy to get environments to use, we're able to provide those same environments as self-service to our developers. They don't have to go to an operations team and get systems provisioned for them to be able to do testing. Now they can run those exact same scripts.

All they need to do is give us an SSH key. We give them an AWS key back, and now they are able to spin up real production quality environments for them to do their own testing. It's a great environment. Our next phase is what we're calling BB Cloud, which is going to take what we've learned in AWSAnd move it to our own data centers.

We have, as an education company, an international education company, we have regulations and we have clients that have issues with public cloud deployments. And so we want to be able to have deployment that works for all of our clients worldwide. And so we're going to have some data centers, some places in the world that'll be running OpenStack. And we're also going to continue deploying to AWS where it makes logical sense.

But we want to be able to abstract away that cloud architecture. So we're building a layer on top of this that allows our application developers to focus on application development. An abstract definition of what their orchestration should look like, and we'll take care of orchestrating it for them. We've centralized and standardized our Chef automation so that we have one team that's stewarding Chef within the organization to make sure that we have good, clean cookbooks that everybody can use, no matter whether they're in development or production operations.

And of course, the most important piece is increased visibility for development. It's a top-tier requirement in BB Cloud to have monitoring, specifically performance monitoring using New Relic, StatsD, and centralized logging that will be completely transparently visible to all developers at any moment that want to go on and see what's going on, not only in their development environments, but in their production environments, too. I obviously want to talk about culture. Everybody up here is going to say culture is the biggest part of instilling DevOps in an organization.

And as an organization, we were very siloed. Like everybody that's going to get up here and is going to talk today is going to say, we had a development team would build their product, package it up, hand it over to QA to do testing, wash their hands, and move on to the next thing. QA would run all their tests on this thing. Once they got a GA quality one, hand it over to operations, wash their hands, and move on to the next thing.

Now, operations was left here with this box, and they had to figure out what to do with it. We had documentation, the same documentation we would give to our self-hosted clients, but it wasn't enough. There wasn't enough feedback coming from development, and there wasn't a shared ownership of, "Well, this is our development product. We want to help you get it into operations efficiently." No, we would just hand it off and give it to them.

It's not a very healthy organizational structure. Here's some ideas of some of the quotes that you would see even within the development organization from developers or from testers, that QA is responsible for defining test strategy, not development. And QA is responsible for checking the quality, not developers. And unit testing is not enough to verify that a feature is there.

A lot of FUD around what testing is and what quality is. And that branched out beyond testing and development and into operations. A lot of finger-pointing between the organizations of, "This version isn't performing well." "Well, it's because your deployment architecture isn't right for this." "No, it was working fine until we deployed the last version. The software is broken.

You guys go fix it." A lot of finger-pointing, not very helpful. So in the end, the key to the cultural change at Blackboard ended up being executive buy-in. Once the team saw that the leadership at the top wanted DevOps, the tide shifted. People started to get on board and started to really believe that it was the way that we wanted to do things.

The single biggest change that was made was bringing all those groups together. Operations and product development all became one organization. We had people in leadership positions that truly cared about development and operations at the same time. We started placing development groups into the traditional IT and infrastructure organizations, and we started moving all of the application operations teams back within the application development organizations, so they would be sitting side by side with the developers and working on the same problem together.

And of course, we automated all the things. Actually, you cannot automate cultural change. It's just not possible. That takes human capital.

And that's what it took from our executive staff to decide that this is what we were going to do. So what have we achieved? Well, development teams are now deploying their own code into production. That's a huge step for an organization that was used to just handing off code to somebody else and saying, "Go do it.

I don't care." We have developers that are solving operational issues without ticket escalation. It doesn't require going through tier one, two, three, four, all the way up. It's literally, "Oh, there's something wrong. I'm going to hop on there and I'm going to fix it right now." We have open feedback loops coming back on operational issues.

Now that we have a shared leadership, those issues are all working their way up to the same leadership that are saying, we need to work on this. It isn't one group over here trying to solve it in their own bubble without working with the other team to solve the problem. And we're making data-driven decisions. We now have the tool sets and the infrastructure in place to start gathering data.

We're gathering an enormous amount of data that we've never gathered before from our production environments. And we're using that data to drive decision-making within our development organizations. And most importantly, we're talking finally. So I am honored to have been invited to speak alongside some of the leaders in the DevOps movement.

I hope that I can offer something back to the community here. Some things I might be able to help with, understanding the impact of cloud computing and what kind of impact it can have on DevOps at a traditionally enterprise company. I can help understand a little bit about cost models of traditional hosting and enterprise hosting versus cloud. We've done a lot of cost modeling at Blackboard around this.

And how to frame a pitch to an executive team if you're really struggling to get buy-in from the senior leadership as to why a DevOps culture could help push an organization forward. I also would love to learn from some of you about effective testing strategies for traditional manual testing. We still have to do some UI-level testing. What's the best way to do that with the minimal impact on a pipeline to be able to achieve a high level of throughput and still be able to do that type of UI testing?And also how to apply DevOps strategies for ship-to-premise software.

We do still have clients that won't even deal with a hosted cloud structure. They want to be able to run it on their own campuses. So how do we do DevOps for a product that we don't even have in our own data centers? So anybody that has experience with that, I would love to hear about it.

And that's it. Thank you very much. Thank you, Gene, for inviting me. If you have any questions, go ahead and email me.

I actually have some time, so... Okay. Thank you, David. And we have a couple minutes for Q&A, and I've been reminded to repeat.

I will repeat the question so everyone can hear it. Yes. So how do you create a no-fault culture? So you had that nice picture of dev pointing fingers at ops and vice versa.

If something breaks, we traditionally want to find fault, point the finger at somebody, but how do you do that in such a way that's not threatening and doesn't stifle the culture you're trying to create? David, how do you transition from blame culture to high-trust culture? Well, I'll be completely honest and say we're still trying to get through that. It's fairly recent that we brought the organizations together, and we're finally sitting at the same table talking instead of blaming.

But there's still that after years and years, the hosting organization has been part of Blackboard for 15 years, I believe. And so after that many years of being a separate organization and really having no recourse other than that blame game, it takes a long time to retrain those people to think, "Well, let's stop thinking about who was at fault and how to solve the problem." So we're better now than we ever have been before, but we still have a long way to go. What was the biggest action that you took that made the biggest difference, you think? Again, moving these teams together and finally having these operational escalation issues, the fire drills, that weren't just the operations team sitting in their room in a dark corner of the data center, but literally the main conference room in our headquarters with the senior VP of product development and all the teams that worked on the code that's having problems all sitting in that room and saying, "Well, what's going on?

Why is this happening?" Awesome. Thank you. By the way, can you share how many lines of code was in that mainline repo? Order of magnitude.

I don't know if I'm allowed. It's a big number. Tens of millions. Tens of millions of J2EE and embedded Perl.

Yep. Wow. That was actually just the Java. That wasn't even the Perl.

Awesome. Any other questions? Over here. Sure.

Did you have to hit rock bottom for the executives to get buy-in? Did you have to hit rock bottom for the execs to buy in? That's a really good question. I would say yes.

Personally, I feel like we did hit rock bottom. I felt like we had enough failed releases, just enough of a client backlash of problems, and honestly, enough clients looking at competitors that really made us think about what we were doing and how we could do things differently to remain competitive in the marketplace. To follow on from that question, so why did your executives not step in wanting to solve that problem and let you instead solve the problem? Question, why didn't the execs jump in and solve the problems themselves?

Why did they enable and empower you to solve the problem? Well, not being the executive, I can't really answer for him. But I think some of it was visibility. I think some of it was just maintaining a level of separation from the people that were seeing the problems and how they were being reported up.

We got a new executive team at Blackboard, and they are much, much more engaged. They are much more interested. They are much more technical than some of the ones we've had in the past. And they wanted to get in there, and they wanted to learn more, and they wanted to know what was going on.

Executives that were willing to get on calls in the middle of the night to deal with issues. It was a new experience for us. Awesome. One more, and then we'll conclude.

Speaking as a developer, I loved the slide where you said your developers were jumping in to do Mm-hmm. But I have legal regulations like Sarbanes-Oxley and then other types of government regulations that keep my developers from having access to those systems. Right. And I'm curious if you've had to deal with that problem.

So the question is, dev wants to but can't. It's compliance, security. So education has a lot of compliance in it, too. FERPA is the big one.

It's a lot of privacy issues around student information and not being able to divulge any identifying information about a student. But FERPA has some fine print in it about how an organization that is acting as a sort of a technical consultant to an organization and that they have access to that data. So we actually don't have to deal with that, at least on our traditional higher education and K12 side of things. We do have a government side of things that we're starting to move more and more into the FedRAMP and DICAP areas of compliance that certainly will start to change that landscape about who can access certain environments.

But as of right now, most of the regulations that apply to us don't really impact our ability to get into production systems. I do HIPAA compliance with healthcare, and one of the things that we came up with was to build some scripts where we could export data and automatically obfuscate the information so that our logs have the first two characters of each field so that you can verify that you're looking into data that was inputted, but can't really see exactly what it was. Right. And that sort of building these little tools into the background because you know that you're not going to be able to expose that data.

Right. So the comment was about obfuscation of data as it was being brought into the development organization. We did some of that at Blackboard, too, where we wanted to be able to bring in more production data sets to be able to do testing on there. But there were concerns from our clients.

"Well, there's information there we don't want you to have access to." And so we built a series of tools and scripts that were able to siphon that data through an obfuscation that would remove any identifiable information, because all we were really more interested in was the relationships of the data and the volumes of data and what the data looked like, not necessarily who it was or what they were doing. Awesome. Thank you so much, David. Thank you.