Continuous Deployment in Telecom

Log in to watch

Europe 2021

Continuous Deployment in Telecom

Senior Consultant of DevOps Practice for Telco Business · Eficode

Gabor is a Senior Consultant of DevOps Practice for Telco Business at Eficode. Eficode is the market leading DevOps brand and trusted automation partner across a number of industries in the Nordics, German speaking market and Benelux region.

He is helping telco companies implement vendor agnostic CI/CD pipelines for carrier grade software, focusing on 5G. His team is working closely with major Mobile Network Operators providing automation, coaching and training solutions.

Gabor has a proven track record in all aspects found in large-scale transformation including technology, processes, commercials, etc. and well over a decade experience in product development and automation.

Chapters

Full transcript

The complete talk, organized by section.

Gábor Megyaszai

Hello, my name is Gábor Megyaszai, and I am very happy to be here and present today.

I will start with a little bit about me. Throughout my career, I spent around nine years in various IT fields dealing with databases, networks, and other sysadmin work. Then I spent another nine years working with one of the major telecom vendors. There I had the opportunity to experience how telecom is a much more rigid, slow-moving animal compared to other enterprise IT fields, and what that means to the business and the problems it leads to.

In my current role at Eficode, I have the opportunity to work with the largest telecom providers and see the other side of the coin from my vendor years. This is what I want to talk about today: how the telecom industry is striving for a change in its operations, but vendors and providers being dependent on each other, and both being highly regulated, makes things difficult. Luckily, we start to see a way out of this situation, and things are in motion for more agile operations with closer collaboration and what some might even be able to call DevOps.

Let us start with what I learned and what we did while I was working with a vendor.

While I was working in product development, we frequently encountered that from completion of a release until it was actually taken into use at a CSP premise, significant time passed. It could be anywhere between 6 and 18 months. This caused problems in many ways, including going back to old releases to make corrections, maintaining environments to verify changes to old releases, and testing compatibility. Many times the person who created the original code had moved to another team or even another company, so someone had to pick things up and figure out what the original code was meant to do.

As we investigated the root causes and what we could do about them, we received many inputs from different stakeholders and made our own conclusions. First, and probably the biggest problem, was the release strategy. The company had maybe one or two releases per year for its applications. In the best case it was up to four, but it could also be much less. Each release therefore had huge change content: dozens of features and even more corrections. It could introduce significant changes on different interfaces or on the platform itself.

That meant the effort to take releases into operational use was huge and, in many cases, quite manual. The different constellations of neighboring network elements had to be verified, and CSP-specific configurations had to be tried and tested. No R&D organization could test all these different scenarios, so sometimes huge errors were uncovered in the CSP acceptance cycle.

This meant a long back and forth between the CSP, services, and R&D. It took significant effort from all parties involved. In many cases, penalties were even involved. Most importantly, this long cycle time eventually slowed development because so much effort was tied down in correcting failures and old errors.

As said, these acceptance cycles could take up to 12 or even 18 months in a support model where a release was maintained and supported for 36 months. Continuous negotiations took place so CSPs could use a given release longer than granted by the support model. That led to even more R&D work, for example custom upgrade paths, which also had to be tested.

Business-wise, this model caused problems too. It was rigid enough that it did not leave room for introducing new business models instead of the old way of permanent licensing. It slowed the R&D machinery enough that there was not enough capacity to experiment with new features and capabilities, and it tied down enough sales and services resources that new market opportunities slipped away. There was also not enough automation or digital support to boost the efficiency of sales, services, or R&D.

As we went deeper and mapped what actually happens when R&D creates new software and the CSP buys it, we uncovered an ugly spider web of nightmare. It is not a problem that you cannot read the fine print; it would just be even scarier. The picture showed an unreasonable amount of subsystems, tools, and teams involved in simple order fulfillment. In most cases, these tools and teams were disconnected, relying on manual data transfer, manual data validation, people triggering actions, and maybe even people executing them.

On average, if somebody ordered an already completed software release, it would be delivered in 52 days. Roughly two months to get software delivered. Any given R&D engineer would already have forgotten what they wrote in a single function by the time any user of the software got their hands on it.

What should we do about this? What should we enable for things to be much better?

We should enable pretty much anything, anytime to be delivered. Any merge done to trunk could immediately be delivered to customers who are willing and able to accept the software packages, so fewer changes would be delivered at once. Quality assurance on the customer side could be done in parallel with R&D internal testing. Feedback on any issue would be much more timely and relevant, and the software could go to production much sooner.

Customers who were not willing or able to take frequent deliveries would still get the usual big-bang software drops, but already in a much more stable and battle-tested shape. They would encounter fewer problems and could go live much sooner. This meant process changes, not only technical enablement, but that would be an entirely different presentation. I will focus on the technical side for now.

Technically, the decision was quite obvious: make a pipeline. Make a pipeline that extends to customer premises, integrates digital software supply chain, integrates with digital sales components, and lets the software glide through from plan through build and release to deploy and eventually operate, while feedback and metrics are collected. Simple, is it not? Of course not. Things are rarely this easy, especially in business-to-business scenarios.

Immediately there were red flags along our pipeline plans. R&D work and CSP operations were in decent shape; we could not complain about those. But how releases were managed, made available, ordered, delivered, invoiced, and paid for needed significant improvement. How software was deployed and accepted also needed significant improvement.

Software supply chain and sales were highly manual and highly disconnected in both tooling and teams. We also had problems with how releases were created, managed, distributed, and made available for customers. These functions relied on people and manual actions, including manual data input for license-key generation or download from an FTP server. Deployment and acceptance were manual too, driven by PDF-based methods of procedure and test plans or HTML-based technical notes.

Three teams were tasked to do something about this, with the metric of bringing order fulfillment time down from 52 days to something much more reasonable.

The teams worked on three interworking subsystems. The first team worked on digital sales, including the new sales platform together with the marketplace, handling customer and contract data, managing contract life cycles, and providing input to supply chain so deliveries would follow contracts, legal requirements, and global trade agreements.

The second team worked on the software supply chain, including the industrial and operational data space for managing releases, creating release bundles, creating license keys, distributing release bundles to appropriate delivery endpoints, validating customer access, and pushing release bundles to the automation platform deployed on customer premise.

The third team worked on the DevOps automation platform, responsible for pulling software release bundles, deploying the software, and orchestrating acceptance testing based on machine-readable input from R&D about change content and from CSPs about their individual requirements. Because these subsystems were designed and developed as a joint effort, very little manual input was needed, and highly reusable components were shared between the solutions.

How did the solution perform? The main metric was delivery time. During the pilot phase, we went from an average of 52 days to an average of around six minutes. Six minutes from order to having the software ready to be deployed and tested in any environment. Feedback from CSPs came much more timely. R&D could receive field-test results before closing the release, improving turnaround time for corrections and eventually decreasing errors.

The system made it possible to create specific and generic release bundles targeting any customer group. With different customer needs and targeted release bundles, new business models with different service levels could be introduced. Such a system also enabled both application R&D and our teams to continuously evolve and develop our solutions. We could start forgetting the software-is-done-now-it-is-your-problem situation because of near real-time feedback and requirement management.

At this point during the pilot, I changed careers to join Eficode and work directly with major CSPs, independently from any vendor. This is where I realized, and realized harshly, the naivete of the vendor view. We imagined that we now had a functioning pipeline and could pump software out swiftly and easily, so it would be smooth sailing. The software could be delivered, deployed, and tested easily; everyone should be happy, and we were in a new mode of operation.

But the reality is that pretty much any major CSP has an army of vendors delivering different software to them. When all of them create such a pipeline, the result is chaotic for the CSPs. During one of our webinars last year, polls showed that the majority of CSPs have over 15 different pipelines comprised of over 50 different tools that they need to manage and maintain so they can work with their vendor army. This does not include vendors who are still mostly manual in their operations. CSPs have pipelines all over the place, sometimes serving highly specific purposes or even a single application software.

What CSPs really need is their own flexible and adaptable pipeline fed by all the different vendor pipelines. With such a setup, they could focus effort where it is needed most: further improving time to market, accelerating new service creation and rollout, or generally focusing on higher value-adding work.

Luckily, I had a chance to work on just that with Deutsche Telekom. We created a prototype pipeline for their future way of working. We built a system where we split the different layers required for 5G applications to operate. We defined the infrastructure, virtual infrastructure, CaaS, PaaS, and CNF or application layer, where required tenants, volumes, networks, host nodes, tooling, and eventually applications are deployed for acceptance and rollout. Everything is declared in a single source of truth originating from the architecture description.

This architecture consists of the master template and master configuration, broken down into each layer individual descriptor: operating system image, Terraform or Ansible script, or Helm chart with its respective configuration.

The process model distinguishes four phases with related activities. Design is the first phase, where automation templates are developed, day-zero configuration parameters are defined, and test activity templates are designed based on the QA strategy. The next phase is continuous integration, where external software artifacts are resolved, deployment artifacts are built, and deployment templates and scripts are rendered along with test activities.

Integration is followed by continuous deployment, where closed-loop deployment of the different technology layers happens. Then dynamically defined test activities are executed, quality fulfillment is checked, and eventually a go/no-go decision is made for go live. Deployment is followed by operations, where continuous monitoring of service quality and execution of different policies are made possible.

Architecture-wise, first we needed to expose various APIs for vendors and their supply chain so they could deliver artifacts and gain access to test systems for close collaboration. Then we had to secure this access, because each vendor should be able to access their, and only their, software.

For delivered artifacts, we had to agree on expected format and content so they are machine-readable but human-understandable. We also had to settle on branching, tagging, and release strategy so it can be controlled what and where can be deployed automatically without compromising production, security, integrity, or operations. Finally, we had to set up distribution of software and code to various sites and subsidiaries.

From this scope, delivered software artifacts and their format require vendors to do their due diligence. I expect a long negotiation in the near or mid future, and I would expect some standardization around these formats and delivery artifacts.

For the actual pipeline architecture, we mostly followed the popular GitOps approach, with extensions for backlog management, test management, security scanning of artifacts, infrastructure component deployment, and monitoring and logging components. This way, if artifacts arrive in the desired format, the same tools and process can apply. Templates and configurations can be adapted easily to the application. Separation of layers and tasks for specific tools also lets us swap tools if the landscape changes.

With this setup, vendors can have their own pipelines directly feeding into Deutsche Telekom's pipeline, providing the required speed and feedback, while DT can operate a pipeline that handles applications from many vendors without the overhead of multiple pipelines.

Putting these two subsystems together, the delivery capability from my vendor years and the deployment capability from the Deutsche Telekom project would enable actual continuous deployment in telecom. Actual continuous deployment means the created software is almost immediately delivered to the CSP and almost immediately can be deployed, bringing down the acceptance cycle time significantly.

This definitely will not happen on its own. Neither CSPs nor vendors can complete such operations alone. It requires close collaboration in shaping the future mode of operations for both parties.

To summarize what I learned during these years: first and foremost, CI/CD is unavoidable. All parties experience the pain of the decades-old rigid mode of operations and strive to change it. Given 5G momentum, this can be an ideal time to introduce such practices.

Building a multi-vendor pipeline is not easy. If it is, you surely missed something important, so go back and take a look. With the right mindset it can be done, but it requires collaboration between vendors and CSPs and rigorous definition of handover points and formats. Everyone building their own pipeline for their own purpose will fall short in the grand scheme of business-to-business service creation, but a flexible and adaptive pipeline can serve everyone well.

What has been achieved, and what we can be proud of, is that software can now be delivered near real time. Within minutes, software can be taken into use almost immediately, effectively bringing the infamous 6-to-12-to-18-month acceptance cycle down to weeks or even days.

If you wish to hear more about this specific case, we will have a webinar with colleagues from Deutsche Telekom, so please visit our website for details. Thank you for your attention. You can find me on Slack, and if you have any questions, please feel free to shoot.