Journey from PCF to AWS EKS

Log in to watch

Las Vegas 2022

Journey from PCF to AWS EKS

Executive Director, Digital Marketing Technology, JPMC Asset Management · JPMC

Imagine doing one cloud transformation - and now imagine doing it all over again. I would like to share the rationale and lessons learned during these 2 cloud migrations. We started our cloud migration with Pivotal Cloud Foundry (PCF) in 2015 but for reasons that I will share later, our team decided to migrate our applications to the public cloud using AWS and EKS. In this session, I would like to share how we used AWS, Terraform and EKS and when it makes sense to use Platform as a Service vs using IAAS like AWS and how it's possible to live harmoniously in a hybrid cloud.

Chapters

Full transcript

The complete talk, organized by section.

Sheela Shankar

[00:00:17] Welcome, everybody. Let's get started. This talk is about my team's journey from Pivotal Cloud Foundry to AWS and EKS. Before we get started, how many of you all have been part of one cloud migration? And how many have been part of multiple? Okay, so I'm going to walk you through two of our cloud migrations. I hope there'll be more, but I'm going to talk to you about two of our migrations.

[00:00:49] A quick introduction. I'm from JP Morgan, and JP Morgan is the premier investment management firm. Our clients trust us to invest $2.5 trillion in assets under management. Our team is Morgan Advisor. We develop applications for our financial professionals to manage their portfolios. We have a lot of data for about 300,000 instruments and performance data running in billions of numbers, and we use quantitative analytics to predict the performance of your portfolio. Our team is a mix of full-stack professionals, backend engineers, cloud engineers, etc. We build, manage, deploy, and support our applications from front to end.

[00:01:47] This is me. I'm Sheela Shankar, and I joined JP Morgan about seven years ago. I'm a full-stack developer and now I lead the team. I'm a lover of all things cloud, SRE, DevOps, etc.

01Where the Cloud Migration Journey Began

[00:02:05] Where did our cloud migration journey first begin? I joined in 2015. When I joined in 2015, we were doing manual deployments, meaning we had physical servers, and the way to deploy the application was to write notes to the operations team which would say: go to this server, copy this artifact using this jump host, turn off the traffic to this data center, etc. That was the manual deployment way.

[00:02:35] When Pivotal Cloud Foundry was introduced to us, our team opened, you know, with open hands. We welcomed it because now there was no more operations. The developers were in charge of the entire process. What was so easy about PCF to get started? PCF is a very opinionated framework, and it has a very low entry barrier. All you need is one `cf push` command, which can get your application running in the cloud. That was the biggest attractive factor about PCF.

[00:03:11] The second is it had a GUI console where you could just use it to start and stop your application, to increase the number of instances, memory, CPU, etc. And there was no infrastructure to maintain, so all that the developer had to do was write the application code, and the runtime, the buildpack, would take care of detecting what they needed to do to get the app bundled up and deployed. It would take care of the routing and services. That was how PCF was introduced in JP Morgan, and every team adopted it with open hands.

[00:03:51] How long did this honeymoon last? I would say within about seven months every team was deploying on Pivotal Cloud Foundry. Since it was a private cloud, we were running out of space and running out of pools. That was one of the main complaints. It did get better, and that was one of the first complaints.

[00:04:19] Second is there was limited choice of services. How we were working with services was we had a JPMC Marketplace and we had exactly one Cloud Cache, one database, which was MariaDB, one queuing system, which was RabbitMQ. So there was limited choice of technology, and we learned to live with that.

[00:04:40] Third is we had pool stability issues. Just like in AWS you have availability zones, in PCF you had something called pools. When an application was going down in one pool, we had to manually turn off traffic to that pool. Things did get better, but in the beginning these were some of the struggles that we had with PCF.

[00:05:03] Also, some of the other issues were IP whitelisting and IP caching. The way PCF services work is you have to whitelist either the IPs or the DNS name of the service that you're consuming. But if those IPs were either added or removed, you would get a 503 service not found error. The way to rectify that would be to restage that service, rebind that service, and then pick up the new IP addresses. These were some of the issues of PCF.

[00:05:36] Let's talk about some of the numbers. JPMC has an annual technology spend of about $14 billion, and we have 50,000 technologists, and we have the largest PCF Cloud Foundry installation in the world. PCF is widely adopted in JP Morgan, and there are about 1,500 production applications.

02Moving to Public Cloud

[00:06:00] Given that, why did our team decide to move to the public cloud? This started in about -- our team adopted it in 2021, but the public cloud was already available in JPMC and adopted by other teams. Some of the main reasons were that we could finally be truly elastic. There was no fear of running out of space or running out of memory, and we wouldn't need to over-provision just in case you would run out of memory or space.

[00:06:30] Second is we also wanted to increase our developer productivity and give more choice to the developers to pick the technology of their choice. Third is lower TCO. Total cost of ownership would reduce with the public cloud because, for example, you don't need to run your servers 24 by 7. If all you're doing is having a batch job which runs once a day, you can convert them into serverless. Also, infrastructure provisioning, you don't have to do it right away. You can do it as your load increases.

[00:07:09] We also have better data solutions. Our team wanted to use DynamoDB and ElastiCache, etc. They had better data solutions where we could have end-to-end encryption, replication in multi-regions, etc.

[00:07:26] What was our four-step process that we followed in this cloud modernization journey to the public cloud? The first step was to adopt Terraform. Terraform is infrastructure as code. Terraform is the only way where we can provision infrastructure in AWS. That was our JPMC policy. This meant that you couldn't go to AWS Console and just order an EC2 instance. You had to do it via Terraform. Doing it this way allowed us to enforce certain policies. If you wanted your EBS volume to be encrypted, we could set up a Terraform policy that would do that. If also any security audit has to be done, all they have to do is look at all the Terraform files in dev, test, and prod, and that would give them what infrastructure we're using and whether we're meeting the security policies.

[00:08:21] This was the first step. Terraform also uses a declarative interface. You just have to tell what you want it to be, and then it'll compare current state with the desired state and make the changes as needed.

[00:08:37] Second is we adopted AWS, as I said, and we also used all the tools that were needed depending on the service. For example, our performance database had to have billions of rows. We refactored that to use DynamoDB so that there's no degradation of performance as the number of rows increase. We also used ElastiCache as a caching provider.

[00:09:07] The third most important modernization step was to adopt Kubernetes for all our microservices and web applications. Using Kubernetes allowed us to overcome all the issues that we were having with PCF IP whitelisting and IP caching. Using Amazon's EKS meant that it's a managed Kubernetes provider, so Amazon will take care of managing the control plane and replicating it through three availability zones. All that the developers have to do is to manage the data and the worker nodes. Adopting Kubernetes is the number one step for our modernization for microservices and web applications.

[00:09:53] Fourth is what we did for observability and monitoring was to adopt Datadog. Previously in PCF we were using Dynatrace and Splunk. In AWS we decided to use Datadog because in Datadog you can look at all your Kubernetes clusters. You can look at all your databases. You can look at all your services, and the API trace can actually walk through each of your APIs and how much time it takes. Doing all this in one particular tool makes it very easy. Datadog is expensive, but it is worth the price. Splunk, of course, is for all our exception monitoring. If you want to see what happened five seconds before and after an exception took place, you could use Splunk for that.

03Team Frustrations and Adoption Practices

[00:10:47] Was it all hunky-dory in AWS because we've migrated all our applications from PCF to AWS? What were the team's comments? I would say the number one thing the developers always said was: why should I care about where my infrastructure runs? I just want to write UI code. It's a mindset change, and it will take time for them to get accustomed to that.

[00:11:19] During our journey, when we were halfway in PCF and half in AWS, it was always difficult for the developers to know which application is migrated to PCF and what is in AWS. Also, there are too many dashboards to look at. There's always Splunk, Datadog, and then if you're managing a PCF application, you have Dynatrace, so the person on support has to look at multiple dashboards. Fourth is there is no UI console. Developers do have to learn `kubectl` to manage, or if you want to look at your nodes or pods. That's again a mind shift. These were some of the frustrations with AWS.

[00:12:04] What helped in driving adoption? I would say the number one factor was having access to a whole lot of blueprints and tech primers. Tech primers were end-to-end applications that help you order your infrastructure and run it. For example, if you want to know a pattern to connect to S3, you could follow that particular tech primer. It would show you from beginning to end how to do it. A blueprint is an established pattern. For example, if you wanted to securely connect to a vendor application, it would show you the steps. If you wanted to connect from AWS to PCF, what steps we needed, that would also be there in the blueprint. Having access to these blueprints and tech primers helped accelerate the adoption, and I used a lot of them to get familiar with all these patterns.

[00:13:04] Second is to join cloud parties. Cloud parties are two- to three-day events where the main goal is to get your app into production. It would start with the PCF service, migrate it to Terraform, deploy it in dev, test, and then do it to prod. It would take about two to three cloud parties to get, if you have a big app, into production. But once you do and you have that confidence to get one app into production, it's easy to rinse and repeat and do multiple apps. That was another factor which helped.

[00:13:43] Third is to have self-service everything. Because we use Terraform, we no longer had to have requests. We could always self-service and help ourselves, so using Terraform helped adoption. Also, we wanted to build a culture of self-help as well as teams helping each other. This also helps because we had the cloud parties, the blueprints, tech primers. I think all of this helped to drive a community of people where they could share their thoughts. The operations team and the platform team always treated the developers as their biggest clients, so any problems that we had were always fixed by the platform team. The developers are the biggest clients.

04DevOps Learnings and Production Issues

[00:14:32] In this six months, what were some of our DevOps learnings? The first one I would say is to refactor. We rarely get to do it if you finish your cloud migration, so do it right the first time. Take the time, and that's what we did. I think that was one of the most important learnings.

[00:14:56] Second is infrastructure. It's not like PCF, where you don't have to worry about the infrastructure. You have to keep into account that there is a lot of learning for the team to adopt Terraform, Kubernetes, etc., and also to control and maintain that infrastructure. You're now responsible for the security of that infrastructure, so you have to put in some time in each of your sprints to address those issues.

[00:15:26] Also, you have to watch out for high costs. It's very easy to keep running POCs and not letting it run -- and then that incurs a big cost. Constantly monitor the cost. Make sure that you are destroying resources that are not needed. You are turning off resources at the end of the day or on the weekends. Dev and test is not needed, so follow those practices to constantly monitor the costs.

[00:15:54] Also, know your dependencies up front. Before you start your migration, know what you are connecting to and how it's going to be in the public cloud world. For example, if we had to connect from AWS to our internal PCF on-prem, we had to open firewalls. To know all those dependencies is very important up front.

[00:16:21] Data migration, of course, when you're migrating from one database to another, have multiple dress rehearsals for schema migration and data migration, and constantly test that the new application is having the same behavior as the old one. Performance testing also has to be done. We necessarily did not find increased performance after moving to AWS. We actually found performance degradation, and that's because many of our services were still in PCF. Though our team was in AWS, we still had our dependent services in PCF. Connecting from AWS to PCF did introduce latencies, so we had to refactor to ensure that there was better performance and not worse.

[00:17:15] What were some of the production issues that we encountered? I would say the first one was configuration. It's very easy to miss a few keys or to have prod pointing to test, so always test your app. You never know what happens.

[00:17:33] Data migration: I would say we did a lot of dry runs of migration, but even towards the end we did find some issues. Data migration is always something to look out for.

[00:17:47] Continuous testing: it's easy to let your guard down after you do a few migrations. It's important to continuously test, and then at the end of it, you can celebrate. Of course, you have to test all your consumers, because if your consumers depend on you and you have migrated to AWS, you have to let them know and let them factor that in their testing. Then always test for edge cases.

05PCF and EKS Together

[00:18:16] When would we use PCF and when would we use EKS? For us, I would say the number one thing to remember is the applications in PCF are not going away. We have 1,500 applications, as I said, so it's important for us to only refactor as needed. Both have their pros and cons. PCF is a fairly stable cloud, and if your application is working well, there is really no need to refactor it. In our case, we did have certain advanced data cases, etc., which we needed to refactor. Right now I would say we are in a mix of living harmoniously between both the private cloud as well as the public cloud.

[00:19:07] It is important to realize that in PCF we really didn't have to worry about infrastructure. In EKS we do, so only if it is worth that amount of extra effort does it make sense to really go to the public cloud.

06Ongoing Challenges and Closing

[00:19:26] The last slide is about areas where we are constantly facing challenges. The number one, I would say, is there is a lot of toil spent in Kubernetes upgrades. Some upgrades are in-place and some require a blue-green cluster. Within about six months, we've already done two upgrades, and both were blue-greens. You have to set up a new cluster, deploy all your apps in your new cluster, then destroy the old one, and then the new one is live. That requires a lot of toil. This is in addition to the application features, etc., so it's important to keep that in mind. I would like to see if people have the same problems and if they have any solutions.

[00:20:17] Second is the pipeline becoming slower, because now you have to scan your container code, you have to scan your application code, and security is the number one concern. We don't deploy to prod until we have all our testing and security scans which are clear, so that introduces a lot of delay in our pipelines. I would also want to see if you all have the same issues and if there are any areas of mitigation.

[00:20:44] Third is security vulnerabilities. It's a constant challenge to constantly be on top of them. If you are managing a whole set of applications, if a vulnerability is identified, you have to fix all of them. This also takes in a lot of toil. I would say these are the three areas where we are spending considerable amount of work in just maintaining the infrastructure to make sure that the lights are running.

[00:21:11] If you have any similar areas or concerns or stories to share, please reach out and we can discuss. That leaves about three minutes for question and answers.