Mastering the Art of Juggling between Application Development and Support

Log in to watch

Las Vegas 2023

Mastering the Art of Juggling between Application Development and Support

Executive Director, Digital Marketing Technology in JPMC Asset Management · JPMorgan Chase & Co.

As an Engineering Manager, one of my primary challenges lies in achieving a delicate balance within the software development spectrum. On one end, we have the excitement of pioneering new application development, and on the other, the responsibility of sustaining these applications once they are operational.

Developers love to experiment with new technologies. However, after these systems go live, the enthusiasm dwindles, and their focus shifts to the next project.

In this session, I aim to share effective strategies that my team has employed while balancing the fine line between developing and supporting applications. I will also share when we faced challenges and successfully regained control.

Chapters

Full transcript

The complete talk, organized by section.

Sheela Shankar

Welcome, everybody, to this session. The topic of the session is mastering the art of juggling between application development and production support.

I'm Sheela Shankar, and I have been building and supporting applications for the last 20 years, even before DevOps became a thing. I'm very passionate about being hands-on. And being an engineering manager, I still like to be hands-on because that's the only way you can do production support.

I like to share my passion of building and supporting applications. I'm always ready to debug into a complicated issue and get my hands dirty, looking at Datadog logs or Splunk logs. I'd like to share my passion with the team and hopefully with you all as well.

I'm from JPMorgan Asset Management. JPMorgan Asset Management is a global leader in investment management, managing more than $3 trillion in assets under management. And we have a very large developer team, almost 50,000 developers.

Let's talk about my team, Morgan Advisor. We build applications for financial professionals to analyze their portfolios, to compare funds, to select funds for their target date portfolio, et cetera. As you can see in some of the screenshots, it's heavy on the UI as well as on the data.

Let's talk about our tech stack, what tools we use, and our platform. As I said, we build a suite of portfolio construction tools. This has data coming from different funds, different data providers like Morningstar, FactSet, Bloomberg, and our cloud platform is based on AWS and Kubernetes.

We build and operate our platform, manage our infrastructure, manage the cloud costs, keep it running. Most of our services are based on Kubernetes. Our technology stack on the backend is mainly Java, Python, Node.js, and on the frontend we use React, Next.js.

Coming to our team: our team is about 30 developers, and to that, a mix of product managers, UX designers. Our entire toolset is managed by a central platform team. So the entire continuous integration platform and the deployment platform are all managed by the central platform team.

Our support model is, "You build it, you run it." Our user base is mainly financial professionals from different companies, and we also have internal users using our tools. They actually use our tools 24 by 7, so it's important that we keep our tools running all the time.

Coming to the topic: why the analogy between juggling and application development? Developers think that once they've deployed code to production, that's the end of the lifecycle. But in fact, that is the beginning of the lifecycle. By constantly adding features, fixing bugs, and listening to user feedback, that completes the entire lifecycle.

While doing this process, while keeping the lights on, developers are juggling multiple tasks because we have asks from business to add new features. We have to keep our existing users happy so that existing features can run reliably. You have your security and infrastructure team requiring that you always have the latest versions of the software and the infrastructure. As a result of which, you are juggling multiple tasks.

Just like juggling, you have the same skills requirement, which is basically you need speed, agility, creativity, adaptability, and then most of all, practice.

Now let's talk about our support model, and why embrace "you build it, you run it"? Why not have a dedicated SRE team? Our team is a very close-knit team where the developers work in close conjunction with the product managers and business sponsors because this is a financial suite. So we have to work with our product managers and our end users to listen to their feedback. We want to have ownership and accountability and get that user feedback to constantly improve our systems. And that makes sense in this particular model, where we can quickly adapt to user feedback.

Second, we want improved fungibility between the different teams. We want any application developer to be able to work on any tech stack. And by supporting applications, that is the best way they can learn the system. So we want to improve the fungibility, and that's one advantage of supporting the applications that we build.

And then the third is, you want to improve the team communication. So a junior reaches out to a senior to ask for some help. You get improved communication skills by writing out root cause analysis.

And then the most important is to get actual feedback. Feedback from users is really useful, and you get that for free when you work on application support.

Those are some of the reasons why we as leaders want to adopt "you build it, you run it." But are our developers happy with "you build it, you run it"? When we interviewed our developers, they were happy to support the applications that they have built. But when it comes to applications that they have not built, they are not very happy.

Let's talk about why application development is so attractive to the developers.

The first one is, you get a chance to work with an empty slate. It's new code. Instead of supporting applications that others have built, you get a chance to build something new from scratch. And you're in the flow, deeply engrossed, cranking out features. Your day is predictable. You have a goal. You're going to work on a few features, you're going to submit a PR and get it reviewed, and you know what your day is going to look like, and you're in the flow.

You contrast that with being on support, and your flow is basically nonexistent. You don't know what issues you're going to face. And you also have this cognitive overload by having to look at so many services.

This is a very common question which I ask my team: can you please look into this batch process? And all the instructions are there, laid out in Confluence. And then the developer is like, "Where do I even, how do I even search for this Confluence document?"

Those were some of the problems that we were trying to solve, is basically how do we get application developers to enjoy application support and to understand the benefits of it?

For this, we had to have multiple retros with our product managers, our business sponsors, and our developers to see what can be done, what were the most common issues, and what can be done to improve them.

The first issue was that there are tons of issues coming from requests. Issues could be something like, "Can you please investigate why this batch process has failed?" Or, "Why is this data point missing?" Or, "Can you please make this fund active or refresh this scenario?" So multiple user requests coming at the same time, and the on-call developer is not knowing what to prioritize.

The number one thing that we implemented is that we would have joint ownership of production systems between the developers and the product managers. The product managers know the product in and out, and they're able to quickly prioritize and give that direction to the developers. And that has really helped with prioritizing issues. Also, the product managers know the product in and out so they can quickly assess if something is a critical feature or something which can wait.

Then what we did is, we made our product managers tech savvy. All along, developers have been learning the business and the domain knowledge from our product owners. So now it's time for the developers to share the knowledge with our product managers.

My product manager is now an expert with the Chrome developer toolbar. He sees a screen on the UI, he knows which API is behind that particular call. He knows the request. He's able to see the request, the response, the amount of time it takes, and then able to do the level zero production support just using the developer toolbar.

This has not only helped with production support, even accepting a story is now made easier because now the product manager knows how to see whether the API and the UI are matching, and he's able to see the performance. So training our PMs on developer tools is really good.

And another thing is, for example, ADA compliance. Getting our stories to be ADA compliant was a big, it was always detected late in the game. So by teaching the product managers how to use the axe plugin, they're able to detect if a particular UI story is ADA compliant or not. Getting the product managers involved in our level zero production support has been really useful.

The second one is, we have too many manual tasks. We want to reduce manual tasks. That's the biggest concern of developers. They get an issue from the user saying a certain particular chart is not showing the right data. How do they even go about starting to look into that issue?

What we did was to improve the documentation so that every user interface has, we know what APIs are behind it, we document that, and also all the different Swagger endpoints. What is the request? What is the response? They're all laid out, and they can try it out and see what API needs to be fixed.

And then now that they know what API has to be fixed, or what database command has to be run, they now have to do it in production, which requires break-glass access. And as we all know, we want to reduce access to production. The way to do that is by building a self-service admin tool or a developer portal for application support. Most common tasks which are requested by the users are all automated.

And so this has a dual benefit. You're not only reducing toil, you're also helping to reduce production access. These are some of the steps that we took for reducing the manual toil.

Coming on to the next point is now, how do we track all the work which is spent on production support? Now you have to have a mental mindset change, that production support is not a chore, it's a feature. Keeping the lights on, listening to the users, and answering their queries is not something which is taken lightly. It is a feature, and hence we have to allocate enough points for that.

We differentiate between planned work and unplanned work. Planned work would be something like running the month-end jobs. At the beginning of the month, you have a few batch jobs or reporting jobs, are all planned work. And you have well-documented processes for the planned work.

And for the unplanned work, these are the requests, which are the ad hoc requests which come. We make sure that we are logging those in Jira and also linking them to the ServiceNow ticket so that if somebody wants to look at what they had worked on previously, they can look at it through the Jira.

Some of these best practices have really helped in sort of documenting what steps they took.

Also, I think we've had multiple sessions by Christophe on how we do blameless postmortem and incident management. So that's also some of the things that we have implemented, where you don't ask a question as to why something broke. You ask, "How did it even break in the first place? Why was there no unit test to catch it? And how can we prevent this from happening?"

And then incident management. Before an incident occurs, you want to have a good process of documenting how do you handle an incident management? Whom do you reach out, and what teams do you escalate it to? So you have all this documented. So now the developer knows what to do in case of an incident.

The next problem we had was alert fatigue: too many issues, too many alerts, and the developer doesn't know which alert to look at. For example, there is an error on the UI, and that error is caused by a service having a failure. And then that service is dependent on multiple services. You see all the services are now throwing alerts.

So we consolidated the alerts so that you can have just one user error which generates the alerts, and then everything else are just exceptions. And you can use the trace ID to debug through all those exceptions, but you don't get alerted multiple times.

Alert consolidation was the first thing that we did.

Second is classification of errors. So there are two types of errors: actionable and just information. The actionable ones are something that you have to actually take an action on. So when we created our Splunk alerts, we would say that you need to take this action, or this is critical, or whether this is just informational.

An example of an actionable alert could be something like your database connections are running low, so please check your Datadog dashboard to see which service is causing that. Then the developer would go to the Datadog dashboard and probably restart a particular service. Those are the actionable alerts.

Contrast that to an informational alert, which is something like a Kubernetes pod is restarting, which is informational, but you don't have to take action on it because Kubernetes is self-healing. Similarly, if a certain service is gracefully degrading, you don't need to take action right now. You can look at it later. By classification of alerts and errors, we are able to reduce the noise.

The next is to know what good looks like. The developer, we have training videos which show what a good process will look like so that the developer can see what were all the steps that a good process goes through. And then when an error occurs, they're then able to quickly identify where the break is. So knowing what good looks like is very important.

Lastly, we had these weird user errors which can only be seen in production, and nobody can reproduce it in normal circumstances. We use the Datadog session replay to actually replay the session and to see what user actions that the user took, and then debug the error using that. And most often than not, it's always a back-button issue. And then you have to take the steps to fix that.

So alert fatigue was also reduced by following these steps. These were some of the practices. There are many more. In the interest of time, I just wanted to stick to a few of these.

Now let's go to the lessons learned. These are all the mistakes that we've made, the lessons that we've learned, and would like to share that with you.

The first one is, do it right the first time. In a hurry to get the MVP out, we cut corners, but then you end up paying the price in production support. Always when building features, it's always important to keep application support in mind because you need to know whether this is going to cost you more in production support. So that was the first step.

The next is, let the on-call rota team figure out the issue. Don't try to solve the issue for them because then they're always going to expect you to jump in and try to help. Follow the protocol. Let the on-call rota team do their work, and then if they need help, they will reach out to you.

This is something which I always do, and I'm trying to learn and let the team follow the protocol.

The next is to pick the right tech stack for the job. This has happened to us, that a developer has picked something new and shiny and then put it into production. But when it comes to supporting it, the team is not comfortable with it or is not familiar with it.

So now when we want to try out something new, we always ask these three questions. One is, are the developers interested in learning this and supporting it? Two, is this technology going to stand the test of time? Because, as we saw yesterday, open source funding can stop, and then you'll have to find something new. And third is, is this technology approved in JPMorgan?

Very often, it's better to stick to the JPMorgan-approved tech stack because they have blueprints and best practices. So it's better to stick to the tried and tested technologies.

Next is to keep our applications evergreen. What this means is that you want to keep your applications green, meaning you can run it through the pipeline to deploy a hotfix.

JPMorgan actually has a guardrail in place where it's mandated that every application is put through the pipeline at least once in three months. By doing this, you are ensuring that any security vulnerability or if there's any language updates that need to be made, the pipeline will fail until you fix those.

If a hotfix needs to be applied, you can now apply it through the pipeline. And that has helped us a couple of times because you never know when a zero-day vulnerability is declared, and you may have to run it through the pipeline. It's a good practice to have these guardrails in place and to also follow them.

Next is about metrics. We all want to gather user feedback and the important metrics like your SLIs, SLOs, and SLAs. But it's also important to not be too overly concerned by them because if you are only measuring for the sake of showing it in a dashboard, it doesn't help.

So we only gather metrics for the sake of improvement. Only if something is going to improve by measuring, you want to measure it. Otherwise, it just makes no sense to just have it on a dashboard.

And also to be aware of the observer effect. If developers know that you're going to measure them on their error budget, they may try to fudge the error budget, exclude a few error codes. So it's always important to not put too much focus on the metrics. It's important to measure it, but don't try to compare teams and other applications for the metrics.

And then finally, it's most important to have the human connection and to encourage pairing, user feedback, and also encourage growth mentality. It's very often that the developer feels overwhelmed and they say, "I can't do this production support. It's too much to learn." Always encourage them to have that growth mindset that today you may not be here, but you will get there with eventual practice. That's important.

And then do not expect all developers to have the same liking for production support, and do not expect everybody to be at the same level. It's important to have that human connection and to treat developers with that particular freedom. So let them support to the level that they can.

Also, it's important to remember that in this entire juggle of software development, it is not a single person. There's no hero. It's the actual team effort, which is working together to build and support the applications.

So what were some of our success stories by following these practices?

Number one is, the developers feel more relaxed when they're on call because they know that the issues are prioritized. Second is, business is happy with the speed of issue resolution because if they say something is a showstopper, they know people will work on it. So the business is also happy.

The developers and the PMs are happy with the increased skillset because developers are learning more skills from the PMs, and the PMs are learning skills from the developers. There's also increased empathy between the developers and the PMs.

We also follow up on the issues during our weekly handoff meetings. We make sure that we don't discuss just the issues for the week, but also the issues for the previous weeks to ensure that nothing falls through the cracks.

And then documentation is vastly improved because both product managers and developers are actively using it and actively contributing.

Those were some of the success stories.

Next, have we reached our final goal? There is still a lot more that we can do. The number one complaint from developers is that they're not able to search for the content they want in Confluence because the current Confluence search is a keyword-based search, and it only works on one particular dataset, which is Confluence.

This is something which I'm interested in working on. It's a POC and something which I work on during the weekends and trying to implement it within JPMorgan when it gets approved. Basically, trying to use AI to help in searching of content. So not only searching content across Confluence, but we have a wealth of information in our service desk Jiras and ServiceNow tickets.

The ability to combine the data sources from all these multiple places, index them into Elasticsearch, create those embeddings that will help the contextual search, and then building a user interface where the user can type in a natural language search and then vectorize that and then find the closest match.

If this can work, it would really help in doing better search. Also, I think AI has got definite potential in many more areas of production support. People talk about AI in application development. I think it has a lot of potential in production support.

Finally, areas where I need help. Two areas. If you are using AI in your current role to help in production support, I would like to hear from you. And secondly, if you are also following "you build it, you run it," and your team is calm and composed, just like this person, I would like to hear from you and get your ideas.