Modernization of NOAA’s Operational Weather Modeling Suite
The National Weather Service (NWS) provides weather forecasts for US and global communities using a suite of complex numerical models run on HPC platforms operated by the US Government's National Oceanic and Atmospheric Agency, NOAA. Weather Service products range from high resolution, hourly forecasts for the aviation community to months long forecasts for resilience management. Our models have historically been developed internally and run on internal platforms.
Over the last five years we have embarked on an ambitious project to transform the development environment for these models from an in-house, stovepiped approach to one that embraces open community modeling. Our models are already starting to benefit from contributors beyond our traditional partners even as we are starting on this journey; one initial result is that by implementing parallel compression algorithms our model I/O footprint has reduced six fold, allowing us to increase resolution and improve forecast skills.
In this talk I will briefly introduce our modeling suites and walk through our efforts to change NOAA’s model development culture and modernize our suites.
I will talk about the challenges we faced in embracing DEV-OPS processes that include testing and development on a range of HPC platforms both within and outside NOAA.
Finally, as cloud adoption is speeding up, I will close with our efforts to successfully migrate models that require massively parallel distributed memory tasks with fast networks to cloud infrastructure.
Chapters
Full transcript
The complete talk, organized by section.
Host Intro (Gene Kim)
It was such a pleasure to meet the next speaker, Dr. Arun Chawla, who I met through one of our programming committee members, Dr. Topo Pal from Fidelity Investments.
By training, Dr. Chawla is a computational oceanographer, but due to an incredible confluence of events, he is now spearheading a new way of working for scientists and engineers at the Environmental Modeling Center, which is part of NOAA, the National Oceanic and Atmospheric Administration, where he serves as Chief of Engineering and Implementation.
I admit that I'm at a loss to properly describe the amazing work that this group does because it impacts our lives in so many different ways. You've benefited from their work if you've ever gotten an inclement weather warning, such as a tornado or flood; if you've ever read or seen a news story about the predicted path of a hurricane; or if you've ever been on a boat or a plane, or ever gotten a package that's been shipped by one, because those vessels receive real-time data from this agency to enable them to avoid dangerous weather.
It was so amazing to learn about all these modern miracles that enable these capabilities, which in so many ways has shaped so much of the history of computation. Dr. Chawla will describe how a compelling event that happened 10 years ago set the stage for demanding a better, newer way of working, and what Dr. Chawla and a group of like-minded thinkers did about it. I know you'll appreciate this glimpse of the amazing work that they do and how they want to make it even better. Here's Arun.
Arun Chawla
Thank you for that very generous introduction, Gene. My name is Arun Chawla, and I appreciate this opportunity to talk about NOAA's operational weather modeling suite and the changes that we have done over the last seven to eight years.
I'm going to give you a brief introduction of what we do in weather modeling, the problems that we face, the direction that we decided to go, our success, and where we want to go next.
The National Weather Service is a federal agency that sits under NOAA, the National Oceanic and Atmospheric Administration. NWS's mission is to provide weather forecasts to save property and life. We are headquartered in the DC area, but our agency has many offices distributed across the continental United States, and we have partnerships with agencies worldwide. We gather and process data from around the globe.
Modern life is made up of a lot of people who use weather data. Our customers range from people who work with the aviation industry. Anytime you take a flight, the flight plan includes the weather forecast that decides how things will go. You need weather data for search and rescue operations. Everyone has gotten severe weather alerts on their phones. Any packages that you receive go through a big shipping industry which relies on weather forecasts. Marine traffic looks at weather forecasts and decides which routes to take and how to make sure packages reach their destinations safely. Firefighting units rely on weather forecasts; the storms we have seen happening in California so often, the big fires, rely on strong winds, so firefighters look at weather forecasts when they are fighting fires. There is a big customer base, including recreation. The key part is that with such a big, diverse set of people that rely so much on weather forecasting, the infrastructure is critical, and failure is not an option for us.
For weather forecasting infrastructure, we rely on observations. We rely on big, complex numerical models that simulate the entire planet and need to run on big HPC platforms. Since we create a lot of data, we have to talk about data dissemination, storing that data, and archival. These are all part of the infrastructure required for weather forecasting. This is not a cheap endeavor that can be taken on easily.
When we talk about weather forecasting, there are three critical layers. One is observation data. Weather forecasting is an initial-value problem, so what you require is very nice initial conditions that can then be forecast using very complex models that are run on big supercomputers, which generate forecasts that are then disseminated across the world. On a daily basis, we have to get about two terabytes of observations. These observations include weather balloons floated by the agency in all places, big satellites, and in situ observations in the ocean. They have to cover the entire globe because weather doesn't respect national boundaries. So we have big collaborations with Europeans, Australians, all of their satellites, and we exchange data all the time.
At the core of a weather forecast is a big HPC computer. These are the statistics of the latest computer that we have: big nodes packed together with lots of cores, and we need a high-speed interconnect that can actually move data fast between these because we're using these to run big, complex numerical models.
One thing we learned the hard way is that you cannot do it with one supercomputer. There was a news item that came out in 1999 where the National Weather Service's only supercomputer caught fire. That was a crazy four months where we had to move all of our forecasts onto smaller backup systems and essentially try to get a new computer back online.
From then on, we've realized that at any given time you need two HPC platforms, completely identical, placed in two different parts of the country, so that if one goes down because of technical or weather reasons, the second one can do the forecast. We essentially work on moving data from one to the other between forecasts. We run four forecast cycles a day, so between one cycle and the next we have about five to six hours, and within an hour we can move our entire operation from one machine to the other. The two platforms will always be linked to each other. They will be exactly the same architecture and everything works the same. It's not just one supercomputer. When we buy, we buy two.
This is a brief introduction about what weather forecasting is and what we do with numerical models. I'm going to go over the problem that we were facing.
Our problem was known to us for a while, but it really culminated in Hurricane Sandy. Hurricane Sandy was a big event that hit the eastern United States in 2012. It was one of the biggest hurricanes around the time. The white track on the slide is the actual track of the hurricane: it came up the East Coast and then turned left. It made landfall between New Jersey and New York. Wall Street was flooded. It was a big event.
Very early in the hurricane, we were all tracking it and trying to simulate it. When the hurricane was still off the coast of Florida, we had two models: the European model and our model, the American one, which was in blue. The European one tracked that the hurricane would hit land and make landfall, whereas the American one would say that it was going to go off into the ocean.
In subsequent forecasts, a day or two later, both the American and European models were tracking exactly where it would fall, and they became very accurate. But the big miss that we had in the early stages galvanized opinion. It got everybody interested in how the American model system lost it so early compared to the European one. That essentially was an eye-opener for Congress, which decided that they needed to fund weather modeling, and it was an eye-opener for all of NOAA to look at our models, look at our close race, and ask how we not only improve the model but also change the way we do modeling.
When we look at numerical guidance that the weather forecast agency gives out, America is a big country with all kinds of weather events. We get tornadoes, flooding, hurricanes, and we have to worry about climate change. Our forecast products go from minutes, when we're talking about dispersion, to hours when we're giving tornado warnings, to days where we're looking at a slow-moving hurricane, or even seasons and years when we're giving out a big outlook as to how the hurricane season or the El Nino/La Nina cycle might go. We have a range of forecasts that come out with a range of products. Of course, the further out you look into the future, the greater the uncertainty; that's always the case.
By about 2012, slowly over time, we had built this complex quilt of numerical models. Each box was a model talking about a particular thing. For example, the Global Forecast System was our global base engine that provided boundary conditions for high-resolution regional nests run over the continental United States, which then provided to very high-resolution local areas. We're looking at air quality, smoke dispersion, and things like that. Over time we built all of these models. They were all talking to each other, running behind each other in small and big jigsaw puzzles. This became known as our quilt.
The problem was that they had all been developed independently over time. We had fractured model development. The same software infrastructure was being repeated over different models. We had limited resources of engineers being spread across all the modeling systems. All of our model development used to happen in-house because we felt we needed to reinvent everything, and they would only run on NOAA platforms.
The end result was that if an academician or someone in private-sector industry wanted to work on a problem, they had to come into our HPC platforms and get accounts. Since these are considered FISMA high standards, there is a lot of protection to it and you need to get clearance, so it was not easy. It was also not easy just to move these models out onto other platforms because they were not very portable.
Some of these models are really old. They cannot keep up with software upgrades. For example, I think one of our dispersion models was using Python 2, and when Python 2 was going to be removed in the new HPC platform, we were scrambling because nobody knew how to convert that to Python 3 easily. They did not know the code. In many cases there were multiple models that were very similar and doing just slightly different things. The models were not able to keep up with science, and the problem felt like we had to either innovate or we were going to keep getting left behind and perish.
It was not just that we had too many models. We had a very convoluted way of getting innovation into our operational systems. It was dubbed the Valley of Death because new ideas would go to die in this big plethora of transitioning models over. Any agency, research lab, or anybody that wanted to work would come up with an idea. They would take a copy of the code, figure out how to run it on their platform, push in a way, and then send the idea to us. We would have to try to code it up, run it in an operational environment, and tell them that their idea didn't work because it was too expensive or too difficult. It could take hours and everything.
My wife sent me the sketch here, which I thought was fantastic because it embodied what would happen. We had hacks that we thought we would need to put there until a good solution came out, and those hack jobs would become permanent and stay for 10 or 15 years. Great ideas would just go to die because we did not have enough people to work on them, or the person who came up with the idea had no way to put that in the codes and test it out in our environment. That was becoming a big problem for us to innovate.
The vision was that we would try to build a unified model system that would clean up our range of multiple different models and build a community -- not just build a new model, but build a community that can change the way we do model development in NOAA.
The modernization plan was to build a modeling system that would share common infrastructure. It would unify across different scales -- from hours to days, maybe even seasons -- and it would couple across different components. That's another thing we've learned in the last 15 to 20 years: what happens in the atmosphere is also dependent upon the ocean. We've heard the stories that the ocean is warming up, and that's providing all the energy that makes storms bigger and stronger. The idea was that we would build basically a single model to rule them all. We would remove legacy models from operations and, at the same time, not rely on all the people being under one agency. We would grow the pool of development partners and remove the artificial barriers that being part of an agency puts in place.
The simple goal was to first take this complex system of independent models that talk to each other and make a system that still looks complicated because it's made up of many modular pieces, but they're all unified. This big piece of component systems can be configured to be either the short-range, hurricane, global, or seasonal system -- in a sense, make the same products that we were getting from the previous one, but have the core of that system be one common block that we could all work with together. The advantage would be that if we came up with a new concept or new software infrastructure for, say, the global, it could be applied for hurricane, regional, and all the other systems without having to recode all the pieces.
But the plan was not just to build another model, because if we just created another model and continued to do things the way we had been doing them for the last 10, 15, 20, or 30 years, we would face the same problem again. It's just that the clothes would be changed. The idea was to shift the paradigm of model development. We would go from in-house, where only a few people understand, to a whole community system, so that people who are working in NOAA, in the private sector, and in academia could work together at the same time.
To do that, we also wanted to modernize the process. Some of the things that scientists don't do very well are documentation and testing from the start. We were pushing a lot that all documentation would have to happen early. We would start doing continuous integration and continuous development. We also wanted to do modular development and make sure that the pieces we build are very portable and can be tested on multiple platforms at the same time. The big picture, our gold standard, was to reduce the time from innovation to operation and also build and nurture a community.
The goal was to design a system around community development needs. The complicated part is that you've got a coupled system: ocean, waves, ice, atmosphere. How do we get all of these to talk to each other? The idea was to leverage existing communities and minimize disruption. Some of these modeling systems already had open communities existing. We said, we're not going to change the way you do it. What we're going to do is add a translation layer on top of your model. In essence, it comes down to what we call a cap. It is external code that says, these are all the variables that the ocean understands; we'll translate that to what a coupled model expects it to be, and then do it vice versa.
That way the individual communities that have been created can still continue to work, and we bring everybody together. We leveraged already existing communities. We created places where there were no new communities; the atmospheric modeling system did not have an open community, so we helped create that. We worked with NASA to create a community around the GOCART model that they have. We did that over time.
Our challenges can be divided into three parts. They were cultural: people were ambivalent to community development. Scientists can be very secretive sometimes, and we had to explain that people aren't trying to steal your ideas. One of the biggest selling points was that if you put all your models on GitHub, then everything that has been put in place is already available. If you wanted to challenge somebody for stealing your idea, it would be very easy to see in the repository. That helped, but it was difficult. These are scientists, not software engineers. They have worked for 15 or 20 years just doing their own stuff and not working in groups, and that's been a change. The younger ones are more used to it. The older ones have been harder to convince, but we've been working on them for about three to four years now. We've done many workshops and talked about things.
The technical aspect is that we didn't control everything in one shop. When you control everything in one shop, it's easier to pass rules. But when components are distributed across different agencies -- for example, GOCART is an aerosol model developed primarily by NASA, and the ice model has been developed primarily by the Department of Energy -- you have to build a coalition of the willing. That involved a lot of conversations. We had a hands-off approach. We let people build their own rules of development and then worked together to see how this could work across.
The third challenge is scientific. When we are trying to do continuous integration and continuous development, sometimes there is no correct answer for science. There is a nonlinear feedback loop. You change something in the ocean and then the atmosphere can react in very slow and unpredictable ways. The only way to find that is over long, big experiments, and that's one of the things that we keep working on. There is no easy "here's a test" that says you've got the right answer. We have to keep doing some of these things.
Our approach has been collaborative. We had many of these challenges and said, okay, we're going to distribute it into four bits. First we're going to look at our development community, which is divided across NOAA and our partners in the Navy and NASA, and see if we can combine and make a unified development community, without worrying about how that will come into operations. That's what we've been focusing on: build a community that can work together from day one and can rapidly bring changes in. Then we'll transition some of these core model components into operations in our old way, then try to unify our operations and development environments, which is a big challenge, and then finally, hopefully, create a true DevOps culture.
When building a unified development environment, we found that communication is key. We had a core team that we set up -- what we call the integration team, which reports to me. It would do the full-system testing, planning, and all of the approaches that we needed for code management. They would work with code managers for each of these components. Each component has a code manager associated with it, set up by the community it belongs to. Sometimes the funding came from NOAA directly, sometimes it was a collaboration between NOAA and the agency involved, and in other places the agencies involved saw the value and provided the resources themselves. We also have code managers for each application, who decide the science that needs to be done and what the biggest challenges are.
The entire team and all the code managers meet every two weeks and plan out all the upgrades that are going to come. This took a while to set, and it was a whole lot of collaboration, talking, and convincing people. One of the things that turned out to be very useful was that we had a core team of software engineers and said: we'll help you solve your problems. If the community, say WaveWatch, said they needed to set up a new compression algorithm, we would come in and help them code that into their model. What we found was that when you're willing to provide assistance, a lot of people become happier to be part of the group. That meeting every two weeks is absolutely critical for us. It helps us set the goal for the next two weeks and also helps us expand and see what the next five or six months of commits are going to be in the modeling system.
The other thing we did was set up a rigorous testing infrastructure. The HPC environment for NOAA is set up with two HPC platforms that we use for operations, one backup and one production. The backup is what we use for science development sometimes, but apart from that we also have multiple HPC platforms available for science and testing. In the past, some scientists would have access to one platform and others to another, but our modeling systems would not always be available on all the platforms, and people struggled to port model systems over.
From the start, when we set up this Unified Forecast System infrastructure, we said that for every commit we'll make sure all of the tests and regression tests are run on all the platforms at the same time. At any given time, if somebody wants to make a new change or work with the latest version of the model, they just need to download the latest code knowing fully well that it will already work. We also created a Docker container for the stack on AWS. Having a model available for scientists to run on their HPC platform of choice was a key ingredient that convinced a lot of people to work with us.
We pushed this automated testing. It took us almost a couple of years to get it done properly, and at any given time we run 600 tests in different configurations. Somebody puts in an upgrade and says, I've got a new idea, and pushes it; we make sure that it runs for the global configuration, the regional, the hurricane, so that nobody gets surprised. If it breaks down, we get the whole team together and figure out why it broke down. That has been a game changer for us.
Another major game changer was essentially a common infrastructure stack. Our modeling systems had a series of libraries that we use, ranging from third-party libraries to in-house libraries. Because these are parallel-processor computers, the libraries can be built either serially or in parallel depending upon different compiler options. At any given time, none of the third-party libraries easily worked on different HPCs; they would be built differently, and that would be a big problem. Any time we needed to port these things, it could take anywhere between three weeks and a month depending upon having a subject matter expert available. We said, we can't do that.
We moved all our libraries to the Spack package manager, which comes out from Lawrence Livermore National Laboratory. It's essentially a package manager that helps you build everything. We set it up to be an automated build, and now about 140 packages can be built in less than an hour. You do a single command and that builds up the entire infrastructure. The entire infrastructure, from end-to-end modeling, can be set up in an hour on any HPC platform. We test it on multiple platforms that we have access to. We build it on MacOS, on AWS, and on several of our on-prem HPC platforms.
This is a collaboration effort between NOAA, our lab, and JCSDA. We started with a small team reporting to me to build this out, and then other centers got excited about it and started adding their libraries into our package system. We now have about 40-plus contributors and it keeps growing every day.
Since we started this repository, we pushed this out about four years ago, and the growth has exploded. The slide is a simulation of how the repositories have grown over time. It shows people making contributions and individual files clustered together. We started with just the atmospheric component. Since then we've added 10 new component systems. Adding a new component system means making sure that it couples together and builds together, and also creating a repository for each one of those components in the open. Since we started doing this, looking at some GitHub stats, we've had about 315 new enhancements, 600 issues resolved, 175 development forks, and 66 contributors. My team is only about four people. People have leveraged their own work to work with this together because they've seen the value of doing all of this together.
Some of the early successes and testimonials that we have seen with this approach: around 2016, we went with the atmosphere-only implementation of the Unified Forecast System, not fully coupled. This was the first implementation. We used to run in operations with 64 vertical levels. We moved to 127 vertical levels, so we doubled the resolution, and that helps with the large underlying flows and addresses some of the problems we were having with Hurricane Sandy.
The problem we had was that at that point, we had an inbuilt binary data format for writing everything out and it had no compression. We had built it in-house, all the libraries. When we tried to put this in operation, it was going to take about seven terabytes of space and write in about 79 seconds. The operational community came to us and said, you don't have that much space, and if you take so much time to write your forecast out, it is doubling the time, and this will not work for us. They basically gave a thumbs down and said, you can't put this in operation.
At that point we started scrambling. We looked around and found there was another format system for file I/O, NetCDF, which was a community format system. It wasn't built in-house, and there was a large community to it. We reached out to those developers and told them about our problem. They got excited, started working with us, and the two teams tried out a few different ideas. In a couple of months we came up with a solution which reduced our file size to 1.3 terabytes, which was actually smaller than what we had when we were at just 64 vertical levels. We doubled the vertical levels and even reduced the file size, which operations loved, and our time was down to 34 seconds.
We had to go into NetCDF, do compression and parallel I/O, and the NetCDF community was trying out new ideas with us. We were working with parts of the libraries that were not available for public release. They would release that out publicly once we had some ideas, and then we would test the public release with our models. They were testing in their environment, we were testing in the operational environment, and the whole thing was tested, released, and deployed in operations in under two months. That was a big success, without which we would never have had our initial operational capability with this modeling system.
The second capability is the concept of moving nests. When you're running hurricanes, hurricanes require very high resolution to resolve the core of the hurricane. But you can't have a model running only over the hurricane, and high resolution over the entire globe can be very difficult. The idea developed by a group in Princeton was building telescopic nests, so you could have higher-resolution nests that are talking to each other. A second group based in Miami worked on an algorithm that would move the highest-resolution nest with the hurricane, so you could follow the hurricane with the highest-resolving nest. A third group, partly reporting to me and partly working with NCAR, based in Colorado and DC, worked on asynchronous I/O. That means as the data is created, it will automatically send it out to different processes dedicated for writing, so you don't lose time by stopping the model to write output.
All these four organizations working together had two HPC platforms for testing, and we were all working in the same repository and integrating it into the main repository in a few weeks. That was a big success, and everybody really liked the new way that we were able to work, test, and build that whole system out.
A few testimonials for us: we started with just a small handful of systems on GitHub in open repositories. We've now managed to convince almost all the agencies, and we have over 50 packages available in open development environments on GitHub. NOAA leadership has endorsed this idea of community development, and they have actually understood that you don't just fund science; you have to fund dedicated code managers. They have provided resources for us. It's never enough, but it's a good start.
One of the big things that we've found and really like is that we've moved from this paradigm of model development with the Valley of Death, where somebody would develop a modeling system in their own place and hand the code off to us and we would recode it. Instead, we have moved to all of us working off the same repository. People can make forks, test ideas out, and tell us about them. We get involved early on. We do the testing, make sure it works in all the HPC platforms, and look at the timing. That has improved the development environment.
Operations has always been a little off-key, but operations is suddenly very keen to adopt our Spack-Stack to equate the dev and ops environments. That's what we're having conversations about, and that's where we're headed next.
So where do we go from here? One of the things we've been looking at in building a community is that while we've got other NOAA players to play along, and they have access to NOAA HPC platforms, a lot of people who might be interested in working with us will not have access to our compute platforms and will probably not easily get clearance to get access. One pilot project was: can you actually run your entire system on the cloud? We did that with our regional forecasting system, and we found that on the cloud we get performance that is comparable, and even better -- about 15 to 20% better than on-prem systems.
This was a demonstration project where we not only ran a single model; we took part in a weather forecasting experiment where the models would run every day for about a month, and forecasters would look at the results and compare them to on-prem platforms to see how well they did. We've got a report coming out on it. The idea here was that this is a viable resource if people want to use it.
Where do we go from here? We transitioned to an open development environment. That was not easy to do. We had partnerships with multiple institutions, but we managed to pull that off. We've got faster development happening. We've got new infrastructure coming in. Our biggest challenges have not really been the technical aspects, but cultural: people changing the way they want to work, and getting operations to buy into this is still a work in progress. One of the biggest problems we have is that, because there is a safety-of-life issue, rapid change is something operations is not very keen on. But they see the value of having an environment that is completely parallel between ops and dev, and the operational team has started to work with us a little bit. The operational head is keen on what DevOps would mean.
What we would like to learn here is: we did a lot of this on our own. I read a few things online and got some ideas, but can we be more efficient? We have multiple models talking to each other, and we do a lot of testing on six or seven platforms. If there are ways we could do this faster or better, more people would be interested in doing it. We still get complaints from people that this is too complicated.
One of the things that we would love to learn about is how can we apply DevOps when safety of life is a big concern, when sometimes we need long experiments to show success of models, and when there are many downstream dependencies. That's all I have. Again, I thank you for your time and appreciate any questions that people might have. Thank you.