WRF scaling and timing

Scaling results | Run time results

The Consulting Services Group (CSG) ran numerous Weather Research and Forecasting (WRF) modeling system jobs on the Yellowstone supercomputer. The jobs included large and small runs, with different domain sizes and time steps. They ranged in size up to 4,096 nodes.

While comparable documentation for optimizing WRF performance on Cheyenne is not yet available, the Yellowstone results below may be helpful in answering such questions as:

  • "Is it possible to solve a problem with such-and-such resolution in a timely manner?"
  • "If I use more cores I will have the results more quickly, but with this resolution will my run be in the efficient strong-scaling regime, an intermediate one, or in the very inefficient one dominated by I/O and initialization instead of computing?"

Numbers from the figures below can help you develop some back-of-the-envelope estimates of what will happen if you increase or decrease the core counts of your runs, so that you can find one that is optimal for you both in terms of time-to-solution and in terms of your allocation.

If you're preparing an allocation request, while these plots provide some guidance, you do need to run some tests on Yellowstone or a comparable system to support the validity of your core-hour estimate for your own physics parameterization, and to make sure you account for the overhead from initialization and I/O.  (Different I/O settings and frequency will affect your runs differently.)

Also see Optimizing WRF performance for related documentation and Determining computational resource needs for additional information on preparing allocation requests.

Scaling results

Figure 1 shows scaling results from the Katrina simulations at two different resolutions and also includes the official CONUS benchmarks from http://www2.mmm.ucar.edu/wrf/WG2/bench/ at 12km and 2.5km resolution. For comparison, a New Zealand case with different physics parametrization and a higher number of vertical levels is also included. When expressed this way, all the cases scale similarly. Note that both axes are logarithmic, so a small distance between points corresponds to a large difference in values.

Figure 1 - WRF scaling
Figure 1

As you can see, there are three regimes:

  • large number of grid points per core - Total grid points / core > 105 (small core count)
  • intermediate number of grid points per core - 104 < Total grid points / core < 105 (intermediate core count)
  • small number of grid points per core - Total grid points / core < 104 (large core count)

For a small number of cores, the WRF computation kernel is in a strong scaling regime. Increasing the core count will make the simulation go faster while it consumes approximately the same amount of core-hours (ignoring time spent in initialization and I/O). Time-to-solution will also depend on the wait in queue, which may be larger for larger jobs.

For an intermediate number of cores, WRF scaling increasingly departs from linear strong scaling. Running the same simulation on larger core counts will require more core-hours even though it will still run faster (again, ignoring time spent in initialization, I/O, and wait in queue).

We do not recommend running WRF on extremely large core counts, because in this regime the speed benefits diminish, the time will be dominated by initialization and I/O (as well as wait in queue), and there will be larger core-hours charges for solving the same problem.

Most WRF jobs on Yellowstone use less than 4,096 cores.

Run time results

Figure 2 below shows the total run time for WRF jobs using increasing numbers of cores. Initialization time, computation time, and writing time also are shown for runs using up to 8,192 cores. Initialization and writing time rendered larger jobs impractical. Using a different output algorithm (CSG used the default) may yield better results for large jobs. Compare this to Figure 1, where only the computing time (represented in Figure 2 by the green triangles) has been taken into account to make the plot.

These results are based on simulations of Hurricane Katrina (2005) at 1km resolution.

As illustrated, initialization and writing output times may be more expensive than computing time for larger core counts. Times shown are for a single restart and single output file of the Katrina 1km case, which used a domain of about 372 million grid points. If you have more restarts and output files, your numbers will be different but the trend will be similar. 

Figure 2 - WRF timing
Figure 2


Related training courses