Optimizing WRF performance

Compiling and linking | Runtime options | Scaling and core count

These recommendations for optimizing the performance of the Weather Research and Forecasting (WRF) modeling system are based on the results of numerous jobs run on the Yellowstone system by the CISL Consulting Services Group. The jobs included small runs and others, with different domain sizes and time steps. They ranged in size up to 4,096 nodes.

Comparable documentation for optimizing WRF performance on Cheyenne is not yet available.

Compiling and linking

We recommend using the default Intel compiler and the compiler's default settings as contained in the script for creating the configure.wrf file.

PGI and GNU compilers were tested and worked, but using the Intel compiler resulted in substantially better performance. We do not recommend trying to compile WRF with the PathScale compiler.

The best results were achieved with a Distributed-Memory Parallelism (DMPar) build, which enables MPI. Depending on the individual case, advanced WRF users may find some improvement in performance with a hybrid build, using both DMPar and SMPar.

We do not recommend SMPar alone or serial WRF builds.

Runtime options

We recommend using the following when running WRF jobs:

  1. Hyper-threading
  2. Processor binding

Hyper-threading improved computing performance by 8% in a test MPI case using 256 Yellowstone nodes. In other cases, hyper-threading had negligible impact.

Tests of hybrid MPI/OpenMP jobs, both with and without hyper-threading, showed that hybrid parallelism can provide marginally higher performance than pure MPI parallelism, but run-to-run variability was high.

Processor binding was enabled by default when running MPI jobs on Yellowstone. We used the default binding settings in test runs.

Scaling and core count

WRF is a highly scalable model, demonstrating both weak and strong scaling, within limits defined by the problem size. We do not recommend running WRF on extremely large core counts (relative to the number of grid points in the domain). This is because there will be increasingly less speed benefit as communication costs exceed the computational work per core. Extremely large core counts for WRF are defined as those for which cores > total grid points/104.

Weak scaling: When increasing both problem size and core count, the time to solution remains constant provided that the core count is small enough that the time spent in I/O and initialization remains negligible. When you increase the size of the WRF domain, you can usually increase the core count to keep a constant time to solution—provided that the time spent in I/O and initialization remains negligible. You will need to run some tests with your own settings (especially input files and I/O frequency and format) to determine the upper limit for your core count.

Strong scaling: When running WRF on relatively small numbers of cores (namely cores < total grid points/105), the time to solution decreases in inverse proportion to increases in core count if the problem size is unchanged. In such a strong scaling regime, increasing only the core count has no performance downside. We recommend making some runs to confirm that you are using the optimal core count for your problem size.

See WRF scaling and timing for more information.

Related training courses