Re-factor of MOM_ANN by adcroft · Pull Request #3 · m2lines/MOM6

adcroft · 2025-06-02T21:51:42Z

Moved MOM_ANN.F90 to src/framework/
Removed unused modules
Removed unused MOM_memory.h
Added input and output means which default to 0 and do not need to be present in the weights file
Gave defaults to means, norms, tests so that they do no need to be present in file
Added missing array notation "(:)"
Minor formatting
Added ANN_allocate, set_layer, set_input_normalization, and set_output_normalization methods to allow reconfiguration during unit tests
Added ANN_unit_tests with some simple constructed-by-code networks with known solutions
Added config_src/drivers/unit_tests/test_MOM_ANN.F90 to drive unit tests
Added config_src/drivers/timing_tests/time_MOM_ANN.F90 as rudimentary for timing inference
Added module dox
Renamed _v1, _v2 etc to labels
Added interface, ANN_apply, to span ANN_apply_vecvtor_oi and ANN_apply_array_sio
Alternate versions of code temporarily existed for purpose of evaluating performance

- Removed unused modules - Removed unused MOM_memory.h - Added input and output means which default to 0 and do not need to be present in the weights file - Gave defaults to means, norms, tests so that they do no need to be present in file - Added missing array notation "(:)" - Minor formatting

- Added ANN_allocate, set_layer, set_input_normalization, and set_output_normalization methods to allow reconfiguration during unit tests - Added ANN_unit_tests with some simple constructed-by-code networks with known solutions - Added config_src/drivers/unit_tests/test_MOM_ANN.F90 to drive unit tests - Added config_src/drivers/timing_tests/time_MOM_ANN.F90 as rudimentary for timing inference

- Adds inference operating on array (instead of single vector of features) - Implements several different versions of inference with various loop orders - Involves storing the transpose of A in the type - Tested by checking inference on same inputs is identical between variants - Added randomizers to assist in unit testing - Adds timing of variants to config_src/drivers/timing/time_MOM_ANN.F90 - Adds an interface (MOM_apply) to select preferred version of inference subroutine - Added command line args to time_MOM_ANN.F90 to allow more rapid evaluation of performance Variants explored, timed with gfortran (13.2) -O3 on Xeon: - vector_v1: - original inference from Pavel - vector_v2: - allocate work arrays just once, using widest layer - loop over layers in 2's to avoid pointer calculations and copies - speed up, x0.8 relative to v1 - vector_v3: - transpose loops - slow down, x1.54 relative to v1 - vector_v4: - transpose weights with same loop order as v1 - slow down, x1.03 relative to v1 - array_v1: - same structure as v2, working on x(space,feature) input/outputs - speed up, x0.41 relative to v1 - array_v2: - as for array_v1 but with transposed loop order - apply activation function on vector of first index while in cache - speed up, x0.35 relative to v1 - array_v3: - same structure as v2, working on x(feature,space) input/outputs - speed up, x0.58 relative to v1

- Added module dox - Renamed _v1, _v2 etc to labels - Added ANN_apply_array_sio to ANN_apply interface - Replaced "flops" with "MBps" in timing output

adcroft · 2025-06-03T15:10:38Z

@Pperezhogin There's a suggestion we stop using "ANN" to describe MLPs or FFNNs. Perhaps we should change the names in the code now before it becomes part of the main branch?

Pperezhogin · 2025-06-03T15:58:51Z

@adcroft
By using "ANN", we are consistent with two our recent papers (Sane et al 2023, Perezhogin et al 2025) and early papers on subgrid modeling (Xie et al 2020, Maulik et al 2019, Beck et al 2019), but not consistent with Dhruv et al 2025 (who uses "MLP").

By using "ANN", it is unlikely we can mislead anybody. It is reasonable to assume that the most general term, such as ANN, would imply the most common architecture, which is MLP. In terms of applied science, I believe the major problem of ML is an amount of newly emerging different terms while actual neural net architectures are not that diverse. For example, I would be happy if people refer to "FNO" as "ANN in Fourier space". That way a lot of confusion could be eliminated. So, if we stick to only one term ("ANN") instead of producing a multitude of them to refer to the same thing, it is good for us. An opinion on this topic from the viewpoint of purely ML community may be different, but I assume we are not part of that community.

In terms of software, using "MLP" may be reasonable in a case if we want to implement 10+ architectures in the future, and we want clearly distinguish between them. However, even in this case, "ANN" likely cannot be misunderstood as anything different from MLP.

One can get some inspiration in naming convention from pytorch. This package does not have either of these terms: "ANN", "CNN", "MLP", "FFNN", "FNO". This suggests that probably actual neural net architecture and its software implementation are different things. Let's discuss in person.

Pperezhogin · 2025-06-09T18:46:40Z

src/framework/MOM_ANN.F90

  real, dimension(CS%layer_sizes(CS%num_layers)), &
-                  intent(out) :: y !< output [arbitrary]
-
+                  intent(inout) :: y !< output [arbitrary]


Just curious, is this change (out to inout) can be explained?

Even though all the old values of y are ignored, it ensures the same bit of memory is being used. We've found empirically that the compiler sometimes decides to use new memory and then copy the result which is slower. inout seems to be faster. 🤷

Pperezhogin · 2025-06-09T20:18:46Z

src/framework/MOM_ANN.F90

 public set_layer, set_input_normalization, set_output_normalization
+public ANN_random, randomize_layer
+
+!> Applies linear layer to input data x and stores the result in y with


Multiple linear layers

This comment was fixed. Thx

Pperezhogin · 2025-06-09T20:24:40Z

src/framework/MOM_ANN.F90

                              !! of size A(output_width, input_width) [nondim]
  real, allocatable :: b(:)   !< bias vector of size output_width [nondim]
+  real, allocatable :: Atranspose(:,:) !< Matrix in column-major order
+                                       !! of size A(output_width, input_width) [nondim]


Atranspose does not seem to be used in the latest commit

Good catch. Removed (force push).

Pperezhogin · 2025-06-09T21:26:12Z

config_src/drivers/timing_tests/time_MOM_ANN.F90

  tstd = tstd - tmean**2  ! convert to variance
  tstd = sqrt( tstd * real(nsamp) / real(nsamp-1) ) ! convert to standard deviation
-  flops = ANN%parameters / tmean
+  words_per_sec = ANN%parameters / ( tmean * 1024 * 1024 )


Seems that this metric changed its meaning. Before it was measuring number of operators per second (flops) and now it is the memory throughput. I am not sure which metric would be the most relevant as it is unclear a priori if this code will be compute-bounded or memory-bounded. Is this words_per_sec metric supposed to be compared to L1/L2/L3 cache throughput?

I somewhat find Gflops to be more relevant metric as its range on single CPU core is clearly defined from approximately 0.5Gflops for scalar operators (typically, in scalar code per one floating point operation there are ~5 service operations, matmul benchmark) up to approximately 50-100Gflops for FMA instructions in longest vector registers assuming no memory transfer between registers and cache. Typical ocean model has 3Gflops performance on average which is much better than scalar code but still far away from the compute bound. I would say a metric of success for ANN module is to be more efficient than ocean model on average, i.e. be in a range of
3Gflops-100Gflops

May be, if we want Gflops, we may need to estimate the number of floating point operations, which is for matmul not the number of parameters in matrix, but approximately twice (one add and one multiply per element).

As @marshallward pointed out, what I'm calculating is the number of words for storage. When reporting Gflops, I had made the assumption that this was likely the number of multiply-adds, but as you say there is ambiguity in whether an FMA should count as one or two ops. Switching to memory processed avoided the ambiguity but in truth, it's probably better to just return the times (as we do in the other tests).

Pperezhogin · 2025-06-10T00:34:48Z

Hi @adcroft I am happy to accept this code with a few comments above.

For future (not this PR). I tried to measure timing (on Perlmutter, AMDs) for ANN with typical use case:

nlayers = 3; nin = 27; layer_width = 20; nout = 3
nxy = 400 (somewhat reasonable size for global ocean model when each MPI rank holds around 20x20 grid points)

In this case we have:

ANN_apply_vector_orig: "mean": 1.8055E-07
ANN_apply_array_sio: "mean": 9.5918E-08

and so the acceleration of new algorithm is about twice (note, it may depend on the cluster). I would say, the inference time per grid point is very close to pytorch that was 1E-7 s per grid point on 1 CPU core on Greene.

I found that inference time of ANN_apply_array_sio can be further reduced by approximately three times:

"mean": 3.5530E-08 (Inference in real(4))
"mean": 3.2987E-08 (+ using Atranspose)
"mean": 3.0819E-08 (+ getting rid of normalization which is 1 for me anyway)

Using real(4) is ok as we trained ANN in real(4) as well. The advantage of Atranspose might be machine-dependent. Getting rid of normalization constant may be possible if apply normalization only with present flag (but we may skip it if it complicates the control flow too much). I might submit these modifications as we accept this PR (I might need to do some testing as 3 times faster was not expected).

The fastest version of the code in real(4) precision is around 40 Gflops (=623*2/3e-8/1e+9, considering each ANN parameter results in two floating point operations), which is big number (40Gflops is a peak performance per core in double precision for this processor).

- Deleted variants of ANN that did not perform as well as the two versions that remain.

adcroft added 5 commits June 2, 2025 15:09

Moved MOM_ANN.F90 to src/framework/

5b8212a

Renamed ANN variants and added some module documentation

4e98437

- Added module dox - Renamed _v1, _v2 etc to labels - Added ANN_apply_array_sio to ANN_apply interface - Replaced "flops" with "MBps" in timing output

adcroft requested a review from Pperezhogin June 2, 2025 21:51

Pperezhogin reviewed Jun 9, 2025

View reviewed changes

Removed alternative variants of ANN in favor of optimized

da17218

- Deleted variants of ANN that did not perform as well as the two versions that remain.

adcroft force-pushed the tmp-pavel-ann branch from e268306 to da17218 Compare June 10, 2025 17:22

Pperezhogin merged commit 8f7adc1 into m2lines:dev/m2lines Jun 16, 2025
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-factor of MOM_ANN#3

Re-factor of MOM_ANN#3
Pperezhogin merged 6 commits intom2lines:dev/m2linesfrom
adcroft:tmp-pavel-ann

adcroft commented Jun 2, 2025

Uh oh!

adcroft commented Jun 3, 2025

Uh oh!

Pperezhogin commented Jun 3, 2025

Uh oh!

Pperezhogin Jun 9, 2025

Uh oh!

adcroft Jun 10, 2025

Uh oh!

Pperezhogin Jun 9, 2025

Uh oh!

adcroft Jun 10, 2025

Uh oh!

Pperezhogin Jun 9, 2025 •

edited

Loading

Uh oh!

adcroft Jun 10, 2025

Uh oh!

Pperezhogin Jun 9, 2025 •

edited

Loading

Uh oh!

adcroft Jun 10, 2025

Uh oh!

Pperezhogin commented Jun 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adcroft commented Jun 2, 2025

Uh oh!

adcroft commented Jun 3, 2025

Uh oh!

Pperezhogin commented Jun 3, 2025

Uh oh!

Pperezhogin Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

adcroft Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Pperezhogin Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

adcroft Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Pperezhogin Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adcroft Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Pperezhogin Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adcroft Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Pperezhogin commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Pperezhogin Jun 9, 2025 •

edited

Loading

Pperezhogin Jun 9, 2025 •

edited

Loading

Pperezhogin commented Jun 10, 2025 •

edited

Loading