Conversation
adcroft
commented
Jun 2, 2025
- Moved MOM_ANN.F90 to src/framework/
- Removed unused modules
- Removed unused MOM_memory.h
- Added input and output means which default to 0 and do not need to be present in the weights file
- Gave defaults to means, norms, tests so that they do no need to be present in file
- Added missing array notation "(:)"
- Minor formatting
- Added ANN_allocate, set_layer, set_input_normalization, and set_output_normalization methods to allow reconfiguration during unit tests
- Added ANN_unit_tests with some simple constructed-by-code networks with known solutions
- Added config_src/drivers/unit_tests/test_MOM_ANN.F90 to drive unit tests
- Added config_src/drivers/timing_tests/time_MOM_ANN.F90 as rudimentary for timing inference
- Added module dox
- Renamed _v1, _v2 etc to labels
- Added interface, ANN_apply, to span ANN_apply_vecvtor_oi and ANN_apply_array_sio
- Alternate versions of code temporarily existed for purpose of evaluating performance
- Removed unused modules - Removed unused MOM_memory.h - Added input and output means which default to 0 and do not need to be present in the weights file - Gave defaults to means, norms, tests so that they do no need to be present in file - Added missing array notation "(:)" - Minor formatting
- Added ANN_allocate, set_layer, set_input_normalization, and set_output_normalization methods to allow reconfiguration during unit tests - Added ANN_unit_tests with some simple constructed-by-code networks with known solutions - Added config_src/drivers/unit_tests/test_MOM_ANN.F90 to drive unit tests - Added config_src/drivers/timing_tests/time_MOM_ANN.F90 as rudimentary for timing inference
- Adds inference operating on array (instead of single vector of
features)
- Implements several different versions of inference with various
loop orders
- Involves storing the transpose of A in the type
- Tested by checking inference on same inputs is identical between
variants
- Added randomizers to assist in unit testing
- Adds timing of variants to config_src/drivers/timing/time_MOM_ANN.F90
- Adds an interface (MOM_apply) to select preferred version of
inference subroutine
- Added command line args to time_MOM_ANN.F90 to allow more rapid
evaluation of performance
Variants explored, timed with gfortran (13.2) -O3 on Xeon:
- vector_v1:
- original inference from Pavel
- vector_v2:
- allocate work arrays just once, using widest layer
- loop over layers in 2's to avoid pointer calculations and copies
- speed up, x0.8 relative to v1
- vector_v3:
- transpose loops
- slow down, x1.54 relative to v1
- vector_v4:
- transpose weights with same loop order as v1
- slow down, x1.03 relative to v1
- array_v1:
- same structure as v2, working on x(space,feature) input/outputs
- speed up, x0.41 relative to v1
- array_v2:
- as for array_v1 but with transposed loop order
- apply activation function on vector of first index while in cache
- speed up, x0.35 relative to v1
- array_v3:
- same structure as v2, working on x(feature,space) input/outputs
- speed up, x0.58 relative to v1
- Added module dox - Renamed _v1, _v2 etc to labels - Added ANN_apply_array_sio to ANN_apply interface - Replaced "flops" with "MBps" in timing output
|
@Pperezhogin There's a suggestion we stop using "ANN" to describe MLPs or FFNNs. Perhaps we should change the names in the code now before it becomes part of the main branch? |
|
@adcroft By using "ANN", it is unlikely we can mislead anybody. It is reasonable to assume that the most general term, such as ANN, would imply the most common architecture, which is MLP. In terms of applied science, I believe the major problem of ML is an amount of newly emerging different terms while actual neural net architectures are not that diverse. For example, I would be happy if people refer to "FNO" as "ANN in Fourier space". That way a lot of confusion could be eliminated. So, if we stick to only one term ("ANN") instead of producing a multitude of them to refer to the same thing, it is good for us. An opinion on this topic from the viewpoint of purely ML community may be different, but I assume we are not part of that community. In terms of software, using "MLP" may be reasonable in a case if we want to implement 10+ architectures in the future, and we want clearly distinguish between them. However, even in this case, "ANN" likely cannot be misunderstood as anything different from MLP. One can get some inspiration in naming convention from pytorch. This package does not have either of these terms: "ANN", "CNN", "MLP", "FFNN", "FNO". This suggests that probably actual neural net architecture and its software implementation are different things. Let's discuss in person. |
src/framework/MOM_ANN.F90
Outdated
| real, dimension(CS%layer_sizes(CS%num_layers)), & | ||
| intent(out) :: y !< output [arbitrary] | ||
|
|
||
| intent(inout) :: y !< output [arbitrary] |
There was a problem hiding this comment.
Just curious, is this change (out to inout) can be explained?
There was a problem hiding this comment.
Even though all the old values of y are ignored, it ensures the same bit of memory is being used. We've found empirically that the compiler sometimes decides to use new memory and then copy the result which is slower. inout seems to be faster. 🤷
src/framework/MOM_ANN.F90
Outdated
| public set_layer, set_input_normalization, set_output_normalization | ||
| public ANN_random, randomize_layer | ||
|
|
||
| !> Applies linear layer to input data x and stores the result in y with |
src/framework/MOM_ANN.F90
Outdated
| !! of size A(output_width, input_width) [nondim] | ||
| real, allocatable :: b(:) !< bias vector of size output_width [nondim] | ||
| real, allocatable :: Atranspose(:,:) !< Matrix in column-major order | ||
| !! of size A(output_width, input_width) [nondim] |
There was a problem hiding this comment.
Atranspose does not seem to be used in the latest commit
There was a problem hiding this comment.
Good catch. Removed (force push).
| tstd = tstd - tmean**2 ! convert to variance | ||
| tstd = sqrt( tstd * real(nsamp) / real(nsamp-1) ) ! convert to standard deviation | ||
| flops = ANN%parameters / tmean | ||
| words_per_sec = ANN%parameters / ( tmean * 1024 * 1024 ) |
There was a problem hiding this comment.
Seems that this metric changed its meaning. Before it was measuring number of operators per second (flops) and now it is the memory throughput. I am not sure which metric would be the most relevant as it is unclear a priori if this code will be compute-bounded or memory-bounded. Is this words_per_sec metric supposed to be compared to L1/L2/L3 cache throughput?
I somewhat find Gflops to be more relevant metric as its range on single CPU core is clearly defined from approximately 0.5Gflops for scalar operators (typically, in scalar code per one floating point operation there are ~5 service operations, matmul benchmark) up to approximately 50-100Gflops for FMA instructions in longest vector registers assuming no memory transfer between registers and cache. Typical ocean model has 3Gflops performance on average which is much better than scalar code but still far away from the compute bound. I would say a metric of success for ANN module is to be more efficient than ocean model on average, i.e. be in a range of
3Gflops-100Gflops
May be, if we want Gflops, we may need to estimate the number of floating point operations, which is for matmul not the number of parameters in matrix, but approximately twice (one add and one multiply per element).
There was a problem hiding this comment.
As @marshallward pointed out, what I'm calculating is the number of words for storage. When reporting Gflops, I had made the assumption that this was likely the number of multiply-adds, but as you say there is ambiguity in whether an FMA should count as one or two ops. Switching to memory processed avoided the ambiguity but in truth, it's probably better to just return the times (as we do in the other tests).
|
Hi @adcroft I am happy to accept this code with a few comments above. For future (not this PR). I tried to measure timing (on Perlmutter, AMDs) for ANN with typical use case: In this case we have: and so the acceleration of new algorithm is about twice (note, it may depend on the cluster). I would say, the inference time per grid point is very close to pytorch that was I found that inference time of Using The fastest version of the code in |
- Deleted variants of ANN that did not perform as well as the two versions that remain.