https://fgiesen.wordpress.com/2023/03/19/notes-on-ffts-for-implementers/ describes that radix-4 and radix-2^2 schemes need half the amount of passes over memory, and therefore are less bottlenecked by memory.
Radix-4 would require a more complex bit reversal transform so I'm not convinced it's worth it, but the radix-2^2 variant should be investigated.