Are you so sure about that? The floating point performance of modern FPGAs per Watt is much better than GPUs. Even in the 28nm Virtex 7 days TFLOPs were roughly on-par, its neck and neck now and the next gen FPGAs are leading ahead on the AI/ half precisions stuff. That Floating point performance gap was true several years ago but has rapidly closed since.
The types of instructions youre listing also take many many clock cycles on GPUs and CPUs, and can almost always be implemented faster in FPGAs
I've never seen a honest comparison involving actual verification of accuracy, not even bit-accuracy. I've seen some very skewed benchmarks made with very ugly code that conflated/convolved FPU performance with memory bandwidth/latency limitations.
https://en.wikipedia.org/wiki/False_sharing seems to be in fashion nowadays for obfuscation purposes.
Frequently the comparison don't even use the real floating-point but some extended-precision fixed-point in the inner loops because the original CPU/GPU implementation was just a generic library code versus carefully-optimized special-purpose code for the FPGA. It does make business sense, especially with regards to time-to-market; but I wouldn't call that science, even if published in the ostensibly scientific journal.
Do you recall where you've seen those comparisons?
Ill see if I can dig up recent ones. A lot of people pull up the old CUDA vs FPGA academic papers that are focused on very old architectures.
As to GPU floating point performance, you dont need a benchmark. The figures are right in the ISA documents. Single precision TFLOPs are usually given in terms of FMA unit operations though, which is a bit misleading.
The FPGAs are a bit harder to get TFLOPs numbers for given the flexibility, it since most of the performance actually comes from the DSP blocks you can calculate those. If youve never read them Xilinx gives extremely detailed performance metrics for every chip for most IP blocks, as well as frequency numbers for the hard blocks in the AC/DC switching characteristic docs. Agner Fog publishes a very detailed set of specifications for the performance of those units on most every CPU/APU available as well.
The main resource CPUs and GPUs have is instruction flexibility. Until a PoW hash truly requires most of the full instruction to be supported to implement it will be hard to keep out ASIC/FPGA.