Ill see if I can dig up recent ones. A lot of people pull up the old CUDA vs FPGA academic papers that are focused on very old architectures.
Thanks in advance.
I'll put the blame squarely on the vendor's lap. Intel which now acquired Altera still lists "An Independent Analysis of Alteras FPGA Floating-point DSP Design Flow" from 2011 as the only source mentioning "accuracy". I've found several other, newer papers; but they all repeat the old bullshit methodology: only using single-precision and only estimating the errors. At most they'll show fused-multiply-add like if double precision or
https://en.wikipedia.org/wiki/Kahan_summation_algorithm never existed, or didn't apply.
As to GPU floating point performance, you dont need a benchmark. The figures are right in the ISA documents. Single precision TFLOPs are usually given in terms of FMA unit operations though, which is a bit misleading.
The FPGAs are a bit harder to get TFLOPs numbers for given the flexibility, it since most of the performance actually comes from the DSP blocks you can calculate those. If youve never read them Xilinx gives extremely detailed performance metrics for every chip for most IP blocks, as well as frequency numbers for the hard blocks in the AC/DC switching characteristic docs. Agner Fog publishes a very detailed set of specifications for the performance of those units on most every CPU/APU available as well.
The funny thing is that the closest to honest comparison of Xilinx's FP I've found on the Altera's site:
https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01222-understanding-peak-floating-point-performance-claims.pdfThe main resource CPUs and GPUs have is instruction flexibility. Until a PoW hash truly requires most of the full instruction to be supported to implement it will be hard to keep out ASIC/FPGA.
I think this claim is true, but somewhat pessimistic. I think it would be fairly easy once wider range of cryptocurrency programmers start to appreciate floating point and
https://en.wikipedia.org/wiki/Chaos_theory as an useful building blocks for the proof-of-work algorithms.
I've only skimmed the currently available literature on the subject, but it is next to trivial to demolish all the current claims of FPGA superiority that I was able to find today:
1) use double precision
2) use division or reciprocal (either accurate or approximate)
3) use square-root or reciprocal square-root (either accurate or approximate)
and I haven't even gotten into transcendental functions (on CPUs) or using later, pixel-oriented hardware in the shaders (on GPUs).
You did, however, motivated me to reconsider Altera/Quartus for certain future projects. They are now shipping limited, but fully hardware implemented single-precision floating-point in their DSP blocks and their toolchain had improved in terms of supported OS-es/device-drivers.