..
I figured out how he got 17Gh/s. It's 34 cores 35K luts each with reduced bus width (not the full 1600) operating at 500mhz. He only placed registers at the start of the round, not in the round. That plus some floorplanning to keep the fmax high.. I could probably hit the same number now that I see it

Youre supposed to keep the cats in the bag! Floorplanning seems to be totally overlooked by a lot of the published crypto work Ive seen on big FPGAs, which is very unfortunate for performance. Place and route is not magic, giving it guidance is quite helpful.
@tromp - now that I found it I see it is quite old, so perhaps it was already implemented in the latest solvers
http://www.cs.cmu.edu/~dga/crypto/cuckoo/analysis.pdf