I wont say its impossible, but I would be really genuinely surprised. 32MB / hash of total bandwidth (read + write) is needed, and 2MB or so of stashes per hashcore.
You have 1280 URAM blocks of 288kb by 72 bit interface dual ported in the biggest configuration .Thats an incredible amount of internal bandwidth but you can only store 23 or so simultaneous Cryptonight7 2MB blocks in that. The absolute biggest part (which isnt on the 1525 board) has 360Mbit URAM, 96Mbit BRAM, and 48Mbit Distributed RAM, holding a theoretical 63 MB of pipelines, assuming you didnt need a single bit of that for the rest of your logic (you do).
The external memory at say 4x64 DIMMs @2666 is only 85GB/s, or 2.6 KH worth of bandwidth with a perfect access pattern.
Even if you could imaginarily use all 2000+ balls on the FPGA for 2666 MT/s DDR style speeds youd still only clear 20KH against external memory and that isnt even real bandwidth.
Even if you took the biggest part with 128x32 Gbps transceivers to SERDES memory youd only have 16kH limit from bandwidth.
Unless you break the algorithm itself, theres no where to find the bandwidth + storage space for 64khs on a single FPGA.
You're missing a really big part of the ultraram. One of the most attractive things that ultraram has to offer. True dual port single clock read/write. Also, when you chain ultrarams together it increases the bus width proportionally to the amount it increases the latency. I never completed monero but my estimates were in the 4-8Kh/s per board range at 100W.
The one I'm having a hard time believing is keccak. The VCU1525 only has 160A vccint. I was hitting 2.5Gh/s at 140-150A vccint with 12 cores operating at 225mhz. This wasn't optimized, like at all, but I'm having a hard time finding 7x worth of hashrate with optimization. Then again, I'm only using vivado and didn't spend a great deal of time on optimization.