I will publish an X16R bitstream which can do Ravencoin at an incredible rate; however these dynamically changing algorithms are extremely complicated to implement in FPGA's so it will be early 2019 by the time I have that one ready.
Why don't you do one at a time and store the result in sram. 1/16 the speed or less, but still faster than the gpu's right?
You can also make 256 kernels by combining 2 of the 16 hashing functions in one kernel. then you get 1/8th of the speed.