If all else fails I'm thinking of running it on top of a NIOS II implementation. Completely pointless but got to be worth a shot just for the laughs.
You could certainly start with a pure software approach and then gradually optimize it with custom instructions. You wouldn't ever get the speed of a pure RTL design, but it's a good (rewarding) exercise. With the reduced resource, utilization you could end up with the fastest implementation on the BeMicro.