Note that field elements are internally stored in a denormalized representation where the limbs can overflow. If you want to convert it to a portable format, use fe_get_b32.
That must be why I was getting different results while testing. I'll check out this function and run my C++ and Python mod-mul tests. It will be interesting to see the results of this.