Can this be further optimized in terms of speed?

The current implementation is already highly optimized, so further gains would likely be in the 10-30% range rather than order-of-magnitude improvements. The most promising areas would be hash function optimization and fine-tuning batch sizes for specific hardware.
These batch sizes could be tuned based on:
CPU cache sizes (L1/L2/L3)
Available SIMD registers
Benchmark different sizes (256, 512, 1024)
Current DP Table Structure:
#pragma pack(push,1)
struct DPSlot{ fp_t fp; Scalar256 key; };
#pragma pack(pop)
static_assert(sizeof(DPSlot)==40);
8 bytes for the fingerprint (fp_t)
32 bytes for the scalar key (Scalar256)
Total: 40 bytes per slot
Could potentially reduce to 32-bit with:
More frequent collisions (manageable)
Secondary verification when matches occur
Savings: 4 bytes per slot (10% reduction)