maybe put some SSE or AVX in there somehow
https://en.cppreference.com/w/cpp/experimental/simdA zero-overhead abstraction for the high-level language you are already using is so much nicer than spending a bunch of time writing your own architecture-specific routines in assembly or compiler intrinsics (std::simd itself being a simple template library implemented using intrinsics). Generally speaking at least
This is very useful, thanks!
Is this part of the base standard, or a C++11/14/17/20 extension?
Meanwhile I have already made an optimized matcher for consecutive character matches:
static bool char16_is_match(RegexNode *node, const char *orig, const char *cur, const char **next) { Char16Node char16 = node->chr16; int result; __asm__ __volatile__ ( "movdqa xmm0, [%1] \n" "movdqa xmm1, [%2] \n" "pcmpeqb xmm0, xmm1 \n" "movmskps %0, xmm0 \n" : "=r"(result) : "r" (char16.chr), "r" (cur) : "xmm0", "xmm1" ); *next = cur+16; return result; }
This actually only uses SSE2. I'm not sure why JLP compiled with SSSE3, is he using some of those newer instructions in his SECP256k1 class?

Anyway, I haven't benchmarked this yet, but the way this is supposed to work, is that the regular expression is already represented as a linked list like this:
^1tryme.*
start --> char ---> char ---> char ---> char ---> char ---> char ---> quant
v
any
so you can actually go down in some of these nodes if necessary.
Now instead of comparing each of these char nodes as is presently done, which is inefficient, we can take advantage of the fact that XMM registers can hold up to 16 char values. So we can combine sets of 2-16 char nodes into a single char16 node, and pad it (as well as the input string!) with NULs if necessary :
start --> char16 --> quant
v
any
This has roughly the same overhead as a single "char" old-school comparison. Considering that the "char" node match function is called hundreds of millions of times in the span of a few seconds, owing to the fact that most people will usually search for just a certain pattern of strings, this will translate to 6-8x less of these calls being made depending on the regex string.