Here is the description I was in the process of writing up while awaiting your reply:
The first part of your algorithm only looks back in the buffer by 32 or 256 bytes, so you only need a window of approximately 256 bytes to implement that part.
It seems you are counting on the second part of your algorithm which calls CityHashCrc128 on the entire 128MB buffer to require a full 128MB of memory. But CityHashCrc128 can be computed incrementally. Looking at the code for CityHashCrc128, it calls CityHashCrc256, which calls CityHashCrc256Long, which (aside from a bit of initialization code) reads and processes the input data completely sequentially. Therefore you can compute the first and second parts of your algorithm at the same time, and only a buffer of about 256 bytes is required. You'll have to execute the first part sequentially rather than in 8 parallel threads. It should achieve extremely good parallelization on GPUs. Perhaps the whole thing can even fit in the registers with no memory usage at all.
I'm not sure why you say it is "inherently single threaded and thus not faster on a GPU". I see no problem running as many computations in parallel as an AMD GPU has "stream processing units" -- e.g. 3200 on a 5970. The conditionals don't create much of a problem -- GPUs can implement conditionals by suppressing execution of instructions per stream processing unit, so the slowdown will be no worse than if you enter the conditional 100% of the time. It should still be way faster than a CPU.
Here is an address you can send to: [removed]