I focused on small table, because i wanted to have big empty space for storing the biggest bloomfilter possible.
Bigger bloom filter is not necessarily better. A better approach is to have multiple smaller bloom filters, based on different parts of the hash lookup. If a 512MB bloom filter has a hit rate of 0.01, a single 1024MB bloom filter will have a hit rate of 0.005, but two independent 512MB bloom filters will have a hit rate of 0.01^2 = 0.0001
i have to experiment that. It's interesting
Indeed, I will try that too, in another project.
On the other hand, there is a number of false-positives which we must accept and test ourself.
Coming back to your previous question - I use (by default) 112 blocks (proc*4) and 384 threads, but I cannot guarantee that values are optimal.
Just right now I switched from static table (size declared on compile) to dynamically reserved, that gives possibility to create bigger tables - 24 bit is max for my 12GB card (it takes more than 60% of memory) - and 77% memory is taken by the whole program. Performance gains between 22 and 24 is not shocking, but still worth to consider.