Also I was getting 1000mk/s on 1080ti on vanity search but only 300mk/s on bitcrack. What is this that slow? As far as I know vanity search only have to compare first 4-6 characters and this has to compare whole address. Is that the only reason for this slowness or something else in the implementation?
I can't help you choose parameters because I don't have any of those, still trying to find a way to lease a GPU for bitcoin (turns out it's very hard to find someone who's not trying to sell you a miner

). But the reason why VanitySearch is much faster than this is not related to the number of characters they are comparing, it's due to the fact that Jean_Luc's code generates a "jump table" of secp256k1 points (see e.g. GPU/Group.h) so he avoids actually doing the elliptic curve math at runtime, unlike brichard's code. This is despite both their engines being written in CUDA.
In VanitySearch he also writes all his math in assembly code while Bitcrack uses C for its math functions.
Several pages back somebody posted
here that they used -b 112 -t 512 -p 512 for their 1080 and got similar speeds as you.
BTW, kinda offtopic. My python implementation (<50 lines) using external libs is doing only 100 keys/sec (on single cpu core) and also I noticed official c/python implementations also do around 1000 keys/sec while generating. What makes these generators a thousand/million times faster? Is it just a simple trick or years worth of dev? Give some pointers maybe.
Are your python files compiled? Try doing that so that your code doesn't have to be interpreted at runtime, using
python -m compileall YOUR_PY_FILES.
Also while running on windows, it is not showing any cpu/gpu/memory. Why is that?
What do you mean? Bitcrack never printed how much CPU or memory it uses in the console. Do you mean in Task Manager? If so then you'll usually find it as a child process of Windows Command Processor (and Task Manager is going to show that using a lot of resources but it's actually the child process running inside it using all that).
You also have to realize, VanitySearch is searching more than just Point. It is searching Point + endo1 + endo2 + symmetry ( pt.y.ModNeg; p1.y.ModNeg; and p2.y.ModNeg; It's a different animal than Bitcrack, hence what seems like super speed compared to Bitcrack. Just different designs and purposes. One is point by point the other is 6 points.