I tried it on Linux, and it fails to malloc on this line
error = cudaMalloc((void **)&dev_countbits, sizeof(uint32_t)*NUM_COUNTBITS_WORDS);
My card is a compute 2.1 GTX 460 with only 768MB ram and currently running two screens. I have changed arch to sm_21
FYI "Device has 647233536 free of 804454400 total bytes of memory"
Suggestions?