I precompute a table of kG value of 16*65535 values with x and y indexed by
0001->FFFF
0001 0000->FFFF 0000
0001 0000 0000->FFFF 0000 0000
...
I load this 2 table one for x one for y in shared memory of the GPU at the start of the program
Interesting approach - it seems other solutions focus more on larger tables - does the speed advantage of having the tables in shared memory outweigh the disadvantage of doing 15 additions vs for example 11 additions with 22 bit tables?