Yes it is my first Cuda project and i think that an optimisation is possible with the use of the different type of memory of the GPU (Global and shared memory) beacuse the time access are different.
To copy my tables in the GPU mem i use the standard function :
cudaMemcpy(b, a, ..., cudaMemcpyHostToDevice);
so i'm not quite sure in which type of GPU mem the tables are loaded.
It depends how you declare that in the device code.
Normally you write to global memory. If it is with __constant__ directive, it will be faster, but teh size must be smaller.
__shared__ is a available only for single block, so you must rewrite to that memory each time you launch kernel. It is also limited.