How about within the miner program running on machine with multiple mining devices (ASIC, GPU, etc), does the work gets divided down? I would suppose so right?
Yes. Most pools use an extraNonce2 size of 4 bytes which gives each miner 2^32 different block headers that can be created from a single workload. The mining program can than split all this work up to all the different devices connected to it.