How far are you with testing heat dissipation? AFAIK most of the chips are made bottom dissipated, this is done because high power chips are usually soldered on heatpad, so heat transfer through solder balls (or pads) is better bottom. Don't know if BM1384 is this case. Also, this can be seen on A1 chips or older BM chips... Different situation is with big-die chips, they are upside-down with direct or indirect contact to heatsink.
I thought that string design needs bigger capacitors closer to Vcore pins to bypass transients caused by chips current draw variation.
I am nowhere with testing heat dissipation. All I know is, every existing BM1384 miner uses heatsinks on the chip tops. I'm assuming the manufacturer knows what it's doing, and since the S5 seems to have no problem drawing 10W per chip out the top (and supplemented by airflow over the board with the aid of the side panels) I'm gonna assume it works that way until I have the setup to do direct testing.
String design does need large capacitors to buffer current transients (and therefore voltage transients) at a node level, and the S5 does have smaller caps immediately tied to the VDD pads probably to compensate for trace and lead inductance and ESR of the node-level caps. Keeping a constant node voltage and current is a stiffer requirement for string designs than for parallel VRM designs.
I used IR thermometer to check S5 chip temps and I discovered big difference between chips and also temp sensor. Chip variation is about 15 degrees, but diff between temp sensor and chip temp is up to 25C which shows high thermal resistance Rthjc and also imperfect heatsink mounting. I am sure small USB miner will not see such troubles.
I see that when one pair of chips draws less (and voltage increases) whole chain goes down. I've seen diode array which I guess is OV protection of each pair, do you plan to add this protection also or it is not necessary for 2 chips string? How about to add balancer like Li-Po batteries? Or simple capacitor divider? My S5 has voltage variation from 0.77V to 0.81V what makes it less efficient I think.