Interesting, it seems that time.time() on Windows only has 15 or 16 ms resolution, so all of the short function calls are reported as taking either 0 ms, 15 ms, or 16 ms, even though they're actually probably taking about 4 ms each.
I also see a few lines like this one:
2017-09-20 07:00:17.936000 453.000 ms for 0 txs in p2p.py:handle_remember_tx (453.000 ms/tx)
453 ms for handle_remember_tx when no transactions are being added seems strange and significant. I'll have to reread the code and see if I can figure out what could be taking so long.
There's also this:
2017-09-20 07:00:37.877000 750.000 ms for work.py:get_work()
2017-09-20 07:00:37.877000 1734.000 ms for 1 shares in handle_shares (1734.000 ms/share)
That makes about 1.7 seconds of latency just in those two functions for your node switching work. (IIRC, handle_shares() calls get_work, so the 750 ms is already included in the 1734 ms.) There may be a few other functions that I didn't add the benchmarking code to that also play a role, but that latency alone should be responsible for about 5.8% of your DOA+orphan rates.
2017-09-20 07:02:24.241000 Decoding transactions took 31 ms, Unpacking 1188 ms, hashing 16 ms
The 1188 ms (unpacking) is something I think I know how to fix via a CPU/RAM tradeoff. That will probably be my next task. However, that will only reduce the CPU usage in processing new getblocktemplate responses, which I think is not in the critical code path for share propagation, so I don't expect it to help with DOA+orphan rates much. It's a pretty easy and obvious optimization, though. The node I do my testing on (4.4 GHz Core i7, pypy) only shows about 40 ms for unpacking instead of 1188 ms, so I wasn't sure it would be worthwhile. But if it's stalling your CPU for over a second, that sounds like it could be interfering with things.
Not sure what's the issue with pypy on your machine. Does your CPU show activity when you run it? How much?