Re: [1500 TH] p2pool: Decentralized, DoS-resistant, Hop-Proof pool

Quote from: Xantus on September 24, 2017, 08:32:53 PM

"Warning: LOST CONTACT WITH BITCOIND for" is displayed ... perhaps these linux bash box dont runs cleanly.

That's what happens when your CPU gets overloaded and stalled. Let's say you have enough incoming traffic in your TCP buffers to keep p2pool's CPU busy for 1.1 minutes. P2pool first sees that it has a queued need to get a new block template, so p2pool sends a getblocktemplate request to bitcoind. Bitcoind may respond immediately, but the response goes to the end of that 1.1 minute queue. P2pool then works through its queue of stuff, then gets to the bitcoind request, and process it, but notices that it took 1.1 minutes and so complains about it.

Same thing for web UI requests. When you point your browser to localhost:9332, that request goes into p2pool's queue of jobs to work on, and p2pool doesn't get around to it until e.g. 1.1 minutes has passed.

Quote

by the way, P2Pool is using at linux bash with pypy 1,3Gb Memory an betwen 0 and 15% cpu

The 15% CPU is the problem. 15% is above the 12.5% threshold for fully utilizing a single core on your machine, which means that the work in p2pool's to-do queue is snowballing.

The --bench output is somewhat helpful, but it seems that the bulk of the load is happening in a function that I didn't put benchmarking code into. In your log files, I frequently see line pairs like this one (note the timestamps):

Code:

2017-09-24 21:52:37.814726 0.000 ms for 7 txs in p2p.py:handle_remember_tx (0.000 ms/tx)
2017-09-24 21:52:58.864085 0.000 ms for 11 txs in handle_losing_tx (0.000 ms/tx)
...
2017-09-24 21:52:59.239189 > >>> Warning: LOST CONTACT WITH BITCOIND for 1.6 minutes! Check that it isn't frozen or dead!

The "Lost contact" message means that your CPU was stalled for 1.6 minutes. 1.6 minutes before the "Lost contact" message at 12:52:59 would be 12:51:20 or something like that. That means that your CPU was busy between 21:52:37 and 21:52:58. However, we don't know what was happening at that time, since the code that was running was something that I didn't add benchmarking code to.

Can you add this to your p2pool startup command line and send me the resulting profile1.log file? Try to run it for about an hour. If you run it for too much less than that, and the profile data will be dominated by the start-up share loading time.

Code:

pypy -m cProfile -o "profile1.log" run_p2pool.py [other options]

If anyone is curious, you can analyze those cProfile log files with a python script that looks like this:

Code:

import pstats, sys

if not len(sys.argv) > 1:
   print "Usage: cumtime.py input_file [-t] [lines]"
   sys.exit()

p = pstats.Stats(sys.argv[1])
if '-t' in sys.argv:
   p.sort_stats('tottime')
else:
   p.sort_stats('cumtime')

lines = 100
if len(sys.argv) > 2:
   try:
   lines = int(sys.argv[2])
   except:
   pass

p.print_stats(lines)