Sunday, July 31, 2016

Improving Genoil's ethminer


In my last post about mining ethereum, I explained why I preferred Genoil's fork of the Ethereum Foundation's ethminer.  After that post, I started having stability problems with one of the newer releases of Genoil's miner.  I suspected the problem was likely deadlocks with mutexes that had been added to the code.  They had been added to reduce the chance of the miner submitting stale or invalid shares, but in this case the solution was worse than the problem, since there is no harm in submitting a small number of invalid shares to a pool.  After taking some time to review the code and discuss my ideas with the author, I decided to make some improvements.  The result is ethminer-nr.

A description of some of the changes can be found on the issues tracker for Genoil's miner, since I expect most of my changes to be merged upstream.  The first thing I did was remove the mutexes.  This does open the possibility of a rare race condition that could cause an invalid share submit when one thread processes a share from a GPU while another thread processes a new job from the pool.  On Linux the threads can be serialized using the taskset command to pin the process to a single CPU.  On a multi-CPU system, use "taskset 1 ./ethminer ..."  to pin the process to the first CPU.

As described in the issues tracker, I added per-GPU reporting of hash rate.  I also reduced the stats output to accepted (A) and rejected (R), including stales, since I have never seen a pool submit fail, and only some pools will report a rejected share.  The more compact output helps the stats still fit on a single line, even with hashrate reporting from multiple GPUs:
  m  13:28:46|ethminer  15099 24326 15099 =54525Khs A807+6:R0+0

To help detect when a pool connection has failed, instead of trying to manage timeouts in the code, I decided to rely on the TCP stack.  The first thing I did was enable TCP keepalives on the stratum connection to the pool.  If the pool server is alive but just didn't have any new jobs for a while, the socket connection will remain open.  If the network connection to the pool fails, there will be no keepalive response and the socket will be closed.  Since the default timeouts are rather long, I reduced them to make network failure detection faster:
sudo sysctl -w net.ipv4.tcp_keepalive_time=30
sudo sysctl -w net.ipv4.tcp_keepalive_intvl=5
sudo sysctl -w net.ipv4.tcp_keepalive_intvl=3

I wasn't certain if packets sent to the server will reset the keepalive timer, even if there is no response (even an ACK) from the server.  Therefore I also reduced the default TCP retransmission count to 5, so the pool connection will close after a packet is sent (i.e. share submit) 5 times without an acknowledgement.
sudo sysctl -w net.ipv4.tcp_retries2=5

I was also able to make a stand-alone linux binary.  Until now the Linux builds had made extensive use of shared libraries, so the binary could not be used without first installing several shared library dependencies like boost and json.  I had to do some of the build manually, so to make your own static linked binary you'll have to wait a few days for some updates to the cmake build scripts.  If you want to try it now anyway, you can add "-DETH_STATIC=1" to the cmake command line.

As for future improvements, since I've started learning OpenCL, I'm hoping to optimize the ethminer OpenCL kernel to improve hashing performance.  Look for something in late August or early September.

3 comments:

  1. Hi Ralph! Really nice article - I'm a bit late, but I'm also digging into altcoins mining on Linux. Thankfully my 12 years as Linux developer helps a lot! I've seen that there's still open issues from you on Genoil's repository: do you recommend to stick with your own fork? Is it fixed for 16.60?

    ReplyDelete
    Replies
    1. My binary release works OK with kernel 4.8 and 16.40, but not with 16.60. I think 16.60 doesn't like one of the #pragmas in the OpenCl kernel, but I haven't done enough testing yet to push changes to the repo. I'm also planning to merge 110 back into master at that time.

      Delete
    2. Just received the test hardware today (RX 580) - I'm using Ubuntu 16.04.2 (kernel 4.8.0-51), genoil's master (1.1.7?) but drivers 17.10! No luck so far. Will downgrade to 16.40! Thanks!

      Delete