Sunday, July 17, 2016

Diving into the OpenCL deep end

Programs for mining on GPUs are usually written in OpenCL.  It's based on C, which I know well, so a few weeks ago I decided to try to improve some mining OpenCL code.  My intention was to both learn OpenCL and better understand mining algorithims.

I started with simple changes to the OpenCL code for Genoil's ethminer.  I then spent a lot of time reading GCN architecture and instruction set documents to understand how AMD GPUs run OpenCL code.  Since I recently started mining Sia, I took a look at the gominer kernel code, and thought I might be able to optimize the performance.  I tested with the AMD fglrx drivers under Ubuntu 14.04 (OpenGL version string: 4.5.13399) with a r9 290 card.

The first thing I tried was replacing the rotate code in the ror64 function to use amd_bitalign.  The bitalign instruction (v_alignbit_b32) can do a 32-bit rotate in a single cycle, much like the ARM barrel shifter.  I was surprised that the speed did not improve, which suggests the AMD OpenCL drivers are optimized to use the alignbit instruction.  What was worse was that the kernel would calculate incorrect hash values.  After double and triple-checking my code, I found a post indicating a bug with amd_bitalign when using values divisible by 8.  I then tried amd_bytealign, and that didn't work either.  I was able to confirm the bug when I found that a bitalign of 21 followed by 3 worked (albeit slower), while a single bitalign of 24 did not.

It would seem there is no reason to use the amd_bitalign any more.  Relying on the driver to optimize the code makes it portable to other platforms.  I couldn't find any documentation from AMD saying the bitalign and other media ops are deprected, but I did verify that the pragmas make no difference in kernel:
#pragma OPENCL EXTENSION cl_amd_media_ops : enable
#pragma OPENCL EXTENSION cl_amd_media_ops : disable

After finding a post stating the rotate() function is optimized to use alignbit, I tried changing the "ror64(x, y)" calls to "rotate(x, 64-y)".  The code functioned properly but was  actually slower.  By using AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps, I was able to view the assember .isa files, and could tell that the calls to rotate with 64-bit values were using v_lshlrev_b32, v_lshrrev_b64, and v_or_b32 instead of a pair of v_alignbit_b32 instructions.  Besides using 1 additional instruction, the 64-bit shift instructions apparently take 2 or even 4 times longer to execute on some platforms.

In the end, I wasn't able to improve the kernel speed.  I think re-writing the kernel in GCN assembler is probably the best way to get the maximum hashing performance.

1 comment:

  1. Old post but, for anyone intrested, ive found that speed improvements can be had by helping out the complier. e.g. any for loops should be removed as the compare function in them takes extra clock ticks. also irraticating the for loops means you can remove basic maths on the incrementing intega e.g. i[j] = G[j+6] + M[j+10]. all the additions can be removed to constants.

    reading from global memory is ok if there is only one access to the global memory in the caluclation, more than one call : then you are better coping it to local variables.

    See for an optimisation i did