Sunday, March 29, 2015

Fastest AVR software SPI in the West

Most AVR MCUs like the ATmega328p have a hardware I/O shift register, but only on a fixed set of pins.  Arduino's shiftOut function is horribly slow, so a number of faster implementations have been written.  I'll look at how fast they are, and explain an implementation in AVR assembler that's faster than any C implementation, and I'll even claim that it is the fastest software SPI for the AVR.

Adafruit spiWrite

void spiWrite(uint8_t data)
{
 uint8_t bit;
 for(bit = 0x80; bit; bit >>= 1) {
  SPIPORT &= ~clkpinmask;
  if(data & bit) SPIPORT |= mosipinmask;
  else SPIPORT &= ~mosipinmask;
  SPIPORT |= clkpinmask;
 }
}

This code comes from the Adafruit Nokia 5110 LCD library.  It's a bit odd because it doesn't use a loop counting down from 8 to 0 for the bits to be shifted, it shifts a bit through the 8 bits in a byte.  While it is much faster than Arduino's shiftOut from using direct port manipulation instead of the slow digitalWrite, its far from an optimal implementation in C.  I compiled the code using avr-gcc 4.8 with -Os, and disassembled the code using avr-objdump -D:
00000000 <spiWrite>:
   0:   28 e0           ldi     r18, 0x08       ; 8
   2:   30 e0           ldi     r19, 0x00       ; 0
   4:   90 e8           ldi     r25, 0x80       ; 128
   6:   2d 98           cbi     0x05, 5 ; 5
   8:   49 2f           mov     r20, r25
   a:   48 23           and     r20, r24
   c:   11 f0           breq    .+4             ; 0x12
   e:   2c 9a           sbi     0x05, 4 ; 5
  10:   01 c0           rjmp    .+2             ; 0x14
  12:   2c 98           cbi     0x05, 4 ; 5
  14:   2d 9a           sbi     0x05, 5 ; 5
  16:   96 95           lsr     r25
  18:   21 50           subi    r18, 0x01       ; 1
  1a:   31 09           sbc     r19, r1
  1c:   21 15           cp      r18, r1
  1e:   31 05           cpc     r19, r1
  20:   91 f7           brne    .-28            ; 0x6
  22:   08 95           ret

The loop for each bit compiles to 14 instructions, and takes 17 clock cycles for a 0 and 18 for a 1.  Although most AVR instructions take a single cycle, the cbi and sbi instructions for clearing and setting a single bit take two cycles.  Branches, when taken, are also two-cycle instructions.

Generic spi_byte

void spi_byte(uint8_t byte){
    uint8_t i = 8;
    do{
        SPIPORT &= ~mosipinmask;
        if(byte & 0x80) SPIPORT |= mosipinmask;
        SPIPORT |= clkpinmask;  // clk hi
        byte <<= 1;
        SPIPORT &=~ clkpinmask; // clk lo

    }while(--i);
    return;
}

I've seen variants of this code used not just for AVR, but also for PIC MCUs.  It is faster than the Adafruit code in part because the loop counts down to zero, and as experienced coders know, on almost every CPU, counting up from zero to eight is slower than counting down to zero.  The disassembly shows the code to be 40% faster than the Adafruit code, taking 12 cycles for a 0 and 13 for a 1.
00000024 <spi_byte>:
  24:   98 e0           ldi     r25, 0x08       ; 8
  26:   2c 98           cbi     0x05, 4 ; 5
  28:   87 fd           sbrc    r24, 7
  2a:   2c 9a           sbi     0x05, 4 ; 5
  2c:   2d 9a           sbi     0x05, 5 ; 5
  2e:   88 0f           add     r24, r24
  30:   2d 98           cbi     0x05, 5 ; 5
  32:   91 50           subi    r25, 0x01       ; 1
  34:   c1 f7           brne    .-16            ; 0x26
  36:   08 95           ret

AVR optimized in C

Looking at the assembler code, half of the loop time is taken by the cbi and sbi two-cycle instructions.  The key to further speed optimizations is to write code that will compile to single-cyle out instructions instead.  The mosi and clk pins can be cleared by reading the port state before the loop, then writing the 8 bits of the port with mosi and clk cleared:
    uint8_t portbits = (SPIPORT & ~(mosipinmask | clkpinmask) );
    do{
        SPIPORT = portbits;      // clk and data low

This also saves having to clear the clk pin at the end of the loop, for a total savings of 3 cycles.  With this technique, the time per bit can be reduced to 9 cycles.  By using the AVR PIN register, another cycle can be shaved off the loop.  The datasheet does not describe the PIN register in detail, stating little more than, "Writing a logic one to PINxn toggles the value of PORTxn".  What this means, for example, is that writing 0x81 to PINB will toggle the state of bit 0 and bit 7, leaving the rest of the bits unchanged.  Here's the final code:
void spi_byteFast(uint8_t byte){
    uint8_t i = 8;
    uint8_t portbits = (SPIPORT & ~(mosipinmask | clkpinmask) );
    do{
        SPIPORT = portbits;      // clk and data low
        if(byte & 0x80) SPIPIN = mosipinmask;
        SPIPIN = clkpinmask;     // toggle clk
        byte <<= 1;
    }while(--i);
    return;
}

The disassembly shows that although the code size has increased, the loop for transmitting a bit takes only 8 cycles.  More speed can be obtained at the cost of code size by having the compiler unroll the loop (enabled with -O3 in gcc).  This would reduce the time per bit to 5 cycles.
00000050 <spi_byteFast>:
  50:   25 b1           in      r18, 0x05       ; 5
  52:   2f 7c           andi    r18, 0xCF       ; 207
  54:   98 e0           ldi     r25, 0x08       ; 8
  56:   40 e1           ldi     r20, 0x10       ; 16
  58:   30 e2           ldi     r19, 0x20       ; 32
  5a:   25 b9           out     0x05, r18       ; 5
  5c:   87 fd           sbrc    r24, 7
  5e:   43 b9           out     0x03, r20       ; 3
  60:   33 b9           out     0x03, r19       ; 3
  62:   88 0f           add     r24, r24
  64:   91 50           subi    r25, 0x01       ; 1
  66:   c9 f7           brne    .-14            ; 0x5a
  68:   08 95           ret

Assembler

I learned to code in assembler (6502) over thirty years ago, and started to learn C a few years after that.  When gcc was first released in 1987, it generated code that was much larger and slower than assembler.  Although it has improved significantly over the years, what surprises me is that it or any other C compiler still rarely matches hand-optimized assembler code.  You might think that there's nothing left to optimize out of  the 7 instructions that make up the the loop above, but by making use of the carry flag, I can eliminate the loop counter.  That saves a register and reduces the loop time to 7 cycles from 8:
spiByte:
    in r18, SPIPORT     ; save port state
    andi r18, ~(mosipinmask | clkpinmask)
    ldi r20, mosipinmask
    ldi r19, clkpinmask
    lsl r24
    ori r24, 0x01       ; 9th bit marks end of byte
spiBit:
    out SPIPORT, r18
    brcc zeroBit
    out SPIPORT-2, r20  ; PORT address -2 is PIN
    lsl r24
    out SPIPORT-2, r19  ; clk hi
    brne spiBit
    ret

When looking for fast software SPI code, the best I could find was 8 cycles per bit.  I read a couple posts on AVRfreaks claiming 7 cycles is possible, but no code was posted.  Unrolled, the above assembler code is still 5 cycles per bit, the same as the optimized C version.  So to back up my claim about the fastest code and hand-optimized assembler being better than the compiler, I need to reduce the timing to 4 cycles per bit.  I can do it using the AVR's T flag, with the bst and bld instructions that transfer a single bit between the T flag and a register.
spiFast:
    in r25, SPIPORT     ; save port state
    andi r25, ~clkpinmask
    ldi r19, clkpinmask
halfByte:
    bst r24, 7
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    bst r24, 6
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    bst r24, 5
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    bst r24, 4
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    swap r24
    eor r1, r19         ; r1 is zero reg
    brne halfByte
    ret

The loop is half unrolled, doing two loops of 4 bits, with the function using a total of 23 instructions.  Fully unrolled the function would use 35 instructions/cycles (plus return), saving 4 cycles over the half unrolled version.

Conclusion

Including overhead, the spiFast assembler code is just under half the speed of  hardware SPI running at full speed (2 cycles per bit).  With the assistance of a hardware timer to generate the clock, and a port dedicated to just the mosi line, it's theoretically possible to output one bit every two cycles using a sequence of lsl and out instructions.  But for a fully software implementation that doesn't modify anything other than the mosi and clk bits, you won't find anything faster than 4 cycles per bit.  Copies of the code are available on my github repo: spi.S and spiWrite.c.

December 2015 update

I've made some small improvements to the code.  Port register state after reset is 0 (all low), so some minor tweaks were made to the code with that in mind.

15 comments:

  1. What does "SPIPORT-2" mean? I am new to AVR assembly

    ReplyDelete
    Replies
    1. It means subtract 2 from SPIPORT. That will give you the data direction register for the PORT. This works in C and asm.

      Delete
    2. It's not the data direction register, it's the PIN register. On newer parts writing 1 to a bit in the PIN register toggles the associated PORT bit. It won't work on older parts.

      Delete
    3. This comment has been removed by the author.

      Delete
    4. Oops, I'm surprised nobody noticed that. Port address -1 is DDR, and -2 is PIN. It's misleading to say it's for "newer" parts. It works on older (i.e. >5yr old) parts too. There are some discontinued parts that didn't have the feature.

      Delete
  2. While I see everywhere optimized code for fast write, the equivelant read code doesn't seem to exist. Why?

    ReplyDelete
    Replies
    1. Adding read (for full-duplex spi) is 2 more asm lines (sbis & ori), though you can't do the carry flag loop counter trick (at least I can't think of a way).
      Rx-only can be done in the same 7-cycle timing of tx only.

      Delete
    2. Can it be done for receive only using the carry flag for half duplex? I am trying to find the fastest most optimal routine to readonly SPI data. It is a continous stream of 8+2 bits from a source with mosi and clk only. So it would actually be the matching receive only optimised routine to your spiWrite.

      Delete
    3. For 8-bit Rx only it can be done using cary flag trick. If you are receiving 10 bits over 2 bytes an unrolled loop would be more code but simpler.

      Delete
    4. I am receiving a fast stream of 6 bytes at a time, 8 bits each, with 2 bits spacing in between them, so how would you write such a routine? It is actually exactly the opposite of your highly optimized master soft spiWrite routine where another AVR slave receives the clock and data in any pair of pins.

      Delete
    5. software spi slave is a bit more work and is slower since it has to busy loop on the clk transitions. Before your comment I whipped up a spi master rx routine and added it to spi.S on my github repo. I can probably do a slave version as well, but just a standard 8-bit version. You'll have to modify it for your odd 2-bit spacing situation. Is it a synchronous UART?

      Delete
    6. I just pushed a new version of spi.S with a new spiSIn function. Best case timing is 9 cycles per bit. If you want to live dangerously you could remove the check for clock low at the end of the loop and get it down to a 7-cycle minimum. It assembles clean but is untested.

      Delete
    7. No, it is not a UART. It is an older Atmel (89c4051) with possibly a 6 byte shiftout routine. Thank you! This is what I was looking for. I will test and see if it works. 9 cycles should be perfect for what I want but just for the record, it is still not able to catch up with the 4 cycle per bit equivalent write.

      Delete
    8. What just worked, and I am very surprised because I was told it is not possible, is hardware SPI as it is suppose to be 1 byte in, 1 out. However the following seems to work well within the SPI interrupt:

      b1 = SPDR;
      while(!(SPSR & (1<<SPIF)));
      b2 = SPDR;
      while(!(SPSR & (1<<SPIF)));
      b3 = SPDR;
      while(!(SPSR & (1<<SPIF)));
      b4 = SPDR;
      while(!(SPSR & (1<<SPIF)));
      b5 = SPDR;
      while(!(SPSR & (1<<SPIF)));
      b6 = SPDR;

      Delete