Nerd Ralph: Fastest AVR software SPI in the West

Most AVR MCUs like the ATmega328p have a hardware I/O shift register, but only on a fixed set of pins. Arduino's shiftOut function is horribly slow, so a number of faster implementations have been written. I'll look at how fast they are, and explain an implementation in AVR assembler that's faster than any C implementation, and I'll even claim that it is the fastest software SPI for the AVR.

Adafruit spiWrite

void spiWrite(uint8_t data)

{

uint8_t bit;

for(bit = 0x80; bit; bit >>= 1) {

SPIPORT &= ~clkpinmask;

if(data & bit) SPIPORT |= mosipinmask;

else SPIPORT &= ~mosipinmask;

SPIPORT |= clkpinmask;

}

This code comes from the Adafruit Nokia 5110 LCD library. It's a bit odd because it doesn't use a loop counting down from 8 to 0 for the bits to be shifted, it shifts a bit through the 8 bits in a byte. While it is much faster than Arduino's shiftOut from using direct port manipulation instead of the slow digitalWrite, its far from an optimal implementation in C. I compiled the code using avr-gcc 4.8 with -Os, and disassembled the code using avr-objdump -D:

00000000 <spiWrite>:

0: 28 e0 ldi r18, 0x08 ; 8

2: 30 e0 ldi r19, 0x00 ; 0

4: 90 e8 ldi r25, 0x80 ; 128

6: 2d 98 cbi 0x05, 5 ; 5

8: 49 2f mov r20, r25

a: 48 23 and r20, r24

c: 11 f0 breq .+4 ; 0x12

e: 2c 9a sbi 0x05, 4 ; 5

10: 01 c0 rjmp .+2 ; 0x14

12: 2c 98 cbi 0x05, 4 ; 5

14: 2d 9a sbi 0x05, 5 ; 5

16: 96 95 lsr r25

18: 21 50 subi r18, 0x01 ; 1

1a: 31 09 sbc r19, r1

1c: 21 15 cp r18, r1

1e: 31 05 cpc r19, r1

20: 91 f7 brne .-28 ; 0x6

22: 08 95 ret

The loop for each bit compiles to 14 instructions, and takes 17 clock cycles for a 0 and 18 for a 1. Although most AVR instructions take a single cycle, the cbi and sbi instructions for clearing and setting a single bit take two cycles. Branches, when taken, are also two-cycle instructions.

Generic spi_byte

void spi_byte(uint8_t byte){

uint8_t i = 8;

do{

SPIPORT &= ~mosipinmask;

if(byte & 0x80) SPIPORT |= mosipinmask;

SPIPORT |= clkpinmask; // clk hi

byte <<= 1;

SPIPORT &=~ clkpinmask; // clk lo

}while(--i);

return;

}

I've seen variants of this code used not just for AVR, but also for PIC MCUs. It is faster than the Adafruit code in part because the loop counts down to zero, and as experienced coders know, on almost every CPU, counting up from zero to eight is slower than counting down to zero. The disassembly shows the code to be 40% faster than the Adafruit code, taking 12 cycles for a 0 and 13 for a 1.

00000024 <spi_byte>:

24: 98 e0 ldi r25, 0x08 ; 8

26: 2c 98 cbi 0x05, 4 ; 5

28: 87 fd sbrc r24, 7

2a: 2c 9a sbi 0x05, 4 ; 5

2c: 2d 9a sbi 0x05, 5 ; 5

2e: 88 0f add r24, r24

30: 2d 98 cbi 0x05, 5 ; 5

32: 91 50 subi r25, 0x01 ; 1

34: c1 f7 brne .-16 ; 0x26

36: 08 95 ret

AVR optimized in C

Looking at the assembler code, half of the loop time is taken by the cbi and sbi two-cycle instructions. The key to further speed optimizations is to write code that will compile to single-cyle out instructions instead. The mosi and clk pins can be cleared by reading the port state before the loop, then writing the 8 bits of the port with mosi and clk cleared:

uint8_t portbits = (SPIPORT & ~(mosipinmask | clkpinmask) );

do{

SPIPORT = portbits; // clk and data low

This also saves having to clear the clk pin at the end of the loop, for a total savings of 3 cycles. With this technique, the time per bit can be reduced to 9 cycles. By using the AVR PIN register, another cycle can be shaved off the loop. The datasheet does not describe the PIN register in detail, stating little more than, "Writing a logic one to PINxn toggles the value of PORTxn". What this means, for example, is that writing 0x81 to PINB will toggle the state of bit 0 and bit 7, leaving the rest of the bits unchanged. Here's the final code:

void spi_byteFast(uint8_t byte){

uint8_t i = 8;

uint8_t portbits = (SPIPORT & ~(mosipinmask | clkpinmask) );

do{

SPIPORT = portbits; // clk and data low

if(byte & 0x80) SPIPIN = mosipinmask;

SPIPIN = clkpinmask; // toggle clk

byte <<= 1;

}while(--i);

return;

}

The disassembly shows that although the code size has increased, the loop for transmitting a bit takes only 8 cycles. More speed can be obtained at the cost of code size by having the compiler unroll the loop (enabled with -O3 in gcc). This would reduce the time per bit to 5 cycles.

00000050 <spi_byteFast>:

50: 25 b1 in r18, 0x05 ; 5

52: 2f 7c andi r18, 0xCF ; 207

54: 98 e0 ldi r25, 0x08 ; 8

56: 40 e1 ldi r20, 0x10 ; 16

58: 30 e2 ldi r19, 0x20 ; 32

5a: 25 b9 out 0x05, r18 ; 5

5c: 87 fd sbrc r24, 7

5e: 43 b9 out 0x03, r20 ; 3

60: 33 b9 out 0x03, r19 ; 3

62: 88 0f add r24, r24

64: 91 50 subi r25, 0x01 ; 1

66: c9 f7 brne .-14 ; 0x5a

68: 08 95 ret

Assembler

I learned to code in assembler (6502) over thirty years ago, and started to learn C a few years after that. When gcc was first released in 1987, it generated code that was much larger and slower than assembler. Although it has improved significantly over the years, what surprises me is that it or any other C compiler still rarely matches hand-optimized assembler code. You might think that there's nothing left to optimize out of the 7 instructions that make up the the loop above, but by making use of the carry flag, I can eliminate the loop counter. That saves a register and reduces the loop time to 7 cycles from 8:

spiByte:

in r18, SPIPORT ; save port state

andi r18, ~(mosipinmask | clkpinmask)

ldi r20, mosipinmask

ldi r19, clkpinmask

lsl r24

ori r24, 0x01 ; 9th bit marks end of byte

spiBit:

out SPIPORT, r18

brcc zeroBit

out SPIPORT-2, r20 ; PORT address -2 is PIN

lsl r24

out SPIPORT-2, r19 ; clk hi

brne spiBit

ret

When looking for fast software SPI code, the best I could find was 8 cycles per bit. I read a couple posts on AVRfreaks claiming 7 cycles is possible, but no code was posted. Unrolled, the above assembler code is still 5 cycles per bit, the same as the optimized C version. So to back up my claim about the fastest code and hand-optimized assembler being better than the compiler, I need to reduce the timing to 4 cycles per bit. I can do it using the AVR's T flag, with the bst and bld instructions that transfer a single bit between the T flag and a register.

spiFast:

in r25, SPIPORT ; save port state

andi r25, ~clkpinmask

ldi r19, clkpinmask

halfByte:

bst r24, 7

bld r25, MOSI

out SPIPORT, r25 ; clk low + data

out SPIPORT-2, r19 ; clk hi

bst r24, 6

bld r25, MOSI

out SPIPORT, r25 ; clk low + data

out SPIPORT-2, r19 ; clk hi

bst r24, 5

bld r25, MOSI

out SPIPORT, r25 ; clk low + data

out SPIPORT-2, r19 ; clk hi

bst r24, 4

bld r25, MOSI

out SPIPORT, r25 ; clk low + data

out SPIPORT-2, r19 ; clk hi

swap r24

eor r1, r19 ; r1 is zero reg

brne halfByte

ret

The loop is half unrolled, doing two loops of 4 bits, with the function using a total of 23 instructions. Fully unrolled the function would use 35 instructions/cycles (plus return), saving 4 cycles over the half unrolled version.

Conclusion

Including overhead, the spiFast assembler code is just under half the speed of hardware SPI running at full speed (2 cycles per bit). With the assistance of a hardware timer to generate the clock, and a port dedicated to just the mosi line, it's theoretically possible to output one bit every two cycles using a sequence of lsl and out instructions. But for a fully software implementation that doesn't modify anything other than the mosi and clk bits, you won't find anything faster than 4 cycles per bit. Copies of the code are available on my github repo: spi.S and spiWrite.c.

December 2015 update

I've made some small improvements to the code. Port register state after reset is 0 (all low), so some minor tweaks were made to the code with that in mind.

Nerd Ralph

Sunday, March 29, 2015

Fastest AVR software SPI in the West

Adafruit spiWrite

Generic spi_byte

AVR optimized in C

Assembler

Conclusion

December 2015 update

14 comments: