Most AVR MCUs like the
ATmega328p have a hardware I/O shift register, but only on a fixed set of pins. Arduino's
shiftOut function is horribly slow, so a number of faster implementations have been written. I'll look at how fast they are, and explain an implementation in AVR assembler that's faster than any C implementation, and I'll even claim that it is the fastest software SPI for the AVR.
Adafruit spiWrite
void spiWrite(uint8_t data)
{
uint8_t bit;
for(bit = 0x80; bit; bit >>= 1) {
SPIPORT &= ~clkpinmask;
if(data & bit) SPIPORT |= mosipinmask;
else SPIPORT &= ~mosipinmask;
SPIPORT |= clkpinmask;
}
}
This code comes from the
Adafruit Nokia 5110 LCD library. It's a bit odd because it doesn't use a loop counting down from 8 to 0 for the bits to be shifted, it shifts a bit through the 8 bits in a byte. While it is much faster than Arduino's shiftOut from using direct port manipulation instead of the slow digitalWrite, its far from an optimal implementation in C. I compiled the code using
avr-gcc 4.8 with -Os, and disassembled the code using avr-objdump -D:
00000000 <spiWrite>:
0: 28 e0 ldi r18, 0x08 ; 8
2: 30 e0 ldi r19, 0x00 ; 0
4: 90 e8 ldi r25, 0x80 ; 128
6: 2d 98 cbi 0x05, 5 ; 5
8: 49 2f mov r20, r25
a: 48 23 and r20, r24
c: 11 f0 breq .+4 ; 0x12
e: 2c 9a sbi 0x05, 4 ; 5
10: 01 c0 rjmp .+2 ; 0x14
12: 2c 98 cbi 0x05, 4 ; 5
14: 2d 9a sbi 0x05, 5 ; 5
16: 96 95 lsr r25
18: 21 50 subi r18, 0x01 ; 1
1a: 31 09 sbc r19, r1
1c: 21 15 cp r18, r1
1e: 31 05 cpc r19, r1
20: 91 f7 brne .-28 ; 0x6
22: 08 95 ret
The loop for each bit compiles to 14 instructions, and takes 17 clock cycles for a 0 and 18 for a 1. Although most AVR instructions take a single cycle, the
cbi and
sbi instructions for clearing and setting a single bit take two cycles. Branches, when taken, are also two-cycle instructions.
Generic spi_byte
void spi_byte(uint8_t byte){
uint8_t i = 8;
do{
SPIPORT &= ~mosipinmask;
if(byte & 0x80) SPIPORT |= mosipinmask;
SPIPORT |= clkpinmask; // clk hi
byte <<= 1;
SPIPORT &=~ clkpinmask; // clk lo
}while(--i);
return;
}
I've seen variants of this code used not just for AVR, but also for PIC MCUs. It is faster than the Adafruit code in part because the loop counts down to zero, and as experienced coders know, on almost every CPU, counting up from zero to eight is slower than counting down to zero. The disassembly shows the code to be 40% faster than the Adafruit code, taking 12 cycles for a 0 and 13 for a 1.
00000024 <spi_byte>:
24: 98 e0 ldi r25, 0x08 ; 8
26: 2c 98 cbi 0x05, 4 ; 5
28: 87 fd sbrc r24, 7
2a: 2c 9a sbi 0x05, 4 ; 5
2c: 2d 9a sbi 0x05, 5 ; 5
2e: 88 0f add r24, r24
30: 2d 98 cbi 0x05, 5 ; 5
32: 91 50 subi r25, 0x01 ; 1
34: c1 f7 brne .-16 ; 0x26
36: 08 95 ret
AVR optimized in C
Looking at the assembler code, half of the loop time is taken by the cbi and sbi two-cycle instructions. The key to further speed optimizations is to write code that will compile to single-cyle
out instructions instead. The mosi and clk pins can be cleared by reading the port state before the loop, then writing the 8 bits of the port with mosi and clk cleared:
uint8_t portbits = (SPIPORT & ~(mosipinmask | clkpinmask) );
do{
SPIPORT = portbits; // clk and data low
This also saves having to clear the clk pin at the end of the loop, for a total savings of 3 cycles. With this technique, the time per bit can be reduced to 9 cycles. By using the AVR PIN register, another cycle can be shaved off the loop. The datasheet does not describe the PIN register in detail, stating little more than, "Writing a logic one to PINxn toggles the value of PORTxn". What this means, for example, is that writing 0x81 to PINB will toggle the state of bit 0 and bit 7, leaving the rest of the bits unchanged. Here's the final code:
void spi_byteFast(uint8_t byte){
uint8_t i = 8;
uint8_t portbits = (SPIPORT & ~(mosipinmask | clkpinmask) );
do{
SPIPORT = portbits; // clk and data low
if(byte & 0x80) SPIPIN = mosipinmask;
SPIPIN = clkpinmask; // toggle clk
byte <<= 1;
}while(--i);
return;
}
The disassembly shows that although the code size has increased, the loop for transmitting a bit takes only 8 cycles. More speed can be obtained at the cost of code size by having the compiler unroll the loop (enabled with -O3 in gcc). This would reduce the time per bit to 5 cycles.
00000050 <spi_byteFast>:
50: 25 b1 in r18, 0x05 ; 5
52: 2f 7c andi r18, 0xCF ; 207
54: 98 e0 ldi r25, 0x08 ; 8
56: 40 e1 ldi r20, 0x10 ; 16
58: 30 e2 ldi r19, 0x20 ; 32
5a: 25 b9 out 0x05, r18 ; 5
5c: 87 fd sbrc r24, 7
5e: 43 b9 out 0x03, r20 ; 3
60: 33 b9 out 0x03, r19 ; 3
62: 88 0f add r24, r24
64: 91 50 subi r25, 0x01 ; 1
66: c9 f7 brne .-14 ; 0x5a
68: 08 95 ret
Assembler
I learned to code in assembler (6502) over thirty years ago, and started to learn C a few years after that. When gcc was first released in 1987, it generated code that was much larger and slower than assembler. Although it has improved significantly over the years, what surprises me is that it or any other C compiler still rarely matches hand-optimized assembler code. You might think that there's nothing left to optimize out of the 7 instructions that make up the the loop above, but by making use of the carry flag, I can eliminate the loop counter. That saves a register and reduces the loop time to 7 cycles from 8:
spiByte:
in r18, SPIPORT ; save port state
andi r18, ~(mosipinmask | clkpinmask)
ldi r20, mosipinmask
ldi r19, clkpinmask
lsl r24
ori r24, 0x01 ; 9th bit marks end of byte
spiBit:
out SPIPORT, r18
brcc zeroBit
out SPIPORT-2, r20 ; PORT address -2 is PIN
lsl r24
out SPIPORT-2, r19 ; clk hi
brne spiBit
ret
When looking for fast software SPI code, the best I could find was 8 cycles per bit. I read a couple posts on AVRfreaks claiming 7 cycles is possible, but no code was posted. Unrolled, the above assembler code is still 5 cycles per bit, the same as the optimized C version. So to back up my claim about the fastest code and hand-optimized assembler being better than the compiler, I need to reduce the timing to 4 cycles per bit. I can do it using the AVR's T flag, with the
bst and
bld instructions that transfer a single bit between the T flag and a register.
spiFast:
in r25, SPIPORT ; save port state
andi r25, ~clkpinmask
ldi r19, clkpinmask
halfByte:
bst r24, 7
bld r25, MOSI
out SPIPORT, r25 ; clk low + data
out SPIPORT-2, r19 ; clk hi
bst r24, 6
bld r25, MOSI
out SPIPORT, r25 ; clk low + data
out SPIPORT-2, r19 ; clk hi
bst r24, 5
bld r25, MOSI
out SPIPORT, r25 ; clk low + data
out SPIPORT-2, r19 ; clk hi
bst r24, 4
bld r25, MOSI
out SPIPORT, r25 ; clk low + data
out SPIPORT-2, r19 ; clk hi
swap r24
eor r1, r19 ; r1 is zero reg
brne halfByte
ret
The loop is half unrolled, doing two loops of 4 bits, with the function using a total of 23 instructions. Fully unrolled the function would use 35 instructions/cycles (plus return), saving 4 cycles over the half unrolled version.
Conclusion
Including overhead, the spiFast assembler code is just under half the speed of hardware SPI running at full speed (2 cycles per bit). With the assistance of a hardware timer to generate the clock, and a port dedicated to just the mosi line, it's theoretically possible to output one bit every two cycles using a sequence of lsl and out instructions. But for a fully software implementation that doesn't modify anything other than the mosi and clk bits, you won't find anything faster than 4 cycles per bit. Copies of the code are available on my github repo:
spi.S and
spiWrite.c.
December 2015 update
I've made some small improvements to the code. Port register state after reset is 0 (all low), so some minor tweaks were made to the code with that in mind.