Monday, March 30, 2015

ESP8266 SPI flash performance


Despite the popularity of the ESP8266, I have yet to see a detailed datasheet published.  Nava Whiteford, on his blog, has links to a summary datasheet and a Cadence tensilica core that the chip is based on.  None of this provides any details on how the memory controller pages in data from the SPI flash, nor the speed of the communications.  About all that is clear from the datasheet and chip markings is that it uses a quad-SPI serial flash chip.

I decided to find out the performance of the SPI flash, as well as get an idea of what the cache line fill size of the chip is.  By looking at the pin-out of the flash chip, I determined that pin 6 is the clock.  After some probing and playing with the settings on my scope, I captured the clock burst shown above.

Based on the 500ns horizontal scale, the clock burst lasts a little more than 2uS.  Zooming in shows that the clock is exactly 40Mhz, or half of the ESP8266 80Mhz clock and have of the maximum 80Mhz speed rating of the SPI flash.  Given the burst lasted a little more than 2uS, the total number of clock pulses is in the range of 85-90.  Accounting for the overhead of commands to enable quad SPI mode and address setup, it seems the burst corresponds to reading 32 bytes from the flash, and therefore the cache line size is likely 32 bytes.

Conclusion

The clock signal is clean, and with a rise + fall time of 11.1ns, could be increased to 90Mhz without significant distortion or attenuation.  With documentation on the registers to change the clock speed to 160Mhz, the ESP8266 can be run at double speed without overclocking the SPI flash.

Sunday, March 29, 2015

Fastest AVR software SPI in the West

Most AVR MCUs like the ATmega328p have a hardware I/O shift register, but only on a fixed set of pins.  Arduino's shiftOut function is horribly slow, so a number of faster implementations have been written.  I'll look at how fast they are, and explain an implementation in AVR assembler that's faster than any C implementation, and I'll even claim that it is the fastest software SPI for the AVR.

Adafruit spiWrite

void spiWrite(uint8_t data)
{
 uint8_t bit;
 for(bit = 0x80; bit; bit >>= 1) {
  SPIPORT &= ~clkpinmask;
  if(data & bit) SPIPORT |= mosipinmask;
  else SPIPORT &= ~mosipinmask;
  SPIPORT |= clkpinmask;
 }
}

This code comes from the Adafruit Nokia 5110 LCD library.  It's a bit odd because it doesn't use a loop counting down from 8 to 0 for the bits to be shifted, it shifts a bit through the 8 bits in a byte.  While it is much faster than Arduino's shiftOut from using direct port manipulation instead of the slow digitalWrite, its far from an optimal implementation in C.  I compiled the code using avr-gcc 4.8 with -Os, and disassembled the code using avr-objdump -D:
00000000 <spiWrite>:
   0:   28 e0           ldi     r18, 0x08       ; 8
   2:   30 e0           ldi     r19, 0x00       ; 0
   4:   90 e8           ldi     r25, 0x80       ; 128
   6:   2d 98           cbi     0x05, 5 ; 5
   8:   49 2f           mov     r20, r25
   a:   48 23           and     r20, r24
   c:   11 f0           breq    .+4             ; 0x12
   e:   2c 9a           sbi     0x05, 4 ; 5
  10:   01 c0           rjmp    .+2             ; 0x14
  12:   2c 98           cbi     0x05, 4 ; 5
  14:   2d 9a           sbi     0x05, 5 ; 5
  16:   96 95           lsr     r25
  18:   21 50           subi    r18, 0x01       ; 1
  1a:   31 09           sbc     r19, r1
  1c:   21 15           cp      r18, r1
  1e:   31 05           cpc     r19, r1
  20:   91 f7           brne    .-28            ; 0x6
  22:   08 95           ret

The loop for each bit compiles to 14 instructions, and takes 17 clock cycles for a 0 and 18 for a 1.  Although most AVR instructions take a single cycle, the cbi and sbi instructions for clearing and setting a single bit take two cycles.  Branches, when taken, are also two-cycle instructions.

Generic spi_byte

void spi_byte(uint8_t byte){
    uint8_t i = 8;
    do{
        SPIPORT &= ~mosipinmask;
        if(byte & 0x80) SPIPORT |= mosipinmask;
        SPIPORT |= clkpinmask;  // clk hi
        byte <<= 1;
        SPIPORT &=~ clkpinmask; // clk lo

    }while(--i);
    return;
}

I've seen variants of this code used not just for AVR, but also for PIC MCUs.  It is faster than the Adafruit code in part because the loop counts down to zero, and as experienced coders know, on almost every CPU, counting up from zero to eight is slower than counting down to zero.  The disassembly shows the code to be 40% faster than the Adafruit code, taking 12 cycles for a 0 and 13 for a 1.
00000024 <spi_byte>:
  24:   98 e0           ldi     r25, 0x08       ; 8
  26:   2c 98           cbi     0x05, 4 ; 5
  28:   87 fd           sbrc    r24, 7
  2a:   2c 9a           sbi     0x05, 4 ; 5
  2c:   2d 9a           sbi     0x05, 5 ; 5
  2e:   88 0f           add     r24, r24
  30:   2d 98           cbi     0x05, 5 ; 5
  32:   91 50           subi    r25, 0x01       ; 1
  34:   c1 f7           brne    .-16            ; 0x26
  36:   08 95           ret


AVR optimized in C

Looking at the assembler code, half of the loop time is taken by the cbi and sbi two-cycle instructions.  The key to further speed optimizations is to write code that will compile to single-cyle out instructions instead.  The mosi and clk pins can be cleared by reading the port state before the loop, then writing the 8 bits of the port with mosi and clk cleared:
    uint8_t portbits = (SPIPORT & ~(mosipinmask | clkpinmask) );
    do{
        SPIPORT = portbits;      // clk and data low

This also saves having to clear the clk pin at the end of the loop, for a total savings of 3 cycles.  With this technique, the time per bit can be reduced to 9 cycles.  By using the AVR PIN register, another cycle can be shaved off the loop.  The datasheet does not describe the PIN register in detail, stating little more than, "Writing a logic one to PINxn toggles the value of PORTxn".  What this means, for example, is that writing 0x81 to PINB will toggle the state of bit 0 and bit 7, leaving the rest of the bits unchanged.  Here's the final code:
void spi_byteFast(uint8_t byte){
    uint8_t i = 8;
    uint8_t portbits = (SPIPORT & ~(mosipinmask | clkpinmask) );
    do{
        SPIPORT = portbits;      // clk and data low
        if(byte & 0x80) SPIPIN = mosipinmask;
        SPIPIN = clkpinmask;     // toggle clk
        byte <<= 1;
    }while(--i);
    return;
}

The disassembly shows that although the code size has increased, the loop for transmitting a bit takes only 8 cycles.  More speed can be obtained at the cost of code size by having the compiler unroll the loop (enabled with -O3 in gcc).  This would reduce the time per bit to 5 cycles.
00000050 <spi_byteFast>:
  50:   25 b1           in      r18, 0x05       ; 5
  52:   2f 7c           andi    r18, 0xCF       ; 207
  54:   98 e0           ldi     r25, 0x08       ; 8
  56:   40 e1           ldi     r20, 0x10       ; 16
  58:   30 e2           ldi     r19, 0x20       ; 32
  5a:   25 b9           out     0x05, r18       ; 5
  5c:   87 fd           sbrc    r24, 7
  5e:   43 b9           out     0x03, r20       ; 3
  60:   33 b9           out     0x03, r19       ; 3
  62:   88 0f           add     r24, r24
  64:   91 50           subi    r25, 0x01       ; 1
  66:   c9 f7           brne    .-14            ; 0x5a
  68:   08 95           ret

Assembler

I learned to code in assembler (6502) over thirty years ago, and started to learn C a few years after that.  When gcc was first released in 1987, it generated code that was much larger and slower than assembler.  Although it has improved significantly over the years, what surprises me is that it or any other C compiler still rarely matches hand-optimized assembler code.  You might think that there's nothing left to optimize out of  the 7 instructions that make up the the loop above, but by making use of the carry flag, I can eliminate the loop counter.  That saves a register and reduces the loop time to 7 cycles from 8:
spiByte:
    in r18, SPIPORT     ; save port state
    andi r18, ~(mosipinmask | clkpinmask)
    ldi r20, mosipinmask
    ldi r19, clkpinmask
    lsl r24
    ori r24, 0x01       ; 9th bit marks end of byte
spiBit:
    out SPIPORT, r18
    brcc zeroBit
    out SPIPORT-2, r20  ; PORT address -2 is PIN
    lsl r24
    out SPIPORT-2, r19  ; clk hi
    brne spiBit
    ret

When looking for fast software SPI code, the best I could find was 8 cycles per bit.  I read a couple posts on AVRfreaks claiming 7 cycles is possible, but no code was posted.  Unrolled, the above assembler code is still 5 cycles per bit, the same as the optimized C version.  So to back up my claim about the fastest code and hand-optimized assembler being better than the compiler, I need to reduce the timing to 4 cycles per bit.  I can do it using the AVR's T flag, with the bst and bld instructions that transfer a single bit between the T flag and a register.
spiFast:
    in r25, SPIPORT     ; save port state
    andi r25, ~clkpinmask
    ldi r19, clkpinmask
halfByte:
    bst r24, 7
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    bst r24, 6
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    bst r24, 5
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    bst r24, 4
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    swap r24
    eor r1, r19         ; r1 is zero reg
    brne halfByte
    ret

The loop is half unrolled, doing two loops of 4 bits, with the function using a total of 23 instructions.  Fully unrolled the function would use 35 instructions/cycles (plus return), saving 4 cycles over the half unrolled version.

Conclusion

Including overhead, the spiFast assembler code is just under half the speed of  hardware SPI running at full speed (2 cycles per bit).  With the assistance of a hardware timer to generate the clock, and a port dedicated to just the mosi line, it's theoretically possible to output one bit every two cycles using a sequence of lsl and out instructions.  But for a fully software implementation that doesn't modify anything other than the mosi and clk bits, you won't find anything faster than 4 cycles per bit.  Copies of the code are available on my github repo: spi.S and spiWrite.c.

December 2015 update

I've made some small improvements to the code.  Port register state after reset is 0 (all low), so some minor tweaks were made to the code with that in mind.

Thursday, March 26, 2015

Yet another esp8266 article

When I received my ESP-01 module, I didn't expect I'd end up writing a blog post about it.  Unless you've spent the last six months on a research mission in Antarctica, you probably have read about these cheap and relatively easy to program Wifi modules.  But as I learned a few things that I haven't read about, I decided to share my knowledge.

The first thing I learned is that the module worked fine getting it's power from a PL2303HX USB-TTL module, despite many claims I've read that the esp8266 modules draw too much power.  This is likely true for FTDI USB-TTL modules, as the internal 3.3V regulator on them is only rated for 50mA.  The the regulator on the PL2303HX is rated for 150mA, and my testing revealed the modules will output more than the ~200mA maximum required for the esp8266 with minimal voltage drop.  I did find the high current draw when I connected power to the ESP-01 sometimes caused the USB-TTL module to reset.  This was resolved by soldering a 22nF capacitor between the 3.3V output and Gnd.

As most guides will tell you, CH_PD/CHIP_EN needs to be high in order for the module to boot, so I soldered a short wire to Vcc.  Some people also say to pull RST high, however, like many MCU's, the esp8266 has an internal pull-up on reset, so no external pullup is required.  Similarly, GPIO0 and GPIO2 need to be high to select boot from SPI flash.  GPIO0 has an internal pull-up on boot, and GPIO2 defaults to high, so nothing needs to be wired to these pins.

An easy way to tell if you ESP is working is to do a Wifi scan.  After I powered it up, I could see an open access point named ESP_XXXXXX, where XXXXXX is part of the MAC address.

To communicate with the ESP-01, I started by using Putty at 9600bps to try using the AT commands.  I wasn't getting any response when I typed, so I reset the module by touching a resistor between Gnd and RST.  This caused the blue LED to flash, and along with a bit of garbage, the following text appeared:
[Vendor:www.ai-thinker.com Version:0.9.2.4]

What I eventually figured out was that the esp8266 requires all AT commands to end with CRLF.  Putty seems to send just CR for [Enter], so I entered <CTRL-M><CTRL-J> in order to send a CRLF.

I decided to try the nodeMCU eLua firmware, so I jumpered GPIO0 low, and ran the nodemcu-flasher.  After the firmware was flashed, I was able to connect (still at 9600bps) with putty and enter elua code.  The next step was to try an IDE for elua development on the esp8266.

I found three different IDEs, LuaLoader, Esplorer, and Lua Uploader.  I tried LuaLoader first, but had problems with serial communications - I wasn't seeing any text coming from the esp8266.  It's handy DTR & RTS control buttons worked, as I could get the module to reset by wiring RTS to reset on the module, then toggling RTS low then back to high.  I tried running Esplorer, but when I found out it needs Java SE installed first, I moved on to trying Lua Uploader rather than waiting for the Java runtime to download and install.

As can be seen from the screen shot above, Lua Uploader is rather spartan, but it worked fine for me.  By default it loads with a blink program that toggles GPIO2.  Communication with the ESP is rather slow at the default 9600bps, so I saved the following init.lua file to the ESP for 115.2kbps communication:
uart.setup(0, 115200, 8, 0, 1, 1)

Conclusion

I'm impressed with the capability of these little modules.  Next I plan to get another module with more GPIO, and try out the C SDK.

Monday, March 16, 2015

Low-power wireless communication options

I've written a number of blog posts about wireless communications, primarily 315/433Mhz ASK/OOK modules and 2.4Ghz GFSK modules.  In this article I'll analyze the pros and cons of each.

UHF ASK/OOK modules

ASK modules are the simplest type of communication modules, both in their interface and in their radio coding.  Single-pin operation allows them to be used with a single GPIO pin, or even with a UART.  The 315 and 433Mhz bands can be received by inexpensive RTL-SDR dongles, making antenna testing relatively easy.  At 78c for a pair of modules, they are a very inexpensive option for wireless communications for embedded systems.  The bands are popular for low-bandwidth keyless entry systems, so fob transmitters are widely available for 315 and 433 Mhz.

For a given power output budget, 315Mhz communications has the best range since free-space loss increases with frequency.  Using 433Mhz means about 3dB more attenuation than 315, and 2.5Ghz has about 15dB more attenuation than 433.  The 315 and 433Mhz bands can be used for low-power control and intermittent data in the US and Canada, but in the UK it looks like only the 433Mhz band is available.  Due to the simple radio encoding involving the presence or absence of a carrier frequency, sensitivity is over 100dB.  Superheterodyne receivers have a sensitivity in the -107 to -111 dB range, while super-regenerative receivers have a sensitivity in the -103 to -106 dB range.

The inexpensive super-regenerative receivers may need tuning, though this can be avoided by using the more expensive superheterodyne receivers which use a crystal for a frequency reference.  High-level protocol requirements such CRC or addressing must be handled in software, although libraries such as VirtualWire can greatly simplify this task.

Although the specs on most transmitters indicate an input power requirement between 3.5 and 12V, the modules I have tested will work on lower voltages.  I tested a 433Mhz transmitter module powered by 2xAA NiMH providing 2.6V.  At that voltage, the transmit power was about 9dB lower than when it is powered at 5V.

As David Cook explains in his LoFi project, it is feasible to use CR2032 coin cells to power projects using these modules.  Depending on the MCU power consumption and the frequency of data transmission, even 5 years of operation on a single coin cell is possible.

2.4Ghz modules

I've written a few blog posts before about 2.4Ghz modules, both Nordic's nRF24l01 and Semitek's SE8R01.  The nRF chip has lower power consumption of 13.5mA in receive mode, than the Semitek module which consumes 19.5mA.  Either module can still be powered from a CR2032 coin cell.  When using a 50-100uF capacitor as recommended by TI, capacitor leakage can add to battery drain, so use a higher voltage capacitor, such as 16 or even 50V as these will have lower leakage.

Nordic's chip has lower power consumption, you might not get a real nRF chip when you purchase a module.  Besides higher power consumption, some clones are not compatible with the genuine nRF this can range from completely unable to communicate with nRF modules like the SE8R01, or simply inability to work with the dynamic payload length feature.  With SE8R01 modules selling for only 41c in small quantity, if you don't require compatibility with the genuine nRF protocol, and aren't planning to try bit-banging bluetooth low energy, then these seem to be the best choice.  One other reason to go with genuine nRF modules (if you can find them for a reasonable price) is for range.  At 250kbps, the nRF has a receiver sensitivity of  -94 dBm.  The SE8R01 doesn't have a 250kbps mode, and at both 500kbps and 1mbps the receiver sensitivity is -86 dBm.

Although the power consumption of 2.4Ghz modules is similar to UHF ASK/OOK modules, the higher data rate permits lower average power.  Take for example an periodic transmission consisting of a total of 10 bytes: 1 sync, 3 address, 4 payload data, and 2 CRC bytes.  At 250kbps the radio would be transmitting for only 320uS and needs to be powered another 130uS for tx settling time, making the total less than half a millisecond.  An ASK/OOK module transmitting at 9600kbps would have the radio powered for a little over 8ms.  For a periodic transmission every 10 seconds, and with sleep power consumption of 4uA, the ASK module would consume an average of 21uA, compared to just 5uA for the nRF module.

Conclusion

For low-speed uni-directional data, a UHF ASK/OOK module will work fine, and is the simplest solution.  For reliable bi-directional communication, or for high bandwidth 2.4Ghz modules are the best choice.