Saturday, April 3, 2021

Honey, I shrunk the Arduino core!


One of my gripes about the Arduino AVR core is that it is not an example of efficient embedded programming.  One of the foundations of C++ (PDF) is zero-overhead abstractions, yet the Arduino core has a very significant overhead.  The Arduino basic blink example compiles to almost 1kB, with most of that space taken up by code that is never used.  Rewriting the AVR core is a task I'm not ready to tackle, but after writing picoCore, I realized I could use many of the same optimization techniques in an Arduino library.  The result is ArduinoShrink, a library that can dramatically reduce the compiled size of Arduino projects.  In this post I'll explain some of the techniques I used to achieve the coding trifecta of faster, better, and smaller.

The Arduino core is actually a static library that is linked with the project code.  As Eli explains in this post on static linking, libraries like libc usually have only one function per .o in order to avoid linking in unnecessary code.  The Arduino doesn't use that kind of modular approach, however by making use of gcc's "-ffunction-sections" option, it does mitigate the amount of code bloat due to the non-modular approach.

With ArduinoShrink, I wrote more modular, self-contained code.  For example, the Arduino delay() function calls micros(), which relies on the 32-bit timer0 interrupt overflow counter.  I simplified the delay function so that it only needs the 8-bit timer value.  If the user code never calls micros() or millis(), the timer0 ISR code never gets linked in.  By using a more efficient algorithm and writing the code in AVR assembler, I reduced the size of the delay function to 12 instructions taking 24 bytes of flash.

In order to minimize code size and maximize speed, almost half of the code is in AVR assembler.  Despite improvements in compiler optimization techniques over the past decades, on architectures like the AVR I can almost always write better assembler code than what the compiler generates.  That's especially true for interrupt service routines, such as the timer0 interrupt used to maintain the counters for millis() and micros().  My assembler version of the interrupt uses only 56 bytes of flash, and is faster than the Arduino ISR written in C.

One part that is still written in C is the digitalWrite() function.  The Arduino core uses a set of tables in flash to map a given pin number to an IO port and bit, making for a lot of code to have digitalWrite(13, LOW) clear PORTB5.  Making use of Bill's discovery that these flash memory table lookups can be resolved at compile time, digitalWrite(13, LOW) compiles to a single instruction: "cbi PORTB, 5".

ArduinoShrink is also designed to significantly reduce interrupt latency.  The original timer0 interrupt takes around 5us to run, during which time any other interrupts are delayed.  The first instruction in my ISR is 'sei', which allows other interrupts to run, reducing the latency impact to a few cycles more than the hardware minimum.  The official Arduino core disables interrupts in several places, such as when reading the millis counter.  My solution is to detect if the millis counter has been updated and re-read it, thereby avoiding any interrupt latency impact.

The only limitation compared to the official AVR core is that the compiler must be able to resolve the pin number for the digital IO functions at compile time.  Although the pin may hard-coded, even with LTO enabled, avr-gcc is not always able to recognize the pin is a compile-time constant.  Since AVR is not a priority target for GCC optimizations, I can't rely on compiler improvements to resolve this limitation.  Therefore I plan to write a version of digitalWrite that is much smaller and faster, even when avr-gcc can't figure out the pin at compile time.

Although ArduinoShrink should be compatible with any Arduino sketch, given some of the compiler tricks I've used it's not unlikely I've missed a potential error.  If you do find what you think is a bug, open an issue in the github repository.

Tuesday, March 2, 2021

Writing USB firmware on the CH55x MCUs

Over the last several months, I've been familiarizing myself with the CH552 and CH551 MCUs.  Most recently, I've been learning how to program the USB serial interface engine on these devices.  The USB interface is powerful and flexible enough to implement many different kinds of USB devices, from HID to CDC serial.  The highlights are:

  • support for endpoints 0 through 4, both IN and OUT
  • 64-byte maximum packet size
  • DMA to/from xram only
  • multiple USB interrupt triggers
One of the first requirements for writing USB firmware is writing the descriptors.  The examples from WCH are difficult to use as a template due to the descriptors being uint8_t arrays instead of structures.  There are USB structure and constant definitions in ch554_usb.h, which I recommend using instead of arrays.  For instance, I changed the CDC serial example from :

__code uint8_t DevDesc[] = {0x12,0x01,0x10,0x01,0x02,0x00,0x00,DEFAULT_ENDP0_SIZE,

__code USB_DEV_DESCR DevDesc = {
.bLength = 18,
.bDescriptorType = USB_DESCR_TYP_DEVICE,
.bcdUSBH = 0x01, .bcdUSBL = 0x10,
.bDeviceSubClass = 0,
.bDeviceProtocol = 0,
.bMaxPacketSize0 = DEFAULT_ENDP0_SIZE,
.idVendorH = 0x1a, .idVendorL = 0x86,
.idProductH = 0x57, .idProductL = 0x22,
.bcdDeviceH = 0x01, .bcdDeviceL = 0x00,
.iManufacturer = 1, // string descriptors
.iProduct = 2,
.iSerialNumber = 0,
.bNumConfigurations = 1

Once the descriptors are written, the code to handle device enumeration is mostly boilerplate and can be copied from one of the examples.  During the firmware development stage, I recommend adding a call to disconnectUSB() near the start of main().  It's a function I added to debug.h which forces the host to re-enumerate the device.  This way I don't have to unplug and re-connect the USB module after flashing new firmware.

Setting up the DMA buffer pointers requires special attention when multiple IN and OUT endpoints are used.  Even though five endpoints are supported, there are only four DMA buffer pointer registers: UEP[0-3]_DMA.  When the bits bUEP4_RX_EN and bUEP4_TX_EN are set in the UEP4_1_MOD SFR, the EP4 OUT buffer is UEP0_DMA + 64, and the EP4 IN buffer is UEP0_DMA + 128.  Endpoints 1-3 have even more complex buffer configurations, with optional double-buffering for IN and OUT using 256 bytes for four buffers starting from the UEPn_DMA pointer.

When I first started writing USB firmware for the CH551 and CH552, I was concerned that it may be difficult to meet the tight timing requirements, particularly for control and bulk packets that can have multiple in a single 1ms frame.  For example, with small data packets, the time between the end of one OUT transfer and the end of the next OUT transfer can be less than 20uS.  If the USB interrupt handler is too slow, the 2nd OUT transfer could overwrite the DMA buffer before processing of the first has completed.  This situation is avoided by setting bUC_INT_BUSY in the USB_CTRL SFR.  When this bit is set, the SIE will NAK any packets while the UIF_TRANSFER flag is set.  Therefore I recommend setting bUC_INT_BUSY, and clear UIF_TRANSFER at the end of the interrupt handler.

I am currently working on the CMSIS_DAP example.  It implements the DAPv1 (HID) protocol supporting SWD transfers, and works well with OpenOCD and pyOCD.  I'm working on adding CDC/ACM for serial UART communication.  The first step is creating the descriptors for the composite CDC + HID device.  The second step will be integrating the usb_device_cdc code.  The final step, although not absolutely necessary, will be optimizing the CDC code for baud rates up to 1mbps.  The current code uses transmit and receive ring buffers with data copied to and from the IN and OUT DMA buffers.  With double-buffering, the transmit and receive ring buffers can be omitted.  The UART interrupt will copy directly between SBUF and the appropriate USB DMA buffer.

Tuesday, January 26, 2021

Quirks of the CH55x MCUs

Over the past several months, I've been been learning to use the CH551 and CH552 MCUs.  Learning generic 8051 programming was the easy part, as there is lots of old documentation available, with Philips having written some of the best.  The learning curve for WCH's additions to the MCS-51 architecture has been steeper, requiring careful reading of the datasheets, and reading the SDK headers and examples.  I've found that the CH55x chips have some quirks that I've never encountered on any other MCUs.

The GPIO modes are controlled by two registers: MOD_OC and DIR_PU.  The register values are explained in the datasheet and in ch554.h in the SDK.  Figure 10.2.1 in the datasheet shows a schematic diagram for the GPIO.  Modes 0, 1, and 2 are for high-Z input, push-pull, and open-drain respectively.  Mode 3, "standard 8051 mode" is the most complicated.  It's an open drain mode with internal pullup, but with the output driven high for two cycles when the GPIO changes from a 0 to a 1.  This ensures a fast signal rise time.  The part that took me the longest to figure out was the operation of the pullup.  The GPIO diagram shows 70k and 10k, but section 10 of the datasheet does not explain their operation.  Therefore I've highlighted a part of the schematic in green.  When the pin input schmitt trigger output is 1, the inverter in the top right of the diagram will output a low signal to turn on the pFET activating the 10k pullup.  When port input value is 0, only the weak 70k pullup is active.

The pullups aren't actually implemented as resistors on the IC.  They are specially-designed FETs with a high drain-source resistance (RDS).  Since RDS varies with gate-source voltage (Vgs), the pullup resistance will vary inversely with Vcc.  Using a 5V supply, the pullup resistance will be close to the 70k shown in the schematic.  Using a 3.3V supply, the pullup resistance is close to 125k.  Although it is not obvious, this information can be found in section 18 of the datasheet, with the specifications for IUP5 and IUP3.  These numbers are the amount of current a grounded pin will source when the pullup is enabled.

The reset pin has an internal pulldown, which seems to be weak like the GPIO pullups.  At times when working with a CH552 running at 3V3, the chip reset when I inadvertently touched the RST pin with my finger.  This was easily solved by keeping the RST pin shorted to ground.

The last issue I encountered is more of a documentation issue than a quirk.  The maximum reliable clock speed of an IC is depended on the supply voltage.  All of the AVR MCUs I've worked with have a graph in the datasheet showing the voltage required to ensure safe operation at a given speed.  For the CH55x MCUs, there is a subtle difference in the electrical specs at section 18 of the datasheet.  At 5V, total supply current at 24MHz is specified, whereas the specs for 3.3V specify total operating current at 16Mhz.  When I tried running a CH552T at 24MHz with a 3.3V supply, it never worked.  The same part worked perfectly at 16MHz.

Despite the quirks, I think the CH55x MCUs are still a good value.  Current quantity 10 pricing at LCSC is 36c for the CH552T, and 26c for the CH551G.  I recently purchased a small tube of the CH552T, and have plans to test the touch, ADC, PWM, and SPI peripherals.

Tuesday, January 19, 2021

GD32E230: a better STM32F0?


On my last LCSC order, I bought a few GD32E230 chips, specifically the GD32E230K8T6.  I chose the LQFP parts since I have lots of QFP32 breakout boards that I've used for other QFP32 parts.  Gigadevice is much better than many other Chinese MCU manufacturers when it comes to providing English documents.  After my past endeavors trying to understand datasheets from WCH and CHK, going through the Gigadevice documentation was rather pleasant.

Although Gigadevice makes no mention of any STM32 compatibility, but the first clue is the matching pinouts of the STM32F030 and GD32E230.  To prepare for testing, I tinned the pads on a couple of breakout boards, applied some flux, and laid the chips on the pads.  I laid the modules on a cast-iron skillet, and heated it up to about 240C.  The solder reflowed well, however I noticed some browning of the white silkscreen.  Next time I'll limit the temperature to 220C.  After testing for continuity and fixing a solder bridge, I was ready to try SWD.  I connected 3.3V power and the SWD lines, and ran "pyocd cmd -v":

0000710:INFO:board:Target type is cortex_m
0000734:INFO:dap:DP IDR = 0x0bf11477 (v1 MINDP rev0)
0000759:INFO:ap:AHB5-AP#0 IDR = 0x04770025 (AHB5-AP var2 rev0)
0000799:INFO:rom_table:AHB5-AP#0 Class 0x1 ROM table #0 @ 0xe00ff000 (designer=4 3b part=4cb)
0000812:INFO:rom_table:[0]<e000e000:SCS-M23 class=9 designer=43b part=d20 devtyp e=00 archid=2a04 devid=0:0:0>
0000823:INFO:rom_table:[1]<e0001000:DWT class=9 designer=43b part=d20 devtype=00 archid=1a02 devid=0:0:0>
0000841:INFO:rom_table:[2]<e0002000:BPU class=9 designer=43b part=d20 devtype=00 archid=1a03 devid=0:0:0>
0000848:INFO:cortex_m_v8m:CPU core #0 is Cortex-M23 r1p0
0000859:INFO:dwt:2 hardware watchpoints
0000866:INFO:fpb:4 hardware breakpoints, 0 literal comparators

I did little probing around the chip memory.  The GD32E23x user manual shows SRAM at 0x20000000, like STM32 parts.  The contents looked like random values, which I could overwrite using the pyocd "ww' command.  Writing to 0x20002000 resulted in a memory fault, indicating the part does not have any "bonus" RAM beyond 8kB.

Next, I tried using the built-in serial bootloader.  After connecting BOOT0 to VDD and connecting power, PA9 and PA10 were pulled high, indicative of the UART being activated.  However my first attempt at using stm32flash was not successful:

After attaching my oscilloscope, and writing a small bootloader protocol test program, I was able to determine that the responses did seem to conform to the STM32 bootloader protocol.  I did notice that the baud rate from the GD32E230 was only 110kbps, so it wasn't perfectly matching the 115.2kbps speed of the 0x7F byte sent for baud rate detection.  To avoid the potential for data corruption, I switched to 57.6kbps.  Before resorting to debugging the source for stm32flash, my test of stm32loader gave better results:
$ stm32loader -V -p com39
Open port com39, baud 115200
Activating bootloader (select UART)
*** Command: Get
    Bootloader version: 0x10
    Available commands: 0x0, 0x2, 0x11, 0x21, 0x31, 0x43, 0x63, 0x73, 0x82, 0x92, 0x6
Bootloader version: 0x10
*** Command: Get ID
Chip id: 0x440 (STM32F030x8)
Supply -f [family] to see flash size and device UID, e.g: -f F1

Next, I was ready to try flashing a basic program.  I first checked for GD32E support in libopencm3.  No luck.  Then as I read through the user manual, I noticed GPIOA starts at 0x4800 0000 on AHB2, the same as STM32F0 devices.  The register names didn't match the STM32, but the function and offsets were the same.  For example on the GD32E, the register to clear individual GPIOA bits is called GPIOA_BC, rather than GPIOA_BRR as it is called on the STM32.  The clock control registers, called RCU on the GD32E, also matched the STM32 RCC registers.  Since it was looking STM32F0 compatible, I tried flashing my blink example with stm32loader, and it worked!

The LED was flashing faster than it did with the STM32F030.  A little searching revealed that the ARM Cortex-M23, like the M0+, has a 2-stage pipeline.  The STM32F030 with it's M0 core has a 3-stage pipeline.  My delay busy loop needs to be four cycles per iteration, and on the M23, the bne instruction only takes two cycles.  My solution is adding a nop instruction based on an optional compile flag.

One problem I have yet to resolve with the GD32E is support for the bootloader Go/0x21 command.  With the STM32F0, I left BOOT0 high, and used DTR to toggle nRST before uploading new code.  The stm32flash "-g 0" option made the target run the uploaded code after flashing was complete.  I went back to debugging stm32flash, and discovered that it is hard-coded to use the "Get Version"/0x01 command, and silently fails if the bootloader responds with a NAK.  After a few mods to the source, I was able to build a version that works with the GD32E230, however the Go command still doesn't work.  Perhaps a task for a later date will be to hook up a debug probe to see what the E230 is doing when it gets the Go command.

Overall, I'm quite happy with the GD32E230K8T6.  They cost less than half the equivalent STM32 parts, and are even cheaper than other Chinese STM32 clones I've seen.  They are lower power and their maximum clock speed is 50% faster than the STM32F0.  In addition to the shorter 2-stage pipeline, the GD32E devices support single-cycle IO, making them faster for bit-banged communications than the STM32F0 which takes 2 cycles to write to a GPIO pin.  The GD32E230 also has some new features, which might be worth discussing in a future blog post.

Saturday, January 2, 2021

Trying to test a "ten cent" tiny ARM-M0 MCU part 2

After my first look at the HK32F030MF4P6, I wondered if the HK part, unlike the STM32F030 it is modeled after, does not have 5V tolerant IO.  I changed the solder jumpers to 3V3 on the CH552 module I'm using as a CMSIS-DAP adapter, which caused it to stop working.  This was because the CH552 requires a 5V supply in order to run reliably at 24Mhz.  After re-flashing the CMSIS-DAP firmware set to run at 16MHz, the module worked, and I was finally able to talk to the HK MCU via SWD.

In the screen shot above, I chose the stm32f051 target because pyocd does not have the HK MCU nor the STM32F030 among it's builtin targets.  For basic SWD communications, the target option is not even necessary.  With the target specified, it's possible to specify peripheral registers by name, rather than having to specify a memory address to read or write.

In the screen shot above, I'm using the "connect_mode" option to bring the nRST line low on the target device when entering debug mode.  Usually this is not necessary for SWD, however some of the probing I did would cause the MCU to crash.  This required a power cycle or reset to restore communications via SWD.

The first tests I did with the HK MCU were to probe the flash and RAM.  The HK datasheet shows the flash at address 0.  In the STM32F0, the flash is at address 0x8000000, and is mapped to address 0 when the boot0 pin is low.  Although the HK MCU doesn't have a boot0 pin, data at address 0x8000000 is mirrored at address 0 as well.  What was most unusal about the HK MCU is that the flash was not erased to all 0xFF as is typical with other flash-based MCUs.  Most of the flash contents was zeros, except for some data at address 0x400, which was the same on the 2 MCUs I checked:

By writing to memory starting at 0x20000000 using the 'ww' command, I discovered that the MCUs I received have 4kB or RAM, rather than the 2kB specified in the datasheet.  Writing to 0x20001000 (beyond 4kB) results in a crash.

For writing and erasing the flash, I initially tried using the pyOCD 'erase' and 'flash' commands.  Since the MCU flash interface is not part of Cortex-M specification, the flash interface peripheral will vary from one MCU vendor to the next.  The flash interface on the STM32F051 is almost identical to the flash interface on the STM32F030, however the 'erase' and 'flash' commands caused the HK MCU to crash when I ran them.  Testing on a genuine STM32F030 crashed as well, and after some debugging and reading through the pyOCD code, I realized the STM32F051 flash routines need 8kB of RAM.  Even after downloading and installing the STM32F0 device pack, I could not erase or flash the HK MCU.

Next I reviewed the STM32F030 programming manual, and tried to access the flash peripheral registers directly.  This was when I found a pyOCD bug with the wreg command.  I was able to unlock the flash by writing the magic sequence of 0x45670123 followed by 0xCDEF89AB to flash.keyr.  I tried erasing the first page at address 0, and although and updated as expected, the memory contents did not change.  What did work was erasing the page at address 0x8000000, which cleared the contents at address 0 as well.  I still find it strange that the erase operation sets all bits to 0 instead of 1.  The HK datasheet says a flash page is 128 bytes, and erasing a page resulted in 128 bytes set to all zero.

I was only partially successful in writing data to the flash.  Writing to 0x8000000 did not work, however writing a 16-bits to address 0 using the 'wh' command was successful.  Trying to write 16-bits to address 2 updated the and as expected, but did not change the data.  Writing to any 4-byte aligned address in the erased page worked, but writing to addresses that were only 2-byte aligned left all 16 bits at zero.  I tried writing bytes with 'wb' and full words with 'ww', both of which crashed the MCU, likely from a hard fault interrrupt.  I even made sure there isn't a bug with the 'wh' command by writing 16-bits at a time to RAM.

While searching the CHK website for more documentation, I found a page with IAR device packs.  Although pyOCD uses Kiel device packs, I downloaded the HK32F0 pack, which is a self-extracting RAR file, which saves the uncompressed files in AppData\Local\Temp\RarSFX0.

Since .pack files are just zip files with a different extension, I zipped the files back up as a .pack file.  However pyOCD couldn't read it: "0000731:CRITICAL:__main__:CMSIS-Pack './HK32F0.pack' is missing a .pdsc file".  Manually examining the files confirmed some of my earlier discoveries, such as flash at address 0x8000000, remapped to address zero.  I found a file named HK32F030M.svd, which contains XML definitions of the peripheral registers.  pyOCD's builtin devices appear to use svd files, so it may be possible to add HKD32F0 support to pyOCD.

Copies of the IAR support pack, datasheet, and pyocd page erase sequence can be found in my github repository.