One of my gripes about the Arduino AVR core is that it is not an example of efficient embedded programming. One of the foundations of C++ (PDF) is zero-overhead abstractions, yet the Arduino core has a very significant overhead. The Arduino basic blink example compiles to almost 1kB, with most of that space taken up by code that is never used. Rewriting the AVR core is a task I'm not ready to tackle, but after writing picoCore, I realized I could use many of the same optimization techniques in an Arduino library. The result is ArduinoShrink, a library that can dramatically reduce the compiled size of Arduino projects. In this post I'll explain some of the techniques I used to achieve the coding trifecta of faster, better, and smaller.
The Arduino core is actually a static library that is linked with the project code. As Eli explains in this post on static linking, libraries like libc usually have only one function per .o in order to avoid linking in unnecessary code. The Arduino doesn't use that kind of modular approach, however by making use of gcc's "-ffunction-sections" option, it does mitigate the amount of code bloat due to the non-modular approach.
With ArduinoShrink, I wrote more modular, self-contained code. For example, the Arduino delay() function calls micros(), which relies on the 32-bit timer0 interrupt overflow counter. I simplified the delay function so that it only needs the 8-bit timer value. If the user code never calls micros() or millis(), the timer0 ISR code never gets linked in. By using a more efficient algorithm and writing the code in AVR assembler, I reduced the size of the delay function to 12 instructions taking 24 bytes of flash.
In order to minimize code size and maximize speed, almost half of the code is in AVR assembler. Despite improvements in compiler optimization techniques over the past decades, on architectures like the AVR I can almost always write better assembler code than what the compiler generates. That's especially true for interrupt service routines, such as the timer0 interrupt used to maintain the counters for millis() and micros(). My assembler version of the interrupt uses only 56 bytes of flash, and is faster than the Arduino ISR written in C.
One part that is still written in C is the digitalWrite() function. The Arduino core uses a set of tables in flash to map a given pin number to an IO port and bit, making for a lot of code to have digitalWrite(13, LOW) clear PORTB5. Making use of Bill's discovery that these flash memory table lookups can be resolved at compile time, digitalWrite(13, LOW) compiles to a single instruction: "cbi PORTB, 5".
ArduinoShrink is also designed to significantly reduce interrupt latency. The original timer0 interrupt takes around 5us to run, during which time any other interrupts are delayed. The first instruction in my ISR is 'sei', which allows other interrupts to run, reducing the latency impact to a few cycles more than the hardware minimum. The official Arduino core disables interrupts in several places, such as when reading the millis counter. My solution is to detect if the millis counter has been updated and re-read it, thereby avoiding any interrupt latency impact.
The only limitation compared to the official AVR core is that the compiler must be able to resolve the pin number for the digital IO functions at compile time. Although the pin may hard-coded, even with LTO enabled, avr-gcc is not always able to recognize the pin is a compile-time constant. Since AVR is not a priority target for GCC optimizations, I can't rely on compiler improvements to resolve this limitation. Therefore I plan to write a version of digitalWrite that is much smaller and faster, even when avr-gcc can't figure out the pin at compile time.
Although ArduinoShrink should be compatible with any Arduino sketch, given some of the compiler tricks I've used it's not unlikely I've missed a potential error. If you do find what you think is a bug, open an issue in the github repository.