Saturday, April 19, 2014

gcc link time optimization can fix bad programming practices

I like to write efficient code, but realize that many people relatively new to C programming don't understand some of the simple ways to write fast and efficient code.  Even some popular sites don't demonstrate best practices in their code, such as the blinking LED Arduino tutorial.  I've written a generic AVR C version:
#include <avr/io.h>
#include <util/delay.h>

int LEDPIN = 1;
void main(void)
{
  while (1) {
    PINB |= (1<<LEDPIN);
    _delay_ms(500);
  }
}

I build it with avr-gcc 4.8.0:
avr-gcc -mmcu=attiny85 -DF_CPU=8000000 -Os -Wall -Wno-main    blink.c   -o blink

Then I check the size:
$ avr-size blink
  text    data     bss     dec     hex filename
   122       2       0     120      7e blink

To show what the problem is here's part of the disassembled code (avr-objdump -D blink):
0000002a <__do_copy_data>:
  2a:   10 e0           ldi     r17, 0x00       ; 0
  2c:   a0 e6           ldi     r26, 0x60       ; 96
  2e:   b0 e0           ldi     r27, 0x00       ; 0
  30:   e4 e7           ldi     r30, 0x74       ; 116
  32:   f0 e0           ldi     r31, 0x00       ; 0
  34:   02 c0           rjmp    .+4             ; 0x3a <__do_copy_data+0x10>
  36:   05 90           lpm     r0, Z+
  38:   0d 92           st      X+, r0
  3a:   a2 36           cpi     r26, 0x62       ; 98
  3c:   b1 07           cpc     r27, r17
  3e:   d9 f7           brne    .-10            ; 0x36 <__do_copy_data+0xc>
  40:   02 d0           rcall   .+4             ; 0x46 <main>
  42:   16 c0           rjmp    .+44            ; 0x70 <_exit>

00000046 <main>:
  46:   21 e0           ldi     r18, 0x01       ; 1
  48:   30 e0           ldi     r19, 0x00       ; 0
  4a:   46 b3           in      r20, 0x16       ; 22
  4c:   c9 01           movw    r24, r18
  4e:   00 90 60 00     lds     r0, 0x0060
  52:   02 c0           rjmp    .+4             ; 0x58 <main+0x12>
  54:   88 0f           add     r24, r24
  56:   99 1f           adc     r25, r25
  58:   0a 94           dec     r0
  5a:   e2 f7           brpl    .-8             ; 0x54 <main+0xe>
  5c:   48 2b           or      r20, r24
  5e:   46 bb           out     0x16, r20       ; 22
  60:   4f ef           ldi     r20, 0xFF       ; 255
  62:   84 e3           ldi     r24, 0x34       ; 52
  64:   9c e0           ldi     r25, 0x0C       ; 12
  66:   41 50           subi    r20, 0x01       ; 1
  68:   80 40           sbci    r24, 0x00       ; 0
  6a:   90 40           sbci    r25, 0x00       ; 0
  6c:   e1 f7           brne    .-8             ; 0x66 <main+0x20>
  6e:   00 c0           rjmp    .+0             ; 0x70 <main+0x2a>
  70:   00 00           nop
  72:   eb cf           rjmp    .-42            ; 0x4a <main+0x4>

All the code in __do_copy_data is to copy any global variables (in this case LEDPIN) from flash to RAM.  The code from address 46 to 5e is to set the bit in PINB, based on the value of LEDPIN, which is stored at address 0x0060 in RAM.  This could be done with a single sbi (set bit) instruction, but the compiler doesn't do that, because the code from blink.c could be linked with other code that changes the global variable LEDPIN.  This is even when the output is a linked elf file like blink, because other object files can still be added to it.

By making the LEDPIN variable const, the compiler should know the bit to set will always be bit 1.  As expected, with that change, it generates a single sbi instruction, instead of the 12 instructions above:
  46:   b1 9a           sbi     0x16, 1 ; 22

However it still copies the value of LEDPIN to RAM with the __do_copy_data code.  This is because some other code that gets linked in may use the global variable LEDPIN.  By making the variable static, the compiler will know it is not used outside the current file.  So after changing the code as follows:
static const int LEDPIN = 1;
The code is much smaller:
   text    data     bss     dec     hex filename
     74       0       0      74      4a blink-static-const

But we can't easily make every new C programmer a really good programmer, so that's why having a compiler that can figure out that for the blink application LEDPIN is only defined once, is only used in the main function and nowhere else.  That's one of the things that gcc's link time optimization (LTO) is can do.  After adding the -flto compiler flag, the original version now compiles to 74 bytes - the same as when defining LEDPIN as static const:
   text    data     bss     dec     hex filename
     74       0       0      74      4a blink-flto

A code-generation bug was introduced in avr-gcc 4.8.0, so I suggest using 4.7.3 or waiting for 4.8.3 which will contain a fix.

6 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. What if you changed it to '#define LEDPIN 1'?

    ReplyDelete
    Replies
    1. Using a #define compiles to the same size as static const int.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Shouldn't the line: PINB |= (1<<LEDPIN); read PINB ^= (1<<LEDPIN); This would use the exclusive or operator rather than the inclusive or operator, so that the LED flashes rather than just turns on?

    ReplyDelete
    Replies
    1. If I was using PORTB, yes. Writing a 1 to a bit in the PINB register toggles the state of the corresponding PORTB bit.

      Delete