cpp and gcc -Os/-O3 abuse for driving pins

by Tibor Palinkas
avr [@] igor2.repo.hu

1. rationale

As most engineering activities, low level programming is known to have its own interesting aspects regarding to optimizing choices between contradictional requirements:

the source code should be reusable (as a module or library);
the API should be convenient and simple;
at the end machine language code should be optimal.

Modern compilers are advanced enough that the natural answer for requirement #3 is not "just write it in assembly". However, programming 8-bit microcontrollers poses further challenges by forcing the programmer to consider other aspects of optimal besides runs fast. Most common additional limitations are size of flash and size of RAM. Typical values for a Harvard-architecture low range 8 bit device are 1024..4096 words of code memory and 100..300 bytes of RAM.

This article describes techniques to handle the above requirements while programming atmel's AVR controllers using avr-gcc and the C language with arbitrary GNU extensions. Combining the above list of requirements, the article will demonstrate two aspects:

how to design C code that is reusable on different hardware setups - while keeping code size low;
how to implement a convenient API - while keeping code size low.

The key ideas are to abuse the precompiler and carefully rely on the optimization capabilities of gcc, to get as much as possible done in compile-time.

2. I/O ports

This section describes how the I/O ports work on AVRs. These ideas are common and are similar or the same for many other microcontroller families.

"Port" usually refers to a parallel digital I/O device. Parallel means a single port device may have multiple pins attached (8 pins at most, on a 8 bit controller). Pins can be controlled independently. If a pin is an input, it goes in high-impedance mode and reports if the external circuit drives the pin low or high (voltage is nearer to zero or power supply voltage). The result is 0 or 1 in a specific bit in a specific byte of the device memory. When the same pin is an output, the controller will drive the pin to low or high depending on how the application sets the same bit in the same device memory byte.

Device memory bytes have symbolic names. For example in AVR or microchip's PIC cores, PORTA means the above described pin status byte of the port device called 'A'. A single device may have multiple port devices labelled with single uppercase letters.

The number of physical pins of the microcontroller is limited and a given pin is shared between multiple functionality from which only one functionality is the port device. For example the controller may have an A/D converter which shares pins with port A. The programmer needs to configure other device memory bits to select which functionality should be attached to the physical pin. Sometimes not all eight bits of a port device are connected to physical pins. For example attiny2313 has only the first 3 bits of PORTA accessible.

For an actual design, this implies further limitations. There are a specific amount of physical pins, the device has specific hardware modules implemented with their possible pin mapping. The programmer needs to map pins in a way that each shared pin has exactly one hardware module connected while all I/O requirements are fulfilled. For example the design may require 4 A/D channels and 14 digital outputs for a display on a device that has all 8 bits of PORTA shared with 8 A/D channels, 6 bits of PORTB and 6 bits of PORTC. In this case the even if all bits of PORTB and PORTC are allocated for digital outputs, 2 pins of PORTA still needs to be used. Furthermore if the communication to the display device is parallel and there is a 8 bit data/address bus and the remaining 6 bits are control bits, it is still impossible to allocate a single 8 bit wide byte in the device control memory to send data. Instead data needs to be split in at least two and different regions of the bits should be copied in different PORT bytes. The situation is even worse considering that different models of the same microcontroller family will have different split-up of port bits and pin sharing rules. On top of this, circuit design often requires the programmer to use specific pins for specific bits even if that combination is a total mixup (i.e. PORTA 0:3 are 4 bits of the 8-bit data, but not wired in 0-1-2-3 order but 3-1-0-2).

Conclusion: a reusable implementation may not depend on availability of specific port pins, availability of continuous blocks of pins or even order of such pins. Instead it must have a per bit PORT:bit mapping

3. optimizing code for per bit mapping

Section 2 reveals that one of the most basic challenges is to manage random bit operations in a way that at the end the code will be optimal. For example let's consider a device that requires 6 output pins. For the simplest examples no reusable code is required, we just assume that those 6 pins can be mapped at PORTA:0...PORTA:5. Furthermore let's say the external device needs to put in initial state by setting the first 3 bits and clearing the second 3 bits. The most naive implementation would look something like this:

### lst bits-1

The resulting code will contain six instructions, one per bit operation. This is not because gcc wouldn't be able to realize that the three bit sets and three bit clears can be translated to a single OR and a single AND. It is because PORTA is declared as volatile, since writing the port has side effects (changing actual state of physical pins). Thus, gcc realizes any change of bit values must take place immediately. This may be useful if the user wanted to generate patters - for example the above code would generate the following pattern, assuming PORTA was an output with uninitialized value (x) before test():

where 7 6 5 4 3 2 1 0
before test() x x x x x x x x
1st set x x x x x x 1 x
2nd set x x x x x 1 1 x
3rd set x x x x 1 1 1 x
1st clear x x x 0 1 1 1 x
2nd clear x x 0 0 1 1 1 x
3rd clear x 0 0 0 1 1 1 x

where	7	6	5	4	3	2	1	0
before test()	x	x	x	x	x	x	x	x
1st set	x	x	x	x	x	x	1	x
2nd set	x	x	x	x	x	1	1	x
3rd set	x	x	x	x	1	1	1	x
1st clear	x	x	x	0	1	1	1	x
2nd clear	x	x	0	0	1	1	1	x
3rd clear	x	0	0	0	1	1	1	x

However, this is not what the initialization requirement stated. We should have something like this:

where 7 6 5 4 3 2 1 0
before test() x x x x x x x x
1st set x x x x x x x x
2nd set x x x x x x x x
3rd set x x x x x x x x
1st clear x x x x x x x x
2nd clear x x x x x x x x
3rd clear x x x x x x x x
after test() x 0 0 0 1 1 1 x
which means the initial sequence is set at once. The following code will achieve that:

where	7	6	5	4	3	2	1	0
before test()	x	x	x	x	x	x	x	x
1st set	x	x	x	x	x	x	x	x
2nd set	x	x	x	x	x	x	x	x
3rd set	x	x	x	x	x	x	x	x
1st clear	x	x	x	x	x	x	x	x
2nd clear	x	x	x	x	x	x	x	x
3rd clear	x	x	x	x	x	x	x	x
after test()	x	0	0	0	1	1	1	x

### lst bits-2

To overcome the volatile problem, the port is read and written once and all the arithmetics are done in between, on an internal temp variable. The resulting assembly code is already the shortest possible for this model, 4 instructions, from which 2 are read/write (in/out) between the actual port and the temp variable (which is became an internal register, r24). Since we need to leave bit 0 and 7 alone (they may be output pins for some other device), it is not possible to do the arithmetics in a single instruction.

At this point the code can be rewritten into a more generic form where the caller describes bit assignment.

### lst bits-3

Unfortunately gcc won't be able to do the same optimization, the code needs to be totally dynamic and work for any combination. The resulting code is not short at all. (NOTE: at this point the source may be simplified merging all the sets into one expression and all the clears in another expression. The simplification will not be done, for clarity: a later section of the document will depend on having these expressions separate; thus these expressions will be kept them separate throughout the whole article.) However, programming microcontrollers allows us to depend on some special restrictions: such a device once connected to 6 pins, pin assignment will never change, thus those arguments to test are not really arguments but constants know compile time:

### lst bits-4

The above code still results in the generic implementation. The compiler had the chance to realize the function is called with constants from this object, but there is no guarantee other objects will not call the function and those call arguments can not be verified compile-time, at least not while compiling this object. To overcome this problem, the function may be marked as static which will tell the compiler it will be called only from within the current object. This, combined with obvious constant-only arguments will result in the old 4-instruction output, test() is inlined in main():

### lst bits-5

This is true only as long as all arguments are locally known constants, or in other words, compile-time known constants (and this does not include any information that would be known only during linking). It is also possible to combine constant and non-constant arguments:

### lst bits-6

Function test() is inlined in fun(), the same 4-instruction base is still there for handling 5 of the bits (only the constant value for andi changed) with many instructions added for handling the only one dynamic bit. Looking at the code, it is obvious that we have constant calls only, but as in the previous example, as long as fun() is not static, the compiler needs to assume random external calls.

If fun() is changed to a static, the problem is gone, as there's no external call possible. The next version of this simple example demonstrates how both static functions may be copied in a header file.

### lst bits-7

Another interesting aspect is how the function got inlined twice in main(). This is because total length of the code is still lower this way than with call overhead. Copying the loop a few more times would result in having a fun() function (with test() inlined in it).

4. generic device instances with macros

It is possible to make the above example to be a bit more general, allowing multiple devices connected to the same controller, on different ports. This version will still require that all bits of a device needs to reside in a single PORT byte.

### lst bits-macro-1

Next requirement is that some devices may work inverted. For a generic implementation it makes sense to pass this information as an argument, sort of configuration and decide what bits to set and clear depending on the argument:

### lst bits-macro-2

Unfortunately, gcc did not realize inverted is a constant and could be optimized, the function inlined with only one branch of the if present. This raises a theoretical question: examples of the previous section evolved in a way that assumed how exactly the compiler will be able to optimize; how far optimization can be relied on?

A possible answer is that the programmer shall not rely on optimization at all. There are different ways to solve the above problem; a trivial solution is to simply write two static functions with different names:

### lst bits-macro-3

Unfortunately this variant duplicates the common parts as well, including function declaration and port operation. In a real application common parts would be much longer. To avoid code duplication, the precompiler can be used. It is possible to copy the whole function in a separate header file which is included twice, with different #defines before each include. This would allow the header to use #if or #ifdef while deciding about inverting and the name of the function could be an externally supplied macro as well.

Returning to the original question, how much one may rely on the capabilities of gcc for optimizing code, and the obvious answer would be 'not much'.

There are some safe bets, tho. For example:

				if (1)
					br1;
				else
					br2;

Since 1 is a hardwired constant, we can safely assume the compiler will use br1 only. Using function-like macros, passing in the condition as a macro argument results in exactly the above code after preprocessing.

### lst bits-macro-4

This approach introduces a conceptual shift. In previous examples, a generic function tried to serve all different instances of a device. This, unfortunately, requires more dynamic code that evaluates arguments run-time, resulting in slower, and most often larger code. Instead of that, a separate static function can be generated for each device instance. Since devices are usually sort of hardwired (pins or attributes like inverting will not change runtime), this approach reduces how much one needs to rely on the optimizer, passing any static data in macro arguments. The resulting per device static functions may be inlined, or if there are enough calls, may be kept as separate function by the compiler. The internal code (bit operations in our case) in either case will be as optimal as possible, because every parameter that may be constant will be copied there as literal numbers by the precompiler:

### lst bits-macro-5

Demonstrating the new concept, this example requires all 'hard-wired' parameters to be provided at the time the macro is creating a new device function. Thus, the new function code will contain hardcoded numbers and will be optimized. This concept also makes calling code simpler: if there are multiple calls, wiring information doesn't need to be repeated.

5. named configuration items

A drawback of the macro hacking shown in the last example of section 4 is at calling the macro - it requires the programmer to remember the order of 8..10 arguments, and which is even worse, makes the code harder to read. It would be preferable to construct the macro call in a way that arguments are named, as in the following example (not C syntax):

				create_test(
					name = test_inverted,
					inverted = 1,
					port = porta,
					bits = {1, 2, 3, 4, 5, 6}
				);

This syntax is very similar to how C99 allows initializing fields of a structure using designators:

				struct str {
					int i1, i2;
				} ;

				int main()
				{
					struct str a = {.i1 = 1, .i2 = 2};
				}

A macro could use this feature to set up a struct instance as a local variable inside a newly created function. Such a constant local variable will be treated like inlined literals when optimizing the code (Warning: this is an assumption about the optimizer - true for gcc).

### lst bits-macro-6

As demonstrated on the compiled version, the structure does not show up in the memory at all. Of course bits should be an array - even if index 0 is unused, the same short code is produced:

### lst bits-macro-7

It would be more convenient to describe bit values in a single line, like .bit = {0, 1, 2, 3, 4, 5, 6} - unfortunately it is not possible because of the way macros are processed - those commas would be treated as macro argument separators. A simple way to overcome this problem is using a string instead of an array of characters:

### lst bits-macro-8

This method works only if it is easy to write all possible values of a setting in a single character in textual form. For bits, it is true as the value is between 0 and 7. This goes wrong if the user provided short string:

### lst bits-macro-9

The code does not become constant-only because of reading beyond the allocated string, thus gcc generates real memory access and stores the string in memory. Besides the increase in code size, the real danger is that the code indeed reads beyond the initialized part of the array. To protect against this, the length of the string should be checked compile time:

### lst bits-macro-A

The generated code is still long, but it already contains a call to ERROR__create_test_requires_exactly_6_pins_in_bit(), which is a function that does not exist in the project. In turn, this will cause a link error. It is also obvious that strlen() took place compile time - in other words, the compiler is clever enough to count the number of characters in a constant string and substitute the value instead of really calling strlen(). Therefore the call is generated only if the length of the string was wrong.

Calling a non-existing function is a very crude way of indicating error. Unfortunately #error or #warning will not work in a macro; using abort() or assert() would not work on a microcontroller, or even if it worked, it would not be the preferred. Indicating an error which is known compile time only during execution is especially expensive on microcontrollers.

Finally, as configuration grows, it is more and more convenient to have default values and configure only those items which differ from default.

### lst bits-macro-B

GCC allows the same designator to appear multiple times while initializing a struct, and takes the last one as the final value. There is a restriction about side effects for repeated fields, but in this application that can be safely ignored as the whole point is to use constants. Conclusion: it is possible to use named configuration items instead of depending on order of macro arguments when instantiating a new driver function. However, the technique demonstrated in this section relies on the following features of the C compiler:

support of the C99 standard (for designators)
when optimizing, treating constant local variables the same way as inline literals
for string range checking, the optimizer should know what strlen() would do and substitute the length of constant strings compile time

cpp and gcc -Os/-O3 abuse for driving pins

1. rationale

2. I/O ports

3. optimizing code for per bit mapping

4. generic device instances with macros

5. named configuration items

6. free allocation of ports

7. full example

8. donwload, compiling

9. copyright