A forum discussion on Stellarisiti raised the question of how to achieve short delays on a Cortex-M microcontroller. Specifically, delays on the order of cycles, where the overhead of calling a vendor-supplied library routine exceeds the desired delay. The difficulty arises from anĀ earlier observation that ARM documents the NOP instruction as being usable only for alignment, and makes no promises about how it impacts execution time. In fact, ARM specifies that its use may decrease execution time, miraculous though that might be.
I felt the lines of argument lacked evidence, and accepted a challenge to investigate. This post covers the details of the experiment and its result; the forum discussion provides additional information including an explanation of “hint instruction”, the effect of “architected hint”, and why the particular alternative delay instructions were selected.
The experiment I proposed was the following:
- Timing will be performed by reading the cycle count register, executing an instruction sequence, then reading the cycle counter. The observation will be the difference between the two counter reads.
- The sequence will consist of zero or one context instructions followed by zero or more (max 7) delay instructions
- The only context instruction tested will be a bit-band write of 1 to SYSCTL->RCGCGPIO enabling a GPIO module that had not been enabled prior to the sequence.
- The two candidate delay instructions will be NOP and MOV R8, R8
- Evaluation will be performed on an EK-TM4C123GXL experimenter board using gcc-arm-none-eabi-4_8-2013q4 with the following flags: -Wall -Wno-main -Werror -std=c99 -ggdb -Os -ffunction-sections -fdata-sections -mthumb -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=softfp
- The implementation will be in C using BSPACM, with the generated assembly code inspected to ensure the sequences as defined above are what has been tested
The predictions I made prior to starting work were:
- Null hypothesis (my bet): There will be no measurable cycle count difference in any test cases that vary only in the selected delay instruction. I.e., there is no pipeline difference on the Cortex-M4.
- “Learn something” result (consistent with my previous claims but not my expectations): For cases where N>0, one cycle fewer will be measured in sequences using NOP than in sequences using MOV R8,R8. I have no prediction whether the context instruction will impact this behavior. I.e., on the Cortex-M4 only one NOP instruction may be absorbed.
- “Surprise me” result (still consistent with my previous claims but demonstrating a much higher level of technology in Cortex-M4 than I would predict): A difference in more than one cycle will be observed between any two cases that vary only in the selected delay instruction, but the difference has an upper bound less than the sequence length. I.e., the pipeline is so deep multiple decoded instructions can be dropped without impacting execution time.
- “The universe is borked” result (can’t happen): The duration of a sequence involving NOP is constant regardless of sequence length while the duration of the sequence involving MOV R8,R8 is (in the limit) linear in sequence length. I.e., the CPU is able to decode and discard an arbitrary number of NOP instructions in constant time.
Naturally, things turned out to be a little more complex, but I believe the results are enlightening. The code is available in this github gist.
Here’s the output from the test program:
May 22 2014 06:16:24
System clock 16000000 Hz
Before GPIO context insn: 21
After GPIO context insn: 23
After GPIO context restored: 21
Null context, NOP: 1 2 3 4 5 6 7 8 1
Null context, MOV: 1 3 4 5 6 7 8 9 1
GPIO context, NOP: 7 7 10 10 10 11 12 13 7
GPIO context, MOV: 7 7 10 10 10 11 12 13 7
So what does this say?
First, note that I’ve added diagnostics to confirm that the GPIO context instruction does what it’s supposed to do (enable an unused GPIO module), and that the instruction to reset the context works. Second, the results for each test show the cycle times for the context followed by zero, one, two, …, seven, and zero delay instructions.
Let’s expand the empty, one, and two delay versions of each case to see what it is we’ve timed. These are extracted from main.dis-Os in the gist. Here’s the null context with NOP:
35:main.c **** t0 = BSPACM_CORE_CYCCNT();
35 0000 214B ldr r3, .L2
36 0002 5A68 ldr r2, [r3, #4]
36:main.c **** *dp++ = BSPACM_CORE_CYCCNT() - t0;
39 0004 5968 ldr r1, [r3, #4]
40 0006 8A1A subs r2, r1, r2
42 0008 0260 str r2, [r0]
38:main.c **** t0 = BSPACM_CORE_CYCCNT();
44 000a 5A68 ldr r2, [r3, #4]
39:main.c **** DELAY_INSN_NOP();
48 000c 00BF nop
40:main.c **** *dp++ = BSPACM_CORE_CYCCNT() - t0;
53 000e 5968 ldr r1, [r3, #4]
54 0010 8A1A subs r2, r1, r2
56 0012 4260 str r2, [r0, #4]
42:main.c **** t0 = BSPACM_CORE_CYCCNT();
58 0014 5A68 ldr r2, [r3, #4]
43:main.c **** DELAY_INSN_NOP(); DELAY_INSN_NOP();
62 0016 00BF nop
65 0018 00BF nop
44:main.c **** *dp++ = BSPACM_CORE_CYCCNT() - t0;
70 001a 5968 ldr r1, [r3, #4]
71 001c 8A1A subs r2, r1, r2
73 001e 8260 str r2, [r0, #8]
Good, so it’s doing what we expect in the most basic case. What about MOV R8,R8?
80:main.c **** t0 = BSPACM_CORE_CYCCNT();
240 0000 214B ldr r3, .L5
241 0002 5A68 ldr r2, [r3, #4]
81:main.c **** *dp++ = BSPACM_CORE_CYCCNT() - t0;
244 0004 5968 ldr r1, [r3, #4]
245 0006 8A1A subs r2, r1, r2
247 0008 0260 str r2, [r0]
83:main.c **** t0 = BSPACM_CORE_CYCCNT();
249 000a 5A68 ldr r2, [r3, #4]
84:main.c **** DELAY_INSN_MOV();
253 000c C046 mov r8, r8
85:main.c **** *dp++ = BSPACM_CORE_CYCCNT() - t0;
258 000e 5968 ldr r1, [r3, #4]
259 0010 8A1A subs r2, r1, r2
261 0012 4260 str r2, [r0, #4]
87:main.c **** t0 = BSPACM_CORE_CYCCNT();
263 0014 5A68 ldr r2, [r3, #4]
88:main.c **** DELAY_INSN_MOV(); DELAY_INSN_MOV();
267 0016 C046 mov r8, r8
270 0018 C046 mov r8, r8
89:main.c **** *dp++ = BSPACM_CORE_CYCCNT() - t0;
275 001a 5968 ldr r1, [r3, #4]
276 001c 8A1A subs r2, r1, r2
278 001e 8260 str r2, [r0, #8]
Good: those differ only in the delay instruction, and it’s the same number of octets in the instruction stream.
Now let’s see what the bitband assignment does to the instruction sequence when followed by NOP:
125:main.c **** t0 = BSPACM_CORE_CYCCNT();
444 0000 394A ldr r2, .L8
126:main.c **** CONTEXT_INSN_GPIO();
446 0002 3A4B ldr r3, .L8+4
447 0004 0121 movs r1, #1
122:main.c **** {
449 0006 30B5 push {r4, r5, lr}
125:main.c **** t0 = BSPACM_CORE_CYCCNT();
455 0008 5468 ldr r4, [r2, #4]
458 000a 1960 str r1, [r3]
127:main.c **** *dp++ = BSPACM_CORE_CYCCNT() - t0;
461 000c 5568 ldr r5, [r2, #4]
462 000e 2C1B subs r4, r5, r4
464 0010 0460 str r4, [r0]
128:main.c **** RESTORE_CONTEXT_INSN_GPIO();
466 0012 0024 movs r4, #0
467 0014 1C60 str r4, [r3]
130:main.c **** t0 = BSPACM_CORE_CYCCNT();
469 0016 5468 ldr r4, [r2, #4]
131:main.c **** CONTEXT_INSN_GPIO();
472 0018 1960 str r1, [r3]
132:main.c **** DELAY_INSN_NOP();
475 001a 00BF nop
133:main.c **** *dp++ = BSPACM_CORE_CYCCNT() - t0;
480 001c 5368 ldr r3, [r2, #4]
481 001e 1B1B subs r3, r3, r4
482 0020 4360 str r3, [r0, #4]
134:main.c **** RESTORE_CONTEXT_INSN_GPIO();
484 0022 324B ldr r3, .L8+4
485 0024 0021 movs r1, #0
486 0026 1960 str r1, [r3]
136:main.c **** t0 = BSPACM_CORE_CYCCNT();
488 0028 5168 ldr r1, [r2, #4]
137:main.c **** CONTEXT_INSN_GPIO();
491 002a 0122 movs r2, #1 // *** OOPS
492 002c 1A60 str r2, [r3]
138:main.c **** DELAY_INSN_NOP(); DELAY_INSN_NOP();
495 002e 00BF nop
498 0030 00BF nop
139:main.c **** *dp++ = BSPACM_CORE_CYCCNT() - t0;
503 0032 2D4A ldr r2, .L8
504 0034 5368 ldr r3, [r2, #4]
505 0036 5B1A subs r3, r3, r1
506 0038 8360 str r3, [r0, #8]
140:main.c **** RESTORE_CONTEXT_INSN_GPIO();
508 003a 2C4B ldr r3, .L8+4
509 003c 0021 movs r1, #0
511 003e 1960 str r1, [r3]
Don’t be misled: although the C source shows the read of the cycle counter occurring before some overhead instructions (e.g. the push), the actual read doesn’t occur until offset 8. So what’s being timed is what we want.
Finally, here’s the bitband assignment with MOV R8,R8 as the delay instruction:
188:main.c **** t0 = BSPACM_CORE_CYCCNT();
726 0000 394A ldr r2, .L11
189:main.c **** CONTEXT_INSN_GPIO();
728 0002 3A4B ldr r3, .L11+4
729 0004 0121 movs r1, #1
185:main.c **** {
731 0006 30B5 push {r4, r5, lr}
188:main.c **** t0 = BSPACM_CORE_CYCCNT();
737 0008 5468 ldr r4, [r2, #4]
740 000a 1960 str r1, [r3]
190:main.c **** *dp++ = BSPACM_CORE_CYCCNT() - t0;
743 000c 5568 ldr r5, [r2, #4]
744 000e 2C1B subs r4, r5, r4
746 0010 0460 str r4, [r0]
191:main.c **** RESTORE_CONTEXT_INSN_GPIO();
748 0012 0024 movs r4, #0
749 0014 1C60 str r4, [r3]
193:main.c **** t0 = BSPACM_CORE_CYCCNT();
751 0016 5468 ldr r4, [r2, #4]
194:main.c **** CONTEXT_INSN_GPIO();
754 0018 1960 str r1, [r3]
195:main.c **** DELAY_INSN_MOV();
757 001a C046 mov r8, r8
196:main.c **** *dp++ = BSPACM_CORE_CYCCNT() - t0;
762 001c 5368 ldr r3, [r2, #4]
763 001e 1B1B subs r3, r3, r4
764 0020 4360 str r3, [r0, #4]
197:main.c **** RESTORE_CONTEXT_INSN_GPIO();
766 0022 324B ldr r3, .L11+4
767 0024 0021 movs r1, #0
768 0026 1960 str r1, [r3]
199:main.c **** t0 = BSPACM_CORE_CYCCNT();
770 0028 5168 ldr r1, [r2, #4]
200:main.c **** CONTEXT_INSN_GPIO();
773 002a 0122 movs r2, #1 // *** OOPS
774 002c 1A60 str r2, [r3]
201:main.c **** DELAY_INSN_MOV(); DELAY_INSN_MOV();
777 002e C046 mov r8, r8
780 0030 C046 mov r8, r8
202:main.c **** *dp++ = BSPACM_CORE_CYCCNT() - t0;
785 0032 2D4A ldr r2, .L11
786 0034 5368 ldr r3, [r2, #4]
787 0036 5B1A subs r3, r3, r1
788 0038 8360 str r3, [r0, #8]
203:main.c **** RESTORE_CONTEXT_INSN_GPIO();
790 003a 2C4B ldr r3, .L11+4
791 003c 0021 movs r1, #0
793 003e 1960 str r1, [r3]
So now that we’ve seen what’s being timed, let’s look at results again:
Null context, NOP: 1 2 3 4 5 6 7 8 1
Null context, MOV: 1 3 4 5 6 7 8 9 1
NOP consistently introduces a one-cycle delay, which is what us old-timers would expect an opcode named “NOP” to do. The MOV R8,R8 instruction also introduces a one-cycle delay but only when it can be pipelined; a single instance in isolation takes two cycles.
What’s the effect when a complex context instruction is used?
GPIO context, NOP: 7 7 10 10 10 11 12 13 7
GPIO context, MOV: 7 7 10 10 10 11 12 13 7
This results requires a little analysis. If you look at the code, the instruction sequences with zero and one delay instruction are what we want to time. With two delay instructions the compiler happens to have loaded the RHS of the bitband store operation into a register within the timed sequence at highlighted line marked *** OOPS in the listing above.
From experience with MSP430 I normally use -Os when compiling, since that enables optimizations designed to reduce code size. These optimizations tend to be a little weak; when -O2 is used instead of -Os the compiler is smarter and doesn’t do the load within the timed sequence:
Null context, NOP: 1 2 3 4 5 6 7 8 1
Null context, MOV: 1 3 4 5 6 7 8 9 1
GPIO context, NOP: 7 7 7 7 7 8 9 10 7
GPIO context, MOV: 7 7 7 7 7 8 9 10 7
You can go look at main.dis-O2 to check out what’s being timed here, but I claim it’s exactly what should be timed.
What this shows is that the peripheral bitband write takes six cycles to complete (subtracting the 1 cycle timing overhead), and the delay instruction gets absorbed into that regardless of which type of delay instruction is used. (Why it takes six cycles is a different question. A bitband write to an SRAM address instead of the peripheral register took five. I don’t know whether the pipeline has six/seven stages, or something else is stalling the CPU.)
My conclusions:
- Don’t muck about trying to be clever: for a one-cycle delay just use __NOP(), the ARM CMSIS standard spelling for an inline function that emits the NOP instruction. Where it has an effect, it’s a one-cycle effect. Where it doesn’t, other instructions don’t behave any better.
- The effect of the pipeline is much bigger than I anticipated: not only does the Cortex-M take advantage of the permission granted by the architected hint that __NOP() can be dropped from the execution stage, the impact of the peripheral write eliminates the difference between a one- and a two-cycle instruction.
What this really means is that attempts to do small (1-3) cycle delays have fragile dependencies on the surrounding instructions, which in turn depend on the compiler and its optimization flags. If you’re getting a hard fault because you manipulate a module register too quickly after enabling the module, insert a __NOP() or two and see if it works. If the exact cycle count of the code you write is critical, you’re going to have to analyze it in context.