Zephyr Two Years Later

Recently I was asked to provide an assessment of Zephyr for startups that might want to use it. This post provides that assessment, and expands on it with some of the background for its conclusions and why I’ve again given up on using Zephyr for my own projects.

Here’s my assessment:

Zephyr will be a de-facto success at least for the next few years because it can’t be seen as a failure: too many really large companies have too much invested in it. It will be widely adopted by the member companies and those who follow them. Many vendors already use Zephyr as the basis of their SDKs, and more will join them.

But anybody who expects Zephyr to “just work” will be disappointed. Instead they should expect to run into functionality and quality gaps that will be frustrating and will delay progress toward whatever their actual goal is. Companies that are not members may find it difficult to get the attention of the people who are already working on solutions that may only meet the needs of their specific member companies. Even member companies can encounter these blocks, though at least they have a voice in the technical steering committee.

So a startup could succeed with Zephyr by going in with open eyes and an expectation of discovering, reporting, and possibly working around or fixing bugs and functional gaps. Fixes might be contributed upstream, or may be more easily maintained in a fork that isn’t destabilized by upstream activity. But expect that core Zephyr will continue to be reworked at deep levels for years to come.

My Experiences with Zephyr

In my previous post on Zephyr I outlined my concerns with Zephyr as an embedded software platform, and noted that despite them I would continue to use it and to contribute small items.

As often happens small items became larger, and four months later I negotiated a contract with one of the platinum member companies to be paid for work that we both agreed was of value to Zephyr.

My involvement turned out to be pretty heavy: between contracted and personal contributions I’ve been in the top ten contributors (by commits merged) for six releases (v2.0.0 through v2.5.0) since the v1.14.0 Long Term Support release two years ago, and am ranked third overall for that span.

At the time of writing I’ve submitted to Zephyr:

  • 276 issues, of which 60 are still open;
  • 602 pull requests, of which 19 are still open. Of the closed ones 498 were merged, and 83 (14%) were not (superseded or failed to gain support);
  • Uncounted issue comments and PR reviews.

All of which is to say: I have a pretty good idea of the state of Zephyr as a project and as a software system, and what it takes to use it and to contribute to it.

After more than 2000 hours contributing to Zephyr I’ve come to these conclusions:

  • Zephyr is only slightly closer to having the stability and functionality I want for my wireless sensor applications. Some gross bugs have been fixed, and a couple useful new features added, but relative to the homegrown solution I abandoned in 2018 it’s lacking in sensor API, device state management, C++ support, and system status monitoring (telemetry).
  • There is no doubt that those gaps can be closed from a technical perspective, but I estimate the effort to achieve parity with my previous framework at roughly 1000 hours of my time.
  • While that would get me what I need, the cost of getting those changes into upstream Zephyr would be two or three times higher, and many would never get the support required for them to be merged.

As a result while my contributions may be of value to Zephyr, Zephyr doesn’t have enough value to me to be worth the pain of making those contributions, so I’ve dissociated myself from the areas I was maintaining and am moving on.

My Concerns with Zephyr

The evangelists will tell you Zephyr is the best open source option for embedded applications, and I can’t say it isn’t. It has support for a wide range of systems, from embedded low-power microcontrollers to multi-core 64-bit networked hosts. It supports a huge number of interfaces. There’s a whole web site telling you all the great things about Zephyr; no need for me to repeat them here.

But that’s high-level promotion. Areas to examine closely include:

  • Does it have the functionality you need? 62% (over 700) of the open issues are enhancements, which through 2.5.0 have not been tracked and include major functionality like pinmux control that have made no progress in over two years.
  • Does it meet your reliability expectations? Coding standard conformance is ongoing and retroactive: guidelines are still being finalized, and existing code is being patched to conform to automated safety and security tests. This is a much lower bar than software that’s developed with safety and security in mind, with requirements, design, testing, and validation artifacts to support its quality claims. As an example: Zephyr is supposed to support symmetric multiprocessing, but while working on the last major feature I contributed I found that much of it has race conditions if preemptive threads or symmetric multi-processing are present.

To me the biggest Zephyr problems aren’t technical. They are:

  • ad hoc and reactive processes;
  • a lack of coordinated resource management;
  • a lack of consistent support for non-member contributors.

These are mostly unchanged from my assessment two years ago. Some problematic areas are summarized below.

Code Reviews

Though there is now a concept of area maintainers who have a nominal responsibility to ensure contributions make progress, there is no infrastructure to track that progress, and only a very few maintainers are prompt in reviewing and merging. Anybody can block proposed code for any reason, and if no consensus can be found the contribution is left to stagnate.

This can result in a negative experience, for new contributors as well as existing ones. Despite multiple pleas I couldn’t get several PRs reviewed in time for the 2.5.0 merge, and several documentation updates remain stalled six months after submitting them.

Evolution and Planning

There is no project-level architectural oversight to coordinate and manage change across subsystems, or to ensure stakeholders are identified and given an opportunity to participate. No completed Zephyr software development task I’ve seen has included design artifacts (requirements, use cases, architectural designs, or implementation or test plans) that go deeper than unmaintained and often unpublished slide decks. API and implementations can be merged with as little oversight as approval by two people from the same company as the submitter. This means phased efforts can appear to make progress because pieces of them are merged, but then fail because somebody who wasn’t aware of the effort notices them and objects.

Process and Stakeholder Involvement

Where there is a process, it’s not always followed. As one recent example: Much Zephyr API was written to return -ENOTSUP as a error where -ENOSYS would have been a better choice, for reasons that are not clear. This has been rediscovered multiple times since 2018, most recently last month when it was again discussed in the API telecon and conclusion recorded (paraphrased: “yeah, that’s wrong, but it needs to be addressed in a wholesale review of error code consistency after the next LTS”).

After that meeting concluded, discussion continued in another meeting and a PR was introduced to change the usage. That PR was submitted, approved, and merged without inviting all stakeholders from the just-completed discussion to review it, and without following the documented process for API changes that may require existing code to be modified to maintain current behavior.

Conclusion

This is a trenchant comment from a memo I sent earlier this year to a subset of TSC members who are active in the technical aspects of Zephyr evolution:

Reviewing my Zephyr experiences over the last two years fails to reveal a single compelling example of a major Zephyr feature or task for which there was/is a well-defined plan with full stakeholder acceptance that was executed in a timely manner to a successful conclusion that met its documented goals.

The memo also included details of five areas where I felt Zephyr wasn’t meeting reasonable expectations. The memo had no visible practical impact. In combination with the functional gaps that keep Zephyr from meeting my low-power wireless sensor needs this was enough to convince me that my time is better spent elsewhere.

My disengagement is nearing completion. I’ve started updating my software engineering and development skills, which have stagnated working on a RTOS that can’t move beyond C99. I’m looking forward to taking a deep dive into Rust for application work, and updating my back-end processes for data aggregation to ES2020 and perhaps even Typescript.

And I’ve grabbed the most recent S140 soft device from Nordic and will be refreshing nrfcxx to add support for the sensor hardware that’s been sitting in boxes for two years while I tried to get Zephyr to a state where I could use it. I won’t have Bluetooth mesh, or OTA firmware updates, or accessible recorded data synchronized to civil time, but I will be able to expand my collection of reliable low-power devices that provide beaconed environmental measurements. And it won’t take me (more) years to get them deployed.

Experiences with Zephyr

In mid November 2018 I’d gotten nrfcxx to the point where it met my first-level needs: low-power Nordic nRF5-based Bluetooth beacons providing sensor data at 1 Hz acquisitions from ambient and enclosed environments including HVAC systems.

nrfcxx has two significant weaknesses, though:

  • There’s no support for over-the-air firmware updates;
  • There’s no support for Bluetooth central/peripheral roles that would allow configuration at the device level (e.g. to provide calibration values).

Around this time I discovered the Zephyr Project, a real-time operating system that emerged from Intel’s Open Source Technology Center and was subsequently adopted by the Linux Foundation. Many of the major silicon providers for IoT applications are members of this project. Best of all, it included not only support for a boot loader, but also a complete Bluetooth stack contributed by Intel and Nordic Semiconductor which included support for Bluetooth Mesh.

So I decided to devote up to three months full-time effort to a deep dive into Zephyr, to see if it was a better path forward for me than nrfcxx.

I submitted my first patch 2018-11-18. Over the next three months I opened 45 issues and submitted 28 pull requests, of which 24 were merged.

Things got a little rocky from the start. I2C on Nordic didn’t work because the API specified a behavior that the driver didn’t support. The kernel command to wait for short periods (measured in microseconds) was horribly inaccurate on Nordic hardware. The system timer implementation was broken too, in an unrelated way.

All this in the first two weeks. This set the stage for the next ten weeks. Some things worked. More didn’t. Ultimately by early February I stopped actively contributing, primarily due to what I perceived as governance failures.

This is what exhausted my patience: Early on, after discussion with and general approval from participants in the weekly API telecon, I submitted a solution for gaps in GPIO configuration that added the features I needed without changing existing behavior. It took weeks and prodding before reviews came in; among several that were positive were a couple that could uncharitably be described as “I want it done this other way instead”. While both approaches could be justified, I wasn’t willing to relax the requirement for full backwards compatibility in the solution I provided, and the other parties weren’t willing to accept my work as a step towards a more significant rework by somebody else in the future. Deadlock. Two months after submission and ongoing back-and-forth it ended up in the Technical Steering Committee which chose to discard both proposals and revisit the issue for the next stable release.

Let me be clear: I don’t have a problem with that as a resolution. I do have a problem that the pull request sat for nine weeks, with ongoing discussion, and had to escalate to the highest decision-making body in the project before anybody could decide what should be done.

Linux has been a success in large part because it had a strong architectural lead who decided the bounds of acceptable technical solutions. Zephyr doesn’t have this: it depends on a contributor-based consensus model where merges require at least one approving review, no requests-for-change, and somebody who has commit privileges being interested enough to perform the merge. There is nobody with both the authority and the responsibility to ensure that architectural and process decisions are made in a timely manner.

There are other, related concerns. Several companies have employees who are paid to work on Zephyr, generally providing new features. The review process and lack of project-level architectural oversight has allowed solutions to be proposed, accepted, and merged entirely based on one perspective. There is little evidence of managed early coordination on core technologies like power management, so people can spend a lot of time working on a solution that meets their needs, before finding out that it’s completely unsuitable for other applications. This is exacerbated by the wide range of target platforms, from multi-core DSP engines and X86_64 processors all the way down to ARM Cortex-M1 devices. The needs of a mains-connected table-top voice-activated personal assistant are vastly different from those of a low-cost battery-powered temperature sensor with minimum five-year expected lifespan bolted into ductwork in the ceiling.

At the end I stepped away, having concluded that key Zephyr capabilities such as GPIO, I2C, SPI, timers, and sensors were functionally incomplete, unpleasant to use, or so abstracted their overhead made them unsuitable for use in ultra-low-power wireless sensors.

All that said: I really want to use Bluetooth mesh.

Zephyr is a “too big to fail” project: it’s got strong backing, and is around for the long haul. It really doesn’t have any viable competitors. I’m also a strong believer that, when there’s an existing solution to the problem you have, you need a rock solid reason why you won’t use it. And “I just don’t like it” isn’t good enough.

So, six weeks later I’m coming back to Zephyr. But the project as it stands still fails to meet my needs with core capabilities such as timers, GPIO, I2C, sensors, introspection of system status, and other functionality necessary for robust low-power sensors. I’ll submit PRs for small things, but I’m not motivated to speculatively contribute the more significant changes, so I’m doing work in my Zephyr fork.

Some of the patches on those branches are worth discussing, so as time permits I may describe them here, and maybe somebody will be interested enough to assist in getting them into Zephyr.

New C++17 projects on github

Back in 2014 I decided to branch off my bspacm work and investigate whether it was really possible to come up with a usable and efficient embedded development infrastructure using modern C++.

After returning to consulting in 2017 I resurrected this project to cover time between clients.

There are a lot of pieces involved, but I’ve now made available two subsets that might be of interest:

  • pabigot-cxx is a package of generic C++17 utilities supporting embedded and non-embedded software development. The public release has only a small fraction of the full system, but there’s some nice tools for handling packed octet buffers with different endianness, and what I think is a very cool solution to compile-time generation of CRC tables.
  • nrfcxx is a C++17 framework that uses the bare-metal peripheral API of the Nordic nRF5 series chips to support ultra-low-power Bluetooth beacons transmitting sensor data.

Although these aren’t designed or supported for general public use the API documentation is fairly complete and nrfcxx has support for several boards with a variety of examples. Anybody interested is welcome to make use of these packages under the terms of the Apache-2.0 software license.

Interactive Web Display of Real-time Sensor Data

The biggest psychological impediment to my work on wireless sensor network frameworks is data. What do I collect? What do I do with it? In short, why do I even want a WSN? (If you can answer that, but want help making it happen, please get in touch through one of the options on my about page.)

Collecting data that’s never looked at is a waste of time, so good tool support for viewing and analyzing time-series data is important. The pieces underlying this are the databases and the graphs.

The traditional database approach is RRDTool. Been around forever, does the job very well.

I’m not in love with its API, which involves providing text representations of the observations through an argc/argv command-line interface even from within C, which on the surface introduces a lot of overhead. But “I just don’t like it” is an inadequate reason to reject a time-proven solution. In fact, a text-based API has certain benefits: I recently got a patch merged to RRD that significantly simplifies specification of archive retention, but the new feature can’t be used in collectd configurations because the parameters are no longer the integer counts that collectd stores and sorts to reformat into RRD arguments.

Where RRDTool falls down is in the display of the data. Well, really, in constructing the specification that’s used to define the graphs. The rrdgraph tool that comes with it actually makes some very powerful graphs, but it’s in desperate need of a GUI to help select among data sets, combine them, change the time bounds, etc. There’s cacti, but it wants to do data collection too, I don’t like kitchen-sink solutions, and it’s not as good as collectd. The days of the fat client are past; there are some web apps like Collectd Graph Panel that make it a bit easier, but not enough to really be nice to use.

Or so I thought until I read this blog and found out about statsd which led me to graphite.

This is what I’m talking about: dashboards with pre-configured graphs showing the information I want to see, automatically updated as the data comes in. A web application to dynamically construct graphs, adding and removing from all available data sets, applying transformations to each source in turn, etc.

Turns out the graphite project provides three layers:

  • whisper and ceres are back-end databases for time series data, similar to RRDTool
  • carbon is a daemon infrastructure that receives samples over a network connection and dispatches them to whisper or ceres based on a text name, which by convention encodes a hierarchy such as collectd.server1.cpu0.load.
  • graphite-web is what generates the graphs; it can read data from whisper, ceres, and RRD databases.

It’s all written in Python and runs as a virtual host under django, so it’s pretty easy to configure on Ubuntu 12.04 or 14.04. Even getting django to run in Apache isn’t nearly as hard as everybody makes it out to be.

Graphite’s capabilities are awesome.

Carbon has some really neat architectural features including automated creation of databases as metrics arrive (with customized retentions based on the metric name), centralized aggregation of metrics, and daemons to perform meta-aggregation and relaying to other servers.

Whisper is…inadequate.

Now whisper exists only because of a couple of issues that the graphite developers had with RRDTool. Paraphrasing the explanation:

  • can’t back-fill data that wasn’t available in a timely manner. Here I have some sympathy. This limitation on RRDTool may explain why collectd’s architecture can’t handle plugins that record multiple observations within a sampling interval. Fixing this in RRDTool would be tough. Probably the only viable solution is to integrate the capability into rrdcached, and enhance rrdcached so it acts as a fetch server without first flushing everything to disk.
  • not designed for irregular updates. In short, incoming data to RRD gets interpolated to align with the primary data point timestamps. If you don’t get data often enough to do that interpolation, RRDTool can’t do the alignment, and data gets dropped. I’m sympathetic to this issue, though it doesn’t affect my use cases as much as it does, say, StatsD and other system-monitoring applications.

On the other hand, whisper has a few problems of its own:

  • Each file stores only one metric, which wastes storage when a sensor provides a multiple metrics (e.g. temperature, humidity, pressure, wind direction and speed) and multiple consolidations (AVERAGE, MAX, MIN over various retentions)
  • As a consequence, when aggregating for lower-resolution periods only one consolidation function is allowed (normally “average”). You lose the extreme values (such as daily highs and lows) unless you configure to create separate databases for those. (Carbon does support this if carbon-aggregator is used, but that’s another daemon/point-of-failure.)
  • In my case, sensors have a limited ability to store data locally, so if the off-sensor database stops functioning and I’m told about it in time I can restart things and back-fill the missing material. This is exactly what nagios is for, but figuring out when the last update was received by a whisper database requires an O(n) search because the value isn’t stored in the database header, even though a related problem was a motivation for rejecting RRDTool!

The biggest problem with carbon, and the graphite project as a whole, is lack of leadership and active management. Carbon has over two hundred open issues and pull requests, some of which have been apparently been addressed but the issues left open. There was a major rework called “megacarbon” but that’s been dormant for six months. There are incompatible changes being made on the 0.9.x maintenance branch relative to master. whisper is supposed to be superseded by ceres, but requests for information on project status and schedule are left unanswered for months. If you noticed the link I gave above for graphite was to an outdated page: it’s because the new one doesn’t have the FAQ or any of the pieces that tell people why they should even care about the project.

This an unfortunate but common failing with open source projects, where nobody’s compensated for their effort and maintenance naturally drops when the original developer can only respond “it works for me” or “I don’t even use that software anymore”.

Regardless of all that. RRDTool is a robust tool that’s worked for years and continues to be maintained: a dozen enhancement patches I submitted were promptly integrated for the next release. As an open source solution for web access to display real-time data I don’t think graphite-web would survive a serious challenge from an actively-managed alternative, but it works well enough that there’s really no motivation to develop a competitor.

I’ve set up permanently running rrdtool, graphite, and collectd systems on both my stable and development servers. They’re already recording whole-house power consumption at 1Hz from a TED 5000, interior temperature and humidity from a daemon running on a raspberry pi (stored in RRD databases because I care about that data), and collectd statistics from internal hosts (stored in whisper databases because it’s easier to customize the retention periods and I don’t care about that data).

Since I finally have a way to visualize the data, and I have all these wireless microcontrollers and sensors, it’s probably time to start collecting more.

The Effect of the ARM Cortex-M NOP Instruction

A forum discussion on Stellarisiti raised the question of how to achieve short delays on a Cortex-M microcontroller. Specifically, delays on the order of cycles, where the overhead of calling a vendor-supplied library routine exceeds the desired delay. The difficulty arises from anĀ earlier observation that ARM documents the NOP instruction as being usable only for alignment, and makes no promises about how it impacts execution time. In fact, ARM specifies that its use may decrease execution time, miraculous though that might be.

I felt the lines of argument lacked evidence, and accepted a challenge to investigate. This post covers the details of the experiment and its result; the forum discussion provides additional information including an explanation of “hint instruction”, the effect of “architected hint”, and why the particular alternative delay instructions were selected.

The experiment I proposed was the following:

  • Timing will be performed by reading the cycle count register, executing an instruction sequence, then reading the cycle counter. The observation will be the difference between the two counter reads.
  • The sequence will consist of zero or one context instructions followed by zero or more (max 7) delay instructions
  • The only context instruction tested will be a bit-band write of 1 to SYSCTL->RCGCGPIO enabling a GPIO module that had not been enabled prior to the sequence.
  • The two candidate delay instructions will be NOP and MOV R8, R8
  • Evaluation will be performed on an EK-TM4C123GXL experimenter board using gcc-arm-none-eabi-4_8-2013q4 with the following flags: -Wall -Wno-main -Werror -std=c99 -ggdb -Os -ffunction-sections -fdata-sections -mthumb -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=softfp
  • The implementation will be in C using BSPACM, with the generated assembly code inspected to ensure the sequences as defined above are what has been tested

The predictions I made prior to starting work were:

  • Null hypothesis (my bet): There will be no measurable cycle count difference in any test cases that vary only in the selected delay instruction. I.e., there is no pipeline difference on the Cortex-M4.
  • “Learn something” result (consistent with my previous claims but not my expectations): For cases where N>0, one cycle fewer will be measured in sequences using NOP than in sequences using MOV R8,R8. I have no prediction whether the context instruction will impact this behavior. I.e., on the Cortex-M4 only one NOP instruction may be absorbed.
  • “Surprise me” result (still consistent with my previous claims but demonstrating a much higher level of technology in Cortex-M4 than I would predict): A difference in more than one cycle will be observed between any two cases that vary only in the selected delay instruction, but the difference has an upper bound less than the sequence length. I.e., the pipeline is so deep multiple decoded instructions can be dropped without impacting execution time.
  • “The universe is borked” result (can’t happen): The duration of a sequence involving NOP is constant regardless of sequence length while the duration of the sequence involving MOV R8,R8 is (in the limit) linear in sequence length. I.e., the CPU is able to decode and discard an arbitrary number of NOP instructions in constant time.

Naturally, things turned out to be a little more complex, but I believe the results are enlightening. The code is available in this github gist.

Here’s the output from the test program:

May 22 2014 06:16:24
System clock 16000000 Hz
Before GPIO context insn: 21
After GPIO context insn: 23
After GPIO context restored: 21
Null context, NOP: 1 2 3 4 5 6 7 8 1
Null context, MOV: 1 3 4 5 6 7 8 9 1
GPIO context, NOP: 7 7 10 10 10 11 12 13 7
GPIO context, MOV: 7 7 10 10 10 11 12 13 7

So what does this say?

First, note that I’ve added diagnostics to confirm that the GPIO context instruction does what it’s supposed to do (enable an unused GPIO module), and that the instruction to reset the context works. Second, the results for each test show the cycle times for the context followed by zero, one, two, …, seven, and zero delay instructions.

Let’s expand the empty, one, and two delay versions of each case to see what it is we’ve timed. These are extracted from main.dis-Os in the gist. Here’s the null context with NOP:

  35:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
  35 0000 214B     		ldr	r3, .L2
  36 0002 5A68     		ldr	r2, [r3, #4]
  36:main.c        ****   *dp++ = BSPACM_CORE_CYCCNT() - t0;
  39 0004 5968     		ldr	r1, [r3, #4]
  40 0006 8A1A     		subs	r2, r1, r2
  42 0008 0260     		str	r2, [r0]
  38:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
  44 000a 5A68     		ldr	r2, [r3, #4]
  39:main.c        ****   DELAY_INSN_NOP();
  48 000c 00BF     		nop
  40:main.c        ****   *dp++ = BSPACM_CORE_CYCCNT() - t0;
  53 000e 5968     		ldr	r1, [r3, #4]
  54 0010 8A1A     		subs	r2, r1, r2
  56 0012 4260     		str	r2, [r0, #4]
  42:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
  58 0014 5A68     		ldr	r2, [r3, #4]
  43:main.c        ****   DELAY_INSN_NOP(); DELAY_INSN_NOP();
  62 0016 00BF     		nop
  65 0018 00BF     		nop
  44:main.c        ****   *dp++ = BSPACM_CORE_CYCCNT() - t0;
  70 001a 5968     		ldr	r1, [r3, #4]
  71 001c 8A1A     		subs	r2, r1, r2
  73 001e 8260     		str	r2, [r0, #8]

Good, so it’s doing what we expect in the most basic case. What about MOV R8,R8?

  80:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
 240 0000 214B     		ldr	r3, .L5
 241 0002 5A68     		ldr	r2, [r3, #4]
  81:main.c        ****   *dp++ = BSPACM_CORE_CYCCNT() - t0;
 244 0004 5968     		ldr	r1, [r3, #4]
 245 0006 8A1A     		subs	r2, r1, r2
 247 0008 0260     		str	r2, [r0]
  83:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
 249 000a 5A68     		ldr	r2, [r3, #4]
  84:main.c        ****   DELAY_INSN_MOV();
 253 000c C046     		mov r8, r8
  85:main.c        ****   *dp++ = BSPACM_CORE_CYCCNT() - t0;
 258 000e 5968     		ldr	r1, [r3, #4]
 259 0010 8A1A     		subs	r2, r1, r2
 261 0012 4260     		str	r2, [r0, #4]
  87:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
 263 0014 5A68     		ldr	r2, [r3, #4]
  88:main.c        ****   DELAY_INSN_MOV(); DELAY_INSN_MOV();
 267 0016 C046     		mov r8, r8
 270 0018 C046     		mov r8, r8
  89:main.c        ****   *dp++ = BSPACM_CORE_CYCCNT() - t0;
 275 001a 5968     		ldr	r1, [r3, #4]
 276 001c 8A1A     		subs	r2, r1, r2
 278 001e 8260     		str	r2, [r0, #8]

Good: those differ only in the delay instruction, and it’s the same number of octets in the instruction stream.

Now let’s see what the bitband assignment does to the instruction sequence when followed by NOP:

 125:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
 444 0000 394A     		ldr	r2, .L8
 126:main.c        ****   CONTEXT_INSN_GPIO();
 446 0002 3A4B     		ldr	r3, .L8+4
 447 0004 0121     		movs	r1, #1
 122:main.c        **** {
 449 0006 30B5     		push	{r4, r5, lr}
 125:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
 455 0008 5468     		ldr	r4, [r2, #4]
 458 000a 1960     		str	r1, [r3]
 127:main.c        ****   *dp++ = BSPACM_CORE_CYCCNT() - t0;
 461 000c 5568     		ldr	r5, [r2, #4]
 462 000e 2C1B     		subs	r4, r5, r4
 464 0010 0460     		str	r4, [r0]
 128:main.c        ****   RESTORE_CONTEXT_INSN_GPIO();
 466 0012 0024     		movs	r4, #0
 467 0014 1C60     		str	r4, [r3]
 130:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
 469 0016 5468     		ldr	r4, [r2, #4]
 131:main.c        ****   CONTEXT_INSN_GPIO();
 472 0018 1960     		str	r1, [r3]
 132:main.c        ****   DELAY_INSN_NOP();
 475 001a 00BF     		nop
 133:main.c        ****   *dp++ = BSPACM_CORE_CYCCNT() - t0;
 480 001c 5368     		ldr	r3, [r2, #4]
 481 001e 1B1B     		subs	r3, r3, r4
 482 0020 4360     		str	r3, [r0, #4]
 134:main.c        ****   RESTORE_CONTEXT_INSN_GPIO();
 484 0022 324B     		ldr	r3, .L8+4
 485 0024 0021     		movs	r1, #0
 486 0026 1960     		str	r1, [r3]
 136:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
 488 0028 5168     		ldr	r1, [r2, #4]
 137:main.c        ****   CONTEXT_INSN_GPIO();
 491 002a 0122     		movs	r2, #1        // *** OOPS
 492 002c 1A60     		str	r2, [r3]
 138:main.c        ****   DELAY_INSN_NOP(); DELAY_INSN_NOP();
 495 002e 00BF     		nop
 498 0030 00BF     		nop
 139:main.c        ****   *dp++ = BSPACM_CORE_CYCCNT() - t0;
 503 0032 2D4A     		ldr	r2, .L8
 504 0034 5368     		ldr	r3, [r2, #4]
 505 0036 5B1A     		subs	r3, r3, r1
 506 0038 8360     		str	r3, [r0, #8]
 140:main.c        ****   RESTORE_CONTEXT_INSN_GPIO();
 508 003a 2C4B     		ldr	r3, .L8+4
 509 003c 0021     		movs	r1, #0
 511 003e 1960     		str	r1, [r3]

Don’t be misled: although the C source shows the read of the cycle counter occurring before some overhead instructions (e.g. the push), the actual read doesn’t occur until offset 8. So what’s being timed is what we want.

Finally, here’s the bitband assignment with MOV R8,R8 as the delay instruction:

 188:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
 726 0000 394A     		ldr	r2, .L11
 189:main.c        ****   CONTEXT_INSN_GPIO();
 728 0002 3A4B     		ldr	r3, .L11+4
 729 0004 0121     		movs	r1, #1
 185:main.c        **** {
 731 0006 30B5     		push	{r4, r5, lr}
 188:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
 737 0008 5468     		ldr	r4, [r2, #4]
 740 000a 1960     		str	r1, [r3]
 190:main.c        ****   *dp++ = BSPACM_CORE_CYCCNT() - t0;
 743 000c 5568     		ldr	r5, [r2, #4]
 744 000e 2C1B     		subs	r4, r5, r4
 746 0010 0460     		str	r4, [r0]
 191:main.c        ****   RESTORE_CONTEXT_INSN_GPIO();
 748 0012 0024     		movs	r4, #0
 749 0014 1C60     		str	r4, [r3]
 193:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
 751 0016 5468     		ldr	r4, [r2, #4]
 194:main.c        ****   CONTEXT_INSN_GPIO();
 754 0018 1960     		str	r1, [r3]
 195:main.c        ****   DELAY_INSN_MOV();
 757 001a C046     		mov r8, r8
 196:main.c        ****   *dp++ = BSPACM_CORE_CYCCNT() - t0;
 762 001c 5368     		ldr	r3, [r2, #4]
 763 001e 1B1B     		subs	r3, r3, r4
 764 0020 4360     		str	r3, [r0, #4]
 197:main.c        ****   RESTORE_CONTEXT_INSN_GPIO();
 766 0022 324B     		ldr	r3, .L11+4
 767 0024 0021     		movs	r1, #0
 768 0026 1960     		str	r1, [r3]
 199:main.c        ****   t0 = BSPACM_CORE_CYCCNT();
 770 0028 5168     		ldr	r1, [r2, #4]
 200:main.c        ****   CONTEXT_INSN_GPIO();
 773 002a 0122     		movs	r2, #1        // *** OOPS
 774 002c 1A60     		str	r2, [r3]
 201:main.c        ****   DELAY_INSN_MOV(); DELAY_INSN_MOV();
 777 002e C046     		mov r8, r8
 780 0030 C046     		mov r8, r8
 202:main.c        ****   *dp++ = BSPACM_CORE_CYCCNT() - t0;
 785 0032 2D4A     		ldr	r2, .L11
 786 0034 5368     		ldr	r3, [r2, #4]
 787 0036 5B1A     		subs	r3, r3, r1
 788 0038 8360     		str	r3, [r0, #8]
 203:main.c        ****   RESTORE_CONTEXT_INSN_GPIO();
 790 003a 2C4B     		ldr	r3, .L11+4
 791 003c 0021     		movs	r1, #0
 793 003e 1960     		str	r1, [r3]

So now that we’ve seen what’s being timed, let’s look at results again:

Null context, NOP: 1 2 3 4 5 6 7 8 1
Null context, MOV: 1 3 4 5 6 7 8 9 1

NOP consistently introduces a one-cycle delay, which is what us old-timers would expect an opcode named “NOP” to do. The MOV R8,R8 instruction also introduces a one-cycle delay but only when it can be pipelined; a single instance in isolation takes two cycles.

What’s the effect when a complex context instruction is used?

GPIO context, NOP: 7 7 10 10 10 11 12 13 7
GPIO context, MOV: 7 7 10 10 10 11 12 13 7

This results requires a little analysis. If you look at the code, the instruction sequences with zero and one delay instruction are what we want to time. With two delay instructions the compiler happens to have loaded the RHS of the bitband store operation into a register within the timed sequence at highlighted line marked *** OOPS in the listing above.

From experience with MSP430 I normally use -Os when compiling, since that enables optimizations designed to reduce code size. These optimizations tend to be a little weak; when -O2 is used instead of -Os the compiler is smarter and doesn’t do the load within the timed sequence:

Null context, NOP: 1 2 3 4 5 6 7 8 1
Null context, MOV: 1 3 4 5 6 7 8 9 1
GPIO context, NOP: 7 7 7 7 7 8 9 10 7
GPIO context, MOV: 7 7 7 7 7 8 9 10 7

You can go look at main.dis-O2 to check out what’s being timed here, but I claim it’s exactly what should be timed.

What this shows is that the peripheral bitband write takes six cycles to complete (subtracting the 1 cycle timing overhead), and the delay instruction gets absorbed into that regardless of which type of delay instruction is used. (Why it takes six cycles is a different question. A bitband write to an SRAM address instead of the peripheral register took five. I don’t know whether the pipeline has six/seven stages, or something else is stalling the CPU.)

My conclusions:

  • Don’t muck about trying to be clever: for a one-cycle delay just use __NOP(), the ARM CMSIS standard spelling for an inline function that emits the NOP instruction. Where it has an effect, it’s a one-cycle effect. Where it doesn’t, other instructions don’t behave any better.
  • The effect of the pipeline is much bigger than I anticipated: not only does the Cortex-M take advantage of the permission granted by the architected hint that __NOP() can be dropped from the execution stage, the impact of the peripheral write eliminates the difference between a one- and a two-cycle instruction.

What this really means is that attempts to do small (1-3) cycle delays have fragile dependencies on the surrounding instructions, which in turn depend on the compiler and its optimization flags. If you’re getting a hard fault because you manipulate a module register too quickly after enabling the module, insert a __NOP() or two and see if it works. If the exact cycle count of the code you write is critical, you’re going to have to analyze it in context.

On using current toolchains

An excerpt from a discussion on the TI MSP430 Forums regarding trade-offs between sticking with an old toolchain that you’re used to, and continually updating:

You can stick with an existing, “well understood” system, and assume that you’re safe because it passes what you think is important to test. Or you can keep up to date with what’s provided by a vendor (who sees a lot more use cases and variations than you do). This is a management choice.

All I can say is that, in my own multi-decade experience, the biggest long-term source of destabilization comes not from regular updates to the current toolchain, but from staying with old tools until something happens that forces you to make a multi-version jump to a new compiler. (And I agree that a new version is a new compiler and cannot just be assumed to work; this is why one should develop complete regression suites with test harnesses to check the “can’t happen but actually did once” situations.) I can’t see what happens in proprietary systems, but it’s been many years since an update to GCC has resulted in my discovery of an undesirable behavioral change that wasn’t ultimately a bug in my own code, with the fix improving quality for that code and all code I’ve worked on since.

If you’re operating in a regulated environment where the cost of updating/certifying is prohibitive, so be it. Best approach in that case is to keep with the toolchain used for original release of the product, and release new products with the most recent toolchain so you’re always taking advantage of the best available solution at the time.

I’m not saying there’s a universally ideal policy, e.g. that you should always use the current toolchain. I am saying that a shop that develops and releases new products using old toolchains without a strong reason behind that decision is not using best practices and is likely to produce an inferior product. If management thinks they’re saving money and reducing risk by not updating, there’s a good chance they’re being short-sighted.

EXP430FR5969 Launchpad First Experiences

Two years ago a launchpad version of TI’s ultra-low-power FRAM-based Wolverine chip was demonstrated at Embedded World 2012. The MSP-EXP430FR5969 was finally released a couple weeks ago.

I first got a Wolverine (MSP430FR5969) chip back in August 2012 by badgering TI to send them to me and Daniel Beer so the open source toolchain comprising mspgcc and mspdebug would support them when the launchpad was released. That lash-up has been in BSP430 since October 2012, but wasn’t really usable. As of today, full support for EXP430FR5969 has been added to BSP430.

The device has a standard 14-pin JTAG header, and also supports the eZ-FET emulator through the micro-B USB connector as the first of the two interfaces (generally showing up as /dev/ttyACM0 on Linux). This emulator can be used with the “ezfet“ driver under mspdebug. Which I suppose is good since you don’t need the MSP-FET430UIF, but check this comparison:

llc[287]$ time mspdebug ezfet "prog app.elf" > /dev/null
real    0m34.338s
user    0m0.064s
sys     0m0.236s
llc[288]$ time mspdebug tilib "prog app.elf" > /dev/null
real    0m8.095s
user    0m0.032s
sys     0m0.016s
llc[289]$ msp430-size app.elf 
   text    data     bss     dec     hex filename
  30011     542    1370   31923    7cb3 app.elf

Four times faster with the FET430UIF. 15 seconds is overhead, the rest is just that eZ-FET is slower per byte transferred. I lived with this for six hours before I was comfortable enough to try the same fix I used for the EXP430F5529LP: disconnect everything, plug in only the EXP430FR5969, then run:

# verify you get a notice saying the firmware needs to be updated
mspdebug tilib
# do the update
sudo mspdebug tilib --allow-fw-update

Yes, you do have to do that as root even if you have udev rules allowing user-level access to the device: during the firmware update the vendor/product IDs change and those rules won’t apply. Now I can put the FET430UIF back in the closet.

Other miscellaneous items of interest:

  • In the current silicon and user’s guide, PM5CTL0.LOCKLPM5 powers up set. Unless explicitly cleared, no GPIO configuration takes effect. This includes useful things like “I’m alive” LED blinks. Since that behavior wasn’t present in the original FR58xx user’s guide or the old XSPFR5969 chips I got two years ago, I spent a few minutes wondering why the board didn’t blink after programming. KatiePier at 43oh to the rescue and now BSP430 clears that bit when the board is initialized. (And yes, of course you can prevent it from doing so, or hook in before it happens, so you can properly handle wakeups from LPM x.5.)
  • The micro-B USB port provides a back-channel UART that shows up as the second interface (/dev/ttyACM1). This uses EUSCI A0, while EUSCI A1 is connected to the standard launchpad UART pins (A.3 and A.4).
  • One of BSP430’s board-raising applications sends the MSP430 clock signals to headers so they can be externally validated; that’d be MCLK (the master CPU clock); SMCLK (sub-master for peripheral clocks); and ACLK (auxiliary clock). The EXP430FR5969 does not make any of these easy to get to: they’re only accessible on the JTAG header.
  • I’m unable to set the clock speed above 8 MHz; the next setting is 16 MHz, and an oscillator fault is generated immediately after CSCTL3.DIVM is cleared to produce an undivided MCLK (the power-up setting divides by 8). This is not an erratum listed in SLAZ473E, but 16MHz is the maximum listed operating speed. I was able to run at 12MHz by using a 24MHz DCOCLK with DIVM_2.
  • The CC3000BOOST booster pack works if powered through the DC jack. The Anaren Air CC110L booster pack also works.
  • Each device has a unique 128-bit number in the TLV region with tag 0x15.
  • The ADC12_B is a 32-channel converter, and the standard internal and voltage inputs are disabled by default (and are not on the traditional channels 10 and 11). The new revision E chips are subject to erratum ADC40, described as

    The ADC module may return large errors in its conversion results. The probability of conversion results with large errors varies depending on temperature and VCC.

    and notes that the workaround is “None”. I’m seeing 5% errors randomly on temperature and voltage reads with all three reference voltage levels.

So: The device works, but is clearly still experimental (as indicated by the X430FR5969 part number, and the warning sheet included in the box). Good enough to bring BSP430 up to twelve supported platforms, though.

Python 2/3 Compatible Source and PyXB

I started work on PyXB just over five years ago. At the time, Python 3.0 had just come out, but was far too new to hassle with, so I made Python 2.4 the minimum required version.

In September 2011 people started to hint they’d like Python 3 support, but it looked like it’d be an awful lot of work, and nobody asked officially, so I just kept it in the back of my mind. In June 2012 the noise was getting harder to ignore, so I logged the request but didn’t take it further.

Over the next year or so PyXB’s unicode support got stronger, and I started understanding exactly how much easier it’d be to do XML with a proper distinction between text (i.e., unicode) and data (i.e, octet sequences). Python 2 did this poorly, but the difference is deeply embedded in Python 3. In September 2013 I finally created a branch for Python 3 off the 1.2.3 release. This involved running 2to3 over the source then running a second script to fix the resulting errors. This was good enough to make available for folks who could build from the repository, but couldn’t support packaging a version because converting the source was too complex to run on an end-user’s machine.

While investigating an installation problem that ultimately turned out to be a bug in pip I discovered six. Six is a single module, released under the MIT license, that can be integrated into a Python package to allow the same source code to work under both Python 2 and Python 3. No more running 2to3. No more fixing up the mess 2to3 makes when it changes pyxb.utils.unicode to pyxb.utils.str.

As of today, the next branch of PyXB passes all tests using Python version 2.6 up through 3.4.0rc1 without source-code changes. Well, ok, some unit tests fail because whitespace in formatted XML changed in 2.7; the unittest.TestCase.assertRaisescontext manager feature isn’t handled in 2.6, 3.0, or 3.1; and I haven’t tested 3.0.1 because hacking its configure script so it can build a functional hashlib module on Ubuntu 12.04 isn’t worth the effort. Nonetheless, PyXB itself works fine.

There’s more work to be done. A packaged PyXB includes generated bindings for about 186 namespaces. When building from the repository those can be generated with the same Python that’ll be running them, so they might include Unicode literals which aren’t going to work across the gap where Python 3 didn’t support the unicode prefix (u’text’) until version 3.3. But the big hurdle has been overcome, and the next PyXB release should support all Python versions from 2.6 onward.

Editing XML schemas on Emacs with nXML

While looking into DocBook recently, I discovered that GNU emacs finally has a high-quality XML editing mode that includes validation of the documents. nXML mode is integrated into emacs23, and it comes with RELAX NG grammars to support docbook editing, though only for DocBook 4.2.

For work on PyXB, though, I really need something that handles XML Schema Definition (XSD) documents. emacs nXML doesn’t come with XSD support, but the RELAX NG homepage points to Jeni Tennison’s schema as a candidate.

This is a great start, but when I tried using it with xmllint from libxml2 to validate some schemas supported by PyXB it said they were invalid. There are a variety of subtle issues the original version didn’t quite get right (and a few cases where my example schema were wrong). I’ve updated the schema to fix those issues, and made it available on github.

emacs nXML comes with an XSLT RELAX NG schema, but only for version 1.0. As XSLT 3 is nearly complete at the time I’m writing this, I was hoping to find support to validate against other XSLT versions as well. Turns out Norman Walsh has provided a unified solution for XSLT 1.0, 2.0, and 3.0 on github.

So: To support XSD and XSLT editing with nXML in Emacs 23, I put this in my .emacs file:

;; nXML mode customization
(add-to-list 'auto-mode-alist '("\\.xsd\\'" . xml-mode))
(add-to-list 'auto-mode-alist '("\\.xslt\\'" . xml-mode))
(add-hook 'nxml-mode-hook
	  '(lambda ()
	     (make-local-variable 'indent-tabs-mode)
	     (setq indent-tabs-mode nil)
	     (add-to-list 'rng-schema-locating-files
			  "~/.emacs.d/nxml-schemas/schemas.xml")))

I copied the original schema from the rng4xsd and xslt-relax-ng repositories, and used Trang to convert from the standard RELAX NG XML syntax to the compact syntax used by nXML. Then the following goes into ~/.emacs.d/nxml-schemas/schemas.xml:

<locatingRules xmlns="http://thaiopensource.com/ns/locating-rules/1.0">
  <!-- Extend to support W3C XML Schema Definition Language, which as
       of 1.1 are known as "XSD" rather than "XML Schema" to avoid
       confusion with other XML schema languages such as RelaxNG. -->
  <uri pattern="*.xsd" typeId="XSD"/>
  <namespace ns="http://www.w3.org/2001/XMLSchema" typeId="XSD"/>
  <documentElement localName="schema" typeId="XSD"/>
  <typeId id="XSD 1.0" uri="xsd10.rnc"/>
  <typeId id="XSD" typeId="XSD 1.0"/>

  <!-- Extend to support all three XSLT variants.  These are all in
       the same namespace, but are distinguished by a version
       attribute in the document element.  If unqualified, a catch-all
       version is used. -->
  <uri pattern="*.xsl" typeId="XSLT"/>
  <uri pattern="*.xslt" typeId="XSLT"/>
  <namespace ns="http://www.w3.org/1999/XSL/Transform" typeId="XSLT"/>
  <typeId id="XSLT 1.0" uri="xslt10.rnc"/>
  <typeId id="XSLT 2.0" uri="xslt20.rnc"/>
  <typeId id="XSLT 3.0" uri="xslt30.rnc"/>
  <typeId id="XSLT" uri="xslt.rnc"/>
</locatingRules>

Now when I write my XSD schemas in emacs, the mode line tells me when they’re invalid, and I can use C-c C-n to jump to the error, with the explanation placed in the message line. Very nice.