Zephyr | pabigot

Recently I was asked to provide an assessment of Zephyr for startups that might want to use it. This post provides that assessment, and expands on it with some of the background for its conclusions and why I’ve again given up on using Zephyr for my own projects.

Here’s my assessment:

Zephyr will be a de-facto success at least for the next few years because it can’t be seen as a failure: too many really large companies have too much invested in it. It will be widely adopted by the member companies and those who follow them. Many vendors already use Zephyr as the basis of their SDKs, and more will join them.
But anybody who expects Zephyr to “just work” will be disappointed. Instead they should expect to run into functionality and quality gaps that will be frustrating and will delay progress toward whatever their actual goal is. Companies that are not members may find it difficult to get the attention of the people who are already working on solutions that may only meet the needs of their specific member companies. Even member companies can encounter these blocks, though at least they have a voice in the technical steering committee.
So a startup could succeed with Zephyr by going in with open eyes and an expectation of discovering, reporting, and possibly working around or fixing bugs and functional gaps. Fixes might be contributed upstream, or may be more easily maintained in a fork that isn’t destabilized by upstream activity. But expect that core Zephyr will continue to be reworked at deep levels for years to come.

My Experiences with Zephyr

In my previous post on Zephyr I outlined my concerns with Zephyr as an embedded software platform, and noted that despite them I would continue to use it and to contribute small items.

As often happens small items became larger, and four months later I negotiated a contract with one of the platinum member companies to be paid for work that we both agreed was of value to Zephyr.

My involvement turned out to be pretty heavy: between contracted and personal contributions I’ve been in the top ten contributors (by commits merged) for six releases (v2.0.0 through v2.5.0) since the v1.14.0 Long Term Support release two years ago, and am ranked third overall for that span.

At the time of writing I’ve submitted to Zephyr:

276 issues, of which 60 are still open;
602 pull requests, of which 19 are still open. Of the closed ones 498 were merged, and 83 (14%) were not (superseded or failed to gain support);
Uncounted issue comments and PR reviews.

All of which is to say: I have a pretty good idea of the state of Zephyr as a project and as a software system, and what it takes to use it and to contribute to it.

After more than 2000 hours contributing to Zephyr I’ve come to these conclusions:

Zephyr is only slightly closer to having the stability and functionality I want for my wireless sensor applications. Some gross bugs have been fixed, and a couple useful new features added, but relative to the homegrown solution I abandoned in 2018 it’s lacking in sensor API, device state management, C++ support, and system status monitoring (telemetry).
There is no doubt that those gaps can be closed from a technical perspective, but I estimate the effort to achieve parity with my previous framework at roughly 1000 hours of my time.
While that would get me what I need, the cost of getting those changes into upstream Zephyr would be two or three times higher, and many would never get the support required for them to be merged.

As a result while my contributions may be of value to Zephyr, Zephyr doesn’t have enough value to me to be worth the pain of making those contributions, so I’ve dissociated myself from the areas I was maintaining and am moving on.

My Concerns with Zephyr

The evangelists will tell you Zephyr is the best open source option for embedded applications, and I can’t say it isn’t. It has support for a wide range of systems, from embedded low-power microcontrollers to multi-core 64-bit networked hosts. It supports a huge number of interfaces. There’s a whole web site telling you all the great things about Zephyr; no need for me to repeat them here.

But that’s high-level promotion. Areas to examine closely include:

Does it have the functionality you need? 62% (over 700) of the open issues are enhancements, which through 2.5.0 have not been tracked and include major functionality like pinmux control that have made no progress in over two years.
Does it meet your reliability expectations? Coding standard conformance is ongoing and retroactive: guidelines are still being finalized, and existing code is being patched to conform to automated safety and security tests. This is a much lower bar than software that’s developed with safety and security in mind, with requirements, design, testing, and validation artifacts to support its quality claims. As an example: Zephyr is supposed to support symmetric multiprocessing, but while working on the last major feature I contributed I found that much of it has race conditions if preemptive threads or symmetric multi-processing are present.

To me the biggest Zephyr problems aren’t technical. They are:

ad hoc and reactive processes;
a lack of coordinated resource management;
a lack of consistent support for non-member contributors.

These are mostly unchanged from my assessment two years ago. Some problematic areas are summarized below.

Code Reviews

Though there is now a concept of area maintainers who have a nominal responsibility to ensure contributions make progress, there is no infrastructure to track that progress, and only a very few maintainers are prompt in reviewing and merging. Anybody can block proposed code for any reason, and if no consensus can be found the contribution is left to stagnate.

This can result in a negative experience, for new contributors as well as existing ones. Despite multiple pleas I couldn’t get several PRs reviewed in time for the 2.5.0 merge, and several documentation updates remain stalled six months after submitting them.

Evolution and Planning

There is no project-level architectural oversight to coordinate and manage change across subsystems, or to ensure stakeholders are identified and given an opportunity to participate. No completed Zephyr software development task I’ve seen has included design artifacts (requirements, use cases, architectural designs, or implementation or test plans) that go deeper than unmaintained and often unpublished slide decks. API and implementations can be merged with as little oversight as approval by two people from the same company as the submitter. This means phased efforts can appear to make progress because pieces of them are merged, but then fail because somebody who wasn’t aware of the effort notices them and objects.

Process and Stakeholder Involvement

Where there is a process, it’s not always followed. As one recent example: Much Zephyr API was written to return -ENOTSUP as a error where -ENOSYS would have been a better choice, for reasons that are not clear. This has been rediscovered multiple times since 2018, most recently last month when it was again discussed in the API telecon and conclusion recorded (paraphrased: “yeah, that’s wrong, but it needs to be addressed in a wholesale review of error code consistency after the next LTS”).

After that meeting concluded, discussion continued in another meeting and a PR was introduced to change the usage. That PR was submitted, approved, and merged without inviting all stakeholders from the just-completed discussion to review it, and without following the documented process for API changes that may require existing code to be modified to maintain current behavior.

Conclusion

This is a trenchant comment from a memo I sent earlier this year to a subset of TSC members who are active in the technical aspects of Zephyr evolution:

Reviewing my Zephyr experiences over the last two years fails to reveal a single compelling example of a major Zephyr feature or task for which there was/is a well-defined plan with full stakeholder acceptance that was executed in a timely manner to a successful conclusion that met its documented goals.

The memo also included details of five areas where I felt Zephyr wasn’t meeting reasonable expectations. The memo had no visible practical impact. In combination with the functional gaps that keep Zephyr from meeting my low-power wireless sensor needs this was enough to convince me that my time is better spent elsewhere.

My disengagement is nearing completion. I’ve started updating my software engineering and development skills, which have stagnated working on a RTOS that can’t move beyond C99. I’m looking forward to taking a deep dive into Rust for application work, and updating my back-end processes for data aggregation to ES2020 and perhaps even Typescript.

And I’ve grabbed the most recent S140 soft device from Nordic and will be refreshing nrfcxx to add support for the sensor hardware that’s been sitting in boxes for two years while I tried to get Zephyr to a state where I could use it. I won’t have Bluetooth mesh, or OTA firmware updates, or accessible recorded data synchronized to civil time, but I will be able to expand my collection of reliable low-power devices that provide beaconed environmental measurements. And it won’t take me (more) years to get them deployed.

In mid November 2018 I’d gotten nrfcxx to the point where it met my first-level needs: low-power Nordic nRF5-based Bluetooth beacons providing sensor data at 1 Hz acquisitions from ambient and enclosed environments including HVAC systems.

nrfcxx has two significant weaknesses, though:

There’s no support for over-the-air firmware updates;
There’s no support for Bluetooth central/peripheral roles that would allow configuration at the device level (e.g. to provide calibration values).

Around this time I discovered the Zephyr Project, a real-time operating system that emerged from Intel’s Open Source Technology Center and was subsequently adopted by the Linux Foundation. Many of the major silicon providers for IoT applications are members of this project. Best of all, it included not only support for a boot loader, but also a complete Bluetooth stack contributed by Intel and Nordic Semiconductor which included support for Bluetooth Mesh.

So I decided to devote up to three months full-time effort to a deep dive into Zephyr, to see if it was a better path forward for me than nrfcxx.

I submitted my first patch 2018-11-18. Over the next three months I opened 45 issues and submitted 28 pull requests, of which 24 were merged.

Things got a little rocky from the start. I2C on Nordic didn’t work because the API specified a behavior that the driver didn’t support. The kernel command to wait for short periods (measured in microseconds) was horribly inaccurate on Nordic hardware. The system timer implementation was broken too, in an unrelated way.

All this in the first two weeks. This set the stage for the next ten weeks. Some things worked. More didn’t. Ultimately by early February I stopped actively contributing, primarily due to what I perceived as governance failures.

This is what exhausted my patience: Early on, after discussion with and general approval from participants in the weekly API telecon, I submitted a solution for gaps in GPIO configuration that added the features I needed without changing existing behavior. It took weeks and prodding before reviews came in; among several that were positive were a couple that could uncharitably be described as “I want it done this other way instead”. While both approaches could be justified, I wasn’t willing to relax the requirement for full backwards compatibility in the solution I provided, and the other parties weren’t willing to accept my work as a step towards a more significant rework by somebody else in the future. Deadlock. Two months after submission and ongoing back-and-forth it ended up in the Technical Steering Committee which chose to discard both proposals and revisit the issue for the next stable release.

Let me be clear: I don’t have a problem with that as a resolution. I do have a problem that the pull request sat for nine weeks, with ongoing discussion, and had to escalate to the highest decision-making body in the project before anybody could decide what should be done.

Linux has been a success in large part because it had a strong architectural lead who decided the bounds of acceptable technical solutions. Zephyr doesn’t have this: it depends on a contributor-based consensus model where merges require at least one approving review, no requests-for-change, and somebody who has commit privileges being interested enough to perform the merge. There is nobody with both the authority and the responsibility to ensure that architectural and process decisions are made in a timely manner.

There are other, related concerns. Several companies have employees who are paid to work on Zephyr, generally providing new features. The review process and lack of project-level architectural oversight has allowed solutions to be proposed, accepted, and merged entirely based on one perspective. There is little evidence of managed early coordination on core technologies like power management, so people can spend a lot of time working on a solution that meets their needs, before finding out that it’s completely unsuitable for other applications. This is exacerbated by the wide range of target platforms, from multi-core DSP engines and X86_64 processors all the way down to ARM Cortex-M1 devices. The needs of a mains-connected table-top voice-activated personal assistant are vastly different from those of a low-cost battery-powered temperature sensor with minimum five-year expected lifespan bolted into ductwork in the ceiling.

At the end I stepped away, having concluded that key Zephyr capabilities such as GPIO, I2C, SPI, timers, and sensors were functionally incomplete, unpleasant to use, or so abstracted their overhead made them unsuitable for use in ultra-low-power wireless sensors.

All that said: I really want to use Bluetooth mesh.

Zephyr is a “too big to fail” project: it’s got strong backing, and is around for the long haul. It really doesn’t have any viable competitors. I’m also a strong believer that, when there’s an existing solution to the problem you have, you need a rock solid reason why you won’t use it. And “I just don’t like it” isn’t good enough.

So, six weeks later I’m coming back to Zephyr. But the project as it stands still fails to meet my needs with core capabilities such as timers, GPIO, I2C, sensors, introspection of system status, and other functionality necessary for robust low-power sensors. I’ll submit PRs for small things, but I’m not motivated to speculatively contribute the more significant changes, so I’m doing work in my Zephyr fork.

Some of the patches on those branches are worth discussing, so as time permits I may describe them here, and maybe somebody will be interested enough to assist in getting them into Zephyr.

pabigot

design and implementation of software-intensive systems

Tag Archives: Zephyr

Zephyr Two Years Later