Zephyr Two Years Later

Recently I was asked to provide an assessment of Zephyr for startups that might want to use it. This post provides that assessment, and expands on it with some of the background for its conclusions and why I’ve again given up on using Zephyr for my own projects.

Here’s my assessment:

Zephyr will be a de-facto success at least for the next few years because it can’t be seen as a failure: too many really large companies have too much invested in it. It will be widely adopted by the member companies and those who follow them. Many vendors already use Zephyr as the basis of their SDKs, and more will join them.
But anybody who expects Zephyr to “just work” will be disappointed. Instead they should expect to run into functionality and quality gaps that will be frustrating and will delay progress toward whatever their actual goal is. Companies that are not members may find it difficult to get the attention of the people who are already working on solutions that may only meet the needs of their specific member companies. Even member companies can encounter these blocks, though at least they have a voice in the technical steering committee.
So a startup could succeed with Zephyr by going in with open eyes and an expectation of discovering, reporting, and possibly working around or fixing bugs and functional gaps. Fixes might be contributed upstream, or may be more easily maintained in a fork that isn’t destabilized by upstream activity. But expect that core Zephyr will continue to be reworked at deep levels for years to come.

My Experiences with Zephyr

In my previous post on Zephyr I outlined my concerns with Zephyr as an embedded software platform, and noted that despite them I would continue to use it and to contribute small items.

As often happens small items became larger, and four months later I negotiated a contract with one of the platinum member companies to be paid for work that we both agreed was of value to Zephyr.

My involvement turned out to be pretty heavy: between contracted and personal contributions I’ve been in the top ten contributors (by commits merged) for six releases (v2.0.0 through v2.5.0) since the v1.14.0 Long Term Support release two years ago, and am ranked third overall for that span.

At the time of writing I’ve submitted to Zephyr:

276 issues, of which 60 are still open;
602 pull requests, of which 19 are still open. Of the closed ones 498 were merged, and 83 (14%) were not (superseded or failed to gain support);
Uncounted issue comments and PR reviews.

All of which is to say: I have a pretty good idea of the state of Zephyr as a project and as a software system, and what it takes to use it and to contribute to it.

After more than 2000 hours contributing to Zephyr I’ve come to these conclusions:

Zephyr is only slightly closer to having the stability and functionality I want for my wireless sensor applications. Some gross bugs have been fixed, and a couple useful new features added, but relative to the homegrown solution I abandoned in 2018 it’s lacking in sensor API, device state management, C++ support, and system status monitoring (telemetry).
There is no doubt that those gaps can be closed from a technical perspective, but I estimate the effort to achieve parity with my previous framework at roughly 1000 hours of my time.
While that would get me what I need, the cost of getting those changes into upstream Zephyr would be two or three times higher, and many would never get the support required for them to be merged.

As a result while my contributions may be of value to Zephyr, Zephyr doesn’t have enough value to me to be worth the pain of making those contributions, so I’ve dissociated myself from the areas I was maintaining and am moving on.

My Concerns with Zephyr

The evangelists will tell you Zephyr is the best open source option for embedded applications, and I can’t say it isn’t. It has support for a wide range of systems, from embedded low-power microcontrollers to multi-core 64-bit networked hosts. It supports a huge number of interfaces. There’s a whole web site telling you all the great things about Zephyr; no need for me to repeat them here.

But that’s high-level promotion. Areas to examine closely include:

Does it have the functionality you need? 62% (over 700) of the open issues are enhancements, which through 2.5.0 have not been tracked and include major functionality like pinmux control that have made no progress in over two years.
Does it meet your reliability expectations? Coding standard conformance is ongoing and retroactive: guidelines are still being finalized, and existing code is being patched to conform to automated safety and security tests. This is a much lower bar than software that’s developed with safety and security in mind, with requirements, design, testing, and validation artifacts to support its quality claims. As an example: Zephyr is supposed to support symmetric multiprocessing, but while working on the last major feature I contributed I found that much of it has race conditions if preemptive threads or symmetric multi-processing are present.

To me the biggest Zephyr problems aren’t technical. They are:

ad hoc and reactive processes;
a lack of coordinated resource management;
a lack of consistent support for non-member contributors.

These are mostly unchanged from my assessment two years ago. Some problematic areas are summarized below.

Code Reviews

Though there is now a concept of area maintainers who have a nominal responsibility to ensure contributions make progress, there is no infrastructure to track that progress, and only a very few maintainers are prompt in reviewing and merging. Anybody can block proposed code for any reason, and if no consensus can be found the contribution is left to stagnate.

This can result in a negative experience, for new contributors as well as existing ones. Despite multiple pleas I couldn’t get several PRs reviewed in time for the 2.5.0 merge, and several documentation updates remain stalled six months after submitting them.

Evolution and Planning

There is no project-level architectural oversight to coordinate and manage change across subsystems, or to ensure stakeholders are identified and given an opportunity to participate. No completed Zephyr software development task I’ve seen has included design artifacts (requirements, use cases, architectural designs, or implementation or test plans) that go deeper than unmaintained and often unpublished slide decks. API and implementations can be merged with as little oversight as approval by two people from the same company as the submitter. This means phased efforts can appear to make progress because pieces of them are merged, but then fail because somebody who wasn’t aware of the effort notices them and objects.

Process and Stakeholder Involvement

Where there is a process, it’s not always followed. As one recent example: Much Zephyr API was written to return -ENOTSUP as a error where -ENOSYS would have been a better choice, for reasons that are not clear. This has been rediscovered multiple times since 2018, most recently last month when it was again discussed in the API telecon and conclusion recorded (paraphrased: “yeah, that’s wrong, but it needs to be addressed in a wholesale review of error code consistency after the next LTS”).

After that meeting concluded, discussion continued in another meeting and a PR was introduced to change the usage. That PR was submitted, approved, and merged without inviting all stakeholders from the just-completed discussion to review it, and without following the documented process for API changes that may require existing code to be modified to maintain current behavior.

Conclusion

This is a trenchant comment from a memo I sent earlier this year to a subset of TSC members who are active in the technical aspects of Zephyr evolution:

Reviewing my Zephyr experiences over the last two years fails to reveal a single compelling example of a major Zephyr feature or task for which there was/is a well-defined plan with full stakeholder acceptance that was executed in a timely manner to a successful conclusion that met its documented goals.

The memo also included details of five areas where I felt Zephyr wasn’t meeting reasonable expectations. The memo had no visible practical impact. In combination with the functional gaps that keep Zephyr from meeting my low-power wireless sensor needs this was enough to convince me that my time is better spent elsewhere.

My disengagement is nearing completion. I’ve started updating my software engineering and development skills, which have stagnated working on a RTOS that can’t move beyond C99. I’m looking forward to taking a deep dive into Rust for application work, and updating my back-end processes for data aggregation to ES2020 and perhaps even Typescript.

And I’ve grabbed the most recent S140 soft device from Nordic and will be refreshing nrfcxx to add support for the sensor hardware that’s been sitting in boxes for two years while I tried to get Zephyr to a state where I could use it. I won’t have Bluetooth mesh, or OTA firmware updates, or accessible recorded data synchronized to civil time, but I will be able to expand my collection of reliable low-power devices that provide beaconed environmental measurements. And it won’t take me (more) years to get them deployed.

pabigot

design and implementation of software-intensive systems