IoT, drones, self-driving cars: hard problems that combine hardware, software, and services are all the rage in the tech world today. At Highfive, we’ve been working on similar problems for a while now. A few months before we unveiled the Highfive conference room device, clients, and service to the world, we built a key component spanning all three of these worlds — and that component’s been one of my favorite software decisions. This tale is a recounting of the problem we were trying to solve and the solution we built to address it.
Coming out of stealth mode
It was the fall of 2014 and our team had just gotten the Highfive conference room device, our software clients, and our service to a state where paying customers could begin to rely on it regularly. Our kickass customer success and sales teams had found early adopter customers to beta test our product. Marketing was gearing up for launch publicity.
Over in engineering, we were fixing bugs and building additional features to expand the variety and scale of customers we could sell to. As a startup, we knew speed was a key advantage. We applied this to our product by shipping a release to every component every other week. Our customers loved us for the speed with which we fixed bugs they reported and built features they wanted. Remember that this was an industry that shipped maybe two updates a year!
There was however a little hiccup with shipping updates to the few dozen devices we had deployed in our customers’ conference rooms. In the initial push with getting an MVP off the ground, we had punted on building an over-the-air update system into the device for the beta… and the workaround was to send our customer success folks out to rotate devices every couple weeks so we could manually flash the latest version.
This solution was clearly not going to scale, so we barred off a conference room (coincidentally named “Fortress of Solitude”) and swore not to move out until we had solved the OTA problem.
Defining the problem
Our team had only a few requirements for our OTA solution:
Requirement 1: We took tests and test automation, feature flagging and instrumentation of software very seriously. This helped us catch regressions despite shipping at the speed we were used to. In spite of all of this, we did occasionally screw up and ship something we needed to correct quickly. Back then, it meant we shipped a bug to a few dozen rooms. Today, it would mean breaking rooms in thousands of organizations across continents. So requirement #1 was that our OTA mechanism at least be bulletproof!
Requirement 2: One of the fundamental promises of Highfive was that you got the first 10 mins of your meeting back. We took pride in conferencing hardware that was easy to setup, easy to administer and trivial to use. If we made people wait every two weeks while their devices updated, we were breaking that promise. So requirement #2 was that we update invisibly and avoid disrupting meetings whenever possible.
Requirement 3: The whole point of doing all this was speed. So if we kept building new and better software and our customers’ devices stayed behind on old software, we weren’t making forward progress. So requirement #3 was that our devices stayed aggressively up-to-date without our customers worrying about updating the devices themselves.
It would have been hubris to try and build something from scratch only a few months before we launched. However, this was largely before the IoT revolution, so keeping an embedded Linux distribution constantly updated wasn’t exactly top of everyone’s mind. This meant we had limited sources of inspiration, but we did find one project that met all our needs.
Open source to the rescue
We found our inspiration (and a lot more) in a product that shares a lot of our philosophy around updating software: Chrome OS.
Every six weeks or so, Chrome OS ships an update. Millions of Chromebooks around the world quickly and unobtrusively update soon after. The open source sister project, Chromium OS, describes both the tenets and the guts of their update system here and here. More importantly, the code is all open source and licensed with friendly terms. And so our strategy fell into place. We would take the Chromium OS update engine, port it to our little embedded Linux distribution, get it to work with our service and app to deliver updates and then scale up our little operation into world domination.
For the curious, here’s a little rundown of how the Chrome OS update engine works and what we had to do to get it to work for us:
First, there’s a build-side component that takes ext[3|4] filesystem images of the software that goes on the device. It takes both new and old images and then turns them into a full image and a delta image, cryptographically signs them and generates json manifests that describe the images. This part was easiest to complete. We use Yocto/OpenEmbedded to build our embedded Linux images and it supports building ext4 images out of the box. Nothing ever is as simple as you think it is though and there’s a technicality here too, but I’ll save it for later.
Second, there’s a sample python service that shows how to serve up manifests in response to RPC requests using the Omaha protocol. Omaha is an XML based protocol used by both Chrome and Chrome OS clients to request updates. This is not to be confused with the Omaha binary that Chrome browser clients install on your machine for fetching and applying these updates. Chrome OS uses a component simply called “update engine” to do this instead, but more on that below. In any case, we used the sample python service for testing, and built a production version in Java as one of the services on our servers.
Third, there’s a client side component – the actual “update engine”. It talks to the update service, downloads the update binaries generated as described earlier, verifies the signatures and then applies those updates. Chrome OS uses two partitions for what’s called the “root” file system – which is all the software contained in an image. While running off one partition, it always downloads and applies updates to the other partition. This way, you can recover from any kind of screw up by simply wiping the other partition and starting over and you *never* affect the software image you’re running. Remember all those times your OSX or Windows machine stopped working because an update process was somehow interrupted? Yup, those never happen to your Highfive device! Just like they never happen to your Chromebook!
Additionally, the client side update engine also writes a bunch of flags for the bootloader to act on. These are described very well on the Chromium design docs I linked to earlier, but the implication for us was that we needed to modify our bootloader – uboot – to read and act on these flags the way your Chromebook bootloader does.
We encountered two slippery parts in porting the update engine over to our device and having it apply updates successfully. The first was that the engine insisted on bringing over half of Chrome OS as dependencies. Some patient and meticulous unwinding let us keep the really valuable pieces – like compression and decompression libraries – and use replacements from our existing system for all the rest – like date and time utility libraries. The second was the technicality I skipped over when talking about the update image generation. We had to undo all the write assumptions we had made about our filesystem when building our software, and separate it out into software on a read-only updateable filesystem and a writable persistent partition for all manner of settings. The update engine could only apply delta updates if all filesystem images were read-only.
Finally, we used the update engine’s excellent DBus interface to control and query it from our primary Highfive application on the device. In the normal course of events, the update engine would quietly and unobtrusively fetch and apply updates overnight, and then reboot the system. The Highfive app could use the DBus interface to force immediate updates, defer or pause updates when video calls started and receive notifications about the progress of an update.
Happily ever after
I had the pleasure of working with three awesome teammates on this project: Eric, Ed & Sinan. For two months, we scribbled on whiteboards, debated, argued, cheered and sulked in that double-door conference room and finally got the job done! More importantly, it’s now been almost 3 happy years since we first shipped the update engine and it was also the last time we needed to manually flash a unit outside the factory and warehouse. We’ve shipped many more awesome things in the years since and also screwed up now and then. Through it all, the OTA engine and service has been our constant friend. In a follow-up post, I’ll share some war stories about how the OTA engine and some work we did around it saved us from recalls and other assorted catastrophes!
Want to join our team to solve such problems and shipping a great product? Then check out our careers page or simply shoot me an email: hemanth at highfive d.o.t com. If you’re interested reading more blogs from the Highfive engineering team, check out our Medium blog.