Developing the Opus and Daala codecs [LWN.net]

By Nathan Willis
October 30, 2013

At the 2013 GStreamer Conference in Edinburgh, Greg Maxwell from Xiph.org spoke about the creation of the Opus audio codec, and how that experience has informed the subsequent development process behind the Daala video codec. Maxwell, who is employed by Mozilla to work on multimedia codecs, highlighted in particular how working entirely in the open gives the project significant advantages over codecs created behind closed doors. Daala is still in flux, he said, but it has the potential to equal the impact that Opus has had on the audio encoding world.

Approaching the battlefield

Mozilla's support for Xiph.org's codec development comes as a surprise to some, Maxwell said, but it makes sense in light of Mozilla's concern for the evolution of the web. As multimedia codecs become more important to the web, they become more important to Mozilla. And they are important, he said: codecs are used by all multimedia applications, so every cost associated with them (such as royalties or license fees) are repeated a million-fold. On your person right now, he told the audience, "you carry with you four or five licenses for AACS." Codec licensing amounts to a billion-dollar tax on communication software. In addition, it is used as a weapon between battling competitors, so it even affects people in countries without software patents.

Moreover, codec licensing is inherently discriminatory. Many come with user caps, so that an application with certain number of users does not need to pay a fee—but many open source projects cannot even afford the cost of counting their users (if they are technically able to do so), much less handle a fee imposed suddenly if the limit is exceeded. In addition, he said, simply ignoring codec licensing (as some open source projects do) creates a risk of its own: license fees or lawsuits that can appear at any time, and usually only when the codec licensor decides the project is successful enough to become a target.

"So here you have the codec guy here saying that the way to solve all this is with another codec," Maxwell joked. Nevertheless, he said, Xiph does believe it can change the game with its efforts. Creating good codecs is hard, but we don't really need that many. It is "weird competitive pressures" that led to the current state of affairs where there are so many codecs to choose from at once. High-quality free codecs can change that. The success of the internet shows that innovation happens when people don't have to ask permission or forgiveness. History also shows us that the best implementations of the patented codecs are usually the free software ones, he said, and it shows that where a royalty-free format is established, non-free codecs see no adoption. Consider JPEG, he said: there are patent-encumbered, proprietary image formats out there, like MrSID, "but who here has even heard of MrSID?" he asked the audience.

Unfortunately, he said, not everyone cares about the same issues; convincing the broader technology industry to get behind a free codec requires that the codec not just be better in one or two ways, but that it be better in almost every way.

Opus day

The goal of being better at everything drove the development of the Opus codec. Opus is "designed for the Internet," he said, not just designed for one specific application. It mixes old and new technologies, but its most interesting feature is its extreme flexibility. Originally, there were two flavors of audio codec, Maxwell said, the speech codecs optimized for low-delay and limited bandwidth, and the music codecs designed for delay-insensitive playback and lots of bandwidth.

But everyone really wants the benefits of all of those features together, and "we can now afford the best of both worlds." Opus represents the merger between work done by Xiph.org, Skype (now owned by Microsoft), Broadcom, and Mozilla, he said. It was developed in the open and published as IETF RFC 6716 in 2012. By tossing out many of the old design decisions and building flexibility into the codec itself, the Opus team created something that can adapt dynamically to any use case. He showed a chart listing the numerous codec choices of 2010: VoIP over the telephone network could use AMR-NB or Speex, wideband VoIP could use AMR-WB or Speex, low-bitrate music streaming could use He-AAC or Vorbis, low-delay broadcast could use AAC-LD, and so on. In the 2012 chart that followed, Opus fills every slot.

Maxwell briefly talked about the design of Opus itself. It supports bit-rates from 6 to 510 kbps, frequency bands from 8 to 48 kHz, and frame sizes from 2.5ms to 60ms. Just as importantly, however, all of these properties can be dynamically changed within the audio stream itself (unlike older codecs), with very fine control. The codec merges two audio compression schemes: Skype's SILK codec, which was designed for speech, and Xiph.org's CELT, which was designed for low-delay audio. Opus is also "structured so that it is hard to implement wrong," he said. Its acoustic model is actually part of its design, rather than something that has to be considered by the application doing the encoding. The development process was iterative with many public releases, and employed massive automated testing totaling hundreds of thousands of hours. Among other things, Skype was able to test Opus by rolling it into its closed-source application releases, but the Mumble open source chat application was used as a testbed, too.

He then showed the results of several tests, run by separately by Google and HydrogenAudio over a wide variety of samples; Opus tests better than all of the others in virtually every test. Its quality has meant rapid adoption; it is already supported in millions of hardware devices and it is mandatory to implement for WebRTC. The codec is available under a royalty-free license, he said, but one that has a protective clause: the right to use Opus is revoked if one engages in any Opus-related patent litigation against any Opus user.

One down, one to go...

Moving on to video, Maxwell then looked briefly at the royalty-free video codecs available. Xiph.org created Theora in 1999/2000, he said, which was good at the time, but "there's only so far you can get by putting rockets on a pig." Google's VP8 is noticeably better than the encumbered H.264 Baseline Profile codec, but even at its release time the industry was raising the bar to H.264's High Profile. VP9 is better than H.264, he said, but it shares the same basic architecture—which is a problem all of its own. Even when a free codec does not infringe on an encumbered codec's patents, he said, using the same architecture makes it vulnerable to FUD, which can impede adoption. VP9 is also a single-vendor product, which makes some people uncomfortable regardless of the license.

"So let's take the strategy we used for Opus and apply it to video," Maxwell said. That means working in public—with a recognized standards body that has a strong intellectual property (IP) disclosure policy, questioning the assumptions of older codec designs, and targeting use cases where flexibility is key. The project also decided to optimize for perceptual quality, he said, which differs from efforts like H.264 that measure success with metrics like peak signal-to-noise ratio (PSNR). A metrics-based approach lets companies push to get their own IP included in the standard, he said; companies have a vested interest in getting their patents into the standard so that they are not left out of the royalty payments.

Daala is the resulting codec effort. It is currently very much a pre-release project—in the audience Q&A portion of the session, Maxwell that they were aiming for a 2015 release date, but the work has already made significant progress. The proprietary High Efficiency Video Coding (HEVC) codec (the direct successor to H.264) is the primary area of industry interest now; it has already been published as a stand-alone ISO/ITU standard, although has not yet made it into web standards or other such playing fields. Daala is targeting the as-yet unfinished generation of codecs that would come after HEVC.

He listed several factors that differentiate Daala from HEVC and the H.264 family of codecs: multisymbol arithmetic coding, lapped transforms, frequency-domain intra-prediction, pyramid vector quantization, Chroma from Luma, and overlapping-block motion compensation (OBMC). Thankfully, he then took the time to explain the differences; first providing a brief look at lossy video compression in general, then describing how Daala takes new approaches at the various steps.

Video compression has four basic stages, he said: prediction, or examining the content to make predictions about the next data frame, transformation, or rearranging the data to make it more compressible, quantization, or computing the difference between the prediction and the data (and also throwing out unnecessary detail), and entropy coding, or replacing the quantized error signal with something that can be more efficiently packed.

The entropy-coding step is historically done with arithmetic coding, he said, but most of the efficient ways to do arithmetic coding are patented, so the team looked into non-binary multisymbol range coding instead. As it turns out, using non-binary coding has some major benefits, such as the fact that it is inherently parallel, while arithmetic coding is inherently serial (and thus slow on embedded hardware). Simply plugging multisymbol coding into VP8 doubled that codec's performance, he said.

Lapped transforms are Daala's alternative to the discrete cosine transform (DCT) that HEVC and similar codecs rely on in the transformation step. DCTs start by breaking each frame into blocks (such as 8-by-8 pixel blocks), and those blocks result in the blocky visual artifacts seen whenever a movie is overcompressed. Lapped transforms add a pre-filter at the beginning of the process and a matched post-filter at the end, and the blocks of the transform overlap with (but are not identical to) the blocks used by the filters. That eliminates blocky artifacts, outperforms DCT, and even offers better compression than wavelets.

Daala also diverges from the beaten path when it comes to intra-prediction—the step of predicting the contents of a block based on its neighboring blocks, Maxwell said. Typical intra-prediction in DCT codecs uses the one-pixel border of neighboring blocks, which does not work well with textured areas of the frame and more importantly does not work with lapped transforms since they do not have blocks at all. But DCT codecs typically make their block predictions in the un-transformed pixel domain; the Daala team decided to try making predictions in the frequency domain instead, which turned out to be quite successful.

But for this new approach, they had to experiment to figure out what basis functions to use as the predictors. So they ran machine learning experiments on a variety of test images (using different initial conditions), and found an efficient set of functions by seeing where the learning algorithm converged. The results are marginally better than DCT, and could be improved with more work. But another interesting fact, he said, was that the frequency-domain predictors happen to work very well with patterned images (which makes sense when one thinks about it: patterns are periodic textures), where DCT does poorly.

Pyramid vector quantization is a technique that came from the Opus project, he said, and it is still in the experimental stage for Daala. The idea is that in the quantization step, encoding the "energy level" (i.e., magnitude) separately from details produces higher perceptual quality. This was clearly the case with Opus, but the team is still figuring out how to apply it to Daala. At the moment, pyramid vector quantization in Daala has the effect of making lower-quality regions of the frame look grainy (or noisy) rather than blurry. That sounds like a reasonable trade-off, Maxwell said, but more work is needed.

Similarly, the Chroma from Luma and OBMC techniques are also still in heavy development. Chroma from Luma is a way to take advantage of the fact in the YUV color space used for video, edges and other image features that are visible in the Luma (brightness) channel almost always correspond to edges in the Chroma (color) channels. The idea has been examined in DCT codecs before, but is not used because it is computationally slow in the pixel domain. In Daala's frequency domain, however, it is quite fast.

OBMC is a way to predict frame contents based on motion in previous frames (also known as inter-prediction). Here again, the traditional method builds on a DCT codec's block structures, which Daala does not have. Instead, the project has been working on using the motion compensation technique used in the Dirac codec, which blends together several motion vectors predicted from nearby. On the plus side, this results in improved PSNR numbers, but on the down side it has negative side effects like blurring sharp features or introducing ghosting artifacts. It also risks introducing block artifacts, since unlike Daala's other techniques it is a block-based operation. To compensate, the project is working on adapting OBMC to variable block sizes; Dirac already does something to that effect but it is inefficient. Daala is using an adaptive subdivision technique, subdividing blocks only as needed.

There are still plenty of unfinished pieces, Maxwell said, showing several techniques based on patent-unencumbered research papers that the project would like to follow up on. The area is "far from exhausted," and what Daala needs most is "more engineers and more Moore's Law." This is particularly true because right now the industry is distracted by figuring out what to do with HEVC and VP9. He invited anyone with an interest in the subject to join the project, and invited application developers to get involved, too. Opus benefited significantly by testing in applications, he said, and Daala would benefit as well. In response to another audience question, he added that Daala is not attempting to compete with VP9, but that Xiph.org has heard "murmurs" from Google that if it looks good, Daala might supplant VP10.

The codec development race is an ongoing battle, and perhaps it will never be finished, as multimedia capabilities continue to advance. Nevertheless, it is interesting to watch Xiph.org break out of the traditional DCT-based codec mold and aim for a generation (or more) beyond what everyone else is working on. On the other hand, perhaps codec technology will get good enough that (as with JPEG) additional compression gains do not matter in the high-bandwidth future. In either case, having a royalty-free codec available is certainly paramount.

[The author would like to thank the Linux Foundation for travel assistance to Edinburgh for GStreamer Conference 2013.]

(Log in to post comments)

Jun	AUG	Feb
	26
2013	2014	2015