At the 2013 GStreamer Conference in Edinburgh, Greg Maxwell from
Xiph.org spoke about the creation of the Opus audio codec, and how that
experience has informed the subsequent development process behind the
Daala video codec. Maxwell,
who is employed by Mozilla to work on multimedia codecs, highlighted
in particular how working entirely in the open gives the project
significant advantages over codecs created behind closed doors. Daala
is still in flux, he said, but it has the potential to equal the
impact that Opus has had on the audio encoding world.
Approaching the battlefield
Mozilla's support for Xiph.org's codec development comes as a
surprise to some, Maxwell said, but it makes sense in light of
Mozilla's concern for the evolution of the web. As multimedia codecs
become more important to the web, they become more important to
Mozilla. And they are important, he said: codecs
are used by all multimedia applications, so every cost associated with
them (such as royalties or license fees) are repeated a million-fold.
On your person right now, he told the audience, "you carry with you
four or five licenses for AACS."
Codec licensing amounts to a billion-dollar tax on communication
software. In addition, it is used as a weapon between battling
competitors, so it even affects people in countries without software
patents.
Moreover, codec licensing is inherently discriminatory. Many come
with user caps, so that an application with certain number of users
does not need to pay a fee—but many open source projects cannot
even afford the cost of counting their users (if they are technically
able to do so), much less handle a fee imposed suddenly if the
limit is exceeded. In addition, he said, simply ignoring codec
licensing (as some open source projects do) creates a risk of its own:
license fees or lawsuits that can appear at any time, and usually only
when the codec licensor decides the project is successful enough to
become a target.
"So here you have the codec guy here saying that the way to solve
all this is with another codec," Maxwell joked. Nevertheless, he
said, Xiph does believe it can change the game with its efforts.
Creating good codecs is hard, but we don't really need that many. It
is "weird competitive pressures" that led to the current state of
affairs where there are so many codecs to choose from at once.
High-quality free codecs can change that. The success of the internet
shows that innovation happens when people don't have to ask permission
or forgiveness. History also shows us that the best implementations
of the patented codecs are usually the free software ones, he said,
and it shows that where a royalty-free format is established, non-free
codecs see no adoption. Consider JPEG, he said: there are
patent-encumbered, proprietary image formats out there, like MrSID, "but who here has
even heard of MrSID?" he asked the audience.
Unfortunately, he said, not everyone cares about the same issues;
convincing the broader technology industry to get behind a free codec
requires that the codec not just be better in one or two ways, but
that it be better in almost every way.
Opus day
The goal of being better at everything drove the development of
the Opus codec. Opus is "designed for the Internet," he said, not
just designed for one specific application. It mixes old and new
technologies, but its most interesting feature is its extreme
flexibility. Originally, there were two flavors of audio codec,
Maxwell said, the speech codecs optimized for low-delay and limited
bandwidth, and the music codecs designed for delay-insensitive playback and
lots of bandwidth.
But everyone really wants the benefits of all of
those features together, and "we can now afford the best of both
worlds." Opus represents the merger between work done by Xiph.org,
Skype (now owned by Microsoft), Broadcom, and Mozilla, he said. It
was developed in the open and published as IETF RFC 6716 in
2012. By tossing out many of the old design decisions and building
flexibility into the codec itself, the Opus team created something
that can adapt dynamically to any use case. He showed a chart listing the
numerous codec choices of 2010: VoIP over the telephone network could
use AMR-NB or Speex, wideband VoIP could use AMR-WB or Speex,
low-bitrate music streaming could use He-AAC or Vorbis, low-delay
broadcast could use AAC-LD, and so on. In the 2012 chart that
followed, Opus fills every slot.
Maxwell briefly talked about the design of Opus itself. It
supports bit-rates from 6 to 510 kbps, frequency bands from 8 to 48
kHz, and frame sizes from 2.5ms to 60ms. Just as importantly,
however, all of these properties can be dynamically changed within the
audio stream itself (unlike older codecs), with very fine control.
The codec merges two audio compression schemes: Skype's SILK codec,
which was designed for speech, and Xiph.org's CELT, which was designed
for low-delay audio. Opus is also "structured so that it is
hard to implement wrong," he said. Its acoustic model is actually
part of its design, rather than something that has to be considered by
the application doing the encoding. The development process was
iterative with many public releases, and employed massive automated
testing totaling hundreds of thousands of hours. Among other things,
Skype was able to test Opus by rolling it into its closed-source
application releases, but the Mumble open source chat
application was used as a testbed, too.
He then showed the results of several tests, run by separately by
Google and HydrogenAudio over a wide variety of samples; Opus tests
better than all of the others in virtually every test. Its quality
has meant rapid adoption; it is already supported in millions of
hardware devices and it is mandatory to implement for WebRTC. The
codec is available under a royalty-free license, he
said, but one that has a protective clause: the right to use Opus is
revoked if one engages in any Opus-related patent litigation against
any Opus user.
One down, one to go...
Moving on to video, Maxwell then looked briefly at the royalty-free
video codecs available. Xiph.org created Theora in 1999/2000, he
said, which was good at the time, but "there's only so far you can get
by putting rockets on a pig." Google's VP8 is noticeably better than
the encumbered H.264 Baseline Profile codec, but even at its release
time the industry was raising the bar to H.264's High Profile. VP9 is
better than H.264, he said, but it shares the same basic
architecture—which is a problem all of its own. Even when a
free codec does not infringe on an encumbered codec's patents, he
said, using the same architecture makes it vulnerable to FUD, which
can impede adoption. VP9 is also a single-vendor product, which makes
some people uncomfortable regardless of the license.
"So let's take the strategy we used for Opus and apply it to
video," Maxwell said. That means working in public—with a
recognized standards body that has a strong intellectual property (IP)
disclosure policy, questioning the assumptions of older codec designs,
and targeting use cases where flexibility is key. The project also
decided to optimize for perceptual quality, he said, which differs
from efforts like H.264 that measure success with metrics like peak
signal-to-noise ratio (PSNR). A metrics-based approach lets companies
push to get their own IP included in the standard, he said; companies
have a vested interest in getting their patents into the standard so
that they are not left out of the royalty payments.
Daala is the resulting codec effort. It is currently very much a
pre-release project—in the audience Q&A portion of the
session, Maxwell that they were aiming for a 2015 release date, but the
work has already made significant progress. The proprietary High
Efficiency Video Coding (HEVC) codec (the direct successor to H.264)
is the primary area of industry interest now; it has already been
published as a stand-alone ISO/ITU standard, although has not yet made
it into web standards or other such playing fields. Daala is
targeting the as-yet unfinished generation of codecs that would come
after HEVC.
He listed several factors that differentiate Daala from HEVC and
the H.264 family of codecs: multisymbol arithmetic coding, lapped
transforms, frequency-domain intra-prediction, pyramid vector
quantization, Chroma from Luma, and overlapping-block motion
compensation (OBMC). Thankfully, he then took the time to explain the
differences; first providing a brief look at lossy video
compression in general, then describing how Daala takes new approaches
at the various steps.
Video compression has four basic stages, he said:
prediction, or examining the content to make predictions
about the next data frame, transformation, or rearranging the
data to make it more compressible, quantization, or computing
the difference between the prediction and the data (and also throwing
out unnecessary detail), and entropy coding, or replacing the
quantized error signal with something that can be more efficiently
packed.
The entropy-coding step is historically done with arithmetic
coding, he said, but most of the efficient ways to do arithmetic
coding are patented, so the team looked into non-binary multisymbol
range coding
instead. As it turns out, using non-binary coding has some major
benefits, such as the fact that it is inherently parallel, while
arithmetic coding is inherently serial (and thus slow on embedded
hardware). Simply plugging multisymbol coding into VP8 doubled that
codec's performance, he said.
Lapped transforms are Daala's alternative to the discrete cosine
transform (DCT) that HEVC and similar codecs rely on in the
transformation step. DCTs start by
breaking each frame into blocks (such as 8-by-8 pixel blocks), and
those blocks result in the blocky visual artifacts seen whenever a
movie is overcompressed. Lapped transforms add a pre-filter at the
beginning of the process and a matched post-filter at the end, and the
blocks of the transform overlap with (but are not identical to) the
blocks used by the filters. That eliminates blocky artifacts,
outperforms DCT, and even offers better compression than wavelets.
Daala also diverges from the beaten path when it comes to
intra-prediction—the step of predicting the contents of a block
based on its neighboring blocks, Maxwell said. Typical
intra-prediction in DCT codecs uses the one-pixel border of
neighboring blocks, which does not work well with textured areas of
the frame and more importantly does not work with lapped transforms
since they do not have blocks at all. But DCT codecs typically
make their block predictions in the un-transformed pixel domain; the
Daala team decided to try making predictions in the frequency domain
instead, which turned out to be quite successful.
But for this new
approach, they had to experiment to figure out what basis functions
to use as the predictors. So they ran machine learning experiments
on a variety of test images (using different initial conditions), and
found an efficient set of functions by seeing where the learning
algorithm converged. The results are marginally better than DCT, and
could be improved with more work. But another interesting fact, he
said, was that the frequency-domain predictors happen to work very
well with patterned images (which makes sense when one thinks about
it: patterns are periodic textures), where DCT does poorly.
Pyramid vector quantization is a technique that came from the Opus
project, he said, and it is still in the experimental stage for
Daala. The idea is that in the quantization
step, encoding the "energy level" (i.e., magnitude) separately from
details produces higher perceptual quality. This was clearly the case
with Opus, but the team is still figuring out how to apply it to
Daala. At the moment, pyramid vector quantization in Daala has the
effect of making lower-quality regions of the frame look grainy (or
noisy) rather than blurry. That sounds like a reasonable trade-off, Maxwell
said, but more work is needed.
Similarly, the Chroma from Luma and OBMC techniques are also still
in heavy development. Chroma
from Luma is a way to take advantage of the fact in the YUV color space used for
video, edges and other image features that are visible in the Luma
(brightness) channel almost always correspond to edges in the Chroma
(color) channels. The idea has been examined in DCT codecs before,
but is not used because it is computationally slow in the pixel
domain. In Daala's frequency domain, however, it is quite fast.
OBMC is a way to predict frame contents based on motion in previous
frames (also known as inter-prediction). Here again, the
traditional method builds on a DCT codec's block structures, which Daala
does not have. Instead, the project has been working on using the
motion compensation technique used in the Dirac
codec, which blends together several motion vectors predicted from
nearby. On the plus side, this results in improved PSNR numbers, but
on the down side it has negative side effects like blurring sharp
features or introducing ghosting artifacts. It also risks introducing
block artifacts, since unlike Daala's other techniques it is a
block-based operation. To compensate, the project is working on
adapting OBMC to variable block sizes; Dirac already does something to
that effect but it is inefficient. Daala is using an adaptive
subdivision technique, subdividing blocks only as needed.
There are still plenty of unfinished pieces, Maxwell said, showing
several techniques based on patent-unencumbered research papers that
the project would like to follow up on. The area is "far from
exhausted," and what Daala needs most is "more engineers and more
Moore's Law." This is particularly true because right now the industry
is distracted by figuring out what to do with HEVC and VP9. He
invited anyone with an interest in the subject to join the project,
and invited application developers to get involved, too. Opus
benefited significantly by testing in applications, he said, and Daala
would benefit as well. In response to another audience question, he
added that Daala is not attempting to compete with VP9, but that
Xiph.org has heard "murmurs" from Google that if it looks good, Daala
might supplant VP10.
The codec development race is an ongoing battle, and perhaps it
will never be finished, as multimedia capabilities continue to
advance. Nevertheless, it is interesting to watch Xiph.org break out
of the traditional DCT-based codec mold and aim for a generation (or
more) beyond what everyone else is working on. On the other hand,
perhaps codec technology will get good enough that (as with JPEG)
additional compression gains do not matter in the high-bandwidth
future. In either case, having a royalty-free codec available is
certainly paramount.
[The author would like to thank the Linux Foundation for travel assistance to Edinburgh for GStreamer Conference 2013.]
(
Log in to post comments)