LiTr: A lightweight video/audio transcoder for Android
December 19, 2019
If a picture’s worth a thousand words, then what about a video?
In 2017, we launched video sharing to give our members the ability to share video content on the feed via the LinkedIn mobile app or a web browser. When posting a video from an Android device, the member could either record it using their device camera app or pick an existing video from the gallery. Once uploaded, a video would then be transcoded into consumption format and appear in the feed as an update.
Once the feature was successfully launched and started gaining popularity, we immediately set off to work on performance improvements. Since video is such a “heavy” consumer of data, any performance gains would significantly improve the user experience. We started with an assumption that users are most likely to share content straight from the mobile device they captured it on. This led us to focus on looking at the typical capture parameters.
Out-of-box video recording resolution on Android cameras, at the time, was about 720 to 1080p with bitrate of 12 to 17 Mbps. This was very different from our top consumption format of 720p/5Mbps with us essentially creating a lot of bytes being sent to the backend just to be discarded by server transcoding.
The solution to this “throwaway data” problem was straightforward: transcode the video on the device to throw away those bytes before sending the video over the network. In order to do that, we needed an on-device transcoder. We discovered an open source solution in android-transcoder, which performed basic hardware accelerated video/audio transcoding on Android. However, when we estimated the changes we would need to implement, we realized it would entail a major rewrite with an API break.
Furthermore, we wanted to be able to modify video frames, which android-transcoder could not do. We decided to write a library from scratch and collaborate with android-transcoder project after completion. Popularity of android-transcoder and its forks (editor by selsamman, MP4Composer-android, Transcoder) demonstrated that there is a need in the Android media community for video/audio transcoding/modification tooling. Thus, LiTr was born.
This fall, I presented LiTr at Demuxed 2019 conference, shortly after open sourcing it. In this post, I’ll provide a high-level overview of that talk, including how we built the LiTr architecture, how you can use it to transform your media, and why we chose MediaCodec to access the hardware encoder. See here for a recording of the talk.
Transcoding on Android can be performed by using software or hardware encoders. Software encoders (such as an Android port of ffmpeg) offer a great variety of supported codecs and containers, as well as an ability to perform editing operations (joining/splitting videos, muxing/demuxing tracks, modifying frames, etc.). However, they can be very battery- and CPU- intensive. Hardware encoders have limited codec selection, but are much more performant and power efficient.
After some experimentation, we came to the conclusion that a hardware encoder would be a much better fit for our needs and constraints. Our use case was fairly simple: reducing video resolution and/or its bitrate to reduce “throwing away” extra pixels. Using a hardware encoder would offer real-time frame rate and lower battery consumption—both important considerations for the mobile device experience. Format compatibility-wise, we decided that the risk existed, but was low. Members normally choose to share videos that are playable on their devices, meaning that they can be decodable. And since most Android devices record video with H.264 compression, that codec would be available to us for us to encode the video.
Lightweight hardware accelerated video/audio transcoder for Android, or LiTr for short
To access encoder hardware, LiTr uses Android’s MediaCodec API. To use MediaCodec, a client must first request the framework to create its instance. For example, a client can tell the framework that it needs a decoder for “video/avc”, at which point, the system can return either a new instance of MediaCodec or null, if that format is not supported. Once a codec instance is created, it must be configured with a set of parameters, such as resolution, bitrate, frame rate, etc. Configuration can fail if desired parameters are not supported (for example, if we are trying to decode a 4K video on a hardware that doesn’t support 4K resolution). Once MediaCodec instance is created and configured, it can be started and used to process frames.
Interaction with a MediaCodec instance happens using buffer queues, when a client continuously flings buffers with data at MediaCodec and receives buffers back:
Client dequeues an input buffer from MediaCodec and receives one if/when it is available.
Client fills the buffer with frame data and releases it back to MediaCodec, along with metadata (start index, byte count, frame presentation time, flags).
MediaCodec processes the data.
Client dequeues an output buffer from MediaCodec and receives one if/when it is available.
Client consumes the output data and releases the buffer back to MediaCodec.
The process is repeated until all frames are processed. A client does not own buffers and has to release them back to MediaCodec once it is done using them. Otherwise, at some point, all dequeuing attempts will consistently fail. When the MediaCodec instance is no longer needed, it is stopped and released.
Using MediaCodec for transcoding
To transcode, we will need two instances of MediaCodec: one running as a decoder, and another one as an encoder. The decoder consumes and decodes encoded source frames. For example, a video decoder would take an H.264 encoded video frame and decode it into pixels, while an audio decoder would decode a compressed AAC audio frame into an uncompressed PCM frame. Decoded frames are then consumed by an encoder to produce encoded frames in the desired target format. For example, video frames would be encoded using a video compression codec, such as H.264 or VP9. In some cases, a decoder’s output can be sent to encoder directly. A good example of such a case is compression bitrate change without modifying frame contents (for example, recompressing audio without muxing stereo channels into mono channel). In other cases, such as resizing video, a rendering layer must be introduced to transform decoder output into encoder input.
When working with video, we can configure MediaCodec to work with ByteBuffer or Surface as an input/output. ByteBuffer is used when access to raw pixels is needed and is generally slower, while Surface is faster, but does not provide direct access to pixels. Surface pixels can, however, be modified using OpenGL frame shaders.
Track transcoder architecture
LiTr uses Surface mode for video codecs and ByteBuffer mode for audio codecs. Video renderer uses OpenGL to resize frames (when the video resolution is changed). And since OpenGL gives us an ability to draw onto video frames, the video renderer has support for custom filters, allowing client apps to modify video frames using OpenGL shaders.
The same can be done when running codecs in ByteBuffer mode. Except for the case of using OpenGL, all rendering and frame modifications have to be done in software. At a price of lower performance, this approach allows using software decoders or frame content-aware logic (ML filters, superscaling, etc.)
The transcoding process described above is how an individual track is transcoded. Source data is read using MediaExtractor, and target data is written out using MediaMuxer, both provided by the Android media stack. For each track type (video, audio, other), LiTr uses a specific track transcoder:
A video track transcoder can resize frames and change encoding bitrate. If necessary, it can also modify frame pixels using client-provided filters. It runs both encoder and decoder codecs in Surface mode, and uses OpenGL to render a decoder’s output onto an encoder’s input.
An audio track transcoder can only change bitrate (for now).
All non-video and non-audio frames are written out “as is,” using passthrough track transcoder
When transcoding, LiTr continuously iterates through all track transcoders until each track transcoder reports that it has completed its work. Track transcoder considers its work completed when a frame with END_OF_STREAM flag travels through each transcoding step. When transcoding is completed, MediaMuxer is signaled to finalize the target media and MediaExtractor releases the source media.
First, import LiTr into your Android app:
Then, instantiate MediaTransformer (the main entry point class) with a Context that has access to source/target media. Usually, that would be your app’s ApplicationContext.
MediaTransformer mediaTransformer = new MediaTransformer(getApplicationContext());
Now you can transform your media:
A few things to be aware of here:
A client must provide a unique String requestId, a token for a transcoding request. Since LiTr accepts multiple transcoding requests, it needs a way to identify each one of them.
The source video URI should be accessible from the Context used when instantiating MediaTranscoder. Source track count and ordering are preserved when transcoding.
Video will be transcoded into H264 and saved in MP4 container at provided file path.
Target Video and Audio formats are instances of Android’s MediaFormat with all your desired parameters set. The format will be applied to all tracks of that type. A null format means that tracks of that type will not be transcoded and will be written out “as is.”
A listener will be called with all transcoding updates: start, progress completion, error, cancellation. A request token will be provided in each listener callback.
Granularity is the desired number of progress updates. The default value is 100 (to match showing percentage in UI). Passing in 0 will call back on every frame.
An optional list of GlFilters apply your custom modifications to video frames.
Ongoing transcoding can be cancelled using the provided token:
LiTr also provides filter implementations in a companion “filter pack” library. If you want to use filters, import litr-filters library:
There are currently two filters in this library, a static bitmap overlay and a frame sequence animation overlay (such as animated GIFs). We are working on implementing more filters and welcome contributions.
If something goes wrong (MediaCodec initialization fails, Decoder errors out, etc.), MediaTransformer will not throw an exception. Instead, it will fail and call listener’s onError method with a custom exception, which the client can then analyze.
Transformation completions may also contain detailed statistics (track metadata, transformation duration, etc.). These are meant to be used in the production environment for tracking or debugging purposes. Note that in the future, LiTr APIs and their behaviors may change, so they are used here mostly for illustration purposes.
“Low level” transformation API
Let’s take a step back and look at the transcoding process more conceptually. We will see that there are five distinct steps:
Reading encoded source data.
Decoding encoded source data.
Rendering decoder output onto encoder input.
Encoding rendered data.
Writing encoded target data.
Each step performs a certain function and has a well-defined interaction with the previous and/or next steps. LiTr abstracts each step in transcoding a video into an interface. We call each such interface a “component.” The abstraction gives clients a powerful ability to modify the transcoding process by plugging in their own component implementations without having to modify LiTr source code. For example, a custom MediaSource can be implemented to read data from a container Android’s MediaExtractor does not support, or a custom encoder may introduce an ability to transcode into codec not supported by encoder hardware (e.g., AV1).
A step-by-step overview of the transcoding process
Straight out of the box, LiTr provides default component implementations, which wrap Android’s MediaCodec classes. To pass in custom component implementations, clients should use “low level” LiTr API:
Since this API gives a lot more control to a client, it is also more prone to breaking. Clients do have to make sure that components can interact with each other successfully. For example, MediaSource produces encoded frames in the format that Decoder expects, or OpenGL Renderer is not used with Decoder and/or Encoder running in ByteBuffer mode.
Contributing to LiTr
LiTr is an open source project and we welcome contributions! Just fork the repo on GitHub, submit pull requests, or open an issue to let us know what new functionalities you’d like to see. We will continue active development on LiTr, but envision its development and evolution to become a community effort.
A huge thank you to Yuya Tanaka (AKA ypresto) for his pioneering of the android-transcoder project, which inspired many amazing Android media projects, including LiTr. A thank you to Google's AOSP CTS team for writing Surface-to-Surface rendering implementation in OpenGL, which became a foundation for GlRenderer in LiTr. A shout out to my awesome colleagues Amita Sahasrabudhe, Long Peng, and Keerthi Korrapati, for their respective contributions and code reviews. A shout out to our designer Mauroof Ahmed for giving LiTr a visual identity.