Learning Center May 15, 2025

Tech Focus: Video Encoding in Vision Systems

Quadruped robot with a close up camera lens

Video encoding is the process of compressing and converting raw video data into a standardized digital format for efficient storage, transmission, and processing.

In computer vision applications such as inspection systems, smart surveillance cameras or UAVs and ROVs, video input is often constant and creates huge volumes of data. Managing this data efficiently hinges on video encoding – an indispensable component in capturing, transmitting, and interpreting video streams.

What is a Video Encoder?

A video encoder is a device or software tool that takes raw video data and compresses it into a smaller, more manageable format. This process makes the video easier to store, send over a network, or process further. Without encoding, video files would be extremely large and difficult to work with, especially in systems that need to handle real-time video or high-resolution feeds.

Video encoders can be built into hardware, for example in IP cameras, GPUs, or embedded systems, where fast and efficient processing is needed. They can also be implemented in software, which is more flexible and often used in computers or cloud systems.

In computer vision systems, video encoders are essential for managing video efficiently while still keeping the quality high enough for tasks like object detection, tracking, or inspection.

The encoder works by identifying and removing unnecessary information in the video. It does this both within a single frame (e.g. reducing fine details the human eye may not notice) and across multiple frames (by reusing parts of the image that haven’t changed). The result is a stream of compressed data that follows a specific video compression/decompression standard or “codec”.

More about codecs

A codec (coder-decoder) is the standard, or set of algorithms, that defines how video data is compressed and decompressed. When a video encoder processes raw video, it uses a specific codec to determine how the data should be transformed, quantized, and packed into a stream.

So, the video encoder is the tool, and the codec is the method it uses.

For example, if you’re using an H.264 encoder, that encoder follows the rules and compression techniques defined by the H.264 standard. This includes how to split video into macroblocks, apply motion estimation, perform transform procedures and quantization steps, and structure the bitstream. When the encoded video is later played or analyzed, a decoder that supports the same codec (i.e., an H.264 decoder) is required to reconstruct the video for display. PC processors (x86) also have built in hardware for accelerating video decode as encoding/decoding in pure software can be slow. The more complex the codec, the more processing power it takes. The older standard, H.264, remains very popular as hardware acceleration of H.264 decode can be found in many processors/systems, so H.265 has been slow to gain traction. Similarly, the time taken to encode can be an issue; sometimes a ‘better quality’ codec can take too long to encode the video.

Different codecs have different benefits usually set by the goals the designers had in mind:
• MJPEG compresses each frame as a separate JPEG image, i.e. simpler, but with larger files.
• H.264 and H.265 (HEVC) aim for high compression efficiency with good visual quality.
• AV1 and VVC offer state-of-the-art compression but require more processing power.

In computer vision, the choice of codec affects latency, image quality, hardware compatibility, and inference accuracy. For example, aggressive compression in H.265 might reduce bandwidth but blur fine details important for object detection, while MJPEG preserves clarity but uses more data.

How Does Video Encoding Work?

Raw video straight from a camera sensor is extremely large because every pixel of every frame is captured in an uncompressed format. For example, a single second of uncompressed 1080p video at 30 frames/sec requires 30 (frames) x 1920×1080 (resolution) x 2(bytes per pixel – note: this varies depending on the pixel format) ~121MB. Encoding significantly reduces this size, ideally without too much reduction of visual quality or analytic usefulness.

The process typically begins by dividing each frame into small blocks, often 8×8 or 16×16 pixels. These blocks are then transformed to shift the data from the spatial domain (pixel values) into the frequency domain. This transformation separates broad, smooth areas (low-frequency data) from sharp edges and fine detail (high-frequency data).

Next comes quantization, where the frequency data is simplified by reducing the precision of less important values. This is where most of the size reduction (and any loss of detail) occurs. In lossy encoding, small variations that won’t be easily seen by human viewers are discarded entirely. In lossless encoding, this step is omitted or done in a reversible way.

After quantization, the simplified data is further compressed using entropy coding techniques which store frequently occurring patterns more efficiently.

All of this data is then packaged into a bitstream that follows a specific format (also defined by the codec). As explained above, codecs define how the encoding steps are applied and how the bitstream is structured, ensuring that encoded video can be decoded by compatible software or hardware on the other end.

In practical terms, encoding lets video systems send high-resolution video over limited-bandwidth connections or store hours of footage in manageable amounts of disk space. For imaging engineers, it’s important to understand that encoding can affect image quality, latency, and even how well AI models perform if they rely on fine visual features.

Many codecs have parameters that can be set or altered by the user. These affect the features of the compression and the overall amount of data compression, for example, it is possible to improve the visual quality of moving objects in the video or reduce the size of the final bitstream. Therefore, many encoders can be tuned to the application at hand. However, engineers should bear in mind that higher image quality will always increase the data size, so tuning encoding parameters or choosing the right codec for the job is often a critical design decision.

Lossy and Lossless Encoding

When choosing how to encode video, considering lossy versus lossless encoding is an important step, especially in machine vision and computer vision applications. Lossy encoding reduces video file size by permanently removing some data from the original video. Lossless encoding, on the other hand, compresses video without discarding any information. This is essential in applications where visual fidelity must be preserved for pixel-accurate analysis such as medical imaging, industrial inspection, or scientific research.

Lossy codecs include H.264, H.265, and AV1, although some of these can be configured to run in a lossless mode. Examples of lossless codecs include Motion JPEG Lossless, Apple Animation Quicktime RLE and Autodesk Animator Codec.

Of course, lossless video files are usually significantly larger, which places higher demands on storage, network throughput, and processing. In some advanced systems, hybrid strategies can be used, for example, encoding regions of interest (ROI) in lossless or high-quality modes while compressing the rest of the frame with lossy methods.

H.264 and IP Cameras

Several key features of the H.264 standard are important when selecting autofocus-zoom cameras with IP output.

The compression efficiency reduces bandwidth and storage requirements, meaning that 24/7 operations can better handle the resultant volumes of data. The network abstraction layer, allowing H.264 video to be easily packetized, enables efficient streaming over IP networks. Plus, error resilience tools can help to maintain video quality, even on unstable networks.

As H.264 is compatible with a wider (and older) range of processor devices, and has lower hardware processing requirements, it is likely to remain a well-used standard for some time to come. For cameras without built-in IP encoders, additional processing hardware such as our Harrier IP Camera Interface Board can be added. This interface board is based on a powerful SoC processor that receives LVDS video and delivers a low-latency H.264 video stream over RTP.

HD-VLC™ as an Option to Cover Long Distances

High-Definition Visually Lossless CODEC (HD-VLC™) is an innovative technology developed by Semtech Corporation. It involves a unique codec which encodes/compresses HD video data to the same rate as standard definition video, i.e. 270 Mb/s or 540 Mb/s serial data rate. HD-VLC™ uses cost-effective coax or fiber optic cables and can transmit high-quality data over long distances (several hundred meters) without introducing any additional latency to the system.

Active Silicon has developed HD-VLC™ encoder and decoder hardware to enable long-reach, high-definition digital video transmission. This solution was originally designed to replace SDI video transmission over co-axial cable in pipe inspection systems but can also be applied in many other application areas. Find out more about this option in our white paper, “High-Definition-Long-Reach-Video-Transmission”.

Focus on IP Camera Integration

One of the key advantages of using IP cameras is their ability to stream video over standard bandwidth Ethernet infrastructure making them easy and cheap to implement. IP cameras come equipped with dedicated hardware encoders capable of producing real-time compressed streams (such as H.264 or H.265); the compressed video can be transmitted and then stored directly to disk, saving computer processing power at the receiving end. When required the video can be efficiently decoded/displayed using the hardware acceleration built in to the PC processor or GPU.

Additionally, their support for standard protocols like RTSP, ONVIF, or HTTP enables easy configuration of the video stream and control of the camera over the existing network infrastructure.

Our Harrier IP autofocus-zoom cameras offer significant benefits to applications in distributed vision systems across smart cities, logistics, and security domains. They are also popular in ROVs for land, air and subsea use.

However, while IP cameras promise ease of integration and scalability, they can introduce hidden complexities into a vision system. The latency, visual quality and data rate of the video stream produced by the IP camera needs to be acceptable for the end application and the load that it will place on the network connection. The performance of any IP camera is only as good as its network connection, so in multi-camera or complex network systems it is best to test for consistent frame rates and latency across varied network conditions. There are also ways to minimize latency across the vision system, our technical note, “Obtaining the lowest latency from your Harrier AF-Zoom IP camera“, explains how to optimize an IP system.

Optimizing the Video Pipeline

For engineers working in computer vision, video encoding is a pivotal design element that influences performance, fidelity, and system architecture. From sensor interface to codec selection to stream handling, every decision in the encoding pipeline must be carefully considered. Understanding the nuances of video encoding unlocks both efficiency and reliability in next-generation vision platforms.

Active Silicon specializes in providing expert advice and tailored solutions for video capture and transmission. From guidance on selecting the right encoding technology to supplying reliable hardware, we offer a wide range of products and services to meet the unique demands of your application. Get in touch today.

Subscribe to Our Newsletter