Embedded Vision Alliance: Technical Articles

Selecting and Designing with an Image Sensor: The Tradeoffs You'll Need to Master

Bookmark and Share

Selecting and Designing with an Image Sensor: The Tradeoffs You'll Need to Master

By Brian Dipert
Embedded Vision Alliance
Senior Analyst

A diversity of image sensor options are available for your consideration, differentiated both in terms of their fundamental semiconductor process foundations and of their circuit (and filter, microlens and other supplement) implementations. Understanding their respective strengths and shortcomings is critical to making an appropriate product selection for your next embedded vision system design.

The image sensor is a critical part of an embedded vision system, since it's at this initial stage of the hardware design that light photons transform into digital 1s and 0s for subsequent processing and analysis. It's not the first stage, mind you; ambient light (in some cases augmented by an integrated LED flash module or other auxiliary illumination source) must first pass through the optics subsystem. However, BDTI engineer Shehrzad Qureshi already tackled the optics topic with aplomb in his article Lens Distortion Correction.

The fundamental purpose of the image sensor, an increasingly prevalent semiconductor device found in digital still and video cameras, mobile phones and tablets, the bezels of laptop and all-in-one computers along with standalone displays, game console peripherals, and other systems, is to approximate the photon-collecting capabilities of the human eye's retina:

If your system needs to only work with black-and-white images, an elementary image sensor (often augmented by an infrared light-blocking filter) may suffice, mimicking the function of the retina's rod photoreceptors. However, many embedded vision systems benefit from (if not require) the capture of full-color frames. The retina contains three kinds of cones with different photopsin chemical compositions, which have different response curves reflective of their differing responses to a full-spectrum color source.

With the exception of some color-blind individuals, for whom one or multiple sets of cones have diminished-to-completely absent function, most humans therefore have trichromatic vision (conversely, a few rare individuals possess four or more distinct cone types). The presence of a multi-color filter array on top of the image sensor's pixel array structure is intended to mimic the retina's L (long: red-dominant), M (medium: green-dominant) and S (short: blue-dominant) cone suite. Due in part to the spectral overlap between M and S cones, evolutionarily explained as enabling early humanoids to discern both potential predators and prey in an overhead foliage canopy, the human visual system is disproportionally sensitive to detail in green spectrum light. Therefore, the Bayer pattern filter array (named after Eastman Kodak's Bryce E. Bayer, its inventor) contains twice as many green filters as either red or blue:

Post-capture processing approximates full-spectrum data for a blue-filter pixel, for example, by interpolating the red- and green-filtered information gathered by nearest-neighbor pixels. Other multi-color combinations are also possible. More recent Kodak sensors, for example, supplement red, green and blue filters with panchromatic (clear) ones in various proportions and patterns, trading off resolution accuracy for improved low-light performance. Subtractive color (CYM, for cyan, yellow and magenta) filter combinations are also possible. And JVC's first-generation GR-HD1 (consumer) and JY-HD10 (professional) HDV camcorder even employed a hybrid complementary/primary matrix of clear, green, cyan, and yellow filters.

The Foveon-championed X3 approach is also of interest in this regard (Foveon is now owned by Sigma, its long-standing primary camera customer). X3 sensors rely on the variable-depth absorption of various light frequencies within a semiconductor foundation. Successively located photodiodes capture the red, green and blue spectrum specifics within each pixel's surface-area dimensions:

leading to same-pixel-location capture of red, blue and green spectrum information with no need for interpolation-induced approximation (and consequent arguably-visible artifacts). X3 sensors tend to deliver lower pixel site counts than do conventional Bayer- or other-pattern filter array-based sensors, but supporters claim that their more accurate 'film'-like reproduction of captured images more than makes up for any raw resolution shortcoming.

CCDs versus CMOS image sensors

Until relatively recently, the CCD (charge-coupled device) was the conventional silicon foundation for image sensors, and it remains the technology of choice for some applications. Electron charge packets accumulate in potential wells; CCDs' 'global shutter' approach terminates this accumulation for all pixels at the same point in time, with charge values sequentially read out of the device in a serial manner and converted from the analog to digital domain via external circuitry:

CCDs still find use in ultra-high-resolution applications, and in ultra-low-light environments such as astrophotography. However, they are also power-hungry and require custom semiconductor processing that has proven to be increasingly expensive versus the conventional bulk CMOS alternative and also limits the amount of beyond-sensor circuitry that can be integrated on a single sliver of silicon. Conversely, CMOS image sensors (as is the case with solar cells) are capable of being manufactured on mature, already-amortized semiconductor fabs that previously produced leading-edge RAMs, FPGAs, processors, and the like.

Therefore, the ascendant CMOS sensor quickly achieved widespread adoption once the resolution it could cost-effectively deliver became acceptable for volume applications. As Wikipedia concisely notes:

APS pixels solve the speed and scalability issues of the passive-pixel sensor. They generally consume less power than CCDs, have less image lag, and require less specialized manufacturing facilities. Unlike CCDs, APS [active-pixel] sensors can combine the image sensor function and image processing functions within the same integrated circuit; CMOS-type APS sensors are typically suited to applications in which packaging, power management, and on-chip processing are important.

CMOS sensors' pixel locations are capable of being randomly accessed. The sensors commonly come in two-, three- and four-transistor per-pixel circuit configurations:

Two-Transistor Circuit

Three-Transistor Circuit

While 'global shutter' CMOS sensors have been demonstrated in prototype form, the 'rolling shutter' alternative architecture is dominant (if not universal) with production devices. With it, each pixel row's worth of data sequentially transfer to a buffer register prior to pixel line reset; the buffered (and subsequently A/D-converted) information then passes to the system processor via external I/O cycles. The advantage of this line-scan approach is reduced silicon overhead versus the 'global shutter' alternative, which would require an incremental multi-transistor and light-blocked circuit structure within each pixel to store the accumulated-photon value. The rolling shutter downside, on the other hand, is that different regions of the sensor capture the image at different points in time, leading to objectionable distortion artifacts with fast-moving subjects:




Partial Exposure

Pixel pitch issues

Large-pixel conventional image sensors often require anti-aliasing filters ahead of them. The anti-aliasing filter serves an analogous process to its audio processing counterpart; it slightly "blurs" the image captured by the sensor in order to compensate for optics-enabled spectral response significantly above the Nyquist frequency of 1/(2*pixel spacing). Aliasing often appears as a Moiré pattern on image regions containing high-frequency repetition, such as window screens and tight texture patterns. Alternative pixel structures such the earlier-mentioned Foveon sensor have less need for resolution-reducing anti-aliasing, as is the case with conventional sensors as their individual pixels decrease in size.

Moore's Law-driven pixel dimension shrinks enable cost-effective delivery of increasing image resolution over time, however they also degrade the sensor's low-light sensitivity by constricting each pixel's ability to collect sufficient photon data in a given amount of time. This undesirable tradeoff is particularly evident with low-fill-rate designs, in which the photodiode comprises a small percentage of each pixel's total surface area:


As partial compensation, manufacturers often place a micro lens array on top of the sensor:

By “bending” the light as it strikes the sensor, each micro lens enables its correlated photodiode to capture more photon information:

Future articles in this series will discuss methods of compensating for various image sensor limitations (low-light SNR issues, de-mosaic interpolation artifacts, micro-lens-induced color filter errors, etc.), as well as the image processor (the next notable link in the embedded vision chain) and the various available schemes for interconnecting the sensor and processor.

Implementing Vision Capabilities in Embedded Systems

Bookmark and Share

Implementing Vision Capabilities in Embedded Systems

by Jeff Bier
Founder and President, BDTI
September 29, 2011

This paper was originally published at the 2011 Embedded Systems Conference Boston.

Abstract—With the emergence of increasingly capable processors, it’s becoming practical to incorporate computer vision capabilities into a wide range of embedded systems, enabling them to analyze their environments via video inputs. Products like Microsoft’s Kinect game controller and Mobileye’s driver assistance systems are raising awareness of the incredible potential of embedded vision technology. As a result, many embedded system designers are beginning to think about implementing embedded vision capabilities. In this presentation, we’ll explore the potential of embedded vision and introduce some of the key ingredients for implementing it. After examining some example applications, we’ll introduce processors, algorithms, tools, and techniques for implementing embedded vision.


We use the term “embedded vision” to refer to the use of computer vision technology in embedded systems. Stated another way, “embedded vision” refers to embedded systems that extract meaning from visual inputs. Similar to the way that wireless communication has become pervasive over the past 10 years, we believe that embedded vision technology will be very widely deployed in the next 10 years.

It’s clear that embedded vision technology can bring huge value to a vast range of applications. Two examples are Mobileye’s vision-based driver assistance systems, intended to help prevent motor vehicle accidents, and MG International’s swimming pool safety system, which helps prevent swimmers from drowning. And for sheer geek appeal, it’s hard to beat Intellectual Ventures’ laser mosquito zapper, designed to prevent people from contracting malaria.

Just as high-speed wireless connectivity began as an exotic, costly technology, embedded vision technology has so far typically been found in complex, expensive systems, such as a surgical robot for hair transplantation and quality control inspection systems for manufacturing.

Advances in digital integrated circuits were critical in enabling high-speed wireless technology to evolve from exotic to mainstream. When chips got fast enough, inexpensive enough, and energy efficient enough, high-speed wireless became a mass-market technology. Today one can buy a broadband wireless modem for under $100.

Similarly, advances in digital chips are now paving the way for the proliferation of embedded vision into high-volume applications. Like wireless communication, embedded vision requires lots of processing power—particularly as applications increasingly adopt high-resolution cameras and make use of multiple cameras. Providing that processing power at a cost low enough to enable mass adoption is a big challenge. This challenge is multiplied by the fact that embedded vision applications require a high degree of programmability. In contrast to wireless applications where standards mean that, for example, algorithms don’t vary dramatically from one cell phone handset to another, in embedded vision applications there are great opportunities to get better results—and enable valuable features—through unique algorithms.

With embedded vision, we believe that the industry is entering a “virtuous circle” of the sort that has characterized many other digital signal processing application domains. Although there are few chips dedicated to embedded vision applications today, these applications are increasingly adopting high-performance, cost-effective processing chips developed for other applications, including DSPs, CPUs, FPGAs, and GPUs. As these chips continue to deliver more programmable performance per dollar and per watt, they will enable the creation of more high-volume embedded vision products. Those high-volume applications, in turn, will attract more attention from silicon providers, who will deliver even better performance, efficiency, and programmability.


Computer vision research has its origins in the 1960s. In more recent decades, embedded computer vision systems have been deployed in niche applications such as target-tracking for missiles, and automated inspection for manufacturing plants. Now, as lower-cost, lower-power, and higher-performance processors emerge, embedded vision is beginning to appear in high-volume applications. Perhaps the most visible of these is the Microsoft Kinect, a peripheral for the Xbox 360 game console that uses embedded vision to enable users to control video games simply by gesturing and moving their bodies. Another example of an emerging high-volume embedded vision application is automotive safety systems based on vision. A few automakers, such as Volvo, have begun to install vision-based safety systems in certain models. These systems perform a variety of functions, including warning the driver (and in some cases applying the brakes) when a forward collision is imminent, or when a pedestrian is in danger of being struck. A third example of an emerging high-volume embedded vision application is “smart” surveillance cameras, which are cameras with the ability to detect certain kinds of activity. For example, the Archerfish Solo, a consumer-oriented smart surveillance camera, can be programmed to detect people, vehicles, or other motion in user-selected regions of the camera’s field of view.

Enabled by the same kinds of chips and algorithms powering the above examples, we expect embedded vision functionality to proliferate into a wide range of products in the next few years. There are obvious places where vision can add tremendous value to equipment in consumer electronics, automotive, entertainment, medical, and retail applications, among others. In other cases, embedded vision will enable the creation of new types of equipment.

The purpose of this paper is to introduce some of the practical aspects of embedded vision technology—and to inspire system designers to imagine what can be done by incorporating vision capabilities into their designs.


Algorithms are the essence of embedded vision. Through algorithms, visual input in the form of raw video or images is transformed into meaningful information that can be acted upon.

Computer vision has been the subject of vibrant academic research for decades, and that research has yielded a deep reservoir of algorithms. For many system designers seeking to implement vision capabilities, the challenge at the algorithm level will not be inventing new algorithms, but rather selecting the best existing algorithms for the task at hand, and refining or tuning them to the specific requirements and conditions of that task.

The algorithms that are applicable depend on the nature of the vision processing being performed. Vision applications are generally constructed from a pipelined sequence of algorithms, as shown in Figure 1. Typically, the initial stages are concerned with improving the quality of the image. For example, this may include correcting geometric distortion created by imperfect lenses, enhancing contrast, and stabilizing images to compensate for undesired movement of the camera.

Figure 1. A typical embedded vision algorithm pipeline.

The second set of stages in a typical embedded vision algorithm pipeline are concerned with converting raw images (i.e., collections of pixels) into information about objects. A wide variety of techniques can be used, identifying objects based on edges, motion, color, size, or other attributes.

The final set of stages in a typical embedded vision algorithm pipeline are concerned with making inferences about objects. For example, in an automotive safety application, these algorithms would attempt to distinguish between vehicles, pedestrians, road signs, and other features of the scene.

Generally speaking, vision algorithms are very computationally demanding, since they involve applying complex computations to large amounts of video or image data in real-time. There is typically a trade-off between the robustness of the algorithm and the amount of computation required.

A. Algorithm Example: Lens Distortion Correction

Lenses, especially inexpensive ones, tend to introduce geometric distortion into images. This distortion is typically characterized as “barrel” distortion or “pincushion” distortion, as illustrated in Figure 2.

Figure 2. Typical lens distortion.
(based on “Lens Distortion Correction” by Shehrzad Qureshi; used with permission)

As shown in the figure, this kind of distortion causes lines that are in fact straight to appear curved, and vice-versa. This can thwart vision algorithms. Hence, it is common to apply an algorithm to reverse this distortion.

The usual technique is to use a known test pattern to characterize the distortion. From this characterization data, a set of image warping coefficients is generated, which is subsequently used to “undistort” each frame. In other words, the warping coefficients are computed once, and then applied to each frame. This is illustrated in Figure 3.

Figure 3. Lens distortion correction scheme.

One complication that arises with lens distortion correction is that the warping operation will use input data corresponding to pixel locations that do not precisely align with the actual pixel locations in the input frame. To enable this to work, interpolation is used between pixels in the input frame. The more demanding the algorithm, the more precise the interpolation must be—and the more computationally demanding.

For color imaging, the interpolation and warping operations must be performed separately on each color component. For example, a 720p video frame comprises 921,600 pixels, or approximately 2.8 million color components. At 60 frames per second, this corresponds to about 166 million color components per second. If the interpolation and warping operations require 10 processing operations per pixel, the distortion correction algorithm will consume 1.66 billion operations per second. (And that’s before we’ve even started trying to interpret the content of the images!)

B. Algorithm Example: Dense Optical Flow

“Optical flow” is a family of techniques used to estimate the pattern of apparent motion of objects, surfaces, and edges in a video sequence. In vision applications, optical flow is often used to estimate observer and object positions and motion in 3-d space, or to estimate image registration for super-resolution and noise reduction algorithms. Optical flow algorithms typically generate a motion vector for each pixel a video frame.

Optical flow requires making some assumptions about the video content (this is known as the “aperture problem”). Different algorithms make different assumptions. For example, some algorithms may assume that illumination is constant across the scene, or that motion is smooth.

Many optical flow algorithms exist. They can be roughly divided into the following classes:

  • Block-based methods (similar to motion estimation in video compression codecs)
  • Differential methods (Lucas-Kanade, Horn-Schunck, Buxton-Buxton, and variations)
  • Other methods (discrete optimization, phase correlation)

A key challenge with optical flow algorithms is aliasing, which can cause incorrect results, for example when an object in the scene has a repeating texture pattern, or when motion exceeds algorithmic constraints. Some optical flow algorithms are sensitive to camera noise. Most optical flow algorithms are computationally intensive.

One popular approach is the Lucas-Kanade method with image pyramid. The Lucas-Kanade method is a differential method of estimating optical flow; it is simple but has significant limitations. For example, it assumes constant illumination and constant motion in a small neighborhood around the pixel position of interest. And, it is limited to very small velocity vectors (less than one pixel per frame).

Image pyramids are a technique to extend Lucas-Kanade to support faster motion. First, each original frame is sub-sampled to different degrees to create several pyramid levels. The Lucas-Kanade method is used at the top level (lowest resolution) yielding a coarse estimate, but supporting greater motion. Lucas-Kanade is then used again at lower levels (higher resolution) to refine the optical flow estimate. This is summarized in Figure 4.

Figure 4. Lucas-Kanade optical flow algorithm with image pyramid. Used by permission of and © Julien Marzat.

C. Algorithm Example: Pedestrian Detection

“Pedestrian detection” here refers to detecting the presence of people standing or walking, as illustrated in Figure 5. Pedestrian detection might more aptly be called an “application” rather than an “algorithm.” It is a complex problem requiring sophisticated algorithms.

Figure 5. Prototype pedestrian detection application implemented on a CPU and an FPGA.

In Figure 6 we briefly summarize a prototype implementation of a stationary-camera pedestrian detection system implemented using a combination of a CPU and an FPGA.

Figure 6: Block diagram of proof-of-concept pedestrian detection application using an FPGA and a CPU.

In the figure, the Pre-processing block comprises operations such as scaling and noise reduction, intended to improve the quality of the image. The Image Analysis block incorporates motion detection, pixel statistics such as averages, color information, edge information, etc. At this stage of processing, the image is divided into small blocks. The object segmentation step groups blocks having similar statistics and thus creates an object. The statistics used for this purpose are based on user defined features specified in the hardware configuration file.

The Identification and Meta Data generation block generates analysis results from the identified objects such as location, size, color information, and statistical information. It puts the analysis results into a structured data format and transmits them to the CPU.

Finally, the On-screen Display block receives command information from the host and superimposes graphics on the video image for display.

This prototype system, operating on 720p resolution video at 60 frames per second, was implemented by BDTI on a combination of a Xilinx Spartan-3A DSP 3400 FPGA and a Texas Instruments OMAP3430 CPU. The total compute load is on the order of hundreds of billions of operations per second.


As we’ve mentioned, vision algorithms typically require high compute performance. And, of course, embedded systems of all kinds are usually required to fit into tight cost and power consumption envelopes. In other digital-signal-processing application domains, such as digital wireless communications, chip designers achieve this challenging combination of high performance, low cost, and low power by using specialized coprocessors and accelerators to implement the most demanding processing tasks in the application. These coprocessors and accelerators are typically not programmable by the chip user, however. This trade-off is often acceptable in wireless applications, where standards mean that there is strong commonality among algorithms used by different equipment designers.

In vision applications, however, there are no standards constraining the choice of algorithms. On the contrary, there are often many approaches to choose from to solve a particular vision problem. Therefore, vision algorithms are very diverse, and tend to change fairly rapidly over time. As a result, the use of non-programmable accelerators and coprocessors is less attractive for vision applications compared to applications like digital wireless and compression-centric consumer video equipment.

Achieving the combination of high performance, low cost, low power, and programmability is challenging. Special-purpose hardware typically achieves high performance at low cost, but with little programmability. General-purpose CPUs provide programmability, but with weak performance or poor cost-, energy-efficiency.

Demanding embedded vision applications most often use a combination of processing elements, which might include, for example:

  • A general-purpose CPU for heuristics, complex decision-making, network access, user interface, storage management, and overall control
  • A high-performance DSP-oriented processor for real-time, moderate-rate processing with moderately complex algorithms
  • One or more highly parallel engines for pixel-rate processing with simple algorithms

While any processor can in theory be used for embedded vision, the most promising types today are:

  • High-performance embedded CPU
  • Application-specific standard product (ASSP) in combination with a CPU
  • Graphics processing unit (GPU) with a CPU
  • DSP processor with accelerator(s) and a CPU
  • Mobile “application processor”
  • Field programmable gate array (FPGA) with a CPU

In this section, we’ll briefly introduce each of these processor types and some of their key strengths and weaknesses for embedded vision applications.

A. High-performance Embedded CPU

In many cases, embedded CPUs cannot provide enough performance—or cannot do so at acceptable price or power consumption levels—to implement demanding vision algorithms. Often, memory bandwidth is a key performance bottleneck, since vision algorithms typically use large amounts of memory bandwidth, and don’t tend to repeatedly access the same data. The memory systems of embedded CPUs are not designed for these kinds of data flows. However, like most types of processors, embedded CPUs become more powerful over time, and in some cases can provide adequate performance.

And there are some compelling reasons to run vision algorithms on a CPU when possible. First, most embedded systems need a CPU for a variety of functions. If the required vision functionality can be implemented using that CPU, then the complexity of the system is reduced relative to a multiprocessor solution.

In addition, most vision algorithms are initially developed on PCs using general-purpose CPUs and their associated software development tools. Similarities between PC CPUs and embedded CPUs (and their associated tools) mean that it is typically easier to create embedded implementations of vision algorithms on embedded CPUs compared to other kinds of embedded vision processors.

In addition embedded CPUs typically are the easiest to use compared to other kinds of embedded vision processors, due to their relatively straightforward architectures, sophisticated tools, and other application development infrastructure, such as operating systems.

An example of an embedded CPU is the Intel Atom E660T.

B. Application-specific standard product (ASSP) in combination with a CPU

Application-specific standard products (ASSPs) are specialized, highly integrated chips tailored for specific applications or application sets. ASSPs may incorporate a CPU, or use a separate CPU chip.

By virtue of specialization, ASSPs typically deliver superior cost- and energy-efficiency compared with other types of processing solutions. Among other techniques, ASSPs deliver this efficiency through the use of specialized coprocessors and accelerators. And, because ASSPs are by definition focused on a specific application, they are usually provided with extensive application software.

The specialization that enables ASSPs to achieve strong efficiency, however, also leads to their key limitation: lack of flexibility. An ASSP designed for one application is typically not suitable for another application, even one that is related to the target application. ASSPs use unique architectures, and this can make programming them more difficult than with other kinds of processors. Indeed, some ASSPs are not user-programmable.

Another consideration is risk. ASSPs often are delivered by small suppliers, and this may increase the risk that there will be difficulty in supplying the chip, or in delivering successor products that enable system designers to upgrade their designs without having to start from scratch.

An example of a vision-oriented ASSP is the PrimeSense PS1080-A2, used in the Microsoft Kinect.

C. Graphics processing unit (GPU) with a CPU

Graphics processing units (GPUs), intended mainly for 3-d graphics, are increasingly capable of being used for other functions, including vision applications. The GPUs used in personal computers today are explicitly intended to be programmable to perform functions other than 3-d graphics. Such GPUs are termed “general-purpose GPUs” or “GPGPUs.”

GPUs have massive parallel processing horsepower. They are ubiquitous in personal computers. GPU software development tools are readily and freely available, and getting started with GPGPU programming is not terribly complex. For these reasons, GPUs are often the parallel processing engines of first resort of computer vision algorithm developers who develop their algorithms on PCs, and then may need to accelerate execution of their algorithms for simulation or prototyping purposes.

GPUs are tightly integrated with general-purpose CPUs, sometimes on the same chip. However, one of the limitations of GPU chips is the limited variety of CPUs with which they are currently integrated, and the limited number of CPU operating systems that support that integration.

Today there are low-cost, low-power GPUs, designed for products like smart phones, tablets. However, these GPUs are generally not GPGPUs, and therefore using them for applications other than 3-d graphics is very challenging.

An example of a GPGPU used in personal computers is the NVIDIA GT240.

D. DSP processor with accelerator(s) and a CPU

Digital signal processors (“DSP processors” or “DSPs”) are microprocessors specialized for signal processing algorithms and applications. This specialization typically makes DSPs more efficient than general-purpose CPUs for the kinds of signal processing tasks that are at the heart of vision applications. In addition, DSPs are relatively mature and easy to use compared to other kinds of parallel processors.

Unfortunately, while DSPs do deliver higher performance and efficiency than general-purpose CPUs on vision algorithms, they often fail to deliver sufficient performance for demanding algorithms. For this reason, DSPs are often supplemented with one or more coprocessors. A typical DSP chip for vision applications therefore comprises a CPU, a DSP, and multiple coprocessors. This heterogeneous combination can yield excellent performance and efficiency, but can also be difficult to program. Indeed, DSP vendors typically do not enable users to program the coprocessors; rather, the coprocessors run software function libraries developed by the chip supplier.

An example of a DSP targeting video applications is the Texas Instruments DM8168

E. Mobile “application processor”

A mobile “application processor” is a highly integrated system-on-chip, typically designed primarily for smart phones but used for other applications. Application processors typically comprise a high-performance CPU core and a constellation of specialized co-processors, which may include a DSP, a GPU, a video processing unit (VPU), a 2-d graphics processor, an image acquisition processor, etc.

These chips are specifically designed for battery powered applications, and therefore place a premium on energy efficiency. In addition, because of the growing importance of and activity surrounding smartphone and tablet applications, mobile application processors often have strong software development infrastructure, including low-cost development boards, Linux and Android ports, etc.

However, as with the DSP processors discussed in the previous section, the specialized co-processors found in application processors are usually not user-programmable, which limits their utility for vision applications.

An example of a mobile application processor is the Freescale i.MX53.

F. Field programmable gate array (FPGA) with a CPU

Field programmable gate arrays (“FPGAs”) are flexible logic chips that can be reconfigured at the gate and block levels. This flexibility enables the user to craft computation structures that are tailored to the application at hand. It also allows selection of I/O interfaces and on-chip peripherals matched to the application requirements. The ability to customize compute structures, coupled with the massive amount of resources available in modern FPGAs, yields high performance coupled with good cost- and energy-efficiency.

However, using FGPAs is essentially a hardware design function, rather than a software development activity. FPGA design is typically performed using hardware description languages (Verilog or VHLD) at the register transfer level (RTL)—a very low level of abstraction. This makes FPGA design time-consuming and expensive, compared to using the other types of processors discussed here.

However using FPGAs is getting easier, due to several factors. First, so called “IP block” libraries—libraries of reusable FPGA design components—are becoming increasingly capable. In some cases, these libraries directly address vision algorithms. In other cases, they enable supporting functionality, such as video I/O ports or line buffers. Second, FGPA suppliers and their partners increasingly offer reference designs—reusable system designs incorporating FPGAs and targeting specific applications. Third, high-level synthesis tools, which enable designers to implement vision and other algorithms in FPGAs using high-level languages, are increasingly effective.

Relatively low-performance CPUs can be implemented by users in the FPGA. In a few cases, high-performance CPUs are integrated into FPGAs by the manufacturer.

An example FPGA that can be used for vision applications is the Xilinx Spartan-6 LX150T.


Developing embedded vision systems is challenging. One consideration, already mentioned above, is that vision algorithms tend to be very computationally demanding. Squeezing them into low-cost, low-power processors typically requires significant optimization work, which in turn requires a deep understanding of the target processor architecture.

Another key consideration is that vision is a system-level problem. That is, success depends on numerous elements working together, besides the vision algorithms themselves. These include lighting, optics, image sensors, image pre-processing, and image storage sub-systems. Getting these diverse elements working together effectively and efficiently requires multi-disciplinary expertise.

There are numerous algorithms available for vision functions, so in many cases it is not necessary to develop algorithms from scratch. But picking the best algorithm for the job, and ensuring that it meets application requirements, can be a large project in itself.

Today, there are many computer vision experts who know little about embedded systems, and many embedded system designers who know little about computer vision. Many projects die in the chasm between these groups. To help bridge this gap, BDTI recently founded the Embedded Vision Alliance [1], an industry partnership dedicated to providing SoC and embedded system engineers with practical know-how they need to incorporate vision capabilities into their designs. The Alliance’s web site, www.Embedded-Vision.com, is growing rapidly with video seminars, technical articles, coverage of industry news, and discussion forums. For example, the site offers a free set of basic computer vision demonstration programs that can be downloaded and run on any Windows computer. [2]

A. Personal Computers

The personal computer is both a blessing and a curse for embedded vision development. Most embedded vision systems—and virtually all vision algorithms—are initially developed on a personal computer. The PC is a fabulous platform for research and prototyping. It is inexpensive, ubiquitous, and easy to integrate with cameras and displays. In addition, PCs are endowed with extensive application development infrastructure, including basic software development tools, vision-specific software component libraries, domain-specific tools (such as MATLAB), and example applications. In addition, the GPUs found in most PCs can be used to provide parallel processing acceleration for PC-based application prototypes or simulations.

However, the PC is not an ideal platform for implementing most embedded vision systems. Although some applications can be implemented on an embedded PC (a more compact, lower-power cousin to the standard PC), many cannot, due to cost, size, and power considerations. In addition, PCs lack sufficient performance for many real-time vision applications.

And, unfortunately, many of the same tools and libraries that make it easy to develop vision algorithms and applications on the PC also make it difficult to create efficient embedded implementations. For example vision libraries intended for algorithm development and prototyping often do not lend themselves to efficient embedded implementation.

B. OpenCV

OpenCV is a free, open source computer vision software component library for personal computers, comprising over two thousand algorithms. [3] Originally developed by Intel, now maintained by Willow Garage. The OpenCV library, used along with Bradski and Kahler’s book, is a great way to quickly begin experimenting with computer vision.

However, OpenCV is not a solution to all vision problems. Some OpenCV functions work better than others. And OpenCV is a library, not a standard, so there is no guarantee that it functions identically on different platforms. In its current form, OpenCV is not particularly well suited to embedded implementation. Ports of OpenCV to non-PC platforms have been made, and more are underway, but there’s currently little coherence to these efforts.

C. Some Promising Developments

While embedded vision development is challenging, some promising recent industry developments suggest that it is getting easier. For example, the Microsoft Kinect is becoming very popular for vision development. Soon after its release in late 2010, the API for the Kinect was reverse-engineered, enabling engineers to use the Kinect with hosts other than the Xbox 360 game console. The Kinect has been used with PCs and with embedded platforms such as the Beagle Board.

The XIMEA Currera integrates an embedded PC in a camera. It’s not suitable for low-cost, low-power applications, but can be a good fit for low-volume applications like manufacturing inspection.

Several embedded processor vendors have begun to recognize the magnitude of the opportunity for embedded vision, and are developing processors specifically targeted embedded vision applications. In addition, smart phones and tablets have the potential to become effective embedded vision platforms. Application software platforms are emerging for certain EV applications, such as augmented reality and gesture-based UIs. Such software platforms simplify embedded vision application development by providing many of the utility functions commonly required by such applications.


With embedded vision, we believe that the industry is entering a “virtuous circle” of the sort that has characterized many other digital signal processing application domains. Although there are few chips dedicated to embedded vision applications today, these applications are increasingly adopting high-performance, cost-effective processing chips developed for other applications, including DSPs, CPUs, FPGAs, and GPUs. As these chips continue to deliver more programmable performance per dollar and per watt, they will enable the creation of more high-volume embedded vision products. Those high-volume applications, in turn, will attract more attention from silicon providers, who will deliver even better performance, efficiency, and programmability.


The author gratefully acknowledges the assistance of Shehrzad Qureshi in providing information on lens distortion correction used in this paper.


[1] www.Embedded-Vision.com
[2] https://www.embedded-vision.com/industry-analysis/video-interviews-demos/2011/09/09/introduction-computer-vision-using-opencv
[3] OpenCV: http://opencv.willowgarage.com/wiki/ Bradski and Kaehler, “Learning OpenCV: Computer Vision with the OpenCV Library”, O’Reilly, 2008
[4] MATLAB/Octave: “Machine Vision Toolbox”, P.I. Corke, IEEE Robotics and Automation Magazine, 12(4), pp 16-25, November 2005. http://petercorke.com/Machine_Vision_Toolbox.html P. D. Kovesi. “MATLAB and Octave Functions for Computer Vision and Image Processing.” Centre for Exploration Targeting, School of Earth and Environment, The University of Western Australia. http://www.csse.uwa.edu.au/~pk/research/matlabfns.
[5] Visym (beta): http://beta.visym.com/overview
[6] “Predator” self-learning object tracking algorithm: Z. Kalal, K. Mikolajczyk, and J. Matas, “Forward-Backward Error: Automatic Detection of Tracking Failures,” International Conference on Pattern Recognition, 2010, pp. 23-26.
[7] Vision on GPUs: GPU4vision project, TU Graz:
[8] Lens distortion correction: Luis Alvarez, Luis Gomez and J. Rafael Sendra. “Algebraic Lens Distortion Model Estimation.” Image Processing On Line, 2010. DOI:10.5201/ipol.2010.ags-alde: http://www.ipol.im/pub/algo/ags_algebraic_lens_distortion_estimation

Automotive Driver Assistance Systems: Using the Processing Power of FPGAs

Bookmark and Share

Automotive Driver Assistance Systems: Using the Processing Power of FPGAs

By Paul Zoratti
Driver Assistance Senior System Architect
Automotive Division
Xilinx Corporation

This is a reprint of a Xilinx-published white paper which is also available here (344 KB PDF).

In the last five years, the automotive industry has made remarkable advances in driver assistance (DA) systems that truly enrich the driving experience and provide drivers with invaluable information about the road around them. This white paper looks at how FPGAs can be leveraged to quickly bring new driver assistance innovations to market.

Driver Assistance Introduction

Since the early 1990s, developers of advanced DA systems have striven to provide a safer, more convenient driving experience. Over the past two decades, DA features such as ultrasonic park assist, adaptive cruise control, and lane departure warning systems in high-end vehicles have been deployed. Recently, automotive manufacturers have added rear-view cameras, blind-spot detection, and surround-vision systems as options. Except for ultrasonic park assist, deployment volumes for DA systems have been limited. However, the research firm Strategy Analytics forecasts that DA system deployment will rise dramatically over the next decade.

In addition to government legislation and strong consumer interest in safety features, innovations in remote sensors and associated processing algorithms that extract and interpret critical information are fueling an increase in DA system deployment. Over time, these DA systems will become more sophisticated and move from high-end to mainstream vehicles, with FPGA-based processing playing a major role.

Driver Assistance Sensing Technology Trends

Sensor research-and-development activities have leveraged adjacent markets, such as cell-phone cameras, to produce devices that not only perform in the automotive environment, but also meet strict cost limits. Similarly, developers have refined complex processing algorithms using PC-based tools and are transitioning them to embedded platforms.

While ultrasonic sensing technology has led the market, IMS Research (Figure 1) shows camera sensors dominating in the coming years.

Figure 1. Driver Assistance Sensors Market

A unique attribute of camera sensors is the value of both the raw and processed outputs. Raw video from a camera can be directly displayed for a driver to identify and assess hazardous conditions, something not possible with other types of remote sensors (for example, radar). Alternatively (or even simultaneously), the video output can be processed using image analytics to extract key information, such as the location and motion of pedestrians. Developers can further expand this "dual-use" concept of camera sensor data by bundling multiple consumer features based on a single set of cameras, as illustrated in Figure 2.

Figure 2. Bundling Multiple Automotive Features

From such applications, it is possible to draw a number of conclusions regarding the requirements of suitable processing platforms for camera-based DA systems:

  • They must support both video processing and image processing. In this case, video processing refers to proper handling of raw camera data for display to the driver, and image processing refers to the application of analytics to extract information (for example, motion) from a video stream.
  • They must provide parallel datapaths for algorithms associated with features that will run concurrently.
  • Given that many new features require megapixel image resolution, connectivity and memory bandwidth are just as critical as raw processing power.

Meeting DA Processing Platform Requirements

FPGAs are well suited to meet DA processing platform requirements. For example, in a wide-field-of-view, single-camera system that incorporates a rear cross-path warning feature, the system's intent is to provide a distortion-corrected image of the area behind the vehicle. In addition, object-detection and motion-estimation algorithms generate an audible warning if an object is entering the projected vehicle path from the side.

Figure 3 illustrates how the camera signal is split between the video- and image- processing functions. The raw processing power needed to perform these functions can quickly exceed what is available in a serial digital signal processor (DSP). Parallel processing along with hardware acceleration is a viable solution.

Figure 3. Video and Image Processing Functions

FPGAs offer highly flexible architectures to address various processing strategies. Within the FPGA logic, it is a simple matter to split the camera signal to feed independent video- and image-processing intellectual property (IP) blocks. Unlike serial processor implementations, which must time-multiplex resources across functions, the FPGA can execute and clock processing blocks independently. Additionally, if it becomes necessary to make a change in the processing architecture, the FPGA's ability to reprogram hardware blocks surpasses solutions based on specialized application-specific standard products (ASSPs) and application-specific integrated circuits (ASICs), giving FPGA implementations a large advantage when anticipating the future evolution of advanced algorithms. For computationally intensive processing, FPGA devices, such as the new XA Spartan®-6 FPGA Automotive family, offer up to 180 independent multiply-and-accumulate (MACC) units with pre-adders.

Another benefit of FPGA implementation is device scalability. As OEMs look to bundle more features, the processing needs will rise. For example, the rear-view camera might need to host a monocular ranging algorithm to provide drivers with information on object distance. The added functionality requires yet another parallel-processing path. Implementing this in a specialized ASIC or ASSP could be problematic, if not impossible, unless the designers made provisions for such expansion ahead of time.

Attempting to add this functionality to a serial DSP could require a complete re-architecture of the software design, even after moving to a more powerful device in the family (if it is plausible at all). By contrast, an FPGA-based implementation allows the new functional block to be added, utilizing previously unused FPGA logic and leaving existing blocks virtually intact. Even if the new function requires more resources than are available in the original device, part/package combinations frequently support moving to a denser device (that is, one with more processing resources) without the need to redesign the circuit board or existing IP blocks.

Finally, the reprogrammable nature of the FPGA offers "silicon reuse" for mutually exclusive DA functions. In the rear-looking camera example, the features described are useful while a vehicle is backing up, but an FPGA-based system could leverage the same sensor and processing electronics while the vehicle is moving forward, with a feature like blind-spot detection. In this application, the system analyzes the camera image to determine the location and relative motion of detected objects. Since this feature and its associated processing functions are not required at the same time as the backup feature, the system can reconfigure the FPGA logic within several hundred milliseconds based on the vehicle state. This allows complete reuse of the FPGA device to provide totally different functionality at very little cost.

Meeting DA External Memory Bandwidth Requirements

In addition to raw processing performance, camera-based DA applications require significant external memory access bandwidth. The most stringent requirements come from multi-camera systems with centralized processing, for example, a four-camera surround-view system. Assuming 4 megapixel imagers (1,280 x 960), 24-bit color processing, and performance of 30 frames per second (FPS), just storing the imagers in external buffers requires 3.6 Gb/s of memory access. If the images need to be simultaneously read and written, the requirement doubles to 7.2 Gb/s. With an 80% read/write burst efficiency, the requirement increases to 8.5 Gb/s. This estimate does not include other interim image storage or code access needs. With these requirements, it is clear that camera-based DA applications are memory bandwidth-intensive.

These systems also commonly require memory controllers; however, adding one in a cost-effective manner requires efficient system-level design. Again, developers can leverage the FPGA's flexibility to meet this need. XA Spartan-6 devices offer two hardened memory controller blocks (MCBs) that designers can configure for 4-, 8- or 16-bit DDR, DDR2, DDR3, or LPDDR memory interfaces. They can clock the MCBs at up to 400 MHz, providing 12.8 Gb/s memory access bandwidth to a 16-bit-wide memory device. Furthermore, with two MCBs, the raw bandwidth doubles to 25.6 Gb/s. The two MCBs can run independently or work together using FPGA logic to create a virtual 32-bit-wide data width.

To summarize, FPGA memory controllers provide customized external memory interface design options to meet DA bandwidth needs and optimize all aspects of the cost equation (memory device type, number of PCB layers, etc.).

DA Image Processing Need for On-Chip Memory Resources

In addition to external memory needs, camera-based DA processing can benefit from on-chip memory that serves as line buffers for processing streaming video or analyzing blocks of image data. Bayer transform, lens distortion correction, and optical-flow motion-analysis are examples of functions that require video line buffering. For a brief quantitative analysis, a Bayer transform function using 12-bit-pixel Bayer pattern intensity information to produce 24-bit color data is examined. Implemented as a raw streaming process, a bicubic interpolation process requires buffering four lines of image data. Packing the 12-bit-intensity data into 16-bit locations requires approximately 20.5 kb of storage per line, or 82 kb for four lines of data.

FPGAs provide on-chip memory resources in the form of block RAM. The XA Spartan-6 family has increased the block-RAM-to-logic ratios to support image-processing needs. The XA Spartan-6 devices offer between 216 kb and 4.7 Mb of block RAM memory structured in dual-port 18 kb blocks capable of 320 MHz clocking.

Transporting Video Data over High-Speed Serial Interfaces

Another DA processing platform issue relates to transport of video data from remotely mounted cameras to central processing or display-capable modules. Most of today's camera installations rely on analog composite video transport (for example, NTSC). However, this method presents several problems for advanced DA systems. Interlaced fields can reduce the effectiveness of object-recognition and motion-estimation algorithms, and analog signals are susceptible to electrical noise, which adversely affects image quality. Finally, with the advent of digital imagers, conversion to or from composite video (CVBS) formats can introduce unnecessary system costs.

A preferred method is to use a digital transport mechanism. Transporting 12 bits of data in parallel can be costly in terms of cable and connectors, so serialization techniques involving low-voltage differential signaling (LVDS) or Ethernet technologies are currently under consideration. Serializing pixel data requires the use of devices with high-speed interfaces. A single 30 FPS megapixel imager with 12-bit pixel depth generates data at greater than 500 Mb/s.

XA Spartan-6 devices offer differential I/O that can operate at speeds exceeding 1 Gb/s, and several members of the family also offer serial transceivers that can be clocked at better than 3 Gb/s. It is possible to leverage these high-speed I/O capabilities along with the FPGA logic to implement emerging LVDS SerDes signaling protocols within the FPGA device itself, eliminating external components and reducing system cost.

Functional Partitioning of Parallel and Serial DA Processes

For the single-camera system with rear cross-path warning example, the video- and image-processing functions clearly benefit from parallel processing and hardware acceleration, while the cross-path warning generation is a serial decision process. So a platform that can support both types of processing is clearly an advantage.

Xilinx FPGAs support instantiation of soft processors such as the MicroBlazeTM 32-bit RISC embedded processor, available in XA Spartan-6 devices. Combining full-function processors with FPGA logic allows for optimized functional partitioning i.e., the functions that benefit from parallel processing or hardware acceleration are implemented in FPGA logic, while those more suited for serial processes are implemented in software and executed on the MicroBlaze processor. While the MicroBlaze processor is capable of supporting system-on-chip (SoC) architectures, Xilinx's 7 series devices include an Extensible Processing Platform with hardened ARM® dual core CortexTM-A9 processors along with a hardened set of peripherals. Xilinx is targeting these 7 series devices for the most complex of DA systems.


System designers working on DA processing platforms must consider architectural flexibility, platform scalability, external memory bandwidth, on-chip memory resources, high-speed serial interfaces, and parallel/serial process partitioning. The challenge is to strike an appropriate balance between meeting these needs and maintaining a competitive product cost structure. In this quest, FPGA technology is a viable alternative to standard ASSP and ASIC approaches. In particular, the resource attributes of the XA Spartan-6 family offer unique options and capabilities in meeting the DA processing platform requirements. With today's FPGAs utilizing 40 nm process nodes and 7 series devices moving to 28 nm, their competitive position as a DA processing platform of choice is very strong for some time to come.

To learn more about the XA Spartan-6 family and the benefits it can offer, go to:


For a demonstration of a four camera surround view system based on Spartan-6 FPGAs, go to:


Build A FaceBot

Bookmark and Share

Build A FaceBot

By Eric Gregori
Senior Software Engineer and Embedded Vision Specialist

Face tracking or face detection is an exciting field in embedded vision. With a simple web camera, some free open source software, and a fun animatronic head kit from Robodyssey Systems, FaceBot will introduce you to face detection and tracking using the easy to learn software from EMG Robotics.

EMG Robotics is Eric Gregori, Senior Software Engineer and Embedded Vision Specialist at BDTI. Click here to read the remainder of this informative article, from the May/June 2011 edition of Robot Magazine and reproduced with the generous permission of the publisher.

Introduction To Computer Vision Using OpenCV (Article)

Bookmark and Share

Introduction To Computer Vision Using OpenCV (Article)

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:

By Eric Gregori
Senior Software Engineer and Embedded Vision Specialist

The name OpenCV has become synonymous with computer vision, but what is OpenCV? OpenCV is a collection of software algorithms put together in a library to be used by industry and academia for computer vision applications and research (Figure 1). OpenCV started at Intel in the mid 1990s as a method to demonstrate how to accelerate certain algorithms in hardware. In 2000, Intel released OpenCV to the open source community as a beta version, followed by v1.0 in 2006. In 2008, Willow Garage took over support for OpenCV and immediately released v1.1.

Introduction To OpenCV Figure 1

Figure 1: OpenCV, an algorithm library (courtesy Willow Garage)

Willow Garage dates from 2006. The company has been in the news a lot lately, subsequent to the unveiling of its PR2 robot (Figure 2). Gary Bradski began working on OpenCV when he was at Intel; as a senior scientist at Willow Garage he aggressively continues his work on the library.

Introduction To OpenCV Figure 2

Figure 2: Willow Garage's PR2 robot

OpenCV v2.0, released in 2009, contained many improvements and upgrades. Initially, OpenCV was primarily a C library. The majority of algorithms were written in C, and the primary method of using the library was via a C API. OpenCV v2.0 migrated towards C++ and a C++ API. Subsequent versions of OpenCV added Python support, along with Windows, Linux, iOS and Android OS support, transforming OpenCV (currently at v2.3) into a cross-platform tool. OpenCV v2.3 contains more than 2500 algorithms; the original OpenCV only had 500. And to assure quality, many of the algorithms provide their own unit tests.

So, what can you do with OpenCV v2.3? Think of OpenCV as a box of 2500 different food items. The chef's job is to combine the food items into a meal. OpenCV in itself is not the full meal; it contains the pieces required to make a meal. But here's the good news; OpenCV includes a bunch of recipes to provide examples of what it can do.

Experimenting with OpenCV, no programming experience necessary

BDTI has created the OpenCV Executable Demo Package, an easy-to-use tool that allows anyone with a Windows computer and a web camera to experiment with some of the algorithms in OpenCV v2.3.  You can...

Challenges to Embedding Computer Vision

Bookmark and Share

Challenges to Embedding Computer Vision

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:

By J. Scott Gardner 
April 8, 2011


To read this article as a pdf file, click here.

For many of us, the idea of computer vision was first imagined as the unblinking red lens through which a computer named HAL spied on the world around itself in 2001: A Space Odyssey (Arthur C. Clark and Stanley Kubrick, 1968).  The computers and robots in science fiction are endowed with vision and computing capabilities that often exceed those of mere humans.  Researchers in computer vision understand the truth: computer vision is very difficult, even when implemented on high-performance computer systems.  Arthur C. Clarke underestimated these challenges when he wrote that the HAL 9000 became operational on January 12, 1997.  As just one example, HAL was able to read the lips of the astronauts, something not yet possible with modern, high-performance computers. While the HAL 9000 was comprised of hundreds of circuit boards, this article takes the computer vision challenges to the world of embedded systems.  The fictional Dr. Chandra would have been severely constrained if trying to design HAL into a modern embedded system.  Yet, embedded vision systems are being built today, and these real-world successors of HAL meet the unique design constraints that are critical for creating successful embedded products.

Why is Computer Vision so Difficult?

Figure 2 Rendered 3D Scene with occluded objects. Source: Wiki CommonsComputer vision has been described as “inverse 3D graphics”, but vision is orders of magnitude more difficult than rendering a single 2D representation of a 3D scene database viewed through a virtual camera.  In 3D graphics, the computer already knows about the objects in the environment and implements a “feed-forward” computation to render a camera view. In computer vision, an observed scene must be analyzed to build the equivalent of a scene database which must correctly identify the characteristics of the objects in the scene.  An embedded computer needs to enhance the image and then make inferences to identify the objects and correctly interpret the scene.

However, vision systems don’t have perfect information about a scene, since the camera field of view will often include occluded objects. It would be very difficult to computationally recreate a 3D scene...

September's Initial Embedded Vision Alliance Member Summit: A Resounding Success

By Brian Dipert
Embedded Vision Alliance
Senior Analyst

How Does Camera Performance Affect Analytics?

Bookmark and Share

How Does Camera Performance Affect Analytics?

By Michael Tusch
Founder and CEO
Apical Limited

Camera designers have decades of experience in creating image processing pipelines which produce attractive and/or visually accurate images, but what kind of image processing produces video which is well-purposed for subsequent analytics? It seems reasonable to begin by considering a conventional ISP (image signal processor). After all, the human eye-brain system produces what we consider aesthetically pleasing imagery for a purpose: to maximize our decision-making abilities. But which elements of such an ISP are most important to get right for good analytics, and how do they impact the performance of the algorithms which run on them?

In this introductory article, we’ll survey the main components of an ISP, highlight those components whose performance we believe to be particularly important for good analytics, and discuss what their effect is likely to be. In subsequent articles, we'll look at specific algorithm co-optimizations between analytics and these main components, focusing on applications in object tracking and face recognition.

Figure 1 shows a simplified block schematic of a conventional ISP. The input is sensor data in a raw format (one color per pixel), and the output is interpolated RGB or YCbCr data (three colors per pixel).

Figure 1. This block diagram provides a simplified view inside a conventional ISP (image signal processor)

Table 1 briefly summarizes the function of each block. The list is not intended to be exhaustive: an ISP design team will frequently also implement other modules.

Module Function
Raw data correction Set black point, remove defective pixels.
Lens correction Correct for geometric and luminance/color distortions.
Noise reduction Apply temporal and/or spatial averaging to increase SNR (signal to noise ratio).
Dynamic range compression Reduce dynamic range from sensor to standard output without loss of information.
Demosaic Reconstruct three colors per pixel via interpolation with pixel neighbors.
3A Calculate correct exposure, white balance and focal position.
Color correction Obtain correct colors in different lighting conditions.
Gamma Encode video for standard output.
Sharpen Edge enhancement.
Digital image stabilization Remove global motion due to camera shake/vibration.
Color space conversion RGB to YCbCr.

Table 1. Functions of main ISP modules

Analytics algorithms may operate directly on the raw data, on the output data, or on data which has subsequently passed through a video compression codec. The data at these three stages often has very different characteristics and quality, which are relevant to the performance of analytics algorithms.

Let us now review the stages of the ISP in order of decreasing importance to analytics, which happens also to be approximately the top-to-bottom order shown in Figure 1. We start with the sensor and optical system. Obviously, the better the sensor and optics, the better the quality of data on which to base decisions. But "better" is not a matter simply of resolution, frame rate or SNR (signal-to-noise ratio). Dynamic range is also a key characteristic. Dynamic range is essentially the relative difference in brightness between the brightest and darkest details that the sensor can record within a single scene, normally expressed in dB.

Common CMOS and CCD sensors have a dynamic range of between 60 and 70 dB, which is sufficient to capture all details in scenes which are fairly uniformly illuminated. Special sensors are required, on the other hand, to capture the full range of illumination in high contrast environments. Around 90dB of dynamic range is needed to simultaneously record information in deep shadows and bright highlights on a sunny day; this requirement rises further if extreme lighting conditions occur (the human eye has a dynamic range of around 120dB). If the sensor can’t capture such a range, objects which move across the scene will disappear into blown-out highlights, or into deep shadows below the sensor black level. High (e.g. wide) dynamic range sensors are certainly helpful in improving analytics in uncontrolled lighting environments. Efficient processing of such sensor data is not trivial, however, as discussed below.

The next most important element is noise reduction, which is important for a number of reasons. In low light, noise reduction is frequently necessary to raise objects above the noise background, subsequently aiding in accurate segmentation. Also, high levels of temporal noise can easily confuse tracking algorithms based on pixel motion even though such noise is largely uncorrelated, both spatially and temporally. If the video go through a lossy compression algorithm prior to analytics post-processing, you should also consider the effect of noise reduction on compression efficiency. The bandwidth required to compress noisy sources is much higher than with "clean" sources. If transmission or storage is bandwidth-limited, the presence of noise reduces the overall compression quality and may lead to increased amplitude of quantization blocks, which easily confuses analytics algorithms.

Effective noise reduction can readily increase compression efficiency by 70% or more in moderate noise environments, even when the increase in SNR is visually not very noticeable. However, noise reduction algorithms may themselves introduce artifacts. Temporal processing works well because it increases the SNR by averaging the processing over multiple frames. Both global and local motion compensation may be necessary to eliminate false motion trails in environments with fast movement. Spatial noise reduction aims to blur noise while retaining texture and edges and risks suppressing important details. You must therefore strike a careful balance between SNR increase and image quality degradation.

The correction of lens geometric distortions, chromatic aberrations and lens shading (vignetting) is of inconsistent significance, depending on the optics and application. For conventional cameras, uncorrected data may be perfectly suitable for post-processing. In digital PTZ (point, tilt and zoom) cameras, on the other hand, correction is a fundamental component of the system. A set of "3A" algorithms control camera exposure, color and focus, based on statistical analysis of the sensor data. Their function and impact on analytics is shown in Table 2 below.

Algorithm Function Impact
Auto exposure Adjust exposure to maximize the amount of scene captured. Avoid flicker in artificial lighting. A poor algorithm may blow out highlights or clip dark areas, losing information. Temporal instabilities may confuse motion-based analytics.
Auto white balance Obtain correct colors in all lighting conditions. If color information is used by analytics, it needs to be accurate. It is challenging to achieve accurate colors in all lighting conditions.
Auto focus Focus the camera. Which regions of the image should receive focus attention? How should the algorithm balance temporal stability versus rapid refocusing in a scene change?

Table 2. The impact of "3A" algorithms

Finally, we turn to DRC (dynamic range compression). DRC is a method of non-linear image adjustment which reduces dynamic range, i.e. global contrast. It has two primary functions: detail preservation and luminance normalization.

I mentioned above that the better the dynamic range of the sensor and optics, the more data will typically be available for analytics to work on. But in what form do the algorithms receive this data? For soome embedded vision applications, it may be no problem to work directly with the high bit depth raw sensor data. But if the analytics is run in-camera on RGB or YCbCr data, or as post-processing based on already lossy-compressed data,the dynamic range of such data is typically limited by the 8-bit standard format, which corresponds to 60 dB. This means that, unless dynamic range compression occurs in some way prior to encoding, the additional scene information will be lost. While techniques for DRC are well established (gamma correction is one form, for example), many of these techniques decrease image quality in the process, by degrading local contrast and color information, or by introducing spatial artifacts.

Another application of DRC is in image normalization. Advanced analytics algorithms, such as those employed in facial recognition, are susceptible to changing and non-uniform lighting environments. For example, an algorithm may recognize the same face differently depending on whether the face is uniformly illuminated or lit by a point source to one side, in the latter case casting a shadow on the other side. Good DRC processing can be effective in normalizing imagery from highly variable illumination conditions to simulated constant, uniform lighting, as shown in Figure 2.

Figure 2. DRC (Dynamic range control) can normalize a source image with non-uniform illumination

In general, we find that the requirements of an ISP to produce natural, visually-accurate imagery and to produce well-purposed imagery for analytics are closely matched. However, as is well-known, it is challenging to maintain accuracy in uncontrolled environments. In future articles, we will focus on individual blocks of the ISP and consider the effect of their specific performance on the behavior of example analytics algorithms.

Design Guidelines for Embedded Real-Time Face Detection Applications

Bookmark and Share

Design Guidelines for Embedded Real-Time Face Detection Applications

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:

By Eldad Melamed
Project Manager, Video Algorithms

To read this article as a pdf file, click here.

Much like the human visual system, embedded computer vision systems perform the same visual functions of analyzing and extracting information from video in a wide variety of products.

In embedded portable devices such as Smartphones, digital cameras, and camcorders, the elevated performance has to be delivered with limited size, cost, and power. Emerging high-volume embedded vision markets include automotive safety, surveillance, and gaming. Computer vision algorithms identify objects in a scene, and consequently produce one region of an image that has more importance than other regions of the image. For example, object and face detection can be used to enhance video conferencing experience, management of public security files, content based retrieval and many other aspects.

Cropping and resizing can be done to properly center the image on a face. In this paper we present an application that detects faces in a digital image, crops the selected main face and resizes it to a fixed size output image (see figure 1).

The application can be used on a single image or on a video stream, and it is designed to run in real time. As far as real-time face detection on mobile devices is concerned, appropriate implementation steps need to be made in order to achieve real-time throughput.

This paper presents such steps for real-time deployment of a face detection application on a programmable vector processor. The steps taken are general purpose in the sense that they can be used to implement similar computer vision algorithms on any mobile device.

 Figure 1 – CEVA face detection application     


While still image processing consumes a small amount of bandwidth and allocated memory, video can be considerably demanding on today’s memory systems. At the other end of the spectrum, memory system design for computer vision algorithms can be extremely challenging because of the extra number of processing steps required to detect and classify objects. Consider a thumbnail with 19x19...

Lens Distortion Correction

Bookmark and Share

Lens Distortion Correction

by Shehrzad Qureshi
Senior Engineer, BDTI
May 14, 2011

A typical processing pipeline for computer vision is given in Figure 1 below:

lens distortion figure 1

The focus of this article is on the lens correction block. In less than ideal optical systems, like those which will be found in cheaper smartphones and tablets, incoming frames will tend to get distorted along their edges. The most common types of lens distortions are either barrel distortion, pincushion distortion, or some combination of the two[1]. Figure 2 is an illustration of the types of distortion encountered in vision, and in this article we will discuss strategies and implementations for correcting for this type of lens distortion.

lens distortion figure 2

These types of lens aberrations can cause problems for vision algorithms because machine vision usually prefers straight edges (for example lane finding in automotive, or various inspection systems). The general effect of both barrel and pincushion distortion is to project what should be straight lines as curves. Correcting for these distortions is computationally expensive because it is a per-pixel operation. However the correction process is also highly regular and “embarassingly data parallel” which makes it amenable to FPGA or GPU acceleration. The FPGA solution can be particularly attractive as there are now cameras on the market with FPGAs in the camera itself that can be programmed to perform this type of processing[2].

Calibration Procedure

The rectlinear correction procedure can be summarized as warping a distorted image (see Figure 2b and 2c) to remove the lens distortion, thus taking the frame back to its undistorted projection (Figure 2a). In other words, we must first estimate the lens distortion function, and then invert it so as to compensate the incoming image frame. The compensated image will be referred to as the undistorted image.

Both types of lens aberrations discussed so far are radial distortions that increase in magnitude as we move farther away from the image center. In order to correct for this distortion at runtime, we first must estimate the coefficients of a parameterized form of the distortion function during a calibration procedure that is specific to a given optics train. The detailed mathematics behind this parameterization is beyond the scope of this article, and is covered thoroughly elsewhere[3]. Suffice it to say if we have:


  • = original distorted point coordinates
  • = image center
  • = undistorted (corrected) point coordinates

then the goal is to measure the distortion model where:


The purpose of the calibration procedure is to estimate the radial distortion coefficients which can be achieved using a gradient descent optimizer[3]. Typically one images a test pattern with co-linear features that are easily extracted autonomously with sub-pixel accuracy. The checkerboard in Figure 2a is one such test pattern that the author has used several times in the past. Figure 3 summarizes the calibration process:

lens distortion figure 3

The basic idea is to image a distorted test pattern, extract the coordinates of lines which are known to be straight, feed these coordinates into an optimizer such as the one described in [3], which emits the lens undistort warp coefficients. These coefficients are used at run-time to correct for the measured lens distortion.

In the provided example we can use a Harris corner detector to automatically find the corners of the checkerboard pattern. The OpenCV library has a robust corner detection function that can be used for this purpose[4]. The following snippet of OpenCV code can be used to extract the interior corners of the calibration image of Figure 2a:


const int nSquaresAcross=9;
const int nSquaresDown=7;
const int nCorners=(nSquaresAcross-1)*(nSquaresDown-1);
CvSize szcorners = cvSize(nSquaresAcross-1,nSquaresDown-1);
std::vector<CvPoint2D32f> vCornerList(nCorners);
/* find corners to pixel accuracy */
int cornerCount = 0;
const int N = cvFindChessboardCorners(pImg, /* IplImage */
/* should check that cornerCount==nCorners */
/* sub-pixel refinement */


Segments corresponding to what should be straight lines are then constructed from the point coordinates stored in the STL vCornerList container. For the best results, multiple lines should be used and it is necessary to have both vertical and horizontal line segments for the optimization procedure to converge to a viable solution (see the blue lines in Figure 3).

Finally, we are ready to determine the radial distortion coefficients. There are numerous camera calibration packages (including one in OpenCV), but a particularly good open-source ANSI C library can be located here[5]. Essentially the line segment coordinates are fed into an optimizer which determines the undistort coefficients by minimizing the error between the radial distortion model and the training data. These coefficients can then be stored in a lookup table for run-time image correction.

Lens Distortion Correction (Warping)

The referenced calibration library[5] also includes a very informative online demo. The demo illustrates the procedure briefly described here. Application of the radial distortion coefficients to correct for the lens aberrations basically boils down an image warp operation. That is, for each pixel in the (undistorted) frame, we compute the distance from the image center and evaluate a polynomial that gives us the pixel coordinates from which to fill in the corrected pixel intensity. Because the polynomial evaluation will more than likely fall in between integer pixel coordinates, some form of interpolation must be used. The simplest and cheapest interpolant is so-called “nearest neighbor” which as its name implies means to simply pick the nearest pixel, but this technique results in poor image quality. At a bare minimum bilinear interpolation should be employed and oftentimes higher order bicubic interpolants are called for.

The amount of computations per frame can become quite large, particularly if we are dealing with color frames. The saving grace of this operation is its inherent parallelism (each pixel is completely independent of its neighbors and hence can be corrected in parallel). This parallelism and the highly regular nature of the computations lends itself readily to accelerators, either via FPGAs[6,7] or GPUs[8,9]. The source code provided in [5] includes a lens correction function with full ANSI C source.

A faster software implementation than [5] can be realized using the OpenCV cvRemap() function[4]. The input arguments into this function are the source and destination images, the source to destination pixel mapping (expressed as two floating point arrays), interpolation options, and a default fill value (if there are any holes in the rectilinear corrected image). At calibration time, we evaluate the distortion model polynomials just once and then store the pixel mapping to disk or memory. At run-time the software simply calls cvRemap()—which is optimized and can accommodate color frames—to correct the lens distortion.


[1] Wikipedia page:  http://en.wikipedia.org/wiki/Distortion_(optics)

[2] http://www.hunteng.co.uk/info/fpgaimaging.htm

[3] L. Alvarez, L. Gomez, R. Sendra. An Algebraic Approach to Lens Distortion by Line Rectification, Journal of Mathematical Imaging and Vision, Vol. 39 (1), July 2009, pp. 36-50.

[4] Bradski, Gary and Kaebler, Adrian. Learning OpenCV (Computer Vision with the OpenCV Library), O’Reilly, 2008.

[5] http://www.ipol.im/pub/algo/ags_algebraic_lens_distortion_estimation/

[6] Daloukas, K.; Antonopoulos, C.D.; Bellas, N.; Chai, S.M. "Fisheye lens distortion correction on multicore and hardware accelerator platforms," Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on , vol., no., pp.1-10, 19-23 April 2010

[7] J. Jiang , S. Schmidt , W. Luk and D. Rueckert "Parameterizing reconfigurable designs for image warping", Proc. SPIE, vol. 4867, pp. 86 2002.

[8] http://visionexperts.blogspot.com/2010/07/image-warping-using-texture-fetches.html

[9] Rosner,J., Fassold,H., Bailer,W., Schallauer,P.: “Fast GPU-based Image Warping and Inpainting for Frame Interpolation”, Proceedings of Computer Graphics, Computer Vision and Mathematics (GraVisMa) Workshop, Plzen, CZ, 2010