Embedded Vision Alliance: Technical Articles

Build A FaceBot

Bookmark and Share

Build A FaceBot

By Eric Gregori
Senior Software Engineer and Embedded Vision Specialist
BDTI

Face tracking or face detection is an exciting field in embedded vision. With a simple web camera, some free open source software, and a fun animatronic head kit from Robodyssey Systems, FaceBot will introduce you to face detection and tracking using the easy to learn software from EMG Robotics.

EMG Robotics is Eric Gregori, Senior Software Engineer and Embedded Vision Specialist at BDTI. Click here to read the remainder of this informative article, from the May/June 2011 edition of Robot Magazine and reproduced with the generous permission of the publisher.

Introduction To Computer Vision Using OpenCV (Article)

Bookmark and Share

Introduction To Computer Vision Using OpenCV (Article)

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


By Eric Gregori
Senior Software Engineer and Embedded Vision Specialist
BDTI

The name OpenCV has become synonymous with computer vision, but what is OpenCV? OpenCV is a collection of software algorithms put together in a library to be used by industry and academia for computer vision applications and research (Figure 1). OpenCV started at Intel in the mid 1990s as a method to demonstrate how to accelerate certain algorithms in hardware. In 2000, Intel released OpenCV to the open source community as a beta version, followed by v1.0 in 2006. In 2008, Willow Garage took over support for OpenCV and immediately released v1.1.

Introduction To OpenCV Figure 1

Figure 1: OpenCV, an algorithm library (courtesy Willow Garage)

Willow Garage dates from 2006. The company has been in the news a lot lately, subsequent to the unveiling of its PR2 robot (Figure 2). Gary Bradski began working on OpenCV when he was at Intel; as a senior scientist at Willow Garage he aggressively continues his work on the library.

Introduction To OpenCV Figure 2

Figure 2: Willow Garage's PR2 robot

OpenCV v2.0, released in 2009, contained many improvements and upgrades. Initially, OpenCV was primarily a C library. The majority of algorithms were written in C, and the primary method of using the library was via a C API. OpenCV v2.0 migrated towards C++ and a C++ API. Subsequent versions of OpenCV added Python support, along with Windows, Linux, iOS and Android OS support, transforming OpenCV (currently at v2.3) into a cross-platform tool. OpenCV v2.3 contains more than 2500 algorithms; the original OpenCV only had 500. And to assure quality, many of the algorithms provide their own unit tests.

So, what can you do with OpenCV v2.3? Think of OpenCV as a box of 2500 different food items. The chef's job is to combine the food items into a meal. OpenCV in itself is not the full meal; it contains the pieces required to make a meal. But here's the good news; OpenCV includes a bunch of recipes to provide examples of what it can do.

Experimenting with OpenCV, no programming experience necessary

BDTI has created the OpenCV Executable Demo Package, an easy-to-use tool that allows anyone with a Windows computer and a web camera to experiment with some of the algorithms in OpenCV v2.3.  You can...

Challenges to Embedding Computer Vision

Bookmark and Share

Challenges to Embedding Computer Vision

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


By J. Scott Gardner 
April 8, 2011

 

To read this article as a pdf file, click here.

For many of us, the idea of computer vision was first imagined as the unblinking red lens through which a computer named HAL spied on the world around itself in 2001: A Space Odyssey (Arthur C. Clark and Stanley Kubrick, 1968).  The computers and robots in science fiction are endowed with vision and computing capabilities that often exceed those of mere humans.  Researchers in computer vision understand the truth: computer vision is very difficult, even when implemented on high-performance computer systems.  Arthur C. Clarke underestimated these challenges when he wrote that the HAL 9000 became operational on January 12, 1997.  As just one example, HAL was able to read the lips of the astronauts, something not yet possible with modern, high-performance computers. While the HAL 9000 was comprised of hundreds of circuit boards, this article takes the computer vision challenges to the world of embedded systems.  The fictional Dr. Chandra would have been severely constrained if trying to design HAL into a modern embedded system.  Yet, embedded vision systems are being built today, and these real-world successors of HAL meet the unique design constraints that are critical for creating successful embedded products.

Why is Computer Vision so Difficult?

Figure 2 Rendered 3D Scene with occluded objects. Source: Wiki CommonsComputer vision has been described as “inverse 3D graphics”, but vision is orders of magnitude more difficult than rendering a single 2D representation of a 3D scene database viewed through a virtual camera.  In 3D graphics, the computer already knows about the objects in the environment and implements a “feed-forward” computation to render a camera view. In computer vision, an observed scene must be analyzed to build the equivalent of a scene database which must correctly identify the characteristics of the objects in the scene.  An embedded computer needs to enhance the image and then make inferences to identify the objects and correctly interpret the scene.

However, vision systems don’t have perfect information about a scene, since the camera field of view will often include occluded objects. It would be very difficult to computationally recreate a 3D scene...

September's Initial Embedded Vision Alliance Member Summit: A Resounding Success

By Brian Dipert
Editor-In-Chief
Embedded Vision Alliance
Senior Analyst
BDTI

How Does Camera Performance Affect Analytics?

Bookmark and Share

How Does Camera Performance Affect Analytics?

By Michael Tusch
Founder and CEO
Apical Limited

Camera designers have decades of experience in creating image processing pipelines which produce attractive and/or visually accurate images, but what kind of image processing produces video which is well-purposed for subsequent analytics? It seems reasonable to begin by considering a conventional ISP (image signal processor). After all, the human eye-brain system produces what we consider aesthetically pleasing imagery for a purpose: to maximize our decision-making abilities. But which elements of such an ISP are most important to get right for good analytics, and how do they impact the performance of the algorithms which run on them?

In this introductory article, we’ll survey the main components of an ISP, highlight those components whose performance we believe to be particularly important for good analytics, and discuss what their effect is likely to be. In subsequent articles, we'll look at specific algorithm co-optimizations between analytics and these main components, focusing on applications in object tracking and face recognition.

Figure 1 shows a simplified block schematic of a conventional ISP. The input is sensor data in a raw format (one color per pixel), and the output is interpolated RGB or YCbCr data (three colors per pixel).

Figure 1. This block diagram provides a simplified view inside a conventional ISP (image signal processor)

Table 1 briefly summarizes the function of each block. The list is not intended to be exhaustive: an ISP design team will frequently also implement other modules.

Module Function
Raw data correction Set black point, remove defective pixels.
Lens correction Correct for geometric and luminance/color distortions.
Noise reduction Apply temporal and/or spatial averaging to increase SNR (signal to noise ratio).
Dynamic range compression Reduce dynamic range from sensor to standard output without loss of information.
Demosaic Reconstruct three colors per pixel via interpolation with pixel neighbors.
3A Calculate correct exposure, white balance and focal position.
Color correction Obtain correct colors in different lighting conditions.
Gamma Encode video for standard output.
Sharpen Edge enhancement.
Digital image stabilization Remove global motion due to camera shake/vibration.
Color space conversion RGB to YCbCr.

Table 1. Functions of main ISP modules

Analytics algorithms may operate directly on the raw data, on the output data, or on data which has subsequently passed through a video compression codec. The data at these three stages often has very different characteristics and quality, which are relevant to the performance of analytics algorithms.

Let us now review the stages of the ISP in order of decreasing importance to analytics, which happens also to be approximately the top-to-bottom order shown in Figure 1. We start with the sensor and optical system. Obviously, the better the sensor and optics, the better the quality of data on which to base decisions. But "better" is not a matter simply of resolution, frame rate or SNR (signal-to-noise ratio). Dynamic range is also a key characteristic. Dynamic range is essentially the relative difference in brightness between the brightest and darkest details that the sensor can record within a single scene, normally expressed in dB.

Common CMOS and CCD sensors have a dynamic range of between 60 and 70 dB, which is sufficient to capture all details in scenes which are fairly uniformly illuminated. Special sensors are required, on the other hand, to capture the full range of illumination in high contrast environments. Around 90dB of dynamic range is needed to simultaneously record information in deep shadows and bright highlights on a sunny day; this requirement rises further if extreme lighting conditions occur (the human eye has a dynamic range of around 120dB). If the sensor can’t capture such a range, objects which move across the scene will disappear into blown-out highlights, or into deep shadows below the sensor black level. High (e.g. wide) dynamic range sensors are certainly helpful in improving analytics in uncontrolled lighting environments. Efficient processing of such sensor data is not trivial, however, as discussed below.

The next most important element is noise reduction, which is important for a number of reasons. In low light, noise reduction is frequently necessary to raise objects above the noise background, subsequently aiding in accurate segmentation. Also, high levels of temporal noise can easily confuse tracking algorithms based on pixel motion even though such noise is largely uncorrelated, both spatially and temporally. If the video go through a lossy compression algorithm prior to analytics post-processing, you should also consider the effect of noise reduction on compression efficiency. The bandwidth required to compress noisy sources is much higher than with "clean" sources. If transmission or storage is bandwidth-limited, the presence of noise reduces the overall compression quality and may lead to increased amplitude of quantization blocks, which easily confuses analytics algorithms.

Effective noise reduction can readily increase compression efficiency by 70% or more in moderate noise environments, even when the increase in SNR is visually not very noticeable. However, noise reduction algorithms may themselves introduce artifacts. Temporal processing works well because it increases the SNR by averaging the processing over multiple frames. Both global and local motion compensation may be necessary to eliminate false motion trails in environments with fast movement. Spatial noise reduction aims to blur noise while retaining texture and edges and risks suppressing important details. You must therefore strike a careful balance between SNR increase and image quality degradation.

The correction of lens geometric distortions, chromatic aberrations and lens shading (vignetting) is of inconsistent significance, depending on the optics and application. For conventional cameras, uncorrected data may be perfectly suitable for post-processing. In digital PTZ (point, tilt and zoom) cameras, on the other hand, correction is a fundamental component of the system. A set of "3A" algorithms control camera exposure, color and focus, based on statistical analysis of the sensor data. Their function and impact on analytics is shown in Table 2 below.

Algorithm Function Impact
Auto exposure Adjust exposure to maximize the amount of scene captured. Avoid flicker in artificial lighting. A poor algorithm may blow out highlights or clip dark areas, losing information. Temporal instabilities may confuse motion-based analytics.
Auto white balance Obtain correct colors in all lighting conditions. If color information is used by analytics, it needs to be accurate. It is challenging to achieve accurate colors in all lighting conditions.
Auto focus Focus the camera. Which regions of the image should receive focus attention? How should the algorithm balance temporal stability versus rapid refocusing in a scene change?

Table 2. The impact of "3A" algorithms

Finally, we turn to DRC (dynamic range compression). DRC is a method of non-linear image adjustment which reduces dynamic range, i.e. global contrast. It has two primary functions: detail preservation and luminance normalization.

I mentioned above that the better the dynamic range of the sensor and optics, the more data will typically be available for analytics to work on. But in what form do the algorithms receive this data? For soome embedded vision applications, it may be no problem to work directly with the high bit depth raw sensor data. But if the analytics is run in-camera on RGB or YCbCr data, or as post-processing based on already lossy-compressed data,the dynamic range of such data is typically limited by the 8-bit standard format, which corresponds to 60 dB. This means that, unless dynamic range compression occurs in some way prior to encoding, the additional scene information will be lost. While techniques for DRC are well established (gamma correction is one form, for example), many of these techniques decrease image quality in the process, by degrading local contrast and color information, or by introducing spatial artifacts.

Another application of DRC is in image normalization. Advanced analytics algorithms, such as those employed in facial recognition, are susceptible to changing and non-uniform lighting environments. For example, an algorithm may recognize the same face differently depending on whether the face is uniformly illuminated or lit by a point source to one side, in the latter case casting a shadow on the other side. Good DRC processing can be effective in normalizing imagery from highly variable illumination conditions to simulated constant, uniform lighting, as shown in Figure 2.

Figure 2. DRC (Dynamic range control) can normalize a source image with non-uniform illumination

In general, we find that the requirements of an ISP to produce natural, visually-accurate imagery and to produce well-purposed imagery for analytics are closely matched. However, as is well-known, it is challenging to maintain accuracy in uncontrolled environments. In future articles, we will focus on individual blocks of the ISP and consider the effect of their specific performance on the behavior of example analytics algorithms.

Design Guidelines for Embedded Real-Time Face Detection Applications

Bookmark and Share

Design Guidelines for Embedded Real-Time Face Detection Applications

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


By Eldad Melamed
Project Manager, Video Algorithms
CEVA

To read this article as a pdf file, click here.

Much like the human visual system, embedded computer vision systems perform the same visual functions of analyzing and extracting information from video in a wide variety of products.

In embedded portable devices such as Smartphones, digital cameras, and camcorders, the elevated performance has to be delivered with limited size, cost, and power. Emerging high-volume embedded vision markets include automotive safety, surveillance, and gaming. Computer vision algorithms identify objects in a scene, and consequently produce one region of an image that has more importance than other regions of the image. For example, object and face detection can be used to enhance video conferencing experience, management of public security files, content based retrieval and many other aspects.

Cropping and resizing can be done to properly center the image on a face. In this paper we present an application that detects faces in a digital image, crops the selected main face and resizes it to a fixed size output image (see figure 1).

The application can be used on a single image or on a video stream, and it is designed to run in real time. As far as real-time face detection on mobile devices is concerned, appropriate implementation steps need to be made in order to achieve real-time throughput.

This paper presents such steps for real-time deployment of a face detection application on a programmable vector processor. The steps taken are general purpose in the sense that they can be used to implement similar computer vision algorithms on any mobile device.
     

 Figure 1 – CEVA face detection application     

 

While still image processing consumes a small amount of bandwidth and allocated memory, video can be considerably demanding on today’s memory systems. At the other end of the spectrum, memory system design for computer vision algorithms can be extremely challenging because of the extra number of processing steps required to detect and classify objects. Consider a thumbnail with 19x19...

Lens Distortion Correction

Bookmark and Share

Lens Distortion Correction

by Shehrzad Qureshi
Senior Engineer, BDTI
May 14, 2011

A typical processing pipeline for computer vision is given in Figure 1 below:

lens distortion figure 1

The focus of this article is on the lens correction block. In less than ideal optical systems, like those which will be found in cheaper smartphones and tablets, incoming frames will tend to get distorted along their edges. The most common types of lens distortions are either barrel distortion, pincushion distortion, or some combination of the two[1]. Figure 2 is an illustration of the types of distortion encountered in vision, and in this article we will discuss strategies and implementations for correcting for this type of lens distortion.

lens distortion figure 2

These types of lens aberrations can cause problems for vision algorithms because machine vision usually prefers straight edges (for example lane finding in automotive, or various inspection systems). The general effect of both barrel and pincushion distortion is to project what should be straight lines as curves. Correcting for these distortions is computationally expensive because it is a per-pixel operation. However the correction process is also highly regular and “embarassingly data parallel” which makes it amenable to FPGA or GPU acceleration. The FPGA solution can be particularly attractive as there are now cameras on the market with FPGAs in the camera itself that can be programmed to perform this type of processing[2].

Calibration Procedure

The rectlinear correction procedure can be summarized as warping a distorted image (see Figure 2b and 2c) to remove the lens distortion, thus taking the frame back to its undistorted projection (Figure 2a). In other words, we must first estimate the lens distortion function, and then invert it so as to compensate the incoming image frame. The compensated image will be referred to as the undistorted image.

Both types of lens aberrations discussed so far are radial distortions that increase in magnitude as we move farther away from the image center. In order to correct for this distortion at runtime, we first must estimate the coefficients of a parameterized form of the distortion function during a calibration procedure that is specific to a given optics train. The detailed mathematics behind this parameterization is beyond the scope of this article, and is covered thoroughly elsewhere[3]. Suffice it to say if we have:

where:

  • = original distorted point coordinates
  • = image center
  • = undistorted (corrected) point coordinates

then the goal is to measure the distortion model where:

and:

The purpose of the calibration procedure is to estimate the radial distortion coefficients which can be achieved using a gradient descent optimizer[3]. Typically one images a test pattern with co-linear features that are easily extracted autonomously with sub-pixel accuracy. The checkerboard in Figure 2a is one such test pattern that the author has used several times in the past. Figure 3 summarizes the calibration process:

lens distortion figure 3

The basic idea is to image a distorted test pattern, extract the coordinates of lines which are known to be straight, feed these coordinates into an optimizer such as the one described in [3], which emits the lens undistort warp coefficients. These coefficients are used at run-time to correct for the measured lens distortion.

In the provided example we can use a Harris corner detector to automatically find the corners of the checkerboard pattern. The OpenCV library has a robust corner detection function that can be used for this purpose[4]. The following snippet of OpenCV code can be used to extract the interior corners of the calibration image of Figure 2a:

 

const int nSquaresAcross=9;
const int nSquaresDown=7;
const int nCorners=(nSquaresAcross-1)*(nSquaresDown-1);
CvSize szcorners = cvSize(nSquaresAcross-1,nSquaresDown-1);
std::vector<CvPoint2D32f> vCornerList(nCorners);
/* find corners to pixel accuracy */
int cornerCount = 0;
const int N = cvFindChessboardCorners(pImg, /* IplImage */
szcorners,
&vCornerList[0],
&cornerCount);
/* should check that cornerCount==nCorners */
/* sub-pixel refinement */
cvFindCornerSubPix(pImg,
&vCornerList[0],
cornerCount,
vSize(5,5),
cvSize(-1,-1),
cvTermCriteria(CV_TERMCRIT_EPS,0,.001));

 

Segments corresponding to what should be straight lines are then constructed from the point coordinates stored in the STL vCornerList container. For the best results, multiple lines should be used and it is necessary to have both vertical and horizontal line segments for the optimization procedure to converge to a viable solution (see the blue lines in Figure 3).

Finally, we are ready to determine the radial distortion coefficients. There are numerous camera calibration packages (including one in OpenCV), but a particularly good open-source ANSI C library can be located here[5]. Essentially the line segment coordinates are fed into an optimizer which determines the undistort coefficients by minimizing the error between the radial distortion model and the training data. These coefficients can then be stored in a lookup table for run-time image correction.

Lens Distortion Correction (Warping)

The referenced calibration library[5] also includes a very informative online demo. The demo illustrates the procedure briefly described here. Application of the radial distortion coefficients to correct for the lens aberrations basically boils down an image warp operation. That is, for each pixel in the (undistorted) frame, we compute the distance from the image center and evaluate a polynomial that gives us the pixel coordinates from which to fill in the corrected pixel intensity. Because the polynomial evaluation will more than likely fall in between integer pixel coordinates, some form of interpolation must be used. The simplest and cheapest interpolant is so-called “nearest neighbor” which as its name implies means to simply pick the nearest pixel, but this technique results in poor image quality. At a bare minimum bilinear interpolation should be employed and oftentimes higher order bicubic interpolants are called for.

The amount of computations per frame can become quite large, particularly if we are dealing with color frames. The saving grace of this operation is its inherent parallelism (each pixel is completely independent of its neighbors and hence can be corrected in parallel). This parallelism and the highly regular nature of the computations lends itself readily to accelerators, either via FPGAs[6,7] or GPUs[8,9]. The source code provided in [5] includes a lens correction function with full ANSI C source.

A faster software implementation than [5] can be realized using the OpenCV cvRemap() function[4]. The input arguments into this function are the source and destination images, the source to destination pixel mapping (expressed as two floating point arrays), interpolation options, and a default fill value (if there are any holes in the rectilinear corrected image). At calibration time, we evaluate the distortion model polynomials just once and then store the pixel mapping to disk or memory. At run-time the software simply calls cvRemap()—which is optimized and can accommodate color frames—to correct the lens distortion.

References

[1] Wikipedia page:  http://en.wikipedia.org/wiki/Distortion_(optics)

[2] http://www.hunteng.co.uk/info/fpgaimaging.htm

[3] L. Alvarez, L. Gomez, R. Sendra. An Algebraic Approach to Lens Distortion by Line Rectification, Journal of Mathematical Imaging and Vision, Vol. 39 (1), July 2009, pp. 36-50.

[4] Bradski, Gary and Kaebler, Adrian. Learning OpenCV (Computer Vision with the OpenCV Library), O’Reilly, 2008.

[5] http://www.ipol.im/pub/algo/ags_algebraic_lens_distortion_estimation/

[6] Daloukas, K.; Antonopoulos, C.D.; Bellas, N.; Chai, S.M. "Fisheye lens distortion correction on multicore and hardware accelerator platforms," Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on , vol., no., pp.1-10, 19-23 April 2010

[7] J. Jiang , S. Schmidt , W. Luk and D. Rueckert "Parameterizing reconfigurable designs for image warping", Proc. SPIE, vol. 4867, pp. 86 2002.

[8] http://visionexperts.blogspot.com/2010/07/image-warping-using-texture-fetches.html

[9] Rosner,J., Fassold,H., Bailer,W., Schallauer,P.: “Fast GPU-based Image Warping and Inpainting for Frame Interpolation”, Proceedings of Computer Graphics, Computer Vision and Mathematics (GraVisMa) Workshop, Plzen, CZ, 2010