Embedded Vision Alliance: Technical Articles

Computer Vision Metrics: Chapter Three (Part D)

Bookmark and Share

Computer Vision Metrics: Chapter Three (Part D)

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


For Part C of Chapter Three, please click here.

Bibliography references are set off with brackets, i.e. "[XXX]". For the corresponding bibliography entries, please click here.


Statistical Region Metrics

Describing texture in terms of statistical metrics of the pixels is a common and intuitive method. Often a simple histogram of a region will be sufficient to describe the texture well enough for many applications. There are also many variations of the histogram, which lend themselves to a wide range of texture analysis. So this is a good point at which to examine histogram methods. Since statistical mathematics is a vast field, we can only introduce the topic here, dividing the discussion into image moment features and point metric features.

Image Moment Features

Image moments [518,4] are scalar quantities, analogous to the familiar statistical measures such as mean, variance, skew, and kurtosis. Moments are well suited to describe polygon shape features and general feature metric information such as gradient distributions. Image moments can be based on either scalar point values or basis functions such as Fourier or Zernike methods discussed later in the section on basis space.

Moments can describe the projection of a function onto a basis space—for example, the Fourier transform projects a function onto a basis of harmonic functions. Note that there is a conceptual relationship between 1D and 2D moments in the context of shape description. For example, the 1D mean corresponds to the 2D centroid, and the 1D minimum and maximum correspond to the 2D major and minor axis. The 1D minimum and maximum also correspond to the 2D bounding box around the 2D polygon shape (also see Figure 6-29).

In this work, we classify image moments under the term polygon shape descriptors in the taxonomy (see Chapter 5). Details on several image moments used for 2D shape description will be covered in Chapter 6, under “Object Shape Metrics for Blobs and Objects.”

Common properties of moments in the context of 1D distributions and 2D images include:

  • 0th order moment is the mean or 2D centroid.
  • Central moments describe variation around the mean or 2D centroid.
  • 1st order central moments contain information about 2D area, centroid, and size.
  • 2nd order central moments are related to variance and measure 2D elliptical shape.
  • 3rd order central moments provide symmetry information about the 2D shape, or skewness.
  • 4th order...

Computer Vision Metrics: Chapter Three (Part C)

Bookmark and Share

Computer Vision Metrics: Chapter Three (Part C)

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


For Part B of Chapter Three, please click here.

Bibliography references are set off with brackets, i.e. "[XXX]". For the corresponding bibliography entries, please click here.


Extended SDM Metrics

Extensions to the Haralick metrics have been developed by the author [26], primarily motivated by a visual study of SDM plots as shown in Figure 3-7. Applications for the extended SDM metrics include texture analysis, data visualization, and image recognition. The visual plots of the SDMs alone are valuable indicators of pixel intensity relationships, and are worth using along with histograms to get to know the data.

The extended SDM metrics include centroid, total coverage, low-frequency coverage, total power, relative Power, locus length, locus mean density, bin mean density, containment, linearity, and linearity strength. The extended SDM metrics capture key information that is best observed by looking at the SDM plots. In many cases the extended SDM metric are be computed four times, once for each SDM direction of 0, 45, 90, and 135 degrees, as shown in Figure 3-5.

The SDMs are interesting and useful all by themselves when viewed as an image. Many of the texture metrics suggested are obvious after viewing and understanding the SDMs; others are neither obvious nor apparently useful until developing a basic familiarity with the visual interpretation of SDM image plots. Next, we survey the following:

  • Example SDMs showing four directional SDM maps: A complete set of SDMs would contain four different plots, one for each orientation. Interpreting the SDM plots visually reveals useful information. For example, an image with a smooth texture will yield a narrow diagonal band of co-occurrence values; an image with wide texture variation will yield a larger spread of values; a noisy image will yield a co-occurrence matrix with outlier values at the extrema. In some cases, noise may only be distributed along one axis of the image—perhaps, across rows or the x axis, which could indicated sensor readout noise as each line is read out of the sensor, suggesting a row- or line-oriented image preparation stage in the vision pipeline to compensate for the camera.
  • Extended SDM texture metrics: The addition of 12 other useful statistical measures to those proposed by Haralick.
  • Some code snippets: These illustrate the extended SDM computations, full source code is shown in Appendix D.

In Figure 3-7, several of the extended SDM metrics can be easily seen, including...

Computer Vision Metrics: Chapter Three (Part B)

Bookmark and Share

Computer Vision Metrics: Chapter Three (Part B)

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


For Part A of Chapter Three, please click here.

Bibliography references are set off with brackets, i.e. "[XXX]". For the corresponding bibliography entries, please click here.


Statistical Methods

The topic of statistical methods is vast, and we can only refer the reader to selected literature as we go along. One useful and comprehensive resource is the online NIST National Institute of Science and Technology Engineering Statistics Handbook (See the NIST online resource for engineering statistics: http://www.itl.nist.gov/div898/handbook/), including examples and links to additional resources and tools.

Statistical methods may be drawn upon at any time to generate novel feature metrics. Any feature, such as pixel values or local region gradients, can be expressed statistically by any number of methods. Simple methods, such as the histogram shown in Figure 3-1, are invaluable. Basic statistics such as minimum, maximum, and average values can be seen easily in the histogram shown in Chapter 2 (Figure 2-22). We survey several applications of statistical methods to computer vision here.


Figure 3-1. Histogram with linear scale values (black) and log scale values (gray), illustrating how the same data is interpreted differently based on the chart scale

Texture Region Metrics

Now we look in detail at the specific metrics for feature description based on texture. Texture is one of the most-studied classes of metrics. It can be thought of in terms of the surface—for example, a burlap bag compared to silk fabric. There are many possible textural relationships and signatures that can be devised in a range of domains, with new ones being developed all the time. In this section we survey some of the most common methods for calculating texture metrics:

  • Edge metrics
  • Cross-correlation
  • Fourier spectrum signatures
  • Co-occurrence matrix, Haralick features, extended SDM features
  • Laws texture metrics
  • Tessellation
  • Local binary patterns (LBP)
  • Dynamic textures

Within an image, each image region has a texture signature, where texture is defined as a common structure and pattern within that region. Texture signatures may be a function of position and intensity relationships, as in the spatial domain, or be based...

Computer Vision Metrics: Chapter Three

Bookmark and Share

Computer Vision Metrics: Chapter Three

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


Bibliography references are set off with brackets, i.e. "[XXX]". For the corresponding bibliography entries, please click here.


Global and Regional Features

Measure twice, cut once.
—Carpenter’s saying

This chapter covers the metrics of general feature description, often used for whole images and image regions, including textural, statistical, model based, and basis space methods. Texture, a key metric, is a well-known topic within image processing, and it is commonly divided into structural and statistical methods. Structural methods look for features such as edges and shapes, while statistical methods are concerned with pixel value relationships and statistical moments. Methods for modeling image texture also exist, primarily useful for image synthesis rather than for description. Basis spaces, such as the Fourier space, are also use for feature description.

It is difficult to develop clean partitions between the related topics in image processing and computer vision that pertain to global vs. regional vs. local feature metrics; there is considerable overlap in the applications of most metrics. However, for this chapter, we divide these topics along reasonable boundaries, though those borders may appear to be arbitrary. Similarly, there is some overlap between discussions here on global and regional features and topics that were covered in Chapter 2 on image processing and that will be discussed in Chapter 6 on local features. In short, many methods are used for local, regional, and global feature description, as well as image processing, such as the Fourier transform and the LBP.

But we begin with a brief survey of some key ideas in the field of texture analysis and general vision metrics.

Historical Survey of Features

To compare and contrast global, regional, and local feature metrics, it is useful to survey and trace the development of the key ideas, approaches, and methods used to describe features for machine vision. This survey includes image processing (textures and statistics) and machine vision (local, regional, and global features). Historically, the choice of feature metrics was limited to those that were computable at the time, given the limitations in compute performance, memory, and sensor technology. As time passed and technology developed, the metrics have become more complex to compute, consuming larger memory footprints. The images are becoming multi-modal, combining intensity, color, multiple spectrums, depth sensor information, multiple-exposure settings, high dynamic range imagery, faster frame rates, and more precision and accuracy in x, y and Z depth. Increases in memory bandwidth and compute performance, therefore,...

Vision-Based Artificial Intelligence Brings Awareness to Surveillance

Bookmark and Share

Vision-Based Artificial Intelligence Brings Awareness to Surveillance

A version of this article was originally published at EE Times' Embedded.com Design Line. It is reprinted here with the permission of EE Times.

Moving beyond the research lab, embedded vision is rapidly augmenting traditional law enforcement techniques in real world surveillance settings. Technological gaps are rapidly being surmounted as automated surveillance systems' various hardware and software components become richer in their feature sets, higher in performance, more power efficient, and lower priced.

By Brian Dipert
Editor-in-Chief
Embedded Vision Alliance

Jacob Jose
IP Camera Product Marketing Manager
Texas Instruments

and Darnell Moore
Senior Member of the Technical Staff, Embedded Vision Team
Texas Instruments

Recent events showcase both the tantalizing potential and the current underwhelming reality of automated surveillance technology. Consider, for example, the terrorist bombing at the finish line of the April 15, 2013 Boston Marathon. Bolyston Street was full of cameras, both those permanently installed by law enforcement organizations and businesses, and those carried by race spectators and participants. But none of them was able to detect the impending threat represented by the intentionally abandoned backpacks, each containing a pressure cooker-implemented bomb, with sufficient advance notice to prevent the tragedy. And the resultant flood of video footage was predominantly analyzed by the eyes of the police department and FBI representatives attempting to identify and locate the perpetrators, due to both the slow speed and low accuracy of the alternative computer-based image analysis algorithms.

Consider, too, the ongoing military presence in Afghanistan and elsewhere, as well as the ongoing threat to U.S. embassies and other facilities around the world. Only a limited number of human surveillance personnel are available to look out for terrorist activities such as the installation of IEDs (improvised explosive devices) and other ordinances, the congregation and movement of enemy forces, and the like. And these human surveillance assets are further hampered by fundamental human shortcomings such as distraction and fatigue.

Computers, on the other hand, don't get sidetracked, and they don't need sleep. More generally, an abundance of ongoing case studies, domestic and international alike, provide ideal opportunities to harness the tireless analysis assistance that computer vision processing can deliver. Automated analytics algorithms are conceptually able, for example, to sift through an abundance of security camera footage in order to pinpoint an object left at a scene and containing an explosive device, cash, contraband or other contents of interest to investigators. And after capturing facial features and other details of the person(s) who left the object, analytics algorithms can conceptually also index image databases both public (Facebook, Google Image Search, etc.) and private (CIA, FBI, etc.) in order to rapidly identify the suspect(s).

Unfortunately, left-object, facial recognition and other related technologies haven't historically been sufficiently mature to be relied upon with high confidence, especially in non-ideal usage settings, such as when individuals aren't looking directly at the lens or are obscured by shadows or other challenging lighting conditions. As a result, human eyes and brains were traditionally relied upon for video analysis instead of computer code, thereby delaying suspect identification and pursuit, as well as increasingly the possibility of error (false positives, missed opportunities, etc). Such automated surveillance technology shortcomings are rapidly being surmounted, however, as cameras (and the image sensors contained within them) become more feature-rich, as the processors analyzing the video outputs similarly increase in performance, and as the associated software therefore becomes more robust.

As these and other key system building blocks such as memory devices also decrease in cost and power consumption, opportunities for surveillance applications are rapidly expanding beyond traditional law enforcement into new markets such as business analytics and consumer-tailored surveillance systems, as well as smart building and smart city initiatives. To facilitate these trends, an alliance of hardware and software component suppliers, product manufacturers, and system integrators has emerged to accelerate the availability and adoption of intelligent surveillance systems and other embedded vision processing opportunities.

How do artificial intelligence and embedded vision processing intersect? Answering this question begins with a few definitions. Computer vision is a broad, interdisciplinary field that attempts to extract useful information from visual inputs, by analyzing images and other raw sensor data. The term "embedded vision" refers to the use of computer vision in embedded systems, mobile devices, PCs and the cloud. Historically, image analysis techniques have typically only been implemented in complex and expensive, therefore niche, surveillance systems. However, the previously mentioned cost, performance and power consumption advances are now paving the way for the proliferation of embedded vision into diverse surveillance and other applications.

Automated Surveillance Capabilities

In recent years, digital equipment has rapidly entered the surveillance industry, which was previously dominated by analog cameras and tape recorders. Networked digital cameras, video recorders and servers have not only improved in quality and utility, but they have also become more affordable. Vision processing has added artificial intelligence to surveillance networks, enabling “aware” systems that help protect property, manage the flow of traffic, and even improve operational efficiency in retail stores. In fact, vision processing is helping to fundamentally change how the industry operates, allowing it to deploy people and other resources more intelligently while expanding and enhancing situational awareness. At the heart of these capabilities are vision algorithms and applications, commonly referred to as video analytics, which vary broadly in definition, sophistication, and implementation (Figure 1).


Figure 1. Video analytics is a broad application category referencing numerous image analysis functions, varying in definition, sophistication, and implementation.

Motion detection, as its name implies, allows surveillance equipment to automatically signal an alert when frame-to-frame video changes are noted. As one of the most useful automated surveillance capabilities, motion detection is widely available, even in entry-level digital cameras and video recorders. A historically popular technique to detect motion relies on codecs' motion vectors, a byproduct of the motion estimation employed by video compression standards such as MPEG-2 and H.264. Because these standards are frequently hardware-accelerated, scene change detection using motion vectors can be efficiently implemented even on modest IP camera processors, needing no additional computing power. However, this technique is susceptible to generating false alarms, because motion vector changes do not always coincide with motion from objects of interest. It can be difficult-to-impossible, using only the motion vector technique, to ignore unimportant changes such as trees moving in the wind or casting shifting shadows, or to adapt to changing lighting conditions.

These annoying "false positives" have unfortunately contributed to the perception that motion detection algorithms are unreliable. To wit, and to prevent vision systems from undermining their own utility, installers often insist on observing fewer than five false alarms per day. Nowadays, however, an increasing percentage of systems are adopting intelligent motion detection algorithms that apply adaptive background modeling along with other techniques to help identify objects with much higher accuracy levels, while ignoring meaningless motion artifacts. While there are no universal industry standards regulating accuracy, systems using these more sophisticated methods even with conventional 2-D cameras regularly achieve detection precision approaching 90 percent for typical surveillance scenes, i.e. those with those adequate lighting and limited background clutter. Even under more challenging environmental conditions, such as poor or wildly fluctuating lighting, precipitation-induced substantial image degradation, or heavy camera vibration, accuracy can still be near 70 percent. And the more advanced 3-D cameras discussed later in this article can boost accuracy higher still.

The capacity to accurately detect motion has spawned several related event-based applications, such as object counting andtrip zone. As the name implies, counting tallies the number of moving objects crossing a user-defined imaginary line, while tripping flags an event each time an object moves from a defined zone to an adjacent zone. Other common applications include loitering, which identifies when objects linger too long, and object left-behind/removed,which searches for the appearance of unknown articles, or the disappearance of designated items.

Robust artificial intelligence often requires layers of advanced vision know-how, from low-level imaging processing to high-level behavioral or domain models. As an example, consider a demanding application such as traffic and parking lot monitoring, which maintains a record of vehicles passing through a scene. It is often necessary to first deploy image stabilization and other compensation techniques to retard the effects of extreme environmental conditions such as dynamic lighting and weather. Compute-intensive pixel-level processing is also required to perform background modeling and foreground segmentation.

To equip systems with scene understanding sufficient to identify vehicles in addition to traffic lanes and direction, additional system competencies handle feature extraction, object detection, object classification (i.e. car, truck, pedestrians, etc.), and long-term tracking. LPR (license plate recognition) algorithms and other techniques locate license plates on vehicles and discern individual license plate characters. Some systems also collect metadata information about vehicles, such as color, speed, direction, and size, which can then be streamed or archived in order to enhance subsequent forensic searches.

Algorithm Implementation Options

Traditionally, analytics systems were based on PC servers, with surveillance algorithms running on x86 CPUs. However, with the introduction of high-end vision processors, all image analysis steps (including the previously mentioned traffic systems) can now optionally be entirely performed in dedicated-function equipment. Embedded systems based on DSPs (digital signal processors), application SoCs (system-on-chips), GPUs (graphics processors), FPGAs (field programmable logic devices) and other processor types are now entering the mainstream, primarily driven by their ability to achieve comparable vision processing performance to that of x86-based systems, at lower cost and power consumption.

Standalone cameras and analytics DVRs (digital video recorders) and NVRs (networked video recorders) increasingly rely on embedded vision processing. Large remote monitoring systems, on the other hand, are still fundamentally based on one or more cloud servers that can aggregate and simultaneously analyze numerous video feeds. However, even emerging "cloud" infrastructure systems are beginning to adopt embedded solutions, in order to more easily address performance, power consumption, cost and other requirements. Embedded vision coprocessors can assist in building scalable systems, offering higher net performance in part by redistributing processing capabilities away from the central server core and toward cameras at the edge of the network.

Semiconductor vendors offer numerous devices for different segments of the embedded cloud analytics market. These ICs can be used on vision processing acceleration cards that go into the PCI Express slot of a desktop server, for example, or to build standalone embedded products (Figure 2). Many infrastructure systems receive compressed H.264 videos from IP cameras and decompress the image streams before analyzing them. Repeated "lossy" video compression and decompression results in information discard that may be sufficient to reduce the accuracy of certain video analytics algorithms. Networked cameras with local vision processing "intelligence," on the other hand, have direct access to raw video data and can analyze and respond to events with low latency (Figure 3).




Figure 2. Modern vision processing "engines" can implement standalone surveillance cameras (top) and embedded analysis systems (middle); alternatively, they can find use on processing acceleration add-in cards for conventional servers (bottom).


Figure 3, In distributed intelligence surveillance systems, networked cameras with local vision processing capabilities have direct access to raw video data and can rapidly analyze and respond to events.

Although the evolution to an architecture based on distributed intelligence is driving the proliferation of increasingly autonomous networked cameras, complex algorithms often still run on infrastructure servers. Networked cameras are commonly powered by Power Over Ethernet (PoE) and therefore have a very limited power budget. Further, the lower the power consumption, the smaller and less conspicuous the camera be. To quantify the capabilities of modern semiconductor devices, consider that an ARM Cortex-A9-based camera consumes only 1.8W in its entirety, while compressing H.264 video at 1080p30 (1920x1080 pixels per frame, 30 frames per second) resolution.

It's relatively easy to recompile PC-originated analytics software to run on an ARM processor, for example. However, as the clock frequency of a host CPU increases, the resultant camera power consumption also increases significantly as compared to running some-to-all of the algorithm on a more efficient DSP, FPGA or GPU. Harnessing a dedicated vision coprocessor will reduce the power consumption even more. And further assisting software development, a variety of computer vision software libraries is available. Some algorithms, such as those found in OpenCV (the Open Source Computer Vision Library), are cross-platform, while others, such as Texas Instruments' IMGLIB (the Image and Video Processing Library), VLIB (the Video Analytics and Vision Library) and VICP (the Video and Imaging Coprocessor Signal Processing Library), are vendor-proprietary. Leveraging pre-existing code speeds time to market, and to the extent that it exploits on-chip vision acceleration resources, it can also produce much higher performance results than those attainable with generic software (Figure 4).



Figure 4. Vision software libraries can speed a surveillance system's time to market (top) as well as notably boost its frame rate and other attributes (bottom).

Historical Trends and Future Forecasts

As previously mentioned, embedded vision processing is one of the key technologies responsible for evolving surveillance systems beyond their archaic CCTV (closed-circuit television) origins and into the modern realm of enhanced situational awareness and intelligent analytics. For most of the last century, surveillance required people, sometimes lots of them, to effectively patrol property and monitor screens and access controls. In the 1990’s, DSPs and image processing ASICs (application-specific integrated circuits) helped the surveillance industry capture image content in digital form using frame grabbers and video cards. Coinciding with the emergence of high-speed networks for distributing and archiving data at scales that had been impossible before, surveillance providers embraced computer vision technology as a means of helping manageand interpret the deluge of video content now being collected.

Initial vision applications such as motion detection sought to draw the attention of on-duty surveillance personnel, or to trigger recording for later forensic analysis. Early in-camera implementations were usually elementary, using simple DSP algorithms to detect gross changes in grayscale video, while those relying on PC servers for processing generally deployed more sophisticated detection and tracking algorithms. Over the years, however, embedded vision applications have substantially narrowed the performance gap with servers, benefiting from more capable function-tailored processors. Each processor generation has integrated more potent discrete components, including multiple powerful general computing cores as well as dedicated image and vision accelerators.

As a result of these innovations, the modern portfolio of embedded vision capabilities is constantly expanding. And these expanded capabilities are appearing in an ever-wider assortment of cameras, featuring multi-megapixel CMOS sensors with wide dynamic range and/or thermal imagers, and designed for every imaginable installation requirement, including dome, bullet, hidden/concealed, vandal-proof, night vision, pan-tilt-zoom, low light, and wirelessly networked devices. Installing vision-enabled cameras at the ‘edge’ has reduced the need for expensive centralized PCs and backend equipment, lowering the implementation cost sufficient to place these systems in reach of broader market segments, including retail, small business, and residential.

The future is bright for embedded vision systems. Sensors capable of discerning and recovering 3-D depth data, such as stereo vision, TOF (time-of-flight), and structured light technologies, are increasingly appearing in surveillance applications, promising significantly more reliable and detailed analytics. 3-D techniques can be extremely useful when classifying or modeling detected objects while ignoring shadows and illumination artifacts, addressing a problem that has long plagued conventional 2-D vision systems. In fact, systems leveraging 3-D information can deliver detection accuracies above 90 percent, even for highly complex scenes, while maintaining a minimal false detection rate (Figure 5).


Figure 5. 3-D cameras are effective in optimizing detection accuracy, by enabling algorithms to filter out shadows and other traditional interference sources.

However, these 3-D technology advantages come with associated tradeoffs that also must be considered. For example, stereo vision, which uses geometric “triangulation” to estimate scene depth, is a passive, low-power approach to depth recovery which is generally less expensive than other techniques and can be used at longer camera-to-object distances, at the tradeoff of reduced accuracy (Figure 6). TOF, on the other hand, is an active, higher-power sensor that generally offers more detail, but at higher cost and with a shorter operating range. Both approaches, along with structured light and other candidates, can be used for detection. But the optimum technology for a particular application can only be fully understood after prototyping (Figure 7).




Figure 6. The stereo vision technique uses a pair of cameras, reminiscent of a human's left- (top) and right-eye perspectives (middle), to estimate the depths of various objects in a scene (bottom).


Figure 7. Although the depth map generated by a TOF (time-of-flight) 3-D sensor is more dense than its stereo vision-created disparity map counterpart, with virtually no coverage "holes" and therefore greater accuracy in the TOF case, stereo vision systems tend to be lower power, lower cost and usable over longer distances.

As new video compression standards such as H.265 become established, embedded vision surveillance systems will need to process even larger video formats (4k x 2k and beyond), which will compel designers to harness hardware processor combinations that may include some or all of the following: CPUs, multi-core DSPs, FPGAs, GPUs, and dedicated accelerators. Addressing often-contending embedded system complexity, cost, power, and performance requirements will likely lead to more distributed vision processing, whereby rich object and feature metadata extracted at the edge can be further processed, modeled, and shared “in the cloud.” And the prospect of more advanced compute engines will enable state-of-the-art vision algorithms, including optical flow and machine learning.

The Embedded Vision Alliance and Embedded Vision Summit

Embedded vision technology has the potential to enable a wide range of electronic products, such as the surveillance systems discussed in this article, that are more intelligent and responsive than before, and thus more valuable to users. It can add helpful features to existing products. And it can provide significant new markets for hardware, software and semiconductor manufacturers. The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower engineers to transform this potential into reality. Texas Instruments, the co-author of this article, is a member of the Embedded Vision Alliance. For more information about the Embedded Vision Alliance, please see this Embedded.com article.

On Tuesday, May 12, 2015, in Santa Clara, California, the Alliance will hold its next Embedded Vision Summit. Embedded Vision Summits are technical educational forums for hardware and software product creators interested in incorporating visual intelligence into electronic systems. They provide how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Alliance member companies. These events are intended to:

  • Inspire product creators' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations.
  • Offer practical know-how to help them incorporate vision capabilities into their products, and
  • Provide opportunities for engineers to meet and talk with leading vision technology companies and learn about their offerings.

Please visit the event page for more information on the Embedded Vision Summit.

Biographies:

Brian Dipert is Editor-In-Chief of the Embedded Vision Alliance. He is also a Senior Analyst at BDTI (Berkeley Design Technology, Inc.), and Editor-In-Chief of InsideDSP, the company's online newsletter dedicated to digital signal processing technology. He has a B.S. degree in Electrical Engineering from Purdue University in West Lafayette, IN. His professional career began at Magnavox Electronics Systems in Fort Wayne, IN; Brian subsequently spent eight years at Intel Corporation in Folsom, CA. He then spent 14 years (and five months) at EDN Magazine.

Jacob Jose is a Product Marketing Manager with Texas Instruments’ IP Camera business. He joined Texas Instruments in 2001, has engineering and business expertise in the imaging, video and analytics markets, and has worked at locations in China, Taiwan, South Korea, Japan, India and the USA. He has a Bachelors degree in computer science and engineering from the National Institute of Technology at Calicut, India and is currently enrolled in the executive MBA program at Kellogg School of Business, Chicago, Ilinois.

Darnell Moore, Ph.D., is a Senior Member of the Technical Staff with Texas Instruments’ Embedded Processing Systems Lab. As an expert in vision, video, imaging, and optimization, his body of work includes Smart Analytics, a suite of vision applications that spawned TI’s DMVA processor family, as well as advanced vision prototypes, such as TI’s first stereo IP surveillance camera. He received a BSEE from Northwestern University and a Ph.D. from the Georgia Institute of Technology.

Computer Vision Metrics: Chapter Two (Part E)

Bookmark and Share

Computer Vision Metrics: Chapter Two (Part E)

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


For Part D of Chapter Two, please click here.

Bibliography references are set off with brackets, i.e. "[XXX]". For the corresponding bibliography entries, please click here.


Gradient-Ascent-Based Super-Pixel Methods

Gradient ascent methods iteratively refine the super-pixel clusters to optimize the segmentation until convergence criteria are reached. These methods use a tree graph structure to associate pixels together according to some criteria, which in this case may be the RGB values or Cartesian coordinates of the pixels, and then a distance function or other function is applied to create regions. Since these are iterative methods, the performance can be slow.

  • Mean-Shift [266] Works by registering off of the region centroid based on a kernel-based mean smoothing approach to create regions of like pixels.
  • Quick-Shift [267] Similar to the mean-shift method but does not use a mean blur kernel and instead uses a distance function calculated from the graph structure based on RGB values and XY pixel coordinates.
  • Watershed [268] Starts from local region pixel value minima points to find pixel value-based contour lines defining watersheds, or basin contours inside which similar pixel values can be substituted to create a homogeneous pixel value region.
  • Turbopixels [269] Uses small circular seed points placed in a uniform grid across the image around which super-pixels are collected into assigned regions, and then the super-pixel boundaries are gradually expanded into the unassigned region, using a geometric flow method to expand the boundaries using controlled boundary value expansion criteria, so as to gather more pixels together into regions with fairly smooth and uniform geometric shape and size.

Depth Segmentation

Depth information, such as a depth map as shown in Figure 2-20, is ideal for segmenting objects based on distance. Depth maps can be computed from a wide variety of depth sensors and methods, including a single camera, as discussed in Chapter 1. Depth cameras, such as the Microsoft Kinect camera, are becoming more common. A depth map is a 2D image or array, where each pixel value is the distance or Z value.


Figure 2-20. Depth images from Middlebury Data set: (Left) Original image. (Right) Corresponding depth image. Data courtesy of Daniel Scharstein and used by permission

...

Computer Vision Metrics: Chapter Two (Part D)

Bookmark and Share

Computer Vision Metrics: Chapter Two (Part D)

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


For Part C of Chapter Two, please click here.

Bibliography references are set off with brackets, i.e. "[XXX]". For the corresponding bibliography entries, please click here.


Transform Filtering, Fourier, and Others

This section deals with basis spaces and image transforms in the context of image filtering, the most common and widely used being the Fourier transform. A more comprehensive treatment of basis spaces and transforms in the context of feature description is provided in Chapter 3. A good reference for transform filtering in the context of image processing is provided by Pratt [9].

Why use transforms to switch domains? To make image pre-processing easier or more effective, or to perform feature description and matching more efficiently. In some cases, there is no better way to enhance an image or describe a feature than by transforming it to another domain—for example, for removing noise and other structural artifacts as outlier frequency components of a Fourier spectrum, or to compact describe and encode image features using HAAR basis features.

Fourier Transform Family

The Fourier transform is very well known and covered in the standard reference by Bracewell [227], and it forms the basis for a family of related transforms. Several methods for performing fast Fourier transform (FFT) are common in image and signal processing libraries. Fourier analysis has touched nearly every area of world affairs, through science, finance, medicine, and industry, and has been hailed as “the most important numerical algorithm of our lifetime” [290]. Here, we discuss the fundamentals of Fourier analysis, and a few branches of the Fourier transform family with image pre-processing applications.

The Fourier transform can be computed using optics, at the speed of light [516]. However, we are interested in methods applicable to digital computers.

Fundamentals

The basic idea of Fourier analysis [227,4,9] is concerned with decomposing periodic functions into a series of sine and cosine waves (Figure 2-14). The Fourier transform is bi-directional, between a periodic wave and a corresponding series of harmonic basis functions in the frequency domain, where each basis function is a sine or cosine function, spaced at whole harmonic multiples from the base frequency. The result of the forward FFT is a complex number composed of magnitude and phase data for each sine and cosine component in the series, also referred to as real data and imaginary data.

...

Computer Vision Metrics: Chapter Two (Part C)

Bookmark and Share

Computer Vision Metrics: Chapter Two (Part C)

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


For Part B of Chapter Two, please click here.

Bibliography references are set off with brackets, i.e. "[XXX]". For the corresponding bibliography entries, please click here.


Practical Considerations for Color Enhancements

For image pre-processing, the color intensity is usually the only color information that should be enhanced, since the color intensity alone carries a lot of information and is commonly used. In addition, color processing cannot be easily done in RGB space while preserving relative color. For example, enhancing the RGB channels independently with a sharpen filter will lead to Moiré fringe artifacts when the RGB channels are recombined into a single rendering. So to sharpen the image, first forward-convert RGB to a color space such as HSV or YIQ, then sharpen the V or Y component, and then inverse-convert back to RGB. For example, to correct illumination in color, standard image processing methods such as LUT remap or histogram equalization will work, provided they are performed in the intensity space.

As a practical matter, for quick color conversions to gray scale from RGB, here are a few methods. (1) The G color channel is a good proxy for gray scale information, since as shown in the sensor discussion in Chapter 1, the RB wavelengths in the spectrum overlap heavily into the G wavelengths. (2) Simple conversion from RGB into gray scale intensity I can be done by taking I = R+G+B / 3. (3) The YIQ color space, used in the NTSC television broadcast standards, provides a simple forward/backward method of color conversion between RGB and a gray scale component Y, as follows:

Color Accuracy and Precision

If color accuracy is important, 8 bits per RGB color channel may not be enough. It is necessary to study the image sensor vendor’s data sheets to understand how good the sensor really is. At the time of this writing, common image sensors are producing 10 to 14 bits of color information per RGB channel. Each color channel may have a different spectral response, as discussed in Chapter 1.

Typically, green is a good and fairly accurate color channel on most devices; red is usually good as well and may also have near infrared sensitivity if the IR filter is removed from the sensor; and blue is always a challenge since the blue wavelength can be hardest to capture in smaller silicon wells, which are close to the size of the blue wavelength, so the sensor...

Computer Vision Metrics: Chapter Two (Part B)

Bookmark and Share

Computer Vision Metrics: Chapter Two (Part B)

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


For Part A of Chapter Two, please click here.

Bibliography references are set off with brackets, i.e. "[XXX]". For the corresponding bibliography entries, please click here.


Polygon Shape Family Pre-Processing

Polygon shapes are potentially the most demanding features when considering image pre-processing steps, since as shown in Table 2-1, the range of potential pre-processing methods is quite large and the choice of methods to employ is very data-dependent. Possibly because of the challenges and intended use-cases for polygon shape measurements, they are used only in various niche applications, such as cell biology.

One of the most common methods employed for image preparation prior to polygon shape measurements is to physically correct the lighting and select the subject background. For example, in automated microscopy applications, slides containing cells are prepared with florescent dye to highlight features in the cells, then the illumination angle and position are carefully adjusted under magnification to provide a uniform background under each cell feature to be measured; the resulting images are then much easier to segment.

As illustrated in Figures 2-4 and 2-5, if the pre-processing is wrong, the resulting shape feature descriptors are not very useful. Here are some of the more salient options for pre-processing prior to shape based feature extraction, then we’ll survey a range of other methods later in this chapter.


Figure 2-4. Use of thresholding to solve problems during image pre-processing to prepare images for polygon shape measurement: (Left) Original image. (Center) Thresholded red channel image. (Right) Perimeter tracing above a threshold


Figure 2-5. Another sequence of morphological pre-processing steps preceding polygon shape measurement: (Left) Original image. (Center) Range thresholded and dilated red color channel. (Right) Morphological perimeter shapes taken above a threshold

  1. Illumination corrections. Typically critical for defining the shape and outline of binary features. For example, if perimeter tracking or boundary segmentation is based on edges or thresholds, uneven illumination will cause problems, since...

Computer Vision Metrics: Chapter Two

Bookmark and Share

Computer Vision Metrics: Chapter Two

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


Bibliography references are set off with brackets, i.e. "[XXX]". For the corresponding bibliography entries, please click here.


Image Pre-Processing

“I entered, and found Captain Nemo deep in algebraical calculations of x and other quantities.”
—Jules Verne, 20,000 Leagues Under The Sea

This chapter describes the methods used to prepare images for further analysis, including interest point and feature extraction. Some of these methods are also useful for global and local feature description, particularly the metrics derived from transforms and basis spaces. The focus is on image pre-processing for computer vision, so we do not cover the entire range of image processing topics applied to areas such as computational photography and photo enhancements, so we refer the interested reader to various other standard resources in Digital Image Processing and Signal Processing as we go along [4,9,325,326], and we also point out interesting research papers that will enhance understanding of the topics. Readers with a strong background in image processing may benefit from a light reading of this chapter.

Perspectives on Image Processing

Image processing is a vast field that cannot be covered in a single chapter. So why do we discuss image pre-processing in a book about computer vision? The reason is to advance the science of local and global feature description, as image pre-processing is typically ignored in discussions of feature description. Some general image processing topics are covered here in light of feature description, intended to illustrate rather than to proscribe, as applications and image data will guide the image pre-processing stage.

Some will argue that image pre-processing is not a good idea, since it distorts or changes the true nature of the raw data. However, intelligent use of image pre-processing can provide benefits and solve problems that ultimately lead to better local and global feature detection. We survey common methods for image enhancements and corrections that will affect feature analysis downstream in the vision pipeline in both favorable and unfavorable ways, depending on how the methods are employed.

Image pre-processing may have dramatic positive effects on the quality of feature extraction and the results of image analysis. Image pre-processing is analogous to the mathematical normalization of a data set, which is a common step in many feature descriptor methods. Or to make a musical analogy, think of image pre-processing as a sound system with a range of controls, such as raw sound with no volume controls; volume control with a simple tone knob; volume control plus treble, bass, and mid; or volume control plus...