Embedded Vision Alliance: Technical Articles

Visual Intelligence Gives Robotic Systems Spatial Sense

Bookmark and Share

Visual Intelligence Gives Robotic Systems Spatial Sense

This article is an expanded version of one originally published at EE Times' Embedded.com Design Line. It is reprinted here with the permission of EE Times.

In order for robots to meaningfully interact with objects around them as well as move about their environments, they must be able to see and discern their surroundings. Cost-effective and capable vision processors, fed data by depth-discerning image sensors and running robust software algorithms, are transforming longstanding autonomous and adaptive robot aspirations into realities.

By Brian Dipert
Editor-in-Chief
Embedded Vision Alliance

and

Yves Legrand
Automation and Robotics Global Market Director, Vertical Solutions Marketing
Freescale Semiconductor

and

Bruce Tannenbaum
Principal Product Marketing Manager
MathWorks

Robots, as long portrayed both in science fiction and shipping-product documentation, promise to free human beings from dull, monotonous and otherwise undesirable tasks, as well as to improve the quality of those tasks' outcomes through high speed and high precision. Consider, for example, the initial wave of autonomous consumer robotics systems that tackle vacuuming, carpet scrubbing and even gutter-cleaning chores. Or consider the ever-increasing prevalence of robots in a diversity of manufacturing line environments (Figure 1).

Figure 1. Autonomous consumer-tailored products (top) and industrial manufacturing systems (bottom) are among the many classes of robots that can be functionally enhanced by the inclusion of vision processing capabilities.

First-generation autonomous consumer robots, however, employ relatively crude schemes for learning about and navigating their surroundings. These elementary techniques include human-erected barriers comprised of infrared transmitters, which coordinate with infrared sensors built into the robot to prevent it from tumbling down a set of stairs or wandering into another room. A built-in shock sensor can inform the autonomous robot that it's collided with an immovable object and shouldn't attempt to continue forward or, in more advanced mapping-capable designs, even bother revisiting this particular location. And while manufacturing robots may work more tirelessly, faster, and more exactly than do their human forebears, their success is predicated on incoming parts arriving in fixed orientations and locations, thereby increasing the complexity of the manufacturing process. Any deviation in part position and/or orientation will result in assembly failures.

Humans use their eyes (along with other senses) and brains to discern and navigate through the world around them. Conceptually, robotic systems should be able to do the same thing, leveraging camera assemblies, vision processors, and various software algorithms. Historically, such technology has typically only been found in a short list of complex, expensive systems. However, cost, performance and power consumption advances in digital integrated circuits are now paving the way for the proliferation of vision into diverse and high-volume applications, including an ever-expanding list of robot implementations. Challenges remain, but they're more easily, rapidly, and cost-effectively solved than has ever before been possible.

Software Techniques

Developing robotic systems capable of visually adapting to their environments requires the use of computer vision algorithms that can convert the data from image sensors into actionable information about the environment. Two common tasks for robots are identifying external objects and their orientations, and determining the robot’s location and orientation. Many robots are designed to interact with one or more specific objects. For situation-adaptive robots, it's necessary to be able to detect these objects when they are in unknown locations and orientations, as well as to comprehend that these objects might be moving.

Cameras produce millions of pixels of data per second, a payload that creates a heavy processing burden. One common way to resolve this challenge is to instead detect multi-pixel features, such as corners, blobs, edges, or lines, in each frame of video data (Figure 2). Such a pixel-to-feature transformation can lower the data processing requirement in this particular stage of the vision processing pipeline by a factor of a thousand or more; millions of pixels reduce to hundreds of features that a robot can then productively use to identify objects and determine their spatial characteristics (Figure 3).

Figure 2. Four primary stages are involved in fully processing the raw output of a 2D or 3D sensor for robotic vision, with each stage exhibiting unique characteristics and constraints in terms of its processing requirements.

Figure 3. Common feature detection algorithm candidates include the MSER (Maximally Stable Extremal Regions) method (top), the SURF (Speeded Up Robust Features) algorithm (middle), and the Shi-Tomasi technique for detecting corners (bottom) (courtesy MIT).

Detecting objects via features first involves gathering a large number of features, taken from numerous already-captured images of each specified object at various angles and orientations.  Then, this database of features can find use in training a machine learning algorithm, also known as a classifier, to accurately detect and identify new objects. Sometimes this training occurs on the robot; other times, due to the high level of computation required, the training occurs off-line. This complexity, coupled with the large amount of training data needed, are drawbacks to machine learning-based approaches. One of the best-known object detection algorithms is the Viola-Jones framework, which uses Haar-like features and a cascade of Adaboost classifiers. This algorithm is particularly good at identifying faces, and can also be trained to identify other common objects.

To determine object orientation via features requires an algorithm such as the statistics-based RANSAC (Random Sampling and Consensus). This algorithm uses a subset of features to model a potential object orientation, and then determines how many other features fit this model. The model with the largest number of matching features corresponds to the correctly recognized object orientation. To detect moving objects, you can combine feature identification with tracking algorithms. Once a set of features has been used to correctly identify an object, algorithms such as KLT (Kanade-Lucas-Tomasi) or Kalman filtering can track the movement of these features between video frames. Such techniques are robust in spite of changes in orientation and occlusion, because they only need to track a subset of the original features in order to be successful.

The already described algorithms may be sufficient for stationary robots. For robots on the move, however, you will also need to employ additional algorithms in order for them to safely move within their surroundings. SLAM (Simultaneous Localization and Mapping) is one category of algorithms that enables a robot to build of a map of its environment and keep track of its current location. Such algorithms require methods for mapping the environment in three dimensions. Many depth-sensing sensor options exist; one common approach is to use a pair of 2D cameras configured as a “stereo” camera, acting similarly to the human visual system.

Stereo cameras rely on epipolar geometry to derive a 3D location for each point in a scene, using projections from a pair of 2D images. As previously discussed from a 2D standpoint, features can also be used to detect useful locations within the 3D scene. For example, it is much easier for a robot to reliably detect the location of a corner of a table than the flat surface of a wall.  At any given location and orientation, the robot can detect features that it can then compare to its internal map in order to locate itself and/or improve the map's quality. Given that objects can and often do move, a static map is not often useful for a robot attempting to adapt to its environment.

Processing Options

As we begin to consider how to create an efficient implementation of robot vision, it can be useful to divide the required processing steps into stages. Specifically, the processing encompassed by the previously discussed algorithms can be divided into four stages, with each stage exhibiting unique characteristics and constraints in terms of its processing requirements (Reference 1). A wide variety of vision processor types exist, and different types may be better suited for each algorithm processing stage in terms of performance, power consumption, cost, function flexibility, and other factors. A vision processor chip may, in fact, integrate multiple different types of processor cores to address multiple processing stages' unique needs (Figure 4).

Figure 4. A vision processor may integrate multiple types of cores to address multiple processing stages' unique needs.

The first processing stage encompasses algorithms that handle various sensor data pre-processing functions, such as:

  • Resizing
  • Color space conversion
  • Image rotation and inversion
  • De-interlacing
  • Color adjustment and gamut mapping
  • Gamma correction, and
  • Contrast enhancement

Each pixel in each frame is processed in this stage, so the number of operation per second is tremendous. And in the case of stereo image processing, the two image planes must be simultaneously processed. One processing option for these kinds of operations is a dedicated hardware block, sometimes referred to as an IPU (Image Processing Unit). Recently introduced vision processors containing IPUs are able to handle two simultaneous image planes, each with 2048x1536 pixel (3+ million pixel) resolution, at robust frame rates.

The second processing stage handles feature detection, where (as previously discussed) corners, edges and other significant image regions are extracted. This processing step still works on a pixel-by-pixel basis, so it is well suited for highly parallel architectures, this time capable of handling more complex mathematical functions such as first- and second-order derivatives. DSPs (digital signal processors), FPGAs (field programmable gate arrays), GPUs (graphics processing units), IPUs and APUs (array processor units) are all common processing options. DSPs and FPGAs are highly flexible, therefore particularly appealing when applications (and algorithms used to implement them) are immature and evolving. This flexibility, however, can come with power consumption, performance and cost tradeoffs versus alternative approaches.

On the other end of the flexibility-versus-focus spectrum is the dedicated-function IPU or APU, developed specifically for vision processing tasks. It can process several dozen billion operations per second but, by being application-optimized, it is not a candidate for more widespread function leverage. An intermediate step between the flexibility-versus-function optimization spectrum extremes is the GPU, historically found in computers but now also embedded within application processors used in smartphones, tablets and other high-volume applications. Floating-point calculations such as the least squares function in optical flow algorithms, descriptor calculations in SURF (the Speeded Up Robust Features algorithm used for fast significant point detection), and point cloud processing are well suited for highly parallel GPU architectures. Such algorithms can alternatively run on SIMD (single-instruction multiple-data) vector processing engines such as ARM's NEON or the AltiVec function block found within Power architecture CPUs.

In the third image processing stage, the system detects and classifies objects based on feature maps. In contrast to the pixel-based processing of previous stages, these object detection algorithms are highly non-linear in structure and in the ways they access data. However, strong processing "muscle" is still required in order to evaluate many different features with a rich classification database. Such requirements are ideal for single- and multi-core conventional processors, such as ARM- and Power Architecture-based RISC devices. And this selection criterion is equally applicable for the fourth image processing stage, which tracks detected objects across multiple frames, implements a model of the environment, and assesses whether various situations should trigger actions.

Development environments, frameworks and libraries such as OpenCL (the Open Computing Language), OpenCV (the Open Source Computer Vision Library) and MATLAB can simplify and speed software testing and development, enabling you to evaluate sections of your algorithms on different processing options, and potentially also including the ability to allocate portions of a task across multiple processing cores. Given the data-intensive nature of vision processing, when evaluating processors, you should appraise not only the number of cores and the per-core speed but also each processor's data handling capabilities, such as its external memory bus bandwidth.

Industry Alliance Assistance

With the emergence of increasingly capable processors, image sensors, memories, and other semiconductor devices, along with robust algorithms, it's becoming practical to incorporate computer vision capabilities into a wide range of embedded systems. By "embedded system," we're referring to any microprocessor-based system that isn’t a general-purpose computer. Embedded vision, therefore, refers to the implementation of computer vision technology in embedded systems, mobile devices, special-purpose PCs, and the cloud.

Embedded vision technology has the potential to enable a wide range of electronic products (such as the robotic systems discussed in this article) that are more intelligent and responsive than before, and thus more valuable to users. It can add helpful features to existing products. And it can provide significant new markets for hardware, software and semiconductor manufacturers. The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower engineers to transform this potential into reality.

Freescale and MathWorks, the co-authors of this article, are members of the Embedded Vision Alliance. First and foremost, the Alliance's mission is to provide engineers with practical education, information, and insights to help them incorporate embedded vision capabilities into new and existing products. To execute this mission, the Alliance has developed a website (www.Embedded-Vision.com) providing tutorial articles, videos, code downloads and a discussion forum staffed by a diversity of technology experts. Registered website users can also receive the Alliance’s twice-monthly email newsletter (www.embeddedvisioninsights.com), among other benefits.

Also consider attending the Alliance's upcoming Embedded Vision Summit, a free day-long technical educational forum to be held on October 2, 2013 in the Boston, Massachusetts area and intended for engineers interested in incorporating visual intelligence into electronic systems and software. The event agenda includes how-to presentations, seminars, demonstrations, and opportunities to interact with Alliance member companies. The keynote presenter will be Mario Munich, Vice President of Advanced Development at iRobot. Munich's previous company, Evolution Robotics (acquired by iRobot) developed the Mint, a second-generation consumer robot with vision processing capabilities. For more information on the Embedded Vision Summit, including an online registration application form, please visit www.embeddedvisionsummit.com.

Transforming a robotics vision processing idea into a shipping product entails careful discernment and compromise. The Embedded Vision Alliance catalyzes conversations in a forum where tradeoffs can be rapidly understood and resolved, and where the effort to productize advanced robotic systems can therefore be accelerated, enabling system developers to effectively harness various vision technologies. For more information on the Embedded Vision Alliance, including membership details, please visit www.Embedded-Vision.com, email info@Embedded-Vision.com or call 925-954-1411.

References:

  1. Embedded Low Power Vision Computing Platform for Automotive” Michael Staudenmaier, Holger Gryska, Freescale Halbleiter Gmbh, Embedded World Nuremberg Conference, 2013.

Biographies

Brian Dipert is Editor-In-Chief of the Embedded Vision Alliance. He is also a Senior Analyst at BDTI (Berkeley Design Technology, Inc.), which provides analysis, advice, and engineering for embedded processing technology and applications, and Editor-In-Chief of InsideDSP, the company's online newsletter dedicated to digital signal processing technology. Brian has a B.S. degree in Electrical Engineering from Purdue University in West Lafayette, IN. His professional career began at Magnavox Electronics Systems in Fort Wayne, IN; Brian subsequently spent eight years at Intel Corporation in Folsom, CA. He then spent 14 years at EDN Magazine.

Yves Legrand is the global vertical marketing director for Industrial Automation and Robotics at Freescale Semiconductor. He is based in France and has spent his professional career between Toulouse and the USA where he worked for Motorola Semiconductor and Freescale in Phoenix and Chicago. His marketing expertise ranges from wireless and consumer semiconductor markets and applications to wireless charging and industrial automation systems. He has a Masters degree in Electrical Engineering from Grenoble INPG in France, as well as a Masters degree in Industrial and System Engineering from San Jose State University, CA.

Bruce Tannenbaum leads the technical marketing team at MathWorks for image processing and computer vision applications. Earlier in his career, he was a product manager at imaging-related semiconductor companies such as SoundVision and Pixel Magic, and developed computer vision and wavelet-based image compression algorithms as an engineer at Sarnoff Corporation (SRI). He holds a BSEE degree from Penn State University and an MSEE degree from the University of Michigan.

Next Generation: Robots That See

By Carlton Heard
Vision Hardware and Software Product Manager
National Instruments

This article was originally published at Control Engineering. It is reprinted here with the permission of CFE Media.

Visual servo control: Vision used for robotic or machine guidance also can be used for in-line part inspection to enhance product quality with traditional feedback systems.

The Embedded Vision Summit East: Westford, MA -- Wed, Oct 2, 2013

On Wednesday, October 2, the Embedded Vision Alliance will return to the Boston, Massachusetts region for the third iteration (and second annual East Coast edition) of the Embedded Vision Summit. This free day-long technical educational forum will take place at the Regency Inn and Conference Center in Westford, Massachusetts. The Embedded Vision Summit is designed to:

Embedded Vision on Mobile Devices: Opportunities and Challenges

Bookmark and Share

Embedded Vision on Mobile Devices: Opportunities and Challenges

by Tom Wilson
CogniVue

Brian Dipert
Embedded Vision Alliance
This article was originally published at Electronic Engineering Journal. It is reprinted here with the permission of TechFocus Media.

Courtesy of service provider subsidies coupled with high shipment volumes, relatively inexpensive smartphones and tablets supply formidable processing capabilities: multi-core GHz-plus CPUs and graphics processors, on-chip DSPs and imaging coprocessors, and multiple gigabytes of memory. Plus, they integrate front- and rear-viewing cameras capable of capturing high-resolution still images and HD video clips. Harnessing this hardware potential, developers are leveraging these same cameras, primarily intended for still and video photography and videoconferencing purposes, to also create diverse embedded vision applications. Implementation issues must be sufficiently comprehended, however, for this potential to translate into compelling reality.

Introduction

Wikipedia defines the term "computer vision" as (Reference 1):

A field that includes methods for acquiring, processing, analyzing, and understanding images...from the real world in order to produce numerical or symbolic information...A theme in the development of this field has been to duplicate the abilities of human vision by electronically perceiving and understanding an image.

As the name implies, this image perception, understanding and decision-making process has historically been achievable only using large, heavy, expensive, and power-draining computers, restricting its usage to a short list of applications such as factory automation and military systems. Beyond these few success stories, computer vision has mainly been a field of academic research over the past several decades.

Today, however, a major transformation is underway. With the emergence of increasingly capable (i.e., powerful, low-cost, and energy-efficient) processors, image sensors, memories, and other semiconductor devices, along with robust algorithms, it's becoming practical to incorporate computer vision capabilities into a wide range of embedded systems. By "embedded system," we're referring to any microprocessor-based system that isn’t a general-purpose computer. Embedded vision, therefore, refers to the implementation of computer vision technology in embedded systems, mobile devices, special-purpose PCs, and the cloud.

Similar to the way that wireless communication technology has become pervasive over the past 10 years, embedded vision technology is poised to be widely deployed in the next 10 years. High-speed wireless connectivity began as a costly niche technology; advances in digital integrated circuits were critical in enabling it to evolve from exotic to mainstream. When chips got fast enough, inexpensive enough, and energy efficient enough, high-speed wireless became a mass-market technology.

Embedded Vision Goes Mobile

Advances in digital chips are now likewise paving the way for the proliferation of embedded vision into high-volume applications. Odds are high, for example, that the cellular handset in your pocket and the tablet computer in your satchel contain at least one rear-mounted image sensor for photography (perhaps two for 3D image capture capabilities) and/or a front-mounted camera for video chat (Figure 1). Embedded vision opportunities in mobile electronics devices include functions such as gesture recognition, face detection and recognition, video tagging, and natural feature tracking for augmented reality. These and other capabilities can be grouped under the broad term "mobile vision."





Figure 1. Google's Nexus 4 smartphone and Nexus 10 tablet, along with the Apple-developed iPhone 5 and iPad 4 counterparts, are mainstream examples of the robust hardware potential that exists in modern mobile electronics devices (a) and b) courtesy of Google, c) and d) courtesy of Apple).

ABI Research forecasts, for example, that approximately 600 million smartphones will implement vision-based gesture recognition by 2017 (Reference 2). This estimate encompasses roughly 40 percent of the 1.5 billion smartphones that some researchers expect will ship that year (Figure 2). And the gesture recognition opportunity extends beyond smartphones to portable media players, portable game players and in particular, media tablets. According to ABI Research, “It is projected that a higher percentage of media tablets will have the technology than smartphones,” and IDC Research estimates that 350 million tablets will ship in 2017 (Reference 3). These forecasts combine to create a compelling market opportunity for mobile vision, even if only gesture recognition is considered.


Figure 2. Gesture interfaces can notably enhance the utility of mobile technology...and not just when the phone rings while you're busy in the kitchen (courtesy of eyeSight).

Face recognition also promises to be a commercially important vision processing function for smartphones and tablets (Figure 3). Some applications are fairly obvious, such as device security (acting as an adjunct or replacement for traditional unlock codes). But plenty of other more subtle, but equally or more lucrative, uses for face recognition also exist, such as discerning emotional responses to advertising. Consider, for example, the $12 million in funding that Affectiva, a developer of facial response software, recently received (Reference 4). This investment comes on the heels of a deal between Affectiva and Ebuzzing, the leading global platform for social video advertising (Reference 5). A consumer who encounters an Affectiva-enabled ad is given the opportunity to activate the device's webcams to measure his or her response to the ad and assess how it compares against the reactions of others. Face response analysis currently takes place predominantly on a "cloud" server, but in the future vision processing may increasingly occur directly on the mobile platform, as advertising becomes more pervasive on smartphones and tablets. Such an approach will also enable more rapid responses, along with offline usage.


Figure 3. Robust face recognition algorithms ensure that you (and only you) can operate your device, in spite of your child's attempts to mimic your mustache with his finger (courtesy of Google).

Mobile advertising spending is potentially a big "driver" for building embedded vision applications into mobile platforms. Such spending is expected to soar from about $9.6 billion in 2012 to more than $24 billion in 2016, according to Gartner Research (Reference 6). Augmented reality will likely play a role in delivering location-based advertising, along with enabling other applications (Figure 4). Juniper Research has recently found that brands and retailers are already deploying augmented reality applications and marketing materials, which are expected to generate close to $300 million in global revenue in 2013 (Reference 7). The high degree of retailer enthusiasm for augmented reality suggested to Juniper Research that advertising spending had increased significantly in 2012 and was positioned for continued growth in the future. Juniper found that augmented reality is an important method for increasing consumer engagement because, among other reasons, it provides additional product information. However, the Juniper report also highlighted significant technical challenges for robust augmented reality implementations, most notably problems linked to the vision-processing performance of the mobile platform.


Figure 4. Augmented reality can tangibly enhance the likelihood of purchasing a product, the experience of using the product, the impact of an advertisement, or the general information-richness of the world around us (courtesy of Qualcomm).

Embedded Vision is Compute-Intensive (and Increasingly So)

Embedded vision processing is notably demanding, for a variety of reasons. First, simply put, there are massive amounts of data to consider. Frame resolutions higher than 1 Mpixel need to be streamed uncompressed at 30 to 60 frames per second, and every pixel of each frame requires processing attention. Typical vision processing algorithms for classification or tracking, for example, also combine a mix of vector and scalar processing steps, which demand intensive external memory access patterns with conventional processing architectures. Shuttling intermediate processing results back and forth between the CPU and external memory drives up power consumption and increases overall processing latency.

In addition, a robust vision-processing algorithm for object classification, for example, must achieve high levels of performance and reliability. A careful algorithm design balance must be achieved between sensitivity (reducing the miss rate) and reliability (reducing "false positives").  In a classification algorithm, this requirement necessitates adding more classification stages to reduce errors. In addition, it may compel more image frame pre-processing to reduce noise, or perhaps a normalization step to remove irregularities in lighting or object orientation. As these vision-based functions gain in acceptance and adoption, user tolerance for unreliable implementations will decrease. Gesture recognition, face recognition and other vision-based functions will also need to become more sophisticated and robust to operate reliably in a broad range of environments. These demands translate into more pixels, more complex processing, and therefore an even tougher challenge for conventional processing architectures.

On the processing side, it might seem at first glance that plenty of processing capacity exists in modern mobile application processor SoCs, since they contain dual- or quad-core CPUs running at 1 GHz and above. But CPUs are not optimized for data-intensive, parallelizable tasks, and—critically for mobile devices—CPUs are not the most energy-efficient processors for executing vision tasks. Today’s mobile application processors incorporate a range of specialized coprocessors for tasks such as video, 3D graphics, audio, and image enhancement. These coprocessors enable smartphones and tablets to deliver impressive multimedia capabilities with long battery life. While some of these same coprocessors can be pressed into service for vision tasks, none of them have been explicitly designed for vision tasks.  As the demand for vision processing increases on mobile processors, specialized vision coprocessors will most likely be added to the ranks of the existing coprocessors, enabling high performance with improved energy efficiency.

Embedded Vision will be Multi-Function

Functions such as gesture recognition, face recognition and augmented reality are just some of the new vision-based methods of interfacing mobile users with their devices, digital worlds and real-time environments. As previously discussed, individual vision processing tasks can challenge today’s mobile application processors. Imagine, therefore, the incremental burden of multiple vision functions running concurrently. For example, the gesture interface employed for simple smartphone control functions may be entirely different than the gestures used for a mobile game. Next, consider that this mobile game might also employ augmented reality. And finally, it's not too much of a "stretch" to imagine that such a game might also use face recognition to distinguish between your "friends" and "enemies." You can see that the processing burden builds as vision functions are combined and used concurrently in gaming, natural feature tracking, user response tracking, and other application scenarios.

For these reasons, expanding on points made earlier, it's even more feasible that future application processors may supplement conventional CPU, GPU and DSP cores with specialized cores specifically intended for vision processing. Several notable examples of this trend already exist: CogniVue's APEX ICP (Image Cognition Processor); CEVA's MM3101 imaging and vision core; Tensilica's IVP (Imaging and Video Processing) core; and the PVP (Pipeline Vision Processor) built into several of Analog Device's latest Blackfin SoCs. Just as today's application processors include CPU cores for applications, GPU cores for graphics, and DSP cores for baseband and general multimedia processing, you should expect the integration of "image cognition processing” cores in the future for mobile vision functions.

Embedded Vision Must Consume Scant Power

The available battery power in mobile devices has not improved significantly in recent years, as evolution of the materials used to construct batteries has stalled. Image sensors consume significant current if running at high frame rates for vision applications, versus their originally intended uses with still images and short videos. Add to this the computationally intense processing of vision algorithms, and batteries may drain quickly. For this reason, many vision applications are currently intended for use over short time durations rather than as always-on features. To enable extended application operation, the mobile electronics industry will need to rely on continued hardware and software evolution, potentially aided by more fundamental architectural revolutions.

An Industry Alliance Accelerates Mobile Vision Understanding, Implementation, Adoption, and Evolution

Embedded vision technology has the potential to enable a wide range of electronic products (such as the mobile devices discussed in this article) that are more intelligent and responsive than before, and thus more valuable to users. It can add helpful features to existing products. And it can provide significant new markets for hardware, software and semiconductor manufacturers. The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower engineers to transform this potential into reality.

CogniVue, the co-author of this article, is a member of the Embedded Vision Alliance, as are Analog Devices, BDTI, CEVA, eyeSight, Qualcomm and Tensilica, also mentioned in the article. First and foremost, the Alliance's mission is to provide engineers with practical education, information, and insights to help them incorporate embedded vision capabilities into new and existing products. To execute this mission, the Alliance has developed a website (www.Embedded-Vision.com) providing tutorial articles, videos, code downloads and a discussion forum staffed by a diversity of technology experts. Registered website users can also receive the Alliance’s twice-monthly email newsletter (www.embeddedvisioninsights.com), among other benefits.

Transforming a mobile vision experience into a product ready for shipping entails compromises touched on in this article—in cost, performance, and accuracy, to name a few. The Embedded Vision Alliance catalyzes conversations on these issues in a forum where such tradeoffs can be rapidly understood and resolved, and where the effort to productize mobile vision can therefore be accelerated, enabling system developers to effectively harness various mobile vision technologies.

For more information on the Embedded Vision Alliance, including membership details, please visit www.Embedded-Vision.com, email info@Embedded-Vision.com or call 925-954-1411. Please also consider attending the Alliance's upcoming Embedded Vision Summit, a free day-long technical educational forum to be held on October 2nd in the Boston, Massachusetts area and intended for engineers interested in incorporating visual intelligence into electronic systems and software. The event agenda includes how-to presentations, seminars, demonstrations, and opportunities to interact with Alliance member companies. For more information on the Embedded Vision Summit, including an online registration application form, please visit www.embeddedvisionsummit.com.

References:

  1. http://en.wikipedia.org/wiki/Computer_vision
  2. Flood, Joshua, "Gesture Recognition Enabled Mobile Devices." ABI Research, Web. 4, Dec 2012
  3. IDC Worldwide Quarterly Tablet Tracker, March 2013
  4. http://www.affectiva.com/news-article/affectiva-raises-12-million-to-extend-emotion-insight-technology-to-online-video-and-consumer-devices/
  5. http://www.techcrunch.com/2013/02/18/affectiva-inks-deal-with-ebuzzing-social-to-integrate-face-tracking-and-emotional-response-into-online-video-ad-analytics
  6. http://www.gartner.com/newsroom/id/2306215
  7. https://www.juniperresearch.com/viewpressrelease.php?pr=348

About the Authors:

Tom Wilson is Vice President of Business Development at CogniVue Corporation, with more than 20 years of experience in various applications such as consumer, automotive, and telecommunications. He has held leadership roles in engineering, sales and product management, and has a Bachelor’s of Science and PhD in Science from Carleton University, Ottawa, Canada.

Brian Dipert is Editor-In-Chief of the Embedded Vision Alliance. He is also a Senior Analyst at BDTI (Berkeley Design Technology, Inc.), and Editor-In-Chief of InsideDSP, the company's online newsletter dedicated to digital signal processing technology. He has a B.S. degree in Electrical Engineering from Purdue University in West Lafayette, IN. His professional career began at Magnavox Electronics Systems in Fort Wayne, IN; Brian subsequently spent eight years at Intel Corporation in Folsom, CA. He then spent 14 years (and five months) at EDN Magazine.

Developing a Quality Inspection Method for Selective Laser Melting of Metals with NI Hardware and Software

By building our system with the NI FlexRIO FPGA adapter module incorporated in an NI PXI system with LabVIEW, we can more effectively monitor and control the quality of the SLM process.
- Tom Craeghs, Catholic University of Leuven

The Challenge

Controlling and monitoring the selective laser melting (SLM) process in real time to more accurately detect errors and maintain quality control.

The Solution

Developing a 3D Optical Surface Profilometer Using LabVIEW and NI Vision Development Module

The device performs to a high standard with a total build cost of less than €5,000 and a development time of less than three months. This is largely due to the ease of hardware integration provided by LabVIEW and also the advanced and high-speed data processing capabilities that LabVIEW and NI vision provide.
- David Moore, Dublin City University

The Challenge

Designing and building a 3D optical surface profilometer capable of surface visualisation and roughness analysis to use in laser-processed surface characterization.

Using LabVIEW and the NI Vision Development Module to Ensure High-Quality HMI Products

We designed automated vision interpretation software (AVI) using LabVIEW and the NI Vision Development Module, which presented the most efficient and powerful capabilities for developing our automated vision interpretation concept and shortening our time market.
- Dan Olsson, Infotiv AB

The Challenge

Designing an automated test system to interpret text, symbols, and indicators to ensure high-quality human machine interface (HMI) products.

The Solution

Designing an Integrated Vision and Robotics Cell for Terminal Block Assembly with NI Vision Hardware and Software

With the NI standard, we can implement fast integration of any product in the projects, shortening development time and costs.
- Cristiano Buttinoni, Certified LabVIEW Developer (CLD), ImagingLab

The Challenge

Designing a system that tightly integrates robotics and vision for assembling electric components with short batches and a variety of products.

The Solution

Building an Integrated Vision and Robotics Packaging Line for Cosmetics Using NI LabVIEW and Vision Hardware

By adopting the LabVIEW platform, we successfully programmed, prototyped, and tested new robotics applications very quickly.
- Ignazio Piacentini, ImagingLab

The Challenge

Identifying the position and orientation and performing quality control of face powder brushes, and programming a robotic system to pick up the brushes and place them in a powder case on an eight-slot shuttle.

The Solution

3-D Sensors Bring Depth Discernment to Embedded Vision Designs

Bookmark and Share

3-D Sensors Bring Depth Discernment to Embedded Vision Designs

By Michael Brading
Automotive and Industrial Business Unit Chief Technology Officer, Aptina Imaging

Kenneth Salsman
Director of New Technology, Aptina Imaging

Manjunath Somayaji
Staff Imaging Scientist, Aptina Imaging

Brian Dipert
Editor-in-Chief, Embedded Vision Alliance
Senior Analyst, BDTI

Tim Droz
Vice President, US Operations, SoftKinetic

Daniël Van Nieuwenhove
Chief Technical Officer, SoftKinetic

Pedro Gelabert
Senior Member of the Technical Staff and Systems Engineer, Texas Instruments
This article is an expanded version of one originally published at EE Times' Embedded.com Design Line. It is reprinted here with the permission of EE Times.

The ability to sense objects in three dimensions can deliver both significant new and significantly enhanced capabilities to vision system designs. Several depth sensor technology alternatives exist to implement this potential, however, each with strengths, shortcomings and common use cases.

The term "embedded vision" refers to the use of computer vision in embedded systems, mobile devices, PCs and the cloud. Stated another way, "embedded vision" refers to systems that extract meaning from visual inputs. Historically, such image analysis technology has typically only been found in complex, expensive systems, such as military equipment, industrial robots and quality-control inspection systems for manufacturing. However, cost, performance and power consumption advances in digital integrated circuits such as processors, memory devices and image sensors are now paving the way for the proliferation of embedded vision into high-volume applications.

With a few notable exceptions, such as Microsoft's Kinect game console and computer peripheral, the bulk of today's embedded vision system designs employ 2-D image sensors. 2-D sensors enable a tremendous breadth and depth of vision capabilities. However, their inability to discern an object's distance from the sensor can make it difficult or impossible to implement some vision functions. And clever workarounds, such as supplementing 2-D sensed representations with already known 3-D models of identified objects (human hands, bodies or faces, for example) can be too constraining in some cases.

In what kinds of applications would full 3-D sensing be of notable value versus the more limited 2-D alternative? Consider, for example, a gesture interface implementation. The ability to discern motion not only up-and-down and side-to-side but also front-to-back greatly expands the variety, richness and precision of the suite of gestures that a system can decode. Or consider a biometrics application: face recognition. Depth sensing is valuable in determining that the object being sensed is an actual person's face, versus a photograph of that person's face; alternative means of accomplishing this objective, such as requiring the biometric subject to blink during the sensing cycle, are inelegant in comparison.

ADAS (automotive advanced driver assistance system) applications that benefit from 3-D sensors are abundant. You can easily imagine, for example, the added value of being able to determine not only that another vehicle or object is in the roadway ahead of or behind you, but also to accurately discern its distance from you. Precisely determining the distance between your vehicle and a speed-limit-change sign is equally valuable in ascertaining how much time you have to slow down in order to avoid getting a ticket. The need for accurate three-dimensional no-contact scanning of real-life objects, whether for a medical instrument, in conjunction with increasingly popular "3-D printers", or for some other purpose, is also obvious. And plenty of other compelling applications exist:  3-D videoconferencing, manufacturing line "binning" and defect screening, etc.

Stereoscopic Vision

Stereoscopic vision, combining two 2-D image sensors, is currently the most common 3-D sensor approach. Passive (i.e. relying solely on ambient light) range determination via stereoscopic vision utilizes the disparity in viewpoints between a pair of near-identical cameras to measure the distance to a subject of interest. In this approach, the centers of perspective of the two cameras are separated by a baseline or IPD (inter-pupillary distance) to generate the parallax necessary for depth measurement (Figure 1). Typically, the cameras’ optical axes are parallel to each other and orthogonal to the plane containing their centers of perspective.

Figure 1. Relative parallax shift as a function of distance. Subject A (nearby) induces a greater parallax than subject B (farther out), against a common background.

For a given subject distance, the IPD determines the angular separation θ of the subject as seen by the camera pair and thus plays an important role in parallax detection. It dictates the operating range within which effective depth discrimination is possible, and it also influences depth resolution limits at various subject distances. A relatively small baseline (i.e. several millimeters) is generally sufficient for very close operation such as gesture recognition using a mobile phone. Conversely, tracking a person’s hand from across a room requires the cameras to be spaced further apart. Generally, it is quite feasible to achieve depth accuracies of less than an inch at distances of up to 10 feet.

Implementation issues that must be considered in stereoscopic vision-based designs include the fact that when subject is in motion, accurate parallax information requires precise camera synchronization, often at fast frame rates (e.g., 120 fps). The cameras must be, at minimum, synchronized during the commencement of a frame capture sequence. An even better approach involves using a mode called “gen-lock,” where the line-scan timings of the two imagers are synchronized. Camera providers have developed a variety of sync-mode (using a master-slave configuration) and gen-lock-mode sensors for numerous applications, including forward-looking cameras in automobiles.

Alignment is another critical factor in stereoscopic vision. The lens systems must be as close to identical as possible, including magnification factors and pitch-roll-yaw orientations. Otherwise, inaccurate parallax measurements will result. Likewise, misalignment of individual lens elements within a camera module could cause varying aberrations, particularly distortions, resulting in false registration along all spatial dimensions. Occlusion, which occurs when an object or portion of an object is visible to one sensor but not to the other --is another area of concern, especially at closer ranges, but this is a challenge common in most depth-sensing techniques.

Structured Light

Microsoft's Kinect is today's best-known structured light-based 3-D sensor. The structured light approach, like the time-of-flight technique to be discussed next, is one example of an active non-contact scanner; non-contact, because scanning does not involve the sensor physically touching an object’s surface, and active, because it generates its own electromagnetic radiation and analyzes the reflection of this radiation from the object. Typically, active non-contact scanners use lasers, LEDs, or lamps in the visible or infrared radiation range. Since these systems illuminate the object, they do not require separate controlled illumination of the object for accurate measurements. An optical sensor captures the reflected radiation.

Structured light is an optical 3-D scanning method that projects a set of patterns onto an object, capturing the resulting image with an image sensor. The image sensor is offset from the projected patterns. Structured light replaces the previously discussed stereoscopic vision sensor's second imaging sensor with a projection component. Similar to stereoscopic vision techniques, this approach takes advantage of the known camera-to-projector separation to locate a specific point between them and compute the depth with triangulation algorithms. Thus, image processing and triangulation algorithms convert the distortion of the projected patterns, caused by surface roughness, into 3-D information (Figure 2).

Figure 2. An example structured light implementation using a DLP-based modulator.

Three main types of scanners are used to implement structured light techniques: laser scanners, fixed-pattern scanners, and programmable-pattern scanners. Laser scanners typically utilize a laser in conjunction with a gyrating mirror to project a line on an object. This line is scanned at discrete steps across the object’s surface. An optical sensor, offset from the laser, captures each line scan on the surface of the object.

Fixed-pattern scanners utilize a laser or LED with a diffractive optical element to project a fixed pattern on the surface of the object. An optical sensor, offset from the laser, captures the projected pattern on the surface of the object. In contrast to a laser scanner, the optical sensor of a fixed-pattern scanner captures all of the projected patterns at once. Fixed-pattern scanners typically use pseudorandom binary patterns, such as those based on De Bruijn sequences or M-arrays. These pseudorandom patterns divide the acquired image into a set of sub-patterns that are easily identifiable, since each sub-pattern appears at most once in the image. Thus, this technique uses a spatial neighborhood codification approach.

Programmable-pattern scanners utilize laser, LED, or lamp illumination along with a digital spatial light modulator to project a series of patterns on the surface of the object. An optical sensor, offset from the projector, captures the projected pattern on the surface of the object. Similar to a fixed-pattern scanner, the optical sensor of the programmable-pattern scanner captures the entire projected pattern at once. The primary advantages of programmable-pattern structured light scanners versus fixed-pattern alternatives involve the ability to obtain greater depth accuracy via the use of multiple patterns, as well as to adapt the patterns in response to factors such as ambient light, the object’s surface, and the object’s optical reflection.

Since programmable-pattern structured light requires the projection of multiple patterns, a spatial light modulator provides a cost effective solution. Several spatial light modulation technologies exist in the market, including LCD (liquid crystal display), LCoS (liquid crystal on silicon), and DLP (digital light processing). DLP-based spatial light modulators' capabilities include fast and programmable pattern rates up to 20,000 frames per second, with 1-bit to 8-bit grey scale support, high contrast patterns, consistent and reliable performance over time and temperature, no motors or other fragile moving components, and available solutions with optical efficiency from 365 to 2500 nm wavelengths.

Structured light-based 3-D sensor designs must optimize, and in some cases balance trade-offs between, multiple implementation factors. Sufficient illumination wavelength and power are needed to provide adequate dynamic range, based on ambient illumination and the scanned object's distance and reflectivity. Algorithms must be optimized for a particular application, taking into account the object's motion, topology, desired accuracy, and scanning speed. Adaptive object analysis decreases scanning speed, for example, but provides for a significant increase in accuracy. The resolution of the spatial light modulator and imaging sensor must be tailored to extract the desired accuracy from the system. This selection process primarily affects both cost and the amount of computation required.

Scanning speed is predominantly limited by image sensor performance; high-speed sensors can greatly increase system cost. Object occlusion can present problems, since the pattern projection might shadow a feature in the topology and thereby hide it from the captured image. Rotation of the scanned object, along with multiple analysis and stitching algorithms, provides a good solution for occlusion issues. Finally, system calibration must be comprehended in the design. It's possible to characterize and compensate for projection and imaging lens distortions, for example, since the measured data is based on code words, not on an image's disparity.

Time-of-Flight

An indirect ToF (time-of-flight) system obtains travel-time information by measuring the delay or phase-shift of a modulated optical signal for all pixels in the scene. Generally, this optical signal is situated in the near-infrared portion of the spectrum so as not to disturb human vision. The ToF sensor in the system consists of an array of pixels, where each pixel is capable of determining the distance to the scene.

Each pixel measures the delay of the received optical signal with respect to the sent signal (Figure 3). A correlation function is performed in each pixel, followed by averaging or integration. The resulting correlation value then represents the travel time or delay. Since all pixels obtain this value simultaneously, "snap-shot" 3-D imaging is possible.

Figure 3. Varying sent-to-received delays correlate to varying distances between a time-of-flight sensor and portions of an object or scene.

As with the other 3-D sensor technologies discussed in this article, a number of challenges need to be addressed in implementing a practical ToF-based system. First, the depth resolution (or noise uncertainty) of the ToF system is linked directly to the modulation frequency, the efficiency of correlation, and the SNR (signal to noise ratio). These specifications are primarily determined by the quality of the pixels in the ToF sensor. Dynamic range must be maximized in order to accurately measure the depth of both close and far objects, particularly those with differing reflectivities.

Another technical challenge involves the suppression of any background ambient light present in the scene, in order to prevent sensor saturation, and enable robust operation in both indoor and outdoor environments. Since more than one ToF system can be present, inter-camera crosstalk must also be eliminated. And all of these challenges must be addressed while keeping the pixel size small enough to obtain the required lateral resolution without compromising pixel accuracy.

Technology Comparisons

No single 3-D sensor technology can meet the needs of every application (Table A). Stereoscopic vision technology demands high software complexity in order to process and analyze highly precise 3-D depth data in real time, thereby typically necessitating DSPs (digital signal processors) or multi-core processors. Stereoscopic vision sensors themselves, however, can be cost-effective and fit in small form factors, making them a good choice for consumer electronics devices such smartphones and tablets. But they typically cannot deliver the high accuracy and fast response time possible with other 3-D sensor technologies, so they may not be the optimum choice for manufacturing quality assurance systems, for example.

Table A. 3-D vision sensor technology comparisons.

Structured light technology is an ideal solution for 3-D object scanning, including integration with 3-D CAD (computer-aided design) systems. And structured light systems are often superior at delivering high levels of accuracy with less depth noise in indoor environments. The highly complex algorithms associated with structured light sensors can be handled by hard-wired logic, such as ASICs and FPGAs, but these approaches often involve expensive development and device costs (NRE and/or per-component). The high computation complexity can also result in slower response times.

ToF systems are tailored for device control in application areas that need fast response times, such as manufacturing and consumer electronics devices. ToF systems also typically have low software complexity. However, they integrate expensive illumination parts, such as LEDs and laser diodes, as well as costly high-speed interface-related parts, such as fast ADCs, fast serial/parallel interfaces and fast PWM (pulse width modulation) drivers, all of which increase bill-of-materials costs.

Industry Alliance Assistance

Determining the optimum 3-D sensor technology for your next embedded vision design is not a straightforward undertaking. The ability to tap into the collective knowledge and experiences of a community of your engineering peers can therefore be quite helpful, along with the ability to harness the knowledge of various potential technology suppliers. These are among the many resources offered by the Embedded Vision Alliance, a worldwide organization of semiconductor, software and services developers and providers, poised to assist you in rapidly and robustly transforming your next-generation ideas into shipping-product reality.

The Alliance’s mission is to provide engineers with practical education, information, and insights to help them incorporate embedded vision capabilities into products. To execute this mission, the Alliance has developed a Web site (www.Embedded-Vision.com) with tutorial articles, videos, code downloads, and a discussion forum staffed by a diversity of technology experts. For more information on the Embedded Vision Alliance, please email info@Embedded-Vision.com or call +1 (925) 954-1411.

Michael Brading is Chief Technical Officer of the Automotive Industrial and Medical business unit at Aptina Imaging. Prior to that, Mike was Vice President of Engineering at InVisage Technologies. Mike has more than 20 years of integrated circuit design experience, working with design teams all over the world. Michael was also previously the director of design and applications at Micron Technology, and the director of engineering for emerging markets. And before joining Micron Technology, he also held engineering management positions with LSI Logic. Michael has a B.S. in communication engineering from the University of Plymouth.

Kenneth Salsman is the Director of New Technology at Aptina Imaging. Kenneth has been a researcher and research manager for more than 30 years at companies such as Bell Laboratories and the Sarnoff Research Center. He was also Director of Technology Strategy for Compaq Research, and a Lead Scientist at both Compaq and Intel. Kenneth has a Masters degree in Nuclear Engineering, along with an extensive background in optical physics. He was also Chief Science Officer at Innurvation, where he developed a pill sized HD optical scanning system for imaging the gastrointestinal tract. He holds more than 48 patents.

Manjunath Somayaji is the Imaging Systems Group manager at Aptina Imaging, where he leads algorithm development efforts on novel multi-aperture/array-camera platforms. For the past ten years, he has worked on numerous computational imaging technologies such as multi-aperture cameras and extended depth of field systems. He received his M.S. degree and Ph.D. from Southern Methodist University (SMU) and his B.E. from the University of Mysore, all in Electrical Engineering. He was formerly a Research Assistant Professor in SMU's Electrical Engineering department. Prior to SMU, he worked at OmniVision-CDM Optics as a Senior Systems Engineer.

Brian Dipert is Editor-In-Chief of the Embedded Vision Alliance. He is also a Senior Analyst at Berkeley Design Technology, Inc., which provides analysis, advice, and engineering for embedded processing technology and applications, and Editor-In-Chief of InsideDSP, the company's online newsletter dedicated to digital signal processing technology. Brian has a B.S. degree in Electrical Engineering from Purdue University in West Lafayette, IN. His professional career began at Magnavox Electronics Systems in Fort Wayne, IN; Brian subsequently spent eight years at Intel Corporation in Folsom, CA. He then spent 14 years at EDN Magazine.

Tim Droz is the Vice President of US Operations at SoftKinetic. He joined SoftKinetic in 2011 after 10 years at Canesta, where he was Vice President of Platform Engineering and head of the Entertainment Solutions Business Unit. Before then, Tim was Senior Director of Engineering at Cylink. Tim also earlier led hardware development efforts in embedded and web-based signature capture payment terminals at pos.com, along with holding engineering positions at EDJ Enterprises and IBM. Tim earned a BSEE from the University of Virginia and a M.S. degree in Electrical and Computer Engineering from North Carolina State University.

Daniël Van Nieuwenhove is the Chief Technical Officer at SoftKinetic. He co-founded Optrima in 2009, and acted as the company's Chief Technical Officer and Vice President of Technology and Products. Optrima subsequently merged with SoftKinetic in 2010. He received an engineering degree in electronics with great distinction at the VUB (Free University of Brussels) in 2002. Daniël holds multiple patents and is the author of several scientific papers. In 2009, he obtained a Ph.D. degree on CMOS circuits and devices for 3-D time-of-flight imagers. As co-founder of Optrima, he brought its proprietary 3-D CMOS time-of-flight sensors and imagers to market.

Pedro Gelabert is a Senior Member of the Technical Staff and Systems Engineer at Texas Instruments. He has more than 20 years of experience in DSP algorithm development and implementation, parallel processing, ultra-low power DSP systems and architectures, DLP applications, and optical processing, along with architecting digital and mixed signal devices. Pedro received his B.S. degree and Ph.D. in electrical engineering from the Georgia Institute of Technology. He is a member of the Institute of Electrical and Electronics Engineers, holds four patents and has published more than 40 papers, articles, user guides, and application notes.