Embedded Vision Alliance: Technical Articles

Gesture Recognition: Enabling Natural Interactions With Electronics

Bookmark and Share

Gesture Recognition: Enabling Natural Interactions With Electronics

By Dong-Ik Ko (Lead Engineer, Gesture Recognition and Depth-Sensing) and Gaurav Agarwal (Manager, Gesture Recognition and Depth-Sensing)
Texas Instruments
This is a reprint of a Texas Instruments-published white paper, which is also available here (2.6 MB PDF).

Introduction

Over the past few years, gesture recognition has made its debut in entertainment and gaming markets. Now, gesture recognition is becoming a commonplace technology, enabling humans and machines to interface more easily in the home, the automobile and at work. Imagine a person sitting on a couch, controlling the lights and TV with a wave of his hand. This and other capabilities are being realized as gesture recognition technologies enable natural interactions with the electronics that surround us. Gesture recognition has long been researched with 2D vision, but with the advent of 3D sensor technology, its applications are now more diverse, spanning a variety of markets.

Limitations of (x, y) coordinate-based 2D vision

Computers are limited when it comes to understanding scenes, as they lack the ability to analyze the world around them. Key problems that computers have in understanding scenes include segmentation, object representation, machine learning and recognition. Because computers are limited by their 2D representation of scenes, a gesture recognition system has to apply various cues to acquire more accurate results and more valuable information. While the possibilities include whole-body tracking and other techniques that combine multiple cues, it is difficult to sense scenes using only 2D representation that do not include known 3D models of objects that they identify, such as human hands, bodies or faces.

“z” (depth) innovation

Depth information, or “z,” enables capabilities well beyond gesture recognition. The challenge in incorporating 3D vision and gesture recognition into technology has been obtaining this third “z” coordinate. The human eye naturally registers x, y and z coordinates for everything it sees, and the brain then interprets those coordinates into a 3D image. In the past, lack of image analysis technology prevented electronics from seeing in 3D. Today, there are three common technologies that can acquire 3D images, each with its own unique strengths and common use cases: stereoscopic vision, structured light pattern and time of flight (TOF). With the analysis of the 3D image output from these technologies, gesture-recognition technology becomes a reality.

Stereoscopic vision

The most common 3D acquisition system is the stereoscopic vision system, which uses two cameras to obtain a left and right stereo image. These images are slightly offset on the same order as the human eyes are. As the computer compares the two images, it develops a disparity image that relates the displacement of objects in the images. Commonly used in 3D movies, stereoscopic vision systems enable exciting and low-cost entertainment. It is ideal for 3D movies and mobile devices, including smartphones and tablets.

Structured light pattern

Structured light illuminates patterns to measure or scan 3D objects. Light patterns are created using either a projection of laser or LED light interference or a series of projected images. By replacing one of sensors of a stereoscopic vision system with a light source, structured-light-based technology basically exploits the same triangulation as a stereoscopic system does to acquire the 3D coordinates of the object. Single 2D camera systems with an IR- or RGB-based sensor can be used to measure the displacement of any single stripe of visible or IR light, and then the coordinates can be obtained through software analysis. These coordinates can then be used to create a digital 3D image of the shape.

Time of flight (TOF)

Relatively new among depth information systems, time of flight (TOF) sensors are a type of light detection and ranging (LIDAR) system that transmit a light pulse from an emitter to an object. A receiver determines the distance of the measured object by calculating the travel time of the light pulse from the emitter to the object and back to the receiver in a pixel format.

TOF systems are not scanners, as they do not measure point to point. Instead, TOF systems perceive the entire scene simultaneously to determine the 3D range image. With the measured coordinates of an object, a 3D image can be generated and used in systems such as device control in areas like manufacturing, robotics, medical technologies and digital photography. TOF systems require a significant amount of processing, and embedded systems have only recently provided the amount of processing performance and bandwidth needed these systems.

Comparing 3D vision technology

No single 3D vision technology can currently meet the needs for every market or application. Figure 1 shows a comparison of the different 3D vision technologies’ response time, software complexity, cost and accuracy.

Figure 1. 3D vision sensor technology comparison

Stereoscopic vision technology requires a large amount of software complexity for highly precise 3D depth data that can typically be processed and analyzed in real time by digital signal processors (DSPs) or multi-core scalar processors. Stereoscopic vision systems can be more cost effective and fit in a small form factor, making them a good choice for devices like smartphones, tablets and other consumer devices. However, stereoscopic vision systems cannot deliver the high accuracy and fast response time that other technologies can, so they are not the best choice for systems requiring high accuracy, such as manufacturing quality assurance systems.

Structured light technology is an ideal solution for 3D scanning of objects, including 3D computer aided design (CAD) systems. The highly complex software associated with these systems can be addressed by hardwired logics, such as ASICs and FPGAs, which require expensive development and materials costs. The computation complexity also results in a slower response time. At the macro level, structured light systems are better than other 3D vision technologies at delivering high levels of accuracy with less depth noise in an indoor environment.

Due to their balance of cost and performance, TOF systems are optimal for device control in areas like manufacturing and consumer electronics devices needing a fast response time. TOF systems typically have low software complexity. However, these systems integrate expensive illumination parts, such as LEDs and laser diodes, as well as costly high-speed interface-related parts, such as fast ADC, fast serial/parallel interface and fast PWM drivers, that increase material costs. Figure 1 provides a comparison of the three 3D sensor technologies.

How “z” (depth) impacts human- machine interfaces

The addition of the “z” coordinate allows displays and images to look more natural and familiar. Displays more closely reflect what people see with their own eyes, thus this third coordinate changes the types of displays and applications available to users.

Stereoscopic display

While using stereoscopic displays, users typically wear 3D glasses. The display emits different images for the left and right eye, tricking the brain into interpreting a 3D image based on the two different images the eyes receive. Stereoscopic displays are used in many 3D televisions and 3D movie theaters today. Additionally, we’re starting to see glasses-free stereoscopic-3D capabilities in the smartphone space. Users now have the ability to not only view 3D content from the palm of their hands, but also capture on-the-go memories in 3D and upload them instantly to the Web.

Multi-view display

Rather than requiring the use of special glasses, multi-view displays instead simultaneously project multiple images, each one slightly offset and angled properly so that a user can experience different projection of images for the same object for each viewpoint angle. These displays create a hologram effect that you can expect to see in the near future.

Detection and applications

The ability to process and display the “z” coordinate is enabling new applications far beyond entertainment and gaming, including manufacturing control, security, interactive digital signage, remote medical care, automotive safety and robotic vision. Figure 2 depicts some applications enabled by body skeleton and depth map sensing.

Figure 2. 3D vision is enabling new applications in a variety of markets

Human gesture recognition for consumer applications

Human gesture recognition is a popular new way to input information in gaming, consumer and mobile devices, including smartphones and tablets. Users can naturally and intuitively interact with the device, leading to greater acceptance and approval of the products. These human-gesture-recognition products include various resolutions of 3D data, from 160 × 120 pixels to 640 × 480 pixels at 30–60 fps. Software modules such as raw-to-depth conversion, two-hand tracking and full-body tracking require parallel processing for efficient and fast analysis of the 3D data to deliver gaming and tracking in real time.

Industrial

A majority of industrial applications for 3D vision, including industrial and manufacturing sensors, integrate an imaging system from as few as 1 pixel to several million pixels. The 3D images can be manipulated and analyzed using DSP + general-purpose processor (GPP) system-on-chip (SoC) processors to accurately detect manufacturing flaws or choose the correct parts from a factory bin.

Interactive digital signage as a pinpoint marketing tool

Advertisements already bombard us on a daily basis, but with interactive digital signage, companies will be increasingly able to use pinpoint marketing tools to deliver the most applicable content to each consumer. For example, as someone walks past a digital sign, an extra message may appear on the sign to acknowledge the customer. If the customer stops to read the message, the sign can interpret that as interest in their prod- uct and deliver a more targeted message. Microphones allow the billboard to recognize significant phrases to further strategically pinpoint the delivered message.

Interactive digital signage systems integrate a 3D sensor for full body tracking, a 2D sensor for facial recognition and microphones for speech recognition. The systems require functionality like MPEG-4 video decoding. High-end DSPs and GPPs are necessary to run the complex analytics software for these systems.

Fault-free virtual or remote medical care

The medical field also benefits from the new and unprecedented applications that 3D vision offers. This technology will ensure that the best medical care is available to everyone, no matter where they are located in the world. Doctors can remotely and virtually treat patients by utilizing medical robotic vision enabled by high accuracy of 3D sensors.

Automotive safety

Recently, 2D sensors have enabled extensive improvements in automotive technology, specifically in traffic signal, lane and obstacle detection. With the proliferation of 3D sensing technology, “z” data from 3D sensors can significantly improve the reliability of scene analysis and prevent more accidents on the road. Using a 3D sensor, a vehicle can reliably detect and interpret the world around it to determine if objects are a threat to the safety of the vehicle and the passengers inside, ultimately preventing collisions. These systems will require the right hardware and sophisticated software to interpret the 3D images in a very timely manner.

Video conferencing

Gone are the years of videoconferences with grainy, disjointed images. Today’s video conferencing systems offer high-definition images, and newer systems leverage 3D sensors to deliver an even more realistic and interactive experience. With integrated 2D and 3D sensors as well as a microphone array, this enhanced video conferencing system can connect with other enhanced systems to enable high-quality video processing, facial recognition, 3D imaging, noise cancellation and content players, including Flash. Given the need for intensive video and audio processing in this application, a DSP + GPP SoC processor will offer the optimum solution with the best mix of performance and peripherals to deliver the required analytical functionality.

Technology processing steps

Many applications will require both a 2D and 3D camera system to properly enable 3D imaging technology. Figure 3 on the following page shows the basic data path of these systems. Moving the data from the sensors and into the vision analytics is more complex than it seems from the data path. Specifically, TOF sensors require up to 16 times the bandwidth of 2D sensors, causing a shortage of bandwidth for input/output (I/O). Another bottleneck occurs when processing the raw 3D data to a 3D point cloud. Identifying the right combination of hardware and software to mitigate these issues is critical for successful gesture recognition and 3D applications. Today, this data path is realized in DSP/GPP combination processors along with discrete analog components and software libraries.

Figure 3. Data path of 2D and 3D camera systems

Challenges for 3D vision embedded systems

Input challenges

As discussed, input bandwidth constraints are a challenge, specifically for TOF-based 3D vision embedded systems. Due to the lack of standardization for the input interface, designers can choose to work with different input options, including serial and parallel interfaces for 2D sensor as well as general-purpose external- memory interfaces. Until a standard input interface with optimum bandwidth is developed, designers will have to work with the unstandardized options available today.

Two different processor architectures

In Figure 3, 3D depth map processing can be divided into two categories: 1) vision-specific, data-centric processing [low-level processing] and 2) application upper-level processing [mid- to high-level processing]. Vision specific, data-centric processing requires a processor architecture that can perform single instruction, multiple data (SIMD), fast floating-point multiplication and addition, and fast search algorithms. A DSP (SIMD+VLIW) or SIMD-based accelerator is an ideal candidate for quickly and reliably performing this type of processing. High-level operating systems (O/Ss) and stacks can provide the necessary features for the upper layer of any application.

Based on the requirements for vision-specific, data centric processing as well as application upper-level processing, an SoC that provides a GPP+DSP+SIMD processor with a high data rate I/O is well suited for 3D vision processing.

Lack of standard middleware

The world of middleware for 3D vision processing encompasses many different pieces from multiple sources, including open source (e.g., OpenCV) as well as proprietary commercial sources. Several commercial libraries are targeted toward body-tracking applications. However, no company has yet developed a middleware interface that is standardized across all the different 3D vision applications. When standardized middleware is available, development will become much faster and easier, and we can expect to see a huge proliferation of 3D vision and gesture recognition technologies across a variety of markets.

Opportunities abound with the proliferation of 3D vision and gesture recognition technologies, and Texas Instruments Incorporated (TI) and its partners are leading the charge in bringing 3D capabilities to new markets and in providing the hardware and middleware our customers need to innovate groundbreaking and exciting analytical applications.

3D vision software architecture

In this section, we will explore some of the more specific TI technologies used to implement the 3D vision architecture necessary to power these new applications. The following information is based on TI’s DaVinciTM video processor and OMAPTM application processor technology.

As stated earlier, TI’s integrated system allows low-level, mid-level and high-level processing to be distributed across multiple processing devices. This enables optimal performance with the most fitting processors. One possible case of 3D vision application process loads can be seen in Figure 4. We can see that low-level processing covers about 40 percent of the processing load for extracting the depth map and filtering. In addition, more than 55 percent of the load is dedicated toward mid- and high-level processing for motion flow, object segmentation and labeling, and tracking.

Figure 4. 3D vision application processing loads (case 1)

Figure 5 shows another case of 3D vision application processing loads where low-level processing for the calculation of segmentation, background and human body covers 20 percent of the total loads, and mid- to high-level processing takes 80 percent.

Figure 5. 3D vision application processing loads (case 2)

Here, detailed methods and algorithms applied could be different for similar processing cells. Thus, same or similar processing cells in Figures 4 and 5 are categorized to different processing levels. The approach in case 1 analyzes the entire scene block by block and then extracts foreground objects. Thus, object segmentation cell is heavier because it touches all pixels. The advantage of this approach is that it can label all objects in the background (such as furniture, non-moving people) and foreground (usually moving objects like people) whether or not objects in the scene are static or dynamic. This approach will be not only be used in gesture applications, but also in surveillance applications. The approach in case two analyzes only moving pixels and connects pixels by analyzing contours and labeling them. The advantage is that fewer computation cycles are required compared to case 1. However, a person must move at least slightly to be detected.

It is important to understand what type of processing is needed for each step in the application in order to properly allocate to the correct processing unit. As demonstrated by Figures 4 and 5, 3D vision architecture should support optimized and balanced hardware IPs for low- to high-level processing of 3D vision software stack.

3D vision hardware architectures

There are various hardware architecture options for 3D vision applications, each with pros and cons.

ISP (Image signal processing)

Due to differences in requirements of 3D vision’s depth acquisition methods (stereoscopic vision, structured light, TOF), it is difficult to define a universal interface for all three technologies. Standard sensor interface formats for 2D, such as CSI2 and parallel ports, can seamlessly support stereoscopic vision and structured light technology up to 1080p resolution. However, TOF technology requires much higher (up to 16 times) data interface bandwidth, has a unique data width per pixel and has additional metadata compared to other two 3D vision depth technologies. Currently, no standard data interface for TOF has been defined. Thus, multiple CSI2 interfaces or CSI3 or parallel interface are considered depending on TOF sensor resolution. These interfaces, originally designed for artifacts of 2D sensors, do not fully utilize 3D sensor (structured light and TOF)’s specifics.

Besides data interface issue for 3D sensor technology, 3D vision specific digital signal processing (lens correction, noise filtering, resizing, compression and 3A: auto exposure, auto focus, auto white balance) can be defined in ISP along with 2D sensor-specific image processing. Some functional logics for 2D sensor can be reused for improving 3D sensor depth quality, but 3D-specific processing is necessary for specific functions. For example, a TOF sensor’s noise pattern is different than a 2D sensor’s one. In structured light technology, noise-handling methods can be different depending on depth accuracy and interested depth coverage.

SIMD (single instruction multiple data) and VLIW (very long instruction word)

  • Pros
    • Flexible programming environment
    • Algorithms with dependencies on neighboring elements
    • Data parallelism and throughput
  • Cons
    • Depend too much on compiler’s performance and efficiency in fast prototyping
    • Need effort (manual optimization depending on algorithms) to acquire fair performance improvement
    • Poor utilization of parallelism for high-level processing vision algorithms.

Graphics processing unit (GPU)

  • Pros
    • Easy programming environment
    • Fast prototyping and optimization
    • Data and task parallelism and throughput
    • MIMD (multiple instruction and multiple data) friendly
  • Cons
    • Inefficient hardware architecture for utilizing vision algorithms’ operational features
    • High power consumption and area size
    • Inefficient memory access and inter-processor communication (IPC)
    • Limitation in algorithm complexity per kernel

GPP (single- and multi-core)

  • Pros
    • Most flexible programming environment
    • Fast prototyping and quick optimization
    • High portability
  • Cons
    • Poor utilization of parallelism
    • Low throughput
    • High power consumption due to a high clock cycle
    • Area size

Hardware accelerator

  • Pros
    • Highly optimized area and power consumption
    • Data parallelism and throughput
    • Block processing friendly architecture (low- and mid-level processing friendly)
    • One-at-a-time optimization. In other words, “what you program is what you get.” At the programming stage, the actual cycles required at runtime can be obtained because there is no dependency on compiler tuning, runtime memory hierarchy associated delay and data & control hazard.
  • Cons
    • Poor programming environment
    • Much more software engineering effort is required to acquire a fair optimization
    • Inefficient in context switch of algorithms
    • Poor portability

TI’s 3D vision hardware IPs

Below is a brief overview of the hardware IPs available from TI and where they traditionally lie in the low-, mid- and high-level processing usage.

Vision accelerator

  • Architecture
    • Hardware accelerator
  • Vision application mapping
    • Low- to mid-level processing

DSP

  • Architecture
    • SIMD & VLIW
  • Vision application mapping
    • Mid- to high-level processing

ARM®

  • Architecture
    • GPP
  • Vision application mapping
    • Low-, mid- and high-level processing

TI’s 3D vision hardware and software stack

TI’s 3D vision hardware and software stack shows how TI’s hardware IPs are leveraged for optimizing 3D vision systems. TI’s 3D processor options include vision accelerators and DSPs. These IPs are integrated into TI’s embedded chips depending on the targeted application’s requirements. These optimized 3D processors boost system performance when processing 3D vision middleware and the 3D Vision Low-Level Library (3DV Low Lib). 3D application software, which utilizes accelerated 3D vision middleware, is ported on the GPP (ARM). The 3D application software layer offers an innovative human-machine user experience on top of the underlying vision analytics algorithms. It also communicates with the 3D sensor device to send 3D depth data and 2D RGB data to the vision analytics algorithms associated with the system software layers. The LinuxTM platform provides system-associated functions over GPP (ARM). TI’s 3D vision hardware and software stack is illustrated in Figure 6.

Figure 6. TI’s 3D vision hardware and software stack

Additionally, TI provides a range of libraries, including DSP Lib, Image Lib, Vision Lib and OpenCV kernels. These libraries, shown in Figure 7, enable customers to reduce their time to market and enhance performance on TI devices. Customers can also tap into TI’s network of third-party designers for additional software and hardware tools to aid development of 3D vision and gesture recognition applications.

Figure 7. TI vision software solutions

Figure 8 shows TI’s broad processor portfolio for 3D vision, including the DaVinci DM36x video processors hand-tracking-based solutions; the DM385 processors for hand-tracking or upper- body-tracking solutions with SkypeTM solutions; the OMAP 3 processors and the DM3730 processor for body- tracking solutions; and the OMAP 4, OMAP 5 and DM8148 processors for integrated voice and Skype together with gesture and body tracking, face detection, and augmented reality. In addition, the SitaraTM AM335x and AM37x ARM® microprocessors (MPUs) can perform 2D vision for hand tracking and face detection.

Figure 8. TI’s 3D vision processor roadmap

Optimized hardware architecture may lead to more engineering effort on the software side. Thus, delivering optimized architectures always should go hand in hand with offering handy development tools. TI provides graphical-user-interface- and integrated-development-environment-based development tools for 3D processors (DSPs and accelerators), which can accelerate customers’ development cycles and can help them accelerate time to market. TI’s Code Composer StudioTM integrated development environment (IDE) and Eclipse-compliant plugins to the IDE for accelerators are popular options.

Conclusion

Despite several challenges, 3D vision and gesture tracking are getting attention and gaining traction in the market. TI offers a complete solution, including a range of processors, tools and comprehensive software, to enable customers to quickly and easily bring leading 3D vision products to market. TI continues to innovate to meet market challenges, leading to even greater adoption of gesture and 3D applications beyond the consumer electronics market. When it comes to applications that make interactions between humans and their devices more interactive and natural, the sky is the limit.

Processing Options For Implementing Vision Capabilities in Embedded Systems

Bookmark and Share

Processing Options For Implementing Vision Capabilities in Embedded Systems

By Jeff Bier
Founder
Embedded Vision Alliance
Co-Founder and President
BDTI

This article was originally published on Altera's Technology Center. It is reprinted here with the permission of Altera.

With the emergence of increasingly capable processors, image sensors, memories and other semiconductor devices, along with associated algorithms, it's becoming practical to incorporate computer vision capabilities into a wide range of embedded systems, enabling them to analyze their environments via video inputs. Products like Microsoft's Kinect game controller and Mobileye's driver assistance systems are raising awareness of the incredible potential of embedded vision technology. As a result, many embedded system designers are beginning to think about implementing embedded vision capabilities. This article explores the opportunity for embedded vision, compares various processor options for implementing it, and introduces an industry alliance created to help engineers incorporate vision capabilities into their designs.

Jeff Bier
Co-Founder and President, BDTI
Founder, Embedded Vision Alliance

The term “embedded vision” refers to the use of computer vision technology in embedded systems. Stated another way, “embedded vision” refers to embedded systems that extract meaning from visual inputs. Similar to the way that wireless communication has become pervasive over the past 10 years, embedded vision technology is poised to be widely deployed in the next 10 years.

It’s clear that embedded vision technology can bring huge value to a vast range of applications (Figure 1). Two examples are Mobileye’s vision-based driver assistance systems, intended to help prevent motor vehicle accidents, and MG International’s swimming pool safety system, which helps prevent swimmers from drowning. And for sheer geek appeal, it’s hard to beat Intellectual Ventures’ laser mosquito zapper, designed to prevent people from contracting malaria.

Figure 1. Embedded vision got its start as computer vision in applications such as assembly-line inspection, optical character recognition, robotics, surveillance, and military systems. In recent years, however, decreasing costs and increasing capabilities have broadened and accelerated its penetration into numerous other markets.

Just as high-speed wireless connectivity began as an exotic, costly technology, so far embedded vision technology typically has been found in complex, expensive systems, such as a surgical robot for hair transplantation and quality-control inspection systems for manufacturing.

Advances in digital integrated circuits were critical in enabling high-speed wireless technology to evolve from exotic to mainstream. When chips got fast enough, inexpensive enough, and energy efficient enough, high-speed wireless became a mass-market technology. Today one can buy a broadband wireless modem for under $100.

Similarly, advances in digital chips are now paving the way for the proliferation of embedded vision into high-volume applications (Figure 2). Like wireless communication, embedded vision requires lots of processing power—particularly as applications increasingly adopt high-resolution cameras and make use of multiple cameras. Providing that processing power at a cost low enough to enable mass adoption is a big challenge. This challenge is multiplied by the fact that embedded vision applications require a high degree of programmability. In contrast to wireless applications where standards mean that, for example, baseband algorithms don’t vary dramatically from one cell phone handset to another, in embedded vision applications there are great opportunities to get better results—and enable valuable features—through unique algorithms.

Figure 2. The embedded vision ecosystem spans hardware, semiconductor, and software component suppliers, subsystem developers, systems integrators, and end users, along with the fundamental research that provides ongoing breakthroughs. This article focuses on the embedded vision algorithm processing options shown in the center of the figure.

With embedded vision, the industry is entering a “virtuous circle” of the sort that has characterized many other digital signal processing (DSP) application domains. Although there are few chips dedicated to embedded vision applications today, these applications are increasingly adopting high-performance, cost-effective processing chips developed for other applications, including digital signal processors, CPUs, FPGAs, and GPUs. As these chips continue to deliver more programmable performance per dollar and per watt, they will enable the creation of more high-volume embedded vision products. Those high-volume applications, in turn, will attract more attention from silicon providers, who will deliver even better performance, efficiency, and programmability.

Processing Candidates

As previously mentioned, vision algorithms typically require high compute performance. And, of course, embedded systems of all kinds are usually required to fit into tight cost and power consumption envelopes. In other DSP application domains, such as digital wireless communications, chip designers achieve this challenging combination of high performance, low cost, and low power by using specialized coprocessors and accelerators to implement the most demanding processing tasks in the application. These coprocessors and accelerators are typically not programmable by the chip user, however.

This tradeoff is often acceptable in wireless applications, where standards mean that there is strong commonality among algorithms used by different equipment designers. In vision applications, however, there are no standards constraining the choice of algorithms. On the contrary, there are often many approaches to choose from to solve a particular vision problem. Therefore, vision algorithms are very diverse, and tend to change fairly rapidly over time. As a result, the use of non-programmable accelerators and coprocessors is less attractive for vision applications compared to applications like digital wireless and compression-centric consumer video equipment.

Achieving the combination of high performance, low cost, low power, and programmability is challenging. Special-purpose hardware typically achieves high performance at low cost, but with little programmability. General-purpose CPUs provide programmability, but with weak performance, poor cost-effectiveness, and/or low energy-efficiency. Demanding embedded vision applications most often use a combination of processing elements, which might include, for example:

  • A general-purpose CPU for heuristics, complex decision-making, network access, user interface, storage management, and overall control
  • A high-performance digital signal processors for real-time, moderate-rate processing with moderately complex algorithms
  • One or more highly parallel engines for pixel-rate processing with simple algorithms

While any processor can in theory be used for embedded vision, the most promising types today are the:

  • High-performance embedded CPU
  • Application-specific standard product (ASSP) in combination with a CPU
  • Graphics processing unit (GPU) with a CPU
  • Digital signal processor with accelerator(s) and a CPU
  • Mobile “application processor”
  • Field programmable gate array (FPGA) with a CPU

Subsequent sections of this article will briefly introduce each of these processor types, along with some of their key strengths and weaknesses for embedded vision applications.

High-Performance Embedded CPU

In many cases, embedded CPUs cannot provide enough performance—or cannot do so at acceptable price or power consumption levels—to implement demanding vision algorithms. Often, memory bandwidth is a key performance bottleneck, since vision algorithms typically use large amounts of data, and don’t tend to repeatedly access the same data. The memory systems of embedded CPUs are not designed for these kinds of data flows. However, like most types of processors, embedded CPUs become more powerful over time, and in some cases can provide adequate performance.

Compelling reasons exist to run vision algorithms on a CPU when possible. First, most embedded systems need a CPU for a variety of functions. If the required vision functionality can be implemented using that CPU, then the complexity of the system is reduced relative to a multiprocessor solution. In addition, most vision algorithms are initially developed on PCs using general-purpose CPUs and their associated software development tools. Similarities between PC CPUs and embedded CPUs (and their associated tools) mean that it is typically easier to create embedded implementations of vision algorithms on embedded CPUs compared to other kinds of embedded vision processors. Finally, embedded CPUs typically are the easiest to use compared to other kinds of embedded vision processors, due to their relatively straightforward architectures, sophisticated tools, and other application development infrastructure, such as operating systems.

ASSP in Combination with a CPU

ASSPs are specialized, highly integrated chips tailored for specific applications or application sets. ASSPs may incorporate a CPU, or use a separate CPU chip. By virtue of specialization, ASSPs typically deliver superior cost- and energy-efficiency compared with other types of processing solutions. Among other techniques, ASSPs deliver this efficiency through the use of specialized coprocessors and accelerators. And, because ASSPs are by definition focused on a specific application, they are usually delivered with extensive application software.

The specialization that enables ASSPs to achieve strong efficiency, however, also leads to their key limitation: lack of flexibility. An ASSP designed for one application is typically not suitable for another application, even one that is related to the target application. ASSPs use unique architectures, and this can make programming them more difficult than with other kinds of processors. Indeed, some ASSPs are not user-programmable. Another consideration is risk. ASSPs often are delivered by small suppliers, and this may increase the risk that there will be difficulty in supplying the chip, or in delivering successor products that enable system designers to upgrade their designs without having to start from scratch.

GPU with a CPU

GPUs, intended mainly for 3D graphics, are increasingly capable of being used for other functions, including vision applications. The GPUs used in personal computers today are explicitly intended to be programmable to perform functions other than 3D graphics. Such GPUs are termed “general-purpose GPUs” or “GPGPUs.” GPUs have massive parallel processing horsepower. They are ubiquitous in personal computers. GPU software development tools are readily and freely available, and getting started with GPGPU programming is not terribly complex. For these reasons, GPUs are often the parallel processing engines of first resort for computer vision algorithm developers who develop their algorithms on PCs, and then may need to accelerate execution of their algorithms for simulation or prototyping purposes.

GPUs are tightly integrated with general-purpose CPUs, sometimes on the same chip. However, one of the limitations of GPU chips is the limited variety of CPUs with which they are currently integrated, and the limited number of CPU operating systems that support that integration. Today, low-cost, low-power GPUs exist, designed for products like smartphones and tablets. However, these GPUs are often not GPGPUs, and therefore using them for applications other than 3D graphics is very challenging.

Digital Signal Processor with Accelerator(s) and a CPU

Digital signal processors are microprocessors specialized for signal processing algorithms and applications. This specialization typically makes digital signal processors more efficient than general-purpose CPUs for the kinds of signal processing tasks that are at the heart of vision applications. In addition, digital signal processors are relatively mature and easy to use compared to other kinds of parallel processors.

Unfortunately, while digital signal processors do deliver higher performance and efficiency than general-purpose CPUs on vision algorithms, they often fail to deliver sufficient performance for demanding algorithms. For this reason, DSPs are often supplemented with one or more coprocessors. A typical DSP chip for vision applications therefore comprises a CPU, a digital signal processor, and multiple coprocessors. This heterogeneous combination can yield excellent performance and efficiency, but can also be difficult to program. Indeed, DSP vendors typically do not enable users to program the coprocessors; rather, the coprocessors run software function libraries developed by the chip supplier.

Mobile “Application Processor”

A mobile “application processor” is a highly integrated system-on-chip, typically designed primarily for smart phones but used for other applications. Application processors typically comprise a high-performance CPU core and a constellation of specialized coprocessors, which may include a digital signal processor, a GPU, a video processing unit (VPU), a 2D graphics processor, an image acquisition processor, etc.

These chips are specifically designed for battery-powered applications, and therefore place a premium on energy efficiency. In addition, because of the growing importance of and activity surrounding smartphone and tablet applications, mobile application processors often have strong software development infrastructure, including low-cost development boards, Linux and Android ports, etc. However, as with the digital signal processors discussed in the previous section, the specialized coprocessors found in application processors are usually not user-programmable, which limits their utility for vision applications.

FPGA with a CPU

FPGAs are flexible logic chips that can be reconfigured at the gate and block levels. This flexibility enables the user to craft computation structures that are tailored to the application at hand. It also allows selection of I/O interfaces and on-chip peripherals matched to the application requirements. The ability to customize compute structures, coupled with the massive amount of resources available in modern FPGAs, yields high performance coupled with good cost- and energy-efficiency.

Using FGPAs, however, is essentially a hardware design function, rather than a software development activity. FPGA design is typically performed using hardware description languages (Verilog or VHLD) at the register transfer level (RTL)—a very low level of abstraction. This makes FPGA design time-consuming and expensive, compared to using the other types of processors discussed in this article.

With that said, using FPGAs is getting easier, due to several factors. First, so-called “IP block” libraries—libraries of reusable FPGA design components—are becoming increasingly capable. In some cases, these libraries directly address vision algorithms. In other cases, they enable supporting functionality, such as video I/O ports or line buffers. Also, FGPA suppliers and their partners increasingly offer reference designs—reusable system designs incorporating FPGAs and targeting specific applications. Finally, high-level synthesis tools, which enable designers to implement vision and other algorithms in FPGAs using high-level languages, are increasingly effective. Users can implement relatively low-performance CPUs in the FPGA. And, in a few cases, FPGA manufacturers are integrating high-performance CPUs within their devices.

In Conclusion

With embedded vision, the industry is entering a “virtuous circle” of the sort that has characterized many other DSP application domains. Although there are few chips dedicated to embedded vision applications today, these applications are increasingly adopting high-performance, cost-effective processing chips developed for other applications, including digital signal processors, CPUs, FPGAs, and GPUs. As these chips continue to deliver more programmable performance per dollar and per watt, they will enable the creation of more high-volume embedded vision products. Those high-volume applications, in turn, will attract more attention from silicon providers, who will deliver even better performance, efficiency, and programmability. And the Embedded Vision Alliance will help empower engineers to harness these chips to create a wide variety of amazing new products.

For more information, please visit www.Embedded-Vision.com. Contact the Embedded Vision Alliance at info@Embedded-Vision.com or +1-510-451-1800.

Sidebar: The Embedded Vision Alliance

As discussed earlier, embedded vision technology has the potential to enable a wide range of electronic products that are more intelligent and responsive than before, and thus more valuable to users. It can add helpful features to existing products. And it can provide significant new markets for hardware, software and semiconductor manufacturers. The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower engineers to transform this potential into reality in a rapid and efficient manner.

More specifically, the mission of the Alliance is to provide engineers with practical education, information, and insights to help them incorporate embedded vision capabilities into products. To execute this mission, the Alliance has developed a full-featured website, freely accessible to all and including (among other things) articles, videos, a daily news portal and a discussion forum staffed by a diversity of technology experts. Registered website users can receive the Embedded Vision Alliance's e-mail newsletter; they also gain access to the Embedded Vision Academy, containing numerous training videos, technical papers and file downloads, intended to enable those new to the embedded vision application space to rapidly ramp up their expertise.

A few examples of compelling content on the Embedded Vision Alliance website include:

  • "Introduction To Computer Vision Using OpenCV", the combination of a descriptive article, tutorial video, and executable software demo.
  • A three-part video interview with Jitendra Malik, Arthur J. Chick Professor of EECS at the University of California at Berkeley and a computer vision academic pioneer: Part 1, Part 2, Part 3.
  • Information on the definition and development of Cernium's Archerfish, a consumer-targeted and embedded vision-based surveillance system, in both article and video interview forms.

Vision-Based Gesture Recognition: An Ideal Human Interface for Industrial Control Applications

Bookmark and Share

Vision-Based Gesture Recognition: An Ideal Human Interface for Industrial Control Applications

By Brian Dipert
Editor-In-Chief
Embedded Vision Alliance
Senior Analyst
BDTI

This article was originally published in Digi-Key's Microcontroller TechZone. An excerpt of it is reprinted here with the permission of Digi-Key.

Embedded vision, the evolution and extrapolation of computer-based vision systems that process and interpret meaning from still and video images, is poised to be the next big technology success story. Consider, for example, the image sensors and processors now commonly found in cellular phones, tablets, laptop computers and dedicated computer displays. Originally intended for video conferencing and photography, they are now being harnessed for additional applications, such as augmented reality.

Similarly, consider the burgeoning popularity of consumer surveillance systems, driven by steady improvements in cameras and their subsystems, as well as the increasingly user-friendly associated surveillance software and services. Also, as anyone who has recently shopped for an automobile already knows, image sensors are increasingly found in numerous locations around a vehicle, leveraged for parking assistance, rear-view safety, impending-collision alert, lane-departure warning, and other functions.

The same robust-featured and cost-effective image sensors, processors, memory devices, I/O transceivers, and other ICs used in the earlier-mentioned systems are equally available to developers of vision-inclusive industrial automation applications. Gesture-based human interfaces are ideal in many respects, and therefore increasingly common, in such environments. For one thing, they are immediately intuitive; why click on a mouse, or a button, or even slide your finger across a touch screen to flip pages or move within a menu page, when you can instead just sweep your hand through the air?

A gesture-based UI also dispenses with the environmental restrictions that often hamper a touch-based interface; water and other fluids, non-conductive gloves, dirt and germs, etc. However, a first-generation motion implementation such as that utilized by the Nintendo® Wii™ game console system has limitations of its own. An easy-to-lose, breakable, in-hand controller is required to implement the scheme. Additionally, the interface between the controller and the system, usually implemented via Bluetooth®, ZigBee® or some other RF wireless technology, is (like a touchscreen interface) vulnerable to functional degradation due to environmental EMI.

Instead, consider an image sensor-inclusive design. Vision-based gesture interfaces use the human body as the controller versus a dedicated piece of extra hardware, interpreting hand, arm, and other body movements. They are comparatively EMI-immune; all that you need to ensure is sufficient operator-to-equipment distance along with adequate ambient lighting. In addition to gesture-based control, and as with the earlier mentioned computers and cell phones, you can use facial recognition technology to not only "unlock" the system in response to the presence of a valid operator's visage but also custom-configure the system on the fly for any particular operator, logging into a specific user account, for example. They can also offer a more extensive suite of user control options than does a coarser-grained accelerometer- or gyroscope-based motion interface.

A Kinect case study

If your system employs a dual-image-sensor (i.e. stereo or 3-D) arrangement, your range of available gestures becomes even richer, encompassing not only horizontal and vertical movements but also depth discernment. Stereo sensor setups also enable facial recognition software to more accurately discern between a real-life human being and a photograph of a person. Microsoft® took a different approach, called structured light, to discern depth with the Kinect peripheral for the Xbox® 360 (see Figure 1).

Figure 1: Microsoft's Kinect peripheral for the Xbox 360 game console, a well-known embedded vision success story (a), combines both monochrome and Bayer-patterned full color image sensors, along with an infrared transmitter for structured light depth discernment (b). Further dissection by iFixit revealed additional component details (c). (Courtesy Microsoft and iFixit, respectively).

Kinect is one of the best-known embedded vision examples, selling eight million units in its first 60 days on the market beginning early November 2011. It is not currently an industrial automation device, at least officially, although hackers' efforts have notably broadened its usefulness beyond the game console origins. Microsoft plans to unveil an official SDK for the Windows® 7 operating system this year, along with a PC-optimized product variant. Regardless, the design trade-offs and decisions made by Microsoft are instructive to others developing vision-based user interface hardware and software.

For the remainder of this article, please visit Digi-Key's website.

OpenCV on TI’s DSP+ARM® Platforms: Mitigating the Challenges of Porting OpenCV to Embedded Platforms

Bookmark and Share

OpenCV on TI’s DSP+ARM® Platforms: Mitigating the Challenges of Porting OpenCV to Embedded Platforms

By Joseph Coombs and Rahul Prabhu
Texas Instruments
This is a reprint of a Texas Instruments-published white paper, which is also available here (365 KB PDF).

Abstract

In today’s advancing market, the growing performance and decreasing price of embedded processors are opening many doors for developers to design highly sophisticated solutions for different end applications. The complexities of these systems can create bottlenecks for developers in the form of longer development times, more complicated development environments and issues with application stability and quality. Developers can address these problems using sophisticated software packages such as OpenCV, but migrating this software to embedded platforms poses its own set of challenges.

This paper will review how to mitigate some of these issues, including C++ implementation, memory constraints, floating-point support and opportunities to maximize performance using vendor-optimized libraries and integrated accelerators or co-processors. Finally, we will introduce a new effort by Texas Instruments (TI) to optimize vision systems by running OpenCV on the C6000™ digital signal processor (DSP) architecture. Benchmarks will show the advantage of using the DSP by comparing the performance of a DSP+ARM® system-on-chip (SoC) processor against an ARM-only device.

Introduction

OpenCV is a free and open-source computer vision library that offers a broad range of func- tionality under the permissive Berkeley Software Distribution (BSD) license. The library itself is written in C++ and is also usable through C or Python language applications. Thousands of developers use OpenCV to power their own specialized applications, making it the most widely used library of its kind. The OpenCV project is under active development, with regular updates to eliminate bugs and add new functionality. The mainline development effort targets the x86 architecture and supports acceleration via Intel’s proprietary Integrated Performance Primitives (IPP) library. A recent release also added support for graphics processing unit (GPU) acceleration using NVIDIA’s Compute Unified Device Architecture (CUDA) standard.

OpenCV’s greatest asset is the sheer breadth of algorithms included in its standard distribution. Figure 1 shows an incomplete list of some of the key function categories included in OpenCV. These range from low-level image filtering and transformation to sophisticated feature analysis and machine learning functionality. A complete listing of every function and use case is beyond the scope of this article, but we will consider the unique requirements of developers in the embedded vision space. For these developers, OpenCV represents an attractively comprehensive toolbox of useful, well-tested algorithms that can serve as building blocks for their own specialized applications. The question then becomes whether or not OpenCV can be used directly in their embedded systems.

Figure 1. Partial overview of the OpenCV library

Despite its original development focus for use with PC workstations, OpenCV can also be a useful tool for embedded development. There are vendor-specific libraries that offer OpenCV-like capabilities on various embedded systems, but few can match OpenCV’s ubiquity in the computer vision field or the sheer breadth of its included algorithms. It should come as no surprise that OpenCV has already been ported to the ARM® architecture, a popular CPU choice for embedded processors. It’s certainly possible to cross-compile the OpenCV source code as-is and use the result with embedded devices, but memory constraints and other architectural considerations may pose a problem. This white paper will examine some of the specific obstacles that must be overcome for OpenCV to achieve acceptable performance on an embedded platform. Finally, the paper will describe a new effort by Texas Instruments (TI) to bring OpenCV to its C6000™ digital signal processor (DSP) architecture. Performance benchmarks will compare TI’s DSP+ARM® system-on-chip (SoC) processor against the standard ARM-only approach.

Changing Requirements of Embedded Vision Applications

The continued growth of embedded vision applications places contradictory demands on embedded developers. Increasingly sophisticated vision algorithms require more memory and processing power, but price and deployment constraints require embedded devices that cost less money and consume less power. Embedded hardware and software expand in complexity while development cycles accelerate and contract. The following applications are representative of the current state and future direction of the overall embedded vision space.

Let’s start with industrial vision applications. One common industrial vision task is assembly line inspection, which detects, classifies and sorts objects to maximize manufacturing speed and quality. These vision algorithms are often run on costly computer workstations; migrating to an embedded DSP is one obvious way to save on price and power consumption. Even applications that are already implemented with embedded systems can be improved by condensing discrete logic into the DSP. For example, many industrial vision systems share the basic shape illustrated by Figure 2. The image signal processor (ISP) is a field programmable gate array (FPGA) that performs time-critical pre-processing on incoming data before it reaches the DSP. This FPGA becomes more expensive and consumes more power proportional to its workload. One way to maximize the efficiency of the overall embedded system is to integrate as much pre-processing as possible into the DSP. The challenge then becomes keeping up with rapid improvements in the physical system. Next-generation systems must process more data in less time to accommodate improved camera resolution and frame rate as well as faster assembly line speeds.

Figure 2. Typical embedded vision system, including camera, pre-processing FPGA and DSP

Video surveillance applications provide another perspective on the evolution of embedded vision. Traditional surveillance systems are less concerned with vision analytics than they are with simply encoding and recording video data. However, as vision algorithms improve, video surveillance will incorporate more automated monitoring and analysis of this recorded data. Examples range from motion and camera tamper detection to people counting and license plate reading. These algorithms enable so-called metadata streaming, or creating automated logs of detected activity to accompany streamed and recorded video data. As vision algorithms become more capable and reliable, video surveillance systems will become more automated and sophisticated. This presents a particular challenge to embedded video surveillance systems, since cutting-edge algorithms that are developed on PCs may require considerable rework and optimization to run efficiently on an embedded device. Consequently, many embedded video surveillance applications are limited to the simpler encode-and-record paradigm.

One last example application from the broad category of embedded vision is automotive vision. Unlike the previously discussed application spaces, automotive vision is almost exclusively the domain of embedded processors. Many automotive vision systems can be reduced to a block diagram similar to Figure 2, essentially consisting of a camera, a pre-processing FPGA and a DSP to apply intensive vision algorithms. Reliability is the key concern in applications such as lane departure warning, steering assistance and proximity detection. The vision algorithms used in automotive vision are under constant, active development using high-level PC software, but running the final application on a PC is simply not an option. The transition from PC to DSP is a critical step in the development of automotive vision applications. Writing and rewriting algorithms to achieve acceptable real-time performance is a major development focus. This only gets more difficult as embedded systems become more sophisticated, incorporating multiple camera inputs and multiple processing cores.

Efficient DSP software plays a critical role in all embedded vision applications. The prospect of using high- level software like OpenCV to facilitate rapid algorithm development is appealing, but optimizing that software for a new platform is a critical sticking point. Conversely, achieving acceptable performance with un-optimized DSP software is simply unrealistic. In the next section of this article, we consider the key challenges associated with porting and optimizing sophisticated PC software — particularly the OpenCV library — to run on an embedded device.

Challenges of Porting OpenCV to Embedded Devices

Since OpenCV is open source and written entirely in C/C++, the library has been cross compiled and ported as-is to a variety of platforms. However, simply rebuilding the library for an embedded platform may not yield the real-time performance demanded in that space. At the same time, rewriting and manually optimizing the entire OpenCV library for a new architecture represents an enormous amount of work. Device-appropriate optimizing compilers are critical to navigate between these opposing challenges. The ubiquitous GNU Compiler Collection (GCC) has been used to successfully port OpenCV to ARM platforms, but GCC is not available on more specialized DSP architectures. These devices typically rely on proprietary compilers that are not as full-featured or standards-compliant as GCC. These compilers may have a strong focus on the C language and be less capable at optimizing C++ code.

The current version of OpenCV relies heavily on C++ Standard Template Library (STL) containers as well as GCC and C99 extensions, which are not well supported on certain embedded compilers. For these reasons, it may be necessary to revert to OpenCV version 1.1 or earlier — which are written almost entirely in C — when targeting a specialized embedded platform. The OpenCV source code includes many low-level optimizations for x86 processors that are not applicable to ARM® or DSP platforms. These optimizations can be replaced with vendor-provided support libraries or intrinsic functions that make explicit use of architecture-specific single instruction, multiple data (SIMD) commands to speed up code execution. OpenCV application programming interfaces (APIs) often allow data to be provided in multiple formats, which can complicate the task of optimizing these functions for a new target device. Limiting these functions to a single data type or splitting them into single-type variants can allow the compiler to generate simpler, more efficient code. Similarly, in-lining small, frequently used internal functions can provide a performance lift to high-level vision functions.

The word “optimization” for embedded platforms often means endlessly poring over low-level architectural minutiae to write and tweak device-specific assembly language code. Fortunately, as embedded processors have grown in complexity, embedded development tools have become more powerful and user-friendly. Most vendors in the embedded industry provide optimized libraries that have been hand tuned to provide the best performance on the device for low-level math, image and vision functionality. Coupling the OpenCV library with these libraries can accelerate high-level OpenCV APIs. TI is one of the few companies that provide vision and imaging libraries that can replace a portion of the code for an OpenCV function or, in some cases, the entire function itself. Similarly, optimized math and signal processing libraries can also provide a significant boost to maximize the potential of OpenCV functions on embedded devices. Using these optimized libraries underneath the OpenCV APIs can maximize performance by utilizing architecture-specific capabilities while maintaining the standard interface of the high-level software. In other words, these low- level libraries can accelerate OpenCV functions without breaking pre-existing application code that is written to use standard OpenCV APIs.

Another challenge often faced when using OpenCV functions in an embedded processor environment deals with the lack of native support for floating-point math. This poses a significant problem for OpenCV since it includes a number of specialized image processing functions that rely heavily on floating-point computation. OpenCV supports a wide range of image data types, including fixed- and floating-point representations. Many OpenCV image-processing functions never use floating-point math, or use it only when the image data consists of floating-point values. However, some specialized functions that work with Eigen values, feature spaces, image transformation and image statistics always use floating-point math regardless of the original image data type. These intensive algorithms require native floating-point support to achieve real-time performance in an embedded application. Figure 3 compares the performance of several OpenCV functions that rely on floating-point processing across multiple embedded targets. The ARM9™ processor used lacks native floating-point support, while the ARM Cortex™-A8 processor includes NEON support for floating-point math and delivers a twofold increase in performance. Also included is TI’s floating- point C674x DSP, which is highly optimized for intensive computation and delivers an even greater boost to performance. These benchmarks emphasize the need for native floating-point support when running certain OpenCV algorithms.

Function Name

ARM9™ (ms)

ARM Cortex-A8 (ms)

C674x DSP (ms)

cvCornerEigenValsandVecs

4746

2655

402

cvGoodFeaturestoTrack

2040

1234

268

cvWarpAffine

82

37

17

cvOpticalFlowPyrLK

9560

5340

344

cvMulSpecturm

104

69

11

cvHaarDetectObject

17500

8217

1180

Figure 3. Performance benchmark for OpenCV functions with floating-point math. Image size 320×240; all cores operated at 300 MHz; ARM9 and C674x DSP cores tested using TI’s OMAP-L138 C6-Integra™ DSP+ARM processor; ARM Cortex-A8 core tested using TI’s DM3730 DaVinci™ digital media processor.

Porting and running OpenCV on embedded systems also presents a more general set of design challenges. In addition to the processor architecture, there may also be memory restrictions and special requirements for deterministic, real-time operation. Multicore devices are also becoming more common in the embedded space, and utilizing these cores efficiently to maximize performance brings its own challenges. Embedded multicore devices may consist of homogeneous cores, such as dual-ARM devices, or they may integrate an ARM with a heterogeneous core such as a DSP or GPU. SoC devices also integrate peripherals and accelerators to reduce overall system complexity by simplifying board design and layout considerations. Many OpenCV functions can benefit greatly from utilizing these specialized processing cores and vector or floating-point accelerators. An algorithm that is highly parallelizable may be a good fit for an integrated GPU. Vision and image-processing algorithms that are not easily parallelized but still require intensive floating-point computation may be better suited for a DSP core. Low-level preprocessing functions like color space conversions, noise reduction and statistical computation tend to be well suited to single-purpose hardware like an FPGA or application-specific integrated circuit (ASIC). Embedded devices that allow developers to effectively split their application, including OpenCV, among the best-suited heterogeneous components can deliver superior performance.

Effectively using and sharing device memory is one of the primary challenges in embedded development. When both random-access memory (RAM) and read-only memory (ROM) are in short supply, applications must make judicious use of these resources. Many modern day applications require a full operating system (OS) with its own sizeable footprint, which makes managing device memory even more critical. An embedded vision application using OpenCV needs reasonably large memory with sufficient bandwidth and access time to accommodate work buffers and program data for several interrelated tasks: data acquisition, processing, and storage or output of results. Moreover, OpenCV functions that operate on multi-dimensional data such as a feature space rather than the standard two- or three-dimensional image or video spaces can consume even larger blocks of memory. OpenCV developers on embedded devices must consider suitable tradeoffs between memory utilization and the full feature set of OpenCV. For example, some OpenCV APIs operate on a “memory storage” unit that is initially allocated with a fixed size and later expanded as necessary to prevent overflow as its contents grow.

Developers can avoid unnecessary allocation calls and memory fragmentation by creating the initial memory storage with enough space to handle the worst-case scenario. Other tradeoffs can be made that impose limits on OpenCV APIs in order to achieve better performance without compromising computational accuracy. For example, nested image regions in OpenCV are represented as sets of components known as contours and holes. Each contour may be contained within a hole and may itself contain one or more holes, and the reverse is true for each hole. Figure 4 illustrates this relationship. OpenCV supports multiple formats to store and traverse these regions, including branched representations that require developers to write complicated routines to plot or process the overall image. Developers can achieve better performance by creating a single-branch structure that can be traversed using a simple loop. Finally, OpenCV applications may suffer from memory leaks caused by sloppy handling of large data buffers. These leaks could waste hundreds of megabytes of highly valuable RAM and could eventually crash the en- tire application. Memory leaks commonly arise when allocating memory and then changing the pointer itself (thereby precluding the use of “free” APIs), forgetting to free storage space after processing is complete, or carelessly changing or translating pointers inside complex data structures. Memory leaks are problematic in any system, but the consequences are particularly dire in the embedded space.

Figure 4. Test image with contour/hole regions and tree structures supported by OpenCV

Multicore embedded processors provide increased performance by increasing the raw processing power available to applications, but significant challenges face embedded developers who want to use that power to accelerate OpenCV. The primary challenge when migrating to the multicore paradigm is properly partitioning the overall program and coordinating the various bits and pieces as they run independently. The simplest case is a system that consists of two separate processing units, such as two ARM cores, or an ARM and DSP. In this case, the problem is often approached as writing a normal, single-core application and then “offloading” parts of that application to the other core. An important criterion for offloading a task from one core to the other is the inter-processor communication (IPC) overhead. Offloading a task is appropriate only if the time spent sending and receiving IPC messages does not exceed the time saved by splitting the processing load. In a multicore scenario, applications need to be multi-threaded to enable the utilization of multiple processor cores to complete a task. Multi-threaded applications need special handling to correctly coordinate their tasks and improve efficiency. However, the performance increase offered by parallelization in most vision algorithms is limited because much of the application must be executed serially. Cache coherency, address translation and endianness translation between multiple processors are some of the issues that a developer may encounter when designing a multicore application.

Certain data types in OpenCV pose a significant challenge to heterogeneous multicore systems. OpenCV defines several data types for its input/output (I/O) and processing operations that typically utilize a header/ data format. Figure 5 shows a dynamic structure used by OpenCV that stores data as a simple linked list. Each list node consists of some data and pointers, or links, to neighboring list nodes. These links can be problematic when sharing lists between separate processing cores that do not share the same memory management unit (MMU). In order to share this data between the cores, pointers used by one core must be translated so that they can be understood by the other core. This address translation must then be reversed when data returns from the second core to the first. Cache coherence between the two cores is also an issue when data is passed back and forth. Additionally, internal OpenCV allocation APIs may need to be modified to ensure that data is placed in sections of memory that are equally accessible by both cores.

Figure 5. Memory storage organization in OpenCV

In addition, OpenCV pre-allocates a memory storage in which the dynamic data structure is formed and further allocates memory if the link list outgrows the pre-allocated memory. Delegation of such a task from a master core to a slave core creates the added complication of feeding the newly allocated memory information back to the memory space of the master core. Compiler-based parallelism offered by OpenMP and application interface based task offloading offered by OpenCL are currently being evaluated for OpenCV implementation on multiple cores.

Multicore SoCs often feature heterogeneous processors that access shared external memory simultaneously. For this reason, developers using OpenCV in SoC applications must consider memory bandwidth in addition to memory capacity. Application performance depends on how quickly and efficiently memory is accessed. Simply adding more memory to a system won’t always help. Direct Memory Access (DMA) adds additional channels through which the processing cores can access external memory, which allows designers to increase bandwidth and reduce contention between cores. Through the use of enhanced DMA units, the processor does not have to directly control repetitive, low-level memory access. Figure 6 shows the performance improvement gained by using DMA to accelerate external memory access in three common image-processing algorithms. The test image is divided into slices and moved from external memory to internal RAM by DMA, processed and then copied out again by DMA. The performance using this method is much improved over processing the same image in-place in external memory.

Function

Slice-based processing with DMA (ms)

In-place processing with cache (ms)

Gaussian filtering

6.1

7.7

Luma extraction

3.0

10.9

Canny edge detection

41.4

79.1

Figure 6. Performance benchmarks for three image-processing algorithms with and without DMA on TI’s OMAP3530 DaVinci™ digital media processor at 720 MHz

Given the challenges inherent in bringing OpenCV to embedded devices, it is worth investigating other computer vision offerings that already exist in the embedded space. The next section of this article examines TI-provided alternatives to OpenCV. These packages are smaller than OpenCV, but they show the performance that is possible on embedded devices with highly optimized software and a deep understanding of the underlying architecture.

TI’s Other Vision Offerings In the Embedded Space

Separate from OpenCV, TI provides optimized libraries to help developers achieve real-time performance with vision and image-processing applications on TI’s embedded devices. The proprietary Vision Library (VLIB) and open source Image Library (IMGLIB) are separate collections of algorithms that are optimized for TI’s C64x+™ DSPs. IMGLIB is distributed with full source code, a combination of optimized C and assembly that can be modified and rebuilt for newer DSP architectures, including C674x and C66x, to take advantage of all available architectural resources. TI also provides example application code to setup dual-buffered DMA transfers, which can speed up the image and vision kernels by 4 to 10 times compared to operating on data in external memory. These libraries are designed to convert most floating-point processing into fixed-point approximations in order to utilize SIMD extensions available in the C64x+ instruction set.

Despite the availability of these proprietary vision software offerings, OpenCV has the benefit of broad industry familiarity. Additionally, OpenCV boasts a development community actively contributing fixes and enhancements to the library, which continually improves and expands its capabilities and feature set. OpenCV has already been ported to several general-purpose processors (GPPs), including ARM®, but obtaining real-time performance often requires additional assistance from dedicated accelerators or co-processors on embedded devices. In the embedded space, DSP+ARM SoC processors and other multicore devices with high-performance, floating-point DSPs or hardware accelerators are excellent platforms to accelerate OpenCV processing. Vision developers can utilize each core as appropriate to maximize the overall perfor- mance of their embedded system. Properly balancing processing and I/O tasks between cores can allow embedded developers to obtain real-time vision performance using OpenCV. The next section describes one effort to port and optimize OpenCV for TI’s DSP+ARM® SoC processors.

DSP acceleration of OpenCV on TI’s C6-Integra™ DSP+ARM processors

TI’s C6-Integra DSP+ARM processors are an attractive target for porting of OpenCV to the embedded space due to their processing capabilities, high levels of integration and power requirements. These processors allow application developers to exploit the strengths of two embedded processor cores. The ARM runs Linux and acts as a GPP, managing I/O transactions such as video input and output and an USB-based user interface. Meanwhile, the floating-point DSP acting as a processing engine enables real-time performance for OpenCV functions. Properly utilizing the power of the DSP core presents two major challenges: coordinating basic communication between heterogeneous processing cores, and passing large data buffers from one memory space to the other. TI provides software solutions for both of these problems.

Figure 7. High-level view of a C6-Integra DSP+ARM application using the C6EZAccel framework

C6EZAccel is a software development tool from TI that provides ARM-side APIs that call into optimized DSP libraries. This abstracts the low-level complexities of heterogeneous multi-core development, including IPC. The DSP side of C6EZAccel consists of an algorithm server that waits to receive messages from the ARM. Each message specifies one or more functions to be executed and provides the data buffers and configuration parameters to be used. C6EZAccel allows the ARM application to specify data using the standard OpenCV data types. Figure 7 gives a high-level view of C6EZAccel used by a C6-Integra DSP+ARM processor. The C6EZAccel tool also supports asynchronous calls to OpenCV APIs so that DSP processing can occur in parallel to other work on the ARM side. When used in asynchronous mode, C6EZAccel APIs save context information before starting DSP processing. The ARM application can then poll to check for DSP completion and use its saved context to restore data structures and pointers returning from the DSP. Figure 8 illustrates how asynchronous processing on the DSP can greatly accelerate the overall application. The DSP side algorithm links with a static OpenCV library that is built from the mainline OpenCV source code with minimal modifications using TI’s optimizing C compiler. There is a lot of room to further optimize the DSP side OpenCV library by rewriting OpenCV functions with the DSP architecture in mind, but the compiler-optimized library provides a useful starting point that developers can start exploring today.

Figure 8. Asynchronous DSP processing accelerates an embedded application

In order to easily call OpenCV APIs on the DSP, the ARM application also uses its own version of the OpenCV library. This library is used to load and prepare data for processing, as well as to call simple APIs that do not necessitate using the DSP. C6EZAccel also includes a custom version of OpenCV’s cvAlloc function that is statically linked into the ARM application to override the default behavior and allocate contiguous data buffers using a Linux module called CMEM. This design allows the ARM application to freely share OpenCV-allocated data buffers with the DSP without modifying and rebuilding the entire ARM side OpenCV library.

Sharing OpenCV structures and data buffers between the ARM and the DSP requires two additional steps: address translation and cache management. Address translation involves converting virtual memory pointers on the ARM side to physical addresses that the DSP can interpret, then restoring the virtual address after DSP processing so that the data can be read and reused later in the ARM application. Cache management maintains data coherence between the independent ARM and DSP applications by writing back and invalidating cached memory that has been or will be modified by the other core. C6EZAccel ensures cache coherence in the ARM application by invalidating output buffers and writing back and invalidating input buffers prior to invoking the DSP side OpenCV APIs. Some OpenCV data structures require additional massaging before they can be passed on to the IPC framework; C6EZAccel takes care of this work as well. All of these tasks are handled transparently by C6EZAccel, so the ARM application looks very similar to an “ordinary” OpenCV application outside the embedded space.

The current performance of OpenCV on an ARM Cortex-A8 versus a DSP is summarized in Figure 9. Note that the DSP side OpenCV library is largely un-optimized, so there is a lot of room for future improvement. Even so, early results are promising; the DSP yields significant improvement beyond the ARM-only OpenCV library.

OpenCV Function

ARM Cortex™-A8 with NEON (ms)

ARM Cortex-A8 with C674x DSP (ms)

Performance Improvement (cycle reduction)

Performance Improvement (x-factor)

cvWarpAffine

52.2

41.24

21.0%

1.25

cvAdaptiveThreshold

85.029

33.433

60.68%

2.54

cvDilate

3.354

1.340

60.47%

2.50

cvErode

3.283

2.211

32.65%

1.48

cvNormalize

52.258

14.216

72.84%

3.68

cvFilter2D

36.21

11.838

67.3%

3.05

cvDFT

594.532

95.539

83.93%

6.22

cvCvtColor

16.537

14.09

14.79%

1.17

cvMulSpectrum

89.425

15.509

78.18%

5.76

cvIntegral

8.325

5.789

30.46%

1.44

cvSmooth

122.57

57.435

53.14%

2.14

cvHoughLines2D

2405.844

684.367

71.55%

3.52

cvCornerHarris

666.928

168.57

74.72%

3.91

cvCornerEigenValsandVecs

3400.336

1418.108

58.29%

2.40

cvGoodFeaturesToTrack

19.378

4.945

74.48%

4.29

cvMatchTemplate

1571.531

212.745

86.46%

7.43

cvMatchshapes

7.549

3.136

58.45%

2.48

Figure 9. Performance benchmark for OpenCV functions on ARM Cortex-A8 (with NEON) versus C674x DSP. Image resolution: 640×480; ARM compiler: CS2009 (with –o3, -mfpu=neon); DSP compiler: TI CGT 7.2 (with –o3); both cores tested using TI TMSC6A816x C6-Integra™ DSP+ARM processor (ARM: 1 GHz, DSP: 800 MHz)

Conclusion

OpenCV is among the largest and most widely used tools in computer vision applications, and it has already started to migrate from servers and desktop PCs to the increasingly capable world of embedded devices. This paper has examined some of the key challenges faced by OpenCV in that transition, including tighter system constraints and difficulty in effectively utilizing custom embedded architectures. It has also shown the performance advantage developers can achieve by running OpenCV on a DSP compared to an ARM-only approach. Texas Instruments is currently accelerating OpenCV on its DSP and DSP+ARM platforms, offering vision developers an embedded hardware solution with high performance, high integration and low power consumption as well as a user-friendly framework with which developers can implement OpenCV. TI’s support of OpenCV for its DSP and DSP+ARM platforms provides a great opportunity for embedded developers to address their performance, power and integration challenges and create a unique niche in the world of embedded vision.

Embedded Vision: FPGAs' Next Notable Technology Opportunity

Bookmark and Share

Embedded Vision: FPGAs' Next Notable Technology Opportunity

By Brian Dipert
Editor-In-Chief
Embedded Vision Alliance
Senior Analyst
BDTI

This article was originally published in the First Quarter 2012 issue (PDF) of the Xilinx Xcell Journal. It is reprinted here with the permission of Xilinx.

A jointly developed reference design validates the potential of Xilinx’s Zynq device in a burgeoning application category.

By Brian Dipert
Editor-In-Chief
Embedded Vision Alliance
dipert@embedded-vision.com

José Alvarez
Engineering Director, Video Technology
Xilinx, Inc.
jose.alvarez@xilinx.com

Mihran Touriguian
Senior DSP Engineer
BDTI (Berkeley Design Technology, Inc.)
touriguian@bdti.com

What up-and-coming innovation can help you design a system that alerts users to a child struggling in a swimming pool, or to an intruder attempting to enter a residence or business? It’s the same technology that can warn drivers of impending hazards on the roadway, and even prevent them from executing lane-change, acceleration and other maneuvers that would be hazardous to themselves and others. It can equip a military drone or other robot with electronic “eyes” that enable limited-to-full autonomous operation. It can assist a human physician in diagnosing a patient’s illness. It can uniquely identify a face, subsequently initiating a variety of actions (automatically logging into a user account, for example, or pulling up relevant news and other information), interpreting gestures and even discerning a person’s emotional state. And in conjunction with GPS, compass, accelerometer, gyroscope and other features, it can deliver a data-augmented presentation of a scene.

The technology common to all of these application examples is embedded vision, which is poised to enable the next generation of electronic-system success stories. Embedded vision got its start in traditional computer vision applications such as assembly line inspection, optical character recognition, robotics, surveillance and military systems. In recent years, however, the decreasing costs and increasing capabilities of key technology building blocks have broadened and accelerated vision’s penetration into key high-volume markets.

Driven by expanding and evolving application demands, for example, image sensors are making notable improvements in key attributes such as resolution, low-light performance, frame rate, size, power consumption and cost. Similarly, embedded vision applications require processors with high performance, low prices, low power consumption and flexible programmability, all ideal attributes that are increasingly becoming a reality in numerous product implementation forms. Similar benefits are being accrued by latest-generation optics systems, lighting modules, volatile and nonvolatile memories, and I/O standards. And algorithms are up to the challenge, leveraging these hardware improvements to deliver more robust and reliable analysis results.

Embedded vision refers to machines that understand their environment through visual means. By “embedded,” we’re referring to any image-sensor-inclusive system that isn’t a general-purpose computer. Embedded might mean, for example, a cellular phone or tablet computer, a surveillance system, an earth-bound or flight-capable robot, a vehicle containing a 360° suite of cameras or a medical diagnostic device. Or it could be a wired or wireless user interface peripheral; Microsoft’s Kinect for the Xbox 360 game console, perhaps the best-known example of this latter category, sold 8 million units in its first two months on the market.

THE FPGA OPPORTUNITY: A CASE STUDY

A diversity of robust embedded vision processing product options exist: microprocessors and embedded controllers, application-tailored SoCs, DSPs, graphics processors, ASICs and FPGAs. An FPGA is an intriguing silicon platform for realizing embedded vision, because it approximates the combination of the hardware attributes of an ASIC—high performance and low power consumption—with the flexibility and time-to-market advantages of the software algorithm alternative running on a CPU, GPU or DSP. Flexibility is a particularly important factor at this nascent stage in embedded vision’s market development, where both rapid bug fixes and feature set improvements are the norm rather than the exception, as is the desire to support a diversity of algorithm options. An FPGA’s hardware configurability also enables straightforward design adaptation to image sensors supporting various serial and parallel (and analog and digital) interfaces.

The Embedded Vision Alliance is a unified worldwide alliance of technology developers and providers chartered with transforming embedded vision’s potential into reality in a rich, rapid and efficient manner (see sidebar). Two of its founding members, BDTI (Berkeley Design Technology, Inc.) and Xilinx, partnered to co-develop a reference design that exemplifies not only embedded vision’s compelling promise but also the role that FPGAs might play in actualizing it. The goal of the project was to explore the typical architectural decisions a system designer would make when creating highly complex intelligent vision platforms containing elements requiring intensive hardware processing and complex software and algorithmic control.

BDTI and Xilinx partitioned the design so that the FPGA fabric would handle digital signal-processing-intensive operations, with a CPU performing complex control and prediction algorithms. The exploratory implementation described here connected the CPU board to the FPGA board via an Ethernet interface. The FPGA performed high-bandwidth processing, with only metadata interchanged through the network tether. This project also explored the simultaneous development of hardware and software, which required the use of accurate simulation models well ahead of the final FPGA hardware implementation.

PHASE 1: ROAD SIGN DETECTION

This portion of the project, along with the next phase, leveraged two specific PC-based functions: a simulation model of under-development Xilinx video IP blocks, and a BDTI-developed processing application (Figure 1). The input data consisted of a 720p HD resolution, 60-frame/second (fps) YUV-encoded video stream representing the images that a vehicle’s front-facing camera might capture. And the goal was to identify (albeit not “read” using optical character recognition, although such an added capability would be a natural extension) four types of objects in the video frames as a driver-assistance scheme:

  • Green directional signs
  • Yellow and orange hazard signs
  • Blue informational signs, and
  • Orange traffic barrels

Figure 1 - The first two phases of BDTI and Xilinx’s video-analytics proof-of-concept reference design development project ran completely on a PC.

The Xilinx-provided IP block simulation models output metadata that identified the locations and sizes of various-colored groups of pixels in each frame, the very same metadata generated by the final hardware IP blocks. The accuracy of many embedded vision systems is affected by external factors such as noise from imaging sensors, unexpected changes in illumination and unpredictable external motion. The mandate for this project was to allow the FPGA hardware to process the images and create metadata in the presence of external disturbances with parsimonious use of hardware resources, augmented by predictive software that would allow for such disturbances without decreasing detection accuracy.

BDTI optimized the IP blocks’ extensive set of configuration parameters for the particular application in question, and BDTI’s postprocessing algorithms provided further refinement and prediction capabilities. In some cases, for example, the hardware was only partially able to identify the objects in one frame, but the application-layer software continued to predict the location of the object using tracking algorithms. This approach worked very well, since in many cases the physical detection may not be consistent across time. Therefore, the software intelligent layer is the key to providing consistent prediction.

As another example, black or white letters contained within a green highway sign might confuse the IP blocks’ generic image-analysis functions, thereby incorrectly subdividing the sign into multiple-pixel subgroups (Figure 2). The IP blocks might also incorrectly interpret other vehicles’ rear driving or brake lights as cones or signs by confusing red with orange, depending on the quality and setup of the imaging sensor used for the application.

FIGURE 2- Second-level, application-tailored algorithms refined the metadata coming from the FPGA’s video-analysis hardware circuits.

The BDTI-developed algorithms therefore served to further process the Xilinx-supplied metadata in an application-tailored manner. They knew, for example, what signs were supposed to look like (size, shape, color, pattern, location within the frame and so on), and therefore were able to combine relevant pixel clusters into larger groups. Similarly, the algorithms determined when it was appropriate to discard seemingly close-in-color pixel clusters that weren’t signs, such as the aforementioned vehicle brake lights.

PHASE 2: PEDESTRIAN DETECTION AND TRACKING

In the first phase of this project, the camera was in motion but the objects (that is, signs) being recognized were stationary. In the second phase targeting security, on the other hand, the camera was stationary but objects (people, in this case) were not. Also, this time the video-analytics algorithms were unable to rely on predetermined colors, patterns or other object characteristics; people can wear a diversity of clothing, for example, and come in various shapes, skin tones and hair colors and styles (not to mention might wear head-obscuring hats, sunglasses and the like). And the software was additionally challenged with not only identifying and tracking people but also generating an alert  when an individual traversed a digital “trip wire” and was consequently located in a particular region within the video frame (Figure 3).

Figure 3 - Pedestrian detection and tracking capabilities included a “trip wire” alarm that reported when an individual moved within a bordered portion of the video frame.

The phase 2 hardware configuration was identical to that of the earlier phase 1, although the software varied; a video stream fed simulation models of the video-analytics IP cores, with the generated metadata passing to a secondary algorithm suite for additional processing. Challenges this time around included:

  • Resolving the fundamental trade-off between unwanted noise and proper object segmentation
  • Varying object morphology (form and structure)
  • Varying object motion, both person-to-person and over time with a particular person
  • Vanishing metadata, when a person stops moving, for example, is blocked by an intermediary object or blends into the background pattern
  • Other objects in the scene, both stationary and in motion
  • Varying distance between each person and the camera, and
  • Individuals vs. groups, and dominant vs. contrasting motion vectors within a group

With respect to the “trip wire” implementation, four distinct video streams were particularly effective in debugging and optimizing the video-analytics algorithms:

  • “Near” pedestrians walking and reversing directions
  • “Near” pedestrians walking in two different directions
  • A “far” pedestrian with a moving truck that appeared, through a trick of perspective, to be of a comparable size, and
  • “Far” pedestrians with an approaching truck that appeared larger than they were

PHASE 3: HARDWARE CONVERSIONS AND FUTURE EVOLUTIONS

The final portion of the project employed Xilinx’s actual video-analytics IP blocks (in place of the earlier simulation models), running on the Spartan®-3A 3400 Video Starter Kit. A MicroBlaze™ soft processor core embedded within the Spartan-3A FPGA, augmented by additional dedicated-function blocks, implemented the network protocol stack. That stack handled the high-bit-rate and Ethernet-packetized metadata transfer to the BDTI-developed secondary processing algorithms, now comprehending both road sign detection and pedestrian detection and tracking. And whereas these algorithms previously executed on an x86-based PC, BDTI successfully ported them to an ARM® Cortex™-A8-derived hardware platform called the BeagleBoard (Figure 4).

Figure 4 - The final phase of the project migrated from Xilinx’s simulation models to actual FPGA IP blocks. BDTI also ported the second-level algorithms from an x86 CPU to an ARM-based SoC, thereby paving the path for the single-chip Zynq Extensible Processing Platform successor.

Embedded vision is poised to become the next notable technology success story for both systems developers and their semiconductor and software suppliers. As the case study described in this article suggests, FPGAs and FPGA-plus-CPU SoCs can be compelling silicon platforms for implementing embedded vision processing algorithms.


SIDEBAR: EMBEDDED VISION ALLIANCE SEES SUCCESS

Embedded vision technology has the potential to enable a wide range of electronic products that are more intelligent and responsive than before, and thus more valuable to users. It can add helpful features to existing products. And it can provide significant new markets for hardware, software and semiconductor manufacturers. The Embedded Vision Alliance, a unified worldwide organization of technology developers and providers, will transform this potential into reality in a rich, rapid and efficient manner.

The alliance has developed a full-featured website, freely accessible to all and including (among other things) articles, videos, a daily news portal and a multi-subject discussion forum staffed by a diversity of technology experts. Registered website users can receive the alliance’s monthly e-mail newsletter; they also gain access to the Embedded Vision Academy, containing numerous tutorial presentations, technical papers and file downloads, intended to enable new players in the embedded vision application space to rapidly ramp up their expertise.

Other envisioned future aspects of the alliance’s charter may include:

  • The incorporation, and commercialization, of technology breakthroughs originating in universities and research laboratories around the world,
  • The codification of hardware, semiconductor and software standards that will accelerate new technology adoption by eliminating the confusion and inefficiency of numerous redundant implementation alternatives,
  • Development of robust benchmarks to enable clear and comprehensive evaluation and selection of various embedded vision system building blocks, such as processors and software algorithms, and
  • The proliferation of hardware and software reference designs, emulators and other development aids that will enable component suppliers, systems implementers and end customers to develop and select products that optimally meet unique application needs.

For more information, please visit www.embedded-vision.com. Contact the Embedded Vision Alliance at info@embedded-vision.com and (510) 451-1800.

Gesture Recognition--First Step Toward 3D UIs?

Bookmark and Share

Gesture Recognition--First Step Toward 3D UIs?

by Dong-Ik Ko and Gaurav Agarwal
Texas Instruments

This article was originally published in the December 2011 issue of Embedded Systems Programming.

Gesture recognition is the first step to fully 3D interaction with computing devices. The authors outline the challenges and techniques to overcome them in embedded systems.

As touchscreen technologies become more pervasive, users are becoming more expert at interacting with machines. Gesture recognition takes human interaction with machines even further. It’s long been researched with 2D vision, but the advent of 3D sensor technology means gesture recognition will be used more widely and in more diverse applications. Soon a person sitting on the couch will be able to control the lights and TV with a wave of the hand, and a car will automatically detect if a pedestrian is close by. Development of 3D gesture recognition is not without its difficulties, however.

Limitations of (x,y) coordinate-based 2D vision
Designers of computer vision technology have struggled to give computers a human-like intelligence in understanding scenes. If computers don’t have the ability to interpret the world around them, humans cannot interact with them in a natural way. Key problems in designing computers that can "understand" scenes include segmentation, object representation, machine learning, and recognition.

Because of the intrinsic limitation of 2D representation of scenes, a gesture recognition system has to apply various cues in order to acquire better results containing more useful information. While the possibilities include whole-body tracking, in spite of combining multiple cues it’s difficult to get anything beyond hand-gesture recognition using only 2D representation.

"z"(depth) innovation
The challenge in moving to 3D vision and gesture recognition has been obtaining the third coordinate "z". One of the challenges preventing machines from seeing in 3D has been image analysis technology. Today, there are three popular solutions to the problem of 3D acquisition, each with its own unique abilities and specific uses: stereo vision, structured light pattern, and time of flight (TOF). With the 3D image output from these technologies, gesture recognition technology becomes a reality.

Stereo vision: Probably the best-known 3D acquisition system is a stereo vision system. This system uses two cameras to obtain a left and right stereo image, slightly offset (on the same order as the human eyes are). By comparing the two images, a computer is able to develop a disparity image that relates the displacement of objects in the images. This disparity image, or map, can be either color-coded or gray scale, depending on the needs of the particular system.

Structured light pattern: Structured light patterns can be used for measuring or scanning 3D objects. In this type of system, a structured light pattern is illuminated across an object. This light pattern can be created using a projection of laser light interference or through the use of projected images. Using cameras similar to a stereo vision system allows a structured light pattern system to obtain the 3D coordinates of the object. Single 2D camera systems can also be used to measure the displacement of any single stripe and then the coordinates can be obtained through software analysis. Whichever system is used, these coordinates can then be used to create a digital 3D image of the shape.

Time of flight: Time of flight (TOF) sensors are a relatively new depth information system. TOF systems are a type of light detection and ranging (LIDAR) system and, as such, transmit a light pulse from an emitter to an object. A receiver is able to determine the distance of the measured object by calculating the travel time of the light pulse from the emitter to the object and back to the receiver in a pixel format.

TOF systems are not scanners in that they do not measure point to point. The TOF system takes in the entire scene at once to determine the 3D range image. With the measured coordinates of an object, a 3D image can be created and used in systems such as device control in areas like robotics, manufacturing, medical technologies, and digital photography.

Until recently, the semiconductor devices needed to implement a TOF system were not available. But today’s devices enable the processing power, speed, and bandwidth needed to make TOF systems a reality.

3D vision technologies
No single 3D vision technology is right for every application or market. Table 1 compares the different 3D vision technologies and their relative strengths and weaknesses regarding response time, software complexity, cost, and accuracy.

Stereo vision technology demands considerable software complexity for high-precision 3D depth data that can be processed by digital signal processors (DSPs) or multicore scalar processors. Stereo vision systems can be low cost and fit in a small form factor, making them a good choice for devices like mobile phones and other consumer devices. However, stereo vision systems cannot deliver the accuracy and response time that other technologies can, so they’re not ideal for systems requiring high accuracy such as manufacturing quality-assurance systems.

Structured light technology is a good solution for 3D scanning of objects, including 3D computer aided design (CAD) systems. The software complexity associated with these systems can be addressed by hard-wired logics (such as ASIC or FPGA), which require expensive development and materials costs. The computation complexity also results in a slower response time. Structured light systems are better than other 3D vision technologies at delivering high levels of accuracy at the micro level.

TOF systems deliver a balance of cost and performance that is optimal for device control in areas like manufacturing and consumer electronics devices needing a fast response time. TOF systems typically have low software complexity, but expensive illumination parts (LED, laser diode) and high-speed interface related parts (fast ADC, fast serial/ parallel interface, fast PWM driver) increase materials cost.

z & human/machine interface
With the addition of the "z" coordinate, displays and images become more natural and familiar to humans. What people see with their eyes on the display is similar to what their eyes see around them. Adding this third coordinate changes the types of displays and applications that can be used.

Displays:
Stereoscopic displays typically require the user to wear 3D glasses. The display provides a different image for the left and right eye, tricking the brain into interpreting a 3D image based on the two different images the eyes receive. This type of display is used in many 3D televisions and 3D movie theaters today.

Multiview displays do not require the use of special glasses. These displays project multiple images at the same time, each one slightly offset and angled properly so that a user can experience different projection of images for the same object per each viewpoint angle. These displays allow a hologram effect and will be delivered in new 3D experiences in the near future.

Detection and applications:
The ability to process and display the "z" coordinate is enabling new applications, including gaming, manufacturing control, security, interactive digital signage, remote medical care, automotive, and robotic vision. Figure 1 depicts some application spaces enabled by the body skeleton and depth map-sensing.

Human gesture recognition (consumer): Human gesture recognition is a new, popular way to give inputs in gaming, consumer, and mobile products. Users are able to interact with the device in a natural and intuitive way, leading to greater acceptance of the products. These human gesture recognition products include various resolutions of 3D data, from 160 x 120 pixels to 640x 480 pixels at 30-60 fps. Software modules such as raw to depth conversion, two-hand tracking, and full body tracking require a digital signal processor (DSP) for efficient and fast processing of the 3D data to deliver real-time gaming and tracking.

Industrial: Most industrial applications for 3D vision, such as industrial and manufacturing sensors, include an imaging system from 1 pixel to several 100K pixels. The 3D images can be manipulated and analyzed using DSP technology to determine manufacturing flaws or to choose the correct parts from a bin.

Interactive digital signage (pinpoint marketing tool): With interactive digital signage, companies will be able to use pinpoint marketing tools to deliver the content that is right for each customer. For example, as someone walks past a digital sign, an extra message may pop up on the sign to acknowledge the customer. If the customer stops to read the message, the sign could interpret that motion as interest in their product and deliver a more targeted message. Microphones would allow the billboard to detect and recognize key phrases to further pinpoint the delivered message.

These interactive digital signage systems will require a 3D sensor for full body tracking, a 2D sensor for facial recognition and microphones for speech recognition. Software for these systems will be run on higher-end DSPs and general-purpose processors (GPPs), delivering applications such as face recognition, full body tracking, and Flash media players as well as functionality like MPEG4 video decoding.

Medical (fault-free virtual/remote care): 3D vision will bring new and unprecedented applications to the medical field. A doctor will no longer be required to be in the same room as the patient. Using a medical robotic vision enabled by high accuracy of 3D sensor, remote and virtual care will ensure the best medical care is available to everyone, no matter where they are located in the world .

Automotive (safety): Recently, automotive technology has come a long way with 2D sensor technology in traffic signal, lane, and obstacle detection. With the advent of 3D sensing technology, "z" data from 3D sensors can significantly improve the reliability of scene analysis. With the inclusion of 3D vision systems, vehicles have new ways of preventing accidents, both day and night. Using a 3D sensor, a vehicle can reliably detect an object and determine if it is a threat to the safety of the vehicle and the passengers inside. These systems will require the hardware and software to support a 3D vision system as well as intensive DSP and GPP processing to interpret the 3D images in a timely manner to prevent accidents.

Video conferencing: Enhanced video conferencing of tomorrow will take advantage of 3D sensors to deliver a more realistic and interactive video conferencing experience. With an integrated 2D sensor as well as a 3D sensor and a microphone array, this enhanced video conferencing system will be able to connect with other enhanced systems to enable high-quality video processing, facial recognition, 3D imaging, noise cancellation and content players (Flash, etc.). With such intensive video and audio processing, DSPs with the right mix of performance and peripherals are needed to deliver the functionality required.

Technology processing steps
For many applications, both a 2D and 3D camera system will be needed to properly enable the technology. Figure 2 shows the basic data path of these systems. Getting the data from the sensors and into the vision analytics is not as simple as it seems from the data path. Specifically, TOF sensors require up to 16 times the bandwidth of 2D sensors, causing a big input/output (I/O) problem. Another bottleneck occurs in the processing from the raw 3D data to a 3D point cloud. Having the right combination of software and hardware to address these issues is critical for gesture recognition and 3D success. Today, this data path is realized in DSP/GPP combination processors along with discrete analog components and software libraries.

Challenges for 3d-vision embedded systems
Input challenges: As discussed, input bandwidth constraints are a considerable challenge for 3D-vision embedded systems. Additionally, there is no standardization for the input interface. Designers can choose to work with different options, including serial and parallel interfaces for 2D sensor and general purpose external memory interfaces. Until a standard input interface is developed with the best possible bandwidth, designers will have to work with what is available.

Two different processor architectures: The 3D depth map processing of Figure 2 can be divided into two categories: vision specific, data-centric processing and application upper-level processing. Vision specific, data-centric processing requires a processor architecture that can perform single instruction, multiple data (SIMD), fast floating-point multiplication and addition, and fast search algorithms. A DSP is a perfect candidate for quickly and reliably performing this type of processing. For application upper-level-processing, high-level operating systems (OSes) and stacks can provide the necessary feature set that the upper layer of any application needs.

Based on the requirements for both processor architectures, a system-on-chip (SoC) that provides a GPP+ DSP+ SIMD processor with a high data rate I/O is a good fit for 3D vision processing, providing the necessary data and application upper level processing.

Lack of standard middleware: The world of middleware for 3D vision processing is a combination of many different pieces pulled together from multiple sources, including open source (for example, OpenCV) as well as proprietary commercial sources. Commercial libraries are targeted for body tracking applications, which is a specific application of 3D vision. No standardized middleware interface has been developed yet for all the different 3D vision applications.

Anything cool after "z"?
While no one questions the "cool" factor of 3D vision, researchers are already looking into new ways to see beyond, through, and inside people and objects. Using multi-path light analysis, researchers around the world are looking for ways to see around corners or objects. Transparence research will yield systems that are able to see through objects and materials. And with emotion detection systems, applications will be able to see inside the human mind to detect whether the person is lying.

The possibilities are endless when it comes to 3D vision and gesture recognition technologies. But the research will be for nothing if the hardware and middleware needed to support these exciting new technologies are not there. Moving forward, SoCs that provide a GPP+DSP +SIMD architecture will be able to deliver the right mix of processing performance with peripheral support and the necessary bandwidth to enable this exciting technology and its applications.

Dong-Ik Ko is a technical lead in 3D vision business unit at Texas Instruments. He has more than 18 years experience in industry and academy research on embedded system design and optimization methods. He has a master of science in electrical engineering from Korea University and a Ph.D. in computer engineering from University of Maryland College Park.

Gaurav Agarwal a business development manager at Texas Instruments, where he identifies growth areas for the DaVinci digital media processor business. He holds a bachelor of technology in electrical engineering from the Indian Institute of Technology, Kanpur and a master of science in electronic engineering from University of Maryland.

The Benefits of an Industry Alliance: How You Can Utilize the Potential of Embedded Vision Technology

By Brian Dipert
Editor-In-Chief
Embedded Vision Alliance
Senior Analyst
BDTI

The VISION Show: Embedded Designs Are Becoming the Status Quo

By Brian Dipert
Editor-In-Chief
Embedded Vision Alliance
Senior Analyst
BDTI

Consumer Surveillance Systems: Design Stratagems, Surmounting Implementation Problems, and Assessing the Embedded Vision Ecosystem

By Brian Dipert
Editor-In-Chief
Embedded Vision Alliance
Senior Analyst
BDTI

Dynamic Range And Edge Detection: An Example Of Embedded Vision Algorithms' Dependence On In-Camera Image Processing

Bookmark and Share

Dynamic Range And Edge Detection: An Example Of Embedded Vision Algorithms' Dependence On In-Camera Image Processing

By Michael Tusch
Founder and CEO
Apical Limited

This article will expand on the theme initiated in the premier article of this series, that of exploring how the pixel processing performed in cameras can either enhance or hinder the performance of embedded vision algorithms. Achieving natural or otherwise aesthetically pleasing camera images is normally considered distinct from the various tasks encompassed in embedded vision. But the human visual system has likely evolved to produce the images we perceive not for beauty per se, rather in order to optimize the brain's decision-making processes based on these inputs.

It may well be, therefore, that we in the embedded vision industry can learn something by considering the image creation and image analysis tasks in combination. While current architectures consider these tasks as sequential and largely independent stages, we also know that the human eye-brain system exhibits a high degree of feedback whereby the brain’s model of the environment informs the image capture process. In Apical's opinion, it is inevitable that future machine vision systems will be designed along similar principles.

In case all this sounds rather high-minded and vague, let’s focus on a specific example of how an imaging algorithm interacts with a simple vision algorithm, a case study that has immediate and practical consequences for those designing embedded vision systems. We’ll look at how the method of DRC (dynamic range compression), a key processing stage in all cameras, affects the performance of a simple threshold-based edge detection algorithm. This technique also happens to have a clear parallel in human vision, since both processes are performed with almost unparalleled efficiency within the eye, based on surprisingly simple and analog neural architectures.

DRC, also known as tone mapping, is the process by which an image with high dynamic range is mapped into an image with lower dynamic range. The term “dynamic range” refers to the relative intensity of the brightest details (as compared to the intensity of the darkest details) which can be resolved within a single image. Most real-world situations exhibit scenes with a dynamic range of up to ~100 dB, although there are conditions where the dynamic range is much higher, and the eye can resolve around 150 dB. Sensors exist which can capture this range or more, which equates to around 17-18 bits per color per pixel. Standard sensors capture around 70 dB, which corresponds to around 12 bits per color per pixel.

We would like to use as much information about the scene as possible for embedded vision, implying that we should use as many bits as possible as input data. One possibility is just to take this raw sensor data as-is, in linear form. But often this data is not available, as the camera module may be separate from the computer vision processor, with transmission subsequently limited to 8 bits per color per pixel. Also, some kinds of vision algorithms work better if they are presented with a narrower image dynamic range: less variation in the scene illumination consequently needs to be taken into account.

Often, therefore, some kind of DRC will be performed between the raw sensor data (which may in some cases be very high dynamic range) and the standard RGB or YUV output provided by the camera module. This DRC will necessarily be non-linear; the most familiar example is gamma correction. In fact, gamma correction is more correctly described as dynamic range preservation, because the intention is to match the gamma applied in-camera with an inverse function applied at the display, in order to recover a linear image at the output (as in, for example, the sRGB or rec709 standards for mapping 10-bit linear data into 8-bit transmission formats).

However, the same kind of correction can be and is also frequently used to compress dynamic range. For the purposes of our vision algorithm example, it would be best to work in a linear domain. In principle, it would be straightforward to apply an inverse gamma and recover the linear image as input. But unfortunately, the gamma used in the camera does not always follow a known standard. There’s a good reason for this: the higher the dynamic range of the sensor, the larger the amount of non-linear correction that needs to be applied to those images which actually exhibit high dynamic range. Conversely, for images that don’t fit the criteria, i.e. images of scenes that are fairly uniform in illumination, little or no correction should be applied.

As a result, the non-linear correction needs to be adaptive, meaning that the algorithm's function depends on the image itself, as derived from an analysis of the intensity histogram of the component pixels. And it may need also to be spatially varying, meaning that different transforms are applied in different image regions. The overall intent is to try to preserve as much information as accurately as possible, without clipping or distortion of the content, while mapping the high input dynamic range down to the relatively low output dynamic range. Figure 1 gives an example of what DRC can achieve: the left-hand image retains the contrast of the original linear sensor image, while the right-hand post-DRC image appears much closer to what the eye would observe.

Figure 1: Original linear image (left) and image after dynamic range compression (right)

Let us assume that we need to work with this non-linear image data as input to our vision algorithm, which is the normal case in many real-world applications. How will our edge detection algorithm perform on this type of data, as compared to the original linear data? Let us consider a very simple edge detection algorithm, based on the ratio of intensities of neighbouring pixels, such that an edge is detected if the ratio is above a pre-defined threshold. And let us also consider the simplest form of DRC, which is gamma-like, and might be associated with a fixed exponent or one derived from histogram analysis (i.e. “adaptive gamma”). What effect will this gamma function have on the edge intensity ratios?

The gamma function is shown in Figure 2, at top right. Here, the horizontal axis is the pixel intensity in the original image, while the vertical axis is the pixel intensity after gamma correction. This function increases the intensity of pixels in a variable manner, such that darker pixels have their intensities increased more than  brighter pixels.  Figure 2 (top left) shows an edge, with its pixel intensity profile along the adjacent horizontal line.  This image is linear; it has been obtained from the original raw data without any non-linear intensity correction. Clearly, the edge corresponds to a dip in the intensity profile; let us assume that this dip exceeds our edge detection threshold by a small amount.

Figure 2: Effect of DRC on an edge. The original edge is shown in the top-left corner, with the gamma-corrected result immediately below it. The intensity profile along the blue horizontal line is shown in the middle column. The result of gamma correction is shown in the middle row, with the subsequent outcome of applied local contrast preservation correction shown in the bottom row.

Now consider the same image after the gamma correction (top right), as shown in Figure 2, middle row. The intensity profile has been smoothed out, with the amplitude of the dip greatly reduced. The image itself is brighter, but the contrast ratio is lower. The reason should be obvious: the pixels at the bottom of the dip are darker than the rest, and their intensities are therefore relatively increased more by the gamma curve than the rest, thereby closing the gap. The difference between the original and new edge profiles is shown in the right column. The dip is now well below our original edge detection threshold.

This outcome is problematic for edge detection, since the strengths of the edges present in the original raw image are reduced in the corrected image, moreover reduced in a totally unpredictable way based on where the edge occurs in the intensity distribution. Making the transform image-adaptive and spatially-variant further increases the unpredictability of how much edges will be smeared out by the transform. There is simply no way to relate the strength of edges in the output to those in the original linear sensor data. On the one hand, therefore, DRC is necessary to pack information recorded by the sensor into a form that the camera can output. However, this very process degrades important local pixel information needed for reliable edge detection.

The degradation could manifest as instability in the white line detection algorithm for an automotive vision system, particularly (for example) when entering or exiting a tunnel, where the dynamic range of the scene changes rapidly and dramatically. Fortunately, a remedy exists. The fix arises from an observation that an ideal DRC algorithm should be highly non-linear on large length scales on the order of the image dimensions, but should be strictly linear on very short scales of a few pixels. In fact this outcome is also desirable for other reasons. Let us see how it can be accomplished, and what effect it would have on the edge problem.

The technique involves deriving a pixel-dependent image gain via the formula Aij = Oij/D(Iij), where i,j are the pixel coordinates, O denotes the output image and I the input image, and D is a filter applied to the input image which acts to increase the width of edges. The so-called amplification map, A, is post-processed by a blurring filter which alters the gain for a particular pixel based on an average over its nearest neighbours. This modified gain map is multiplied with the original image to produce the new output image. The result is that the ratio in intensities between neighbouring pixels is precisely preserved, independent of the overall shape of the non-linear transform applied to the whole image.

This result is shown in the bottom row of Figure 2. Although the line is brighter, its contrast with respect to its neighbours is preserved. We can see this more clearly in the image portion example of Figure 3. Here, several edges are present within the text. The result of standard gamma correction is to reduce local contrast, “flattening” the image, while the effect of the local contrast preservation algorithm is to lock the ratio of edge intensities, such that the dips in the intensity profile representing the dark lines within the two letters in the bottom image and the top image are identical.

Figure 3: Effect of DRC on a portion of an image. The original linear image is in the top-left corner, with the gamma-corrected result immediately below it. The effect of local contrast preservation is shown in the bottom-left corner.

In summary, while non-linear image contrast correction is essential for forming images that are viewable and transmissible, such transforms should retain linearity on small scales important for edge analysis. Note that our definition of the amplification map above as a pixel position-dependent quantity implies that such transforms must be local rather than global (i.e. position-independent). It is worth noting that the vast majority of cameras in the market entirely employ global processing and therefore have no means of controlling the relationship between edges in the original linear sensor data and the camera output.

In conclusion, we have reviewed an example of where the nature of the image processing applied in-camera has a significant effect on the performance of a basic embedded vision algorithm. The data represented by a standard camera output is very different from the raw data recorded by the camera sensor. While in-camera processing is crucial in delivering images which accurately reflect the original scene, it may have an unintended negative impact on the effectiveness of vision algorithms. Therefore, it is a good idea to understand how the pixels input to the algorithms have been pre-processed, since this pre-processing may have both positive and negative impacts on the performance of those algorithms in real-life situations.

We will look at other examples of this interaction in future articles.