Embedded Vision Alliance: Technical Articles

How to Build an Angstrom Linux Distribution for Intel (Altera) SoC FPGAs with OpenCV and Camera Driver Support

This article was originally published at PathPartner Technology's website. It is reprinted here with the permission of PathPartner Technology.

Computer Vision Evolves Towards Ubiquity

This column was originally published at Vision Systems Design's website. It is reprinted here with the permission of PennWell.

Taking on Poverty with Jobs Created by Machine Learning and Computer Vision

This article was originally published by Embedded Vision Alliance consultant Dave Tokic. It is reprinted here with Tokic's permission.

Fundamentals of coolSRAM-1T Memory IP

Bookmark and Share

Fundamentals of coolSRAM-1T Memory IP

Embedded vision applications deal with a lot of data; a single 1080p60 (1920x1080 pixel per frame, 60 frames per second) 24-bit color video stream requires nearly a half GByte of storage per second, along with nearly 3 Gbps of bandwidth, and 8-bit alpha (transparency) or 3-D depth data further amplifies the payload by 33% in each case. Reliably and cost-effectively storing this data, as well as rapidly transferring it between memory and other system nodes, is critical to robust system operation. As such, advanced memory technologies such as Mentor Graphics' coolSRAM-1T are valuable in embedded vision designs. This is a reprint of a Mentor Graphics-published white paper, which is also available here (1.2 MB PDF).

For more information on Mentor Graphics' Memory IP products, please visit https://www.mentor.com/products/ip/memory-ip/.


Memory content in modern silicon chips (SoC, ASIC, etc.) is dramatically increasing as more complex functionality and software are required to run on a single monolithic chip. More complex types of IPs (intellectual properties) are also being integrated into a single chip (such as analog, CPU, DSP, OTP, SRAM, ROM, CAM, and high-performance I/Os). Chip design managers face the daunting task of making a choice for each of the IPs on a chip, which involves acquiring and digesting copious amounts of information.

This paper explores the design tradeoffs between 6T static random access memory (6T-SRAM) and Mentor Graphics’ coolSRAM-1T dynamic random access memory. Background information for 6T-SRAM and coolSRAM-1T technologies is covered first. The coolSRAM-1T technology is detailed in subsequent sections; followed up by a sample test report.

6-transistor Static Random access Memory (6T-SRAM)

6T-SRAM is the most widely used memory type in silicon designs today. New process nodes are typically qualified by using optimized 6T-SRAM arrays. A schematic for a 6T-SRAM static cell is shown in Figure 1.

Figure 1. 6T static random access memory (6T-SRAM) is composed of a storage latch and two access transistors that allow read and write access into the cell.

The 6T-SRAM cell consists of a storage latch (two back-to-back inverters) along with two access NMOS transistors that allow read and write access into the cell. The value stored in the cell is determined by the polarity of Q/ Q x nodes, which are opposite each other. Note that the storage nodes (Q / Q x) are actively driven at all times. The cell is read by first precharging to supply (VDD) and then floating the bitlines (bL and bLx). When wordline (WL) is driven high to VDD, the bitline connected to the side of the latch with a zero inside is pulled low while the complement bitline stays at VDD. A sense amplifier detects this differential signal and converts it to full-swing digital value. To write into the cell, the desired data value is driven onto the bitlines and the wordline is driven to VDD. The data is forced into the cell. Read/write operations on a 6T-SRAM cell are summarized in Table 1.

Table 1. Summary of 6T-SRAM operation.

The 6T-SRAM cell area is minimized by fabrication houses by extensively utilizing “pushed” design rules in the cell. Since the 6T-SRAM cells are used only in regular two-dimensional arrays, the process is optimized so that high yield is achieved with the pushed design rules. However, despite the use of pushed design rules, the 6T-SRAM cell contains six transistors and is quite a bit larger than single transistor/capacitor dynamic memory. Since 6T-SRAM also uses active drivers to maintain data, leakage power becomes a major concern in advanced process nodes. On the other hand, coolSRAM-1T uses passive storage structures that are optimized for low leakage and tend to have a lower leakage current in advanced technology nodes.

coolSRAM-1T Embedded Dynamic Random Access Memory

Dynamic random access memory (DRAM) can be a cost-effective alternative to 6T-SRAM since it occupies a smaller area. However, conventional DRAM bitcell implementations require additional process steps that add to the wafer cost and diminish or completely erase any area savings that DRAM provides.

Mentor Graphics’ coolSRAM-1T uses only the standard base CMOS process, providing design flexibility and cost savings. The interface to the memory blocks looks very much like a 6T-SRAM, providing random access every cycle (i.e., RAS/CAS page access patterns found in stand-alone DRAM architectures are not used).

The coolSRAM-1T dynamic memory cell consists of a storage capacitor and an access transistor. Because the capacitor provides a passive storage medium (there is no active drive on the storage node Vcell), minimizing the leakage currents in the cell is extremely critical to a successful dynamic memory implementation. The coolSRAM-1T is shown in Figure 2.

Figure 2. coolSRAM-1T dynamic memory cell with a storage capacitor and an access transistor. The passive nature of the cell requires extremely low leakage currents.

Typically, the cell is read by biasing the bitline (bL) at approximately half supply voltage and then floating it. The wordline (WL) is then driven to high, and the charge stored inside the cell is dumped onto the bitline. The small voltage change on the bitline is sensed by an amplifier. Since the read is destructive (the act of reading destroys the data in the cell), after being sensed, the data is written back into the cell. Writing into the dynamic memory cell is accomplished by driving the bitline to the desired data and then driving the wordline high. The dynamic memory operation is summarized in Table 2.

Table 2. Summary of dynamic memory operation.

The charge inside the cell is passively stored, so any source of leakage will slowly decrease the amount of charge originally written into the cell. A periodic refresh is therefore needed to maintain data indefinitely in a dynamic memory cell. A simple refresh can be accomplished by the system accessing a memory location. Typically, dynamic memory blocks have a refresh interface that allows a memory location to be refreshed without affecting the instance output and provides the ability to refresh more than one entry at a time.

One of the major challenges of dynamic memory design is the loss of stored charge due to leakage. Four major sources of leakage in a dynamic memory cell are demonstrated in Figure 3. The access transistor will have subthreshold leakage during idle state (depending on the bias conditions of the bitline) and especially when a “0” is written to another row in the same block. Subthreshold leakage is exponentially dependent on device threshold voltage and gate-source overdrive voltage. As the gate oxide thickness is scaled to below around 30 angstrom, the gate leakage starts to become a significant source of charge loss. Depending on the physical implementation details of the capacitor, there might be appreciable leakage through the storage capacitor. Finally, the junction of the access transistor will leak charge into the bulk. Junction leakage can be minimized by optimizing the cell layout structure and the fabrication process.

Figure 3. Four major leakage sources in a dynamic memory cell.

To minimize the cell leakage and provide a large storage capacitance in a small area, the DRAM process has diverged from the standard baseline digital CMOS process. In recent years, due to increasing requirements for large amounts of embedded memory in SOC applications, DRAM process modules have been designed to be plugged into the standard CMOS process flow. However, such process modules increase the mask and wafer costs, so there is a minimum cost barrier to utilizing DRAM on chips.

Mentor Graphics’ innovative coolSRAM-1T technology uses the standard CMOS process to implement the dynamic memory cell and can reduce the memory area by half with no additional mask or wafer costs. Mentor Graphics’ coolSRAM-1T design is also foundry or factory-independent and can be ported quickly.

The basic requirements of a robust DRAM are that it have adequate storage capacitance in a small area, very small cell leakage, and robust peripheral sense circuits. Mentor Graphics’ coolSRAM-1T utilizes process options available in the standard CMOS process to build the cell and circuits that satisfy the above requirements.

coolSRAM-1T Cell In Standard CMOS

To minimize subthreshold and junction leakage, the Mentor Graphics’ coolSRAM-1T dynamic memory cell utilizes the thick oxide or input/output (I/O) transistor option available in all advanced process nodes. Thick oxide devices have larger oxide thickness, higher threshold voltage, and deeper junction characteristics. In addition to much lower leakage levels, the thick oxide devices are also able to withstand higher gate and drain voltages, therefore enabling more charge storage in the same cell area. An important physical parameter for the cell is the amount of charge stored: Q = C V, where Q is the amount of charge stored, C is the cell capacitance, and V is the voltage in the cell. Note that the cell capacitance and cell voltage are interchangeable.

The storage capacitor can be implemented as the gate oxide capacitance of a thick oxide device. In advanced technology nodes (e.g., 65nm and below), the use of metal-to-metal capacitance as the storage medium becomes more area efficient than planar gate oxide capacitance. The interconnect conductor spacing becomes smaller with each technology generation, resulting in larger capacitance compared to the unit capacitance that can be obtained from a thick oxide gate. Figure 4 illustrates the coolSRAM-1T cell implementation.

Figure 4. Mentor Graphics’ coolSRAM-1T cell implementation with thick oxide devices. The basic device characteristics for a given oxide thickness remain the same at different technology nodes. The cell capacitance is implemented with metal fringe capacitors in the most advanced process nodes.

The basic device characteristics for thick oxide devices remain the same from one technology node to the next (e.g., 50 Å, 2.5V I/O devices have the same oxide thickness and threshold voltage at 130nm and 90nm technology nodes). Since the device characteristics are the same, it is easier to port and characterize the coolSRAM-1T technology when the same oxide thickness option is used. Within the same technology node, variation in the implementation of the coolSRAM-1T technology is required for different gate oxide process options: 130nm 1.2V/2.5V process will have a different compiler than 130nm 1.2V/3.3V process.

Peripheral Circuit Implementation

The most important parts of the peripheral circuits are (1) the sense amplifier and (2) the write-back circuit that restores the charge into the cell after a destructive read. The sense amplifier is a sophisticated analog circuit that accurately senses the small signal from the cell injected onto the bitline. After the sensing phase is complete, the write-back circuit drives the bitlines full-swing to restore the charge into the cell after the destructive read. The sense amplifier must have a very high sensitivity as well as a high tolerance to noise and process variations. Appropriate steps are taken during design to guarantee excellent margins on silicon—as verified on coolSRAM-1T designs incorporated into customers’ products. To boost the signal in a given cell capacitor area, the cell array is operated at I/O voltage level. A larger voltage stored in the cell results in a larger signal for the sense amplifier and improves overall performance. The interface to the system is at the VDD (core) voltage level. A large number of patents have been filed on various aspects (cell, peripheral circuits to top level architecture) of this innovative technology.

The signals must be level-shifted from one voltage domain to another as they travel from the memory interface to the cell array. Tradeoffs for each design style must be carefully considered. Table 3 lists tradeoffs for three different approaches to implementing the coolSRAM-1T voltage domains: (1) memory array and the sense amplifier running at VDD, (2) memory array at VIO and the sense amplifier at VDD, and (3) memory array at VIO and sense amplifier at VIO. These three different approaches are represented in Figure 5.

Table 3. Comparison of different voltage-domain partitioning in Mentor Graphics’ coolSRAM-1T.

Independent of the size of a coolSRAM-1T instance, the cycle time is limited by the amount of time it takes a local bank (memory cells, sense circuits, and x-decoders) to correctly sense the signal from the cell, amplify, write back, and precharge to get ready for the next access into the block. The system clock frequency need not be limited by this local bank cycle time requirement if there is any known access pattern into the memory. For example, if a local bank can be guaranteed not to be accessed in two consecutive cycles, the system can run at twice the local bank frequency without having to introduce pipelining inside the bank itself.

Figure 5. Architectural representation of different voltage-domain partitioning options. The represented memory is partitioned into two independent local banks. Left to right - VDD sense amp/VDD cell, VDD sense amp/VIO cell, VIO sense amp/VIO cell.

This concept of interleaving accesses into the memory is demonstrated for the case of a 2-way interleaving, shown in Figure 6. The interleaving concept is expandable to n-way interleaving if a bank is guaranteed to be accessed only once per n-cycles (and there are at least n local independent banks inside the memory instance)—with the system clock running n-times faster than the local bank.

Figure 6. Example of interleaved access to two local banks (b0 and b1). Each bank is accessed in alternating cycles, and the system clock runs at twice the frequency of a local bank.

coolSRAM-1T Compilers

The Mentor Graphics coolSRAM-1T solution is integrated into the MemQuest Memory Compiler environment where instances can be compiled and verified (including full spice simulation characterization). Compiled instances in the 160nm 1.8V/3.3V, 130nm 1.5V/3.3V, and 110nm 1.2V/3.3V technology nodes have been incorporated into customer products and are in volume production. The memory architecture can support large instance sizes (e.g., 4Mbits or larger) as well as extreme aspect ratios. Extreme aspect ratios (e.g., height -to-width ratio of 1:20) are useful in applications such as LCD drivers where the chip height is typically limited to less than 1mm.

The compiled coolSRAM-1T memories maintain area savings over the 6T-SRAM in each technology node down to 65nm. Table 4 summarizes area savings for a typical 1Mbit instance in each technology node from 180nm to 65nm.

Table 5. Area savings of Mentor Graphics’ coolSRAM-1T over Mentor Graphics’ coolSRAM-6T™ in various standard process nodes.

If the coolSRAM-1T architecture drives the bitlines to VIO, the active power for the coolSRAM-1T will be larger than that of a similar 6T-SRAM implementation. However, in advanced technology nodes, the coolSRAM-1T instance has lower leakage, especially at high temperatures (since dynamic memory cells are optimized for low leakage with the use of thick oxide devices). Overall system power performance would have to be carefully evaluated based on the time periods that the memory is in active usage or standby. In general, coolSRAM-1T provides a better leakage performance in advanced, especially generic, process nodes.

In some designs, depending on size, the customer should consider bitcell redundancy to improve yield or retention time. Mentor Graphics’ coolSRAM-1T comes with an optional, built-in, column redundancy that is fully compatible with Mentor Graphics’ Tessent® product suite. Another consideration with memory-intensive SOC designs is for the customer to use some form of ECC. ECC should not be considered as a direct replacement for column redundancy; column redundancy is recommended for large instances even when ECC is used.


Testing procedures for Mentor Graphics’ coolSRAM-1T memory are summarized below. More information is provided to Mentor Graphics’ licensees in datasheets and test documentation. The goals of testing are (1) to identify defective parts and (2) to improve quality in the field, i.e., to identify parts that might fail a short time after being produced or fabricated. The coolSRAM-1T test flow to accomplish these goals has three major steps.

The first step in testing is to internally stress the instance at an elevated voltage, outside of the normal operating conditions. This is the voltage over-stress condition where the cell storage capacitors, bitlines, wordlines, and other signals are biased in parallel at a dc voltage for a set duration at elevated voltage and temperature levels. The memory interface provides a mode pin to automatically invoke this test condition. The intent is to aggravate internal weak defects that might become full-blown failures a short time after deployment in products.

The second step is to run the SRAM-style Built-In-Self-Test (BIST) algorithm. This algorithm checks for any defects or failing peripheral circuits by running a succession of patterns through the memory (e.g., March C+ algorithm). This testing is carried out in a tightened margin environment (high-temperature, low-voltage conditions). Additionally, proprietary interface pins allow users to further squeeze internal voltage and timing margins of the instance during BIST tests. When the instance passes the BIST tests with reduced internal margins and aggressive test conditions, it will be more likely to withstand the stress of operation and aging in the field.

Finally, the cell retention is verified under similar reduced-margin conditions as those conditions mentioned above (both voltage/temperature and internal circuit settings). The retention is verified at a longer time than what is guaranteed at a given temperature. The retention margin is sometimes implemented by margining the temperature (testing for retention at a higher temperature than maximum operation voltage specified for the part).

Testchip Shmoo Results

Example shmoo plots from Mentor Graphics’ 110nm test report are shown in Figure 7. Contact Mentor Graphics for full test report, or for availability of other technologies test reports. Chips are chosen from 5 process corners (TT, SS, FF, SF, FS) and must pass the BIST algorithms to qualify as a pass. The results below are shown at 25°C, but plots are also generated at hot and cold temperatures. A total of six plots are generated for each memory and temperature, as a result of the multiple supply voltages. For each plot, one power source is swept while the other power source is held constant at nominal, nominal+10%, or nominal-10%. These shmoo are healthy and show a robust operation of Mentor Graphics’ coolSRAM-1T memory.

Figure 7. Example shmoo plots at 25°C from 110nm test report. The top 3 plots sweep VIO voltage with a constant VDD supply. The bottom 3 plots sweep VDD voltage with a constant VIO supply.

Testchip Retention Results

To determine the minimum refresh frequency required for the coolSRAM-1T, retention time is measured instead of the standard 6T-SRAM retention voltage. The testing procedure for acquiring data retention time is a five step process. Direct access is first used to write a portion of the memory with a known data pattern consisting of 1’s and 0’s. Then the memory is disabled by asserting chip-enable low. Wait for a set duration. Afterward, assert chip-enable high, and read the data to confirm that there is no data loss. Increase wait time duration and repeat reading and confirming data until a failure occurs. Figure 8 shows the retention time dependency on temperature. Testing is performed on FF parts at nominal-10% VIO, which is the worst case retention corner due to the higher leakage and lower storage voltage level. The retention time results were in line with Mentor Graphics’ simulation results.

Figure 8. Example retention time vs. temperature from 110nm test report.

Cost Analysis

In a bulk CMOS design, whenever the power and speed performance targets are satisfied and over 1Mbit of memory is needed, it is more cost-effective to use Mentor Graphics’ coolSRAM-1T to replace 6T SRAM since extra mask or wafer cost are not required. In most advanced technology nodes, there has been a trend to integrate metal-insulator-metal (MIM) capacitor structures to implement dynamic memory storage capacitor. However, such additional process steps increase both mask and wafer cost. In such a scenario, the following cost analysis applies in choosing between coolSRAM-1T and MIM-based DRAM.

Although Mentor Graphics coolSRAM-1T does not require any extra mask or process steps, the resulting memory density is lower than the memory designed with MIM-based process. Therefore, there is a break even point between using Mentor Graphics solution and MIM-based solution. The break even point is given by

X ≤ (1-1/n)/(1-1/m)

Where X is the fraction of the chip that is coolSRAM-1T (e.g. X = 0.3 if 30% of the chip is occupied by coolSRAM-1T), n is the wafer cost increase factor when MIM module is added (e.g. n = 1.2 for 20% increase in wafer cost for MIM addition) and m is the size increase factor between using MIM-based solution and Mentor Graphics solution (e.g. m = 1.3 when Mentor Graphics solution is 30% larger than the corresponding MIM- based solution).

Note that this model does not include an increase in mask set costs, process development costs incurred by the factory, and the restriction of factory choice when using this MIM technology. In some cases, DRAM MIM process option may be incompatible with other process options (such as RF module). An example cost-analysis graph is included in Figure 9 with typical area and cost-factor parameters in a 65nm technology node. The Mentor Graphics’ solution is better or comparable in cost to MIM up to about 40% memory content on the chip – not considering the MIM increase in mask cost and limitation on factory choice.

Figure 9. Example retention time vs. temperature from 110nm test report.

Vision Processing Opportunities in Virtual Reality

Bookmark and Share

Vision Processing Opportunities in Virtual Reality

VR (virtual reality) systems are beginning to incorporate practical computer vision techniques, dramatically improving the user experience as well as reducing system cost. This article provides an overview of embedded vision opportunities in virtual reality systems, such as environmental mapping, gesture interface, and eye tracking, along with implementation details. It also introduces an industry alliance available to help product creators incorporate robust vision capabilities into their VR designs.

VR (virtual reality) dates from the late 1960s, at least from an initial-product standpoint; the concept of VR has been discussed for much longer in academia, industry and popular literature alike. Yet only in the last several years has VR entered the public consciousness, largely thanks to the popularity of the Oculus Rift HMD (head-mounted display), more recently joined by products such as the similarly PC-based HTC Vive, the game console-based Sony PlayStation VR, and the smartphone-based Samsung Gear VR along with Google's Cardboard (and upcoming Daydream) platforms (Figure 1). The high degree of virtual-world immersion realism delivered by these first-generation mainstream systems, coupled with affordable price tags, has generated significant consumer interest. And Oculus' success in attracting well-known engineers such as John Carmack (formerly of id Software) and Michael Abrash, coupled with the company's March 2014 $2B acquisition by Facebook, hasn't hampered the technology's popularity, either.

Figure 1. The PC-based HTC Vive (top), game-console-based Sony PlayStation VR (middle) and smartphone-based Samsung Gear VR (bottom) are examples of the varying platform (and vision processing) implementations found in first-generation VR systems (top courtesy Maurizio Pesce, middle courtesy Marco Verch, bottom courtesy Nan Palmero).

Much the same can be said about computer vision, a broad, interdisciplinary field that uses processing technology to extract useful information from visual inputs by analyzing images and other raw sensor data. Computer vision has mainly been a field of academic research over the past several decades, implemented primarily in complex and expensive systems. Recent technology advances have rapidly moved computer vision applications into the mainstream, as cameras (and the image sensors contained within them) become more feature-rich, as the processors analyzing the video outputs similarly increase in performance, and as the associated software becomes more robust. As these and other key system building blocks such as memory devices also decrease in cost and power consumption, the advances are now paving the way for the proliferation of practical computer vision into diverse applications. Computer vision that is embedded in stand-alone products or applications is often called "embedded vision."

VR is a key potential growth market for embedded vision, as initial implementations of a subset of the technology's full potential in first-generation platforms exemplify. Cameras mounted directly on the HMD exterior, and/or in the room in which the VR system is being used, find use in discerning the user's head, hands', other body parts' and overall body locations and motions, thereby enabling the user (in part or in entirety) to be "inserted" into the virtual world. These same cameras can also ensure that a VR user who's in motion doesn't collide with walls, furniture or other objects in the room, including other VR users.

Meanwhile, cameras mounted inside the HMD can implement gaze tracking for user interface control, for example, as well as supporting advanced techniques such as foveated rendering and simulated dynamic depth of field that enhance the perceived virtual world realism while simultaneously reducing processing and memory requirements. And by minimizing the overall processing and storage requirements for data coming from both internal and external sensors, as well as by efficiently partitioning the overall load among multiple heterogeneous resources within the overall system architecture, designers are able to optimize numerous key parameters: performance, capacity, bandwidth, power consumption, cost, size, and weight, for example.

User and Environment Mapping Using a Conventional Camera

In order to deliver fully immersive VR, the 3D graphics displayed by a HMD need to be rendered precisely from the user’s viewpoint, and with imperceptible latency. Otherwise, if location inaccuracy or delay is excessive, the user may experience nausea; more generally, any desirable "suspension of disbelief" aspect of the VR experience will be practically unattainable. To measure the location, orientation and motion of the headset, today's first-generation VR systems at minimum use IMUs (inertial measurement units) such as gyroscopes and accelometers. Such sensors operate just like the ones found in mobile phones – and with smartphone-based VR, in fact, they're one and the same. Unfortunately, IMUs tend to suffer from location drift, i.e., an ever-increasing discrepancy between where the system thinks the HMD is located versus its actual location. And even under the best of circumstances, they're not able to deliver the accuracy required for a robust immersive experience.

Higher-end VR systems such as HTC’s Vive, the Oculus Rift, and Sony’s PlayStation VR leverage additional external light transmission and reception devices, enabling more accurate detection and tracking of the HMD and its behavior. Both Oculus and Sony, for example, utilize an array of "tracker" infrared LEDs mounted to the HMD, in conjunction with a room-located camera. With HTC's Vive the photosensors are installed on the HMD itself, where they detect the horizontal and vertical beams emitted by multiple "Lighthouse" laser base stations. Such light-emitting active systems, typically based on infrared or otherwise non-visible spectrum light such that the human eye doesn’t notice it, have two fundamental drawbacks, however: potential interference from other similar-spectrum illumination sources, and limited range.

An alternative or supplement to IMUs or external active systems for tracking involves mounting one or several small outward-facing cameras on the HMD exterior. HTC’s Vive, in fact, implements such an arrangement, via a single camera located on the front of the HMD. To date, the camera only finds use in the system's "Chaperone" mode, which optionally provides a live video feed to the HMD in order to safely guide the user away from obstacles. But external cameras (which are also already present in smartphones, therefore readily available to smartphone-based VR setups), in conjunction with vision processors running relevant algorithms, are conceptually capable of much more:

  • Tracking of the location and orientation of the headset, in all six degrees of freedom (three directions each of translation and rotation)
  • Analyzing the user's surroundings to prevent collisions with walls, furniture, other VR users, or other obstacles, and
  • Capturing and deciphering the position, orientation and motion of the user’s hands and other body parts, along with those of other users in the same environment

Algorithms such as SLAM (simultaneous localization and mapping) and SfM (structure from motion) can accurately determine the headset’s location and other characteristics using only a standard 2D camera. Algorithms such as these detect and track feature points of environment objects in the camera’s view, from one frame to another. Based on the dynamic 2D motion of these feature points, 3D position can be calculated. This determination is conceptually similar to the one made by our brains when we move our heads; objects closer to us move more substantially across our field of view in a given amount of time than do objects in the distance.

The amount and direction of feature points' motion from frame to frame enable the algorithm to calculate the location and orientation of the camera capturing the images. By knowing the location of the camera, we can then know the location of the HMD, and therefore the user’s location. This data can combine with IMU measurements, using sensor fusion techniques, to obtain an even more accurate VR viewpoint for rendering purposes. Since standard cameras are being used, the setup can account for objects at distances that would be infeasible with infrared-based setups mentioned elsewhere; the HMD can even be reliably used outdoors.

The 3D point cloud of the environment generated by a SLAM, SfM or comparable algorithm also enables the VR system to deduce the 2D distance and even full 3D location of items in the surroundings. This data can find use, for example, in warning the user when he or she gets too close to other objects, including other users, in order to avoid collisions. Keep in mind, however, that rapid head movements can temporarily disrupt the accuracy of the environment scanning process delivered by an HMD-located camera, versus with an alternative room-located camera setup.

If the user's hands (potentially including individual fingers) and/or other body parts are also capable of being accurately captured by the HMD-based camera, their positions and movements can also be inserted into the virtual world, as well as being used for gesture interface control and other purposes. Whether or not this added capability is feasible depends on the camera's FoV (field of view), orientation, DoF (depth of field), and other parameters, along with the technology on which it is based.

User and Environment Mapping Using a 3D Sensor

Conventional 2D camera sensors enable a wide range of vision capabilities across a breadth of applications, and are also cost-effective and steadily advancing in various attributes. However, as previously discussed, in order to use them to determine the 3D location of the camera and/or objects in the viewing environment, it's necessary for one or both to be in motion at the time. In fully static situations, obtaining an accurate determination of the distance from the camera to any particular object can range from difficult to impossible, even if an object is recognized and a model of its linear dimensions versus distance (such as common sizes of humans' faces) is incorporated in the calculations.

Alternatively, the HMD developer can choose to incorporate one (or more) of several available 3D sensor technologies in the design, each with a corresponding set of strengths and shortcomings for VR and other vision applications. Stereoscopic vision, for example, combines two side-by-side 2D image sensors (mimicking the spacing between a human being's set of eyes), determining the distance to an object via triangulation, by using the disparity in viewpoints between them when jointly imaging the subject of interest.

Another 3D sensor approach, structured light, is perhaps best known as the technology employed in Microsoft's first-generation Kinect peripheral for PCs and the Xbox 360 game console. Structured light is an optical 3D scanning method that projects a set of patterns onto an object, capturing the resulting image with an image sensor. The fixed separation offset between the projector and sensor is leveraged in computing the depth to specific points in the scene, again using triangulation algorithms, this time by translating distortion of the projected patterns (caused by surface roughness) into 3D information

A third approach, the ToF (time-of-flight) sensor is an increasingly common approach to depth sensing, found in the second-generation Kinect peripheral as well as Google's Project Tango platforms and several trendsetting VR HMD prototypes. A ToF system obtains travel-time information by measuring the delay or phase-shift of a modulated optical signal for all pixels in the scene. Generally, this optical signal is situated in the near-infrared portion of the spectrum so as not to disturb human vision.

ToF sensors consist of arrays of pixels, where each pixel is capable of determining the distance to the scene. Each pixel measures the delay of the received optical signal with respect to the sent signal. A correlation function is performed in each pixel, followed by averaging or integration. The resulting correlation value then represents the travel time or delay. Since all pixels obtain this value simultaneously, "snap-shot" 3D imaging is possible.

Modern ToF systems are capable of simultaneously supporting long-range (e.g., environment mapping) and short-range (e.g., hand tracking) functions, with average power consumption below 300mW for combination of the active illumination source and the 3D imager (Figure 2). The imager is capable of operating at multiple concurrent frame rates and other settings, in order to capture multiple independent data streams for different tasks: 45 fps to realize close range hand tracking, for example, while in parallel scanning the environment at 5 fps along with increased exposure time.

Figure 2. ToF-based 3D cameras, being monocular in nature, enable compact and lightweight system designs as well as delivering cost-effective module calibration and manufacturing capabilities (courtesy Infineon Technologies).

ToF cameras ideally keep the FoV, i.e., the opening angle of the receiving lens, as narrow as possible, typically around 60 degrees horizontal for optimum accuracy. However, a wider FoV of 100 degree or more can be required to deliver seamless and natural movement recognition in the user's peripheral vision, a VR-unique requirement. Increasing the FoV translates into a decrease in pixel resolution, creating challenges for hand tracking algorithms; it also results in more complex illumination challenges and tradeoffs in the camera lens optical performance. Increasing the pixel resolution, on the other hand, translates into a higher vision processing computational load, larger and heavier imagers and lenses, and greater system cost. Some application-dependent compromise between these two extremes will likely be necessary.

Eye Tracking Capabilities

While outward-facing cameras are important in gaining an understanding of the environment surrounding a user, one or multiple cameras mounted within the HMD and facing the user can be used to understand what the user is looking at and to provide a user interface to the system.  Eye tracking enables user gaze to find use as an interaction method, for example, automatically moving the cursor and/or other interface elements to wherever the user is looking at the time. Thanks to humans’ tendencies to rapidly draw their sequential attention to different parts of a scene, gaze-based interfaces may become extremely popular in VR applications.

Eye tracking can also find use in a process known as "foveated rendering," a technique that can significantly reduce system computational and memory requirements. Within the anatomy of the human retina are found two types of light receptor cells: rods and cones. Rods, being well suited to color discernment and high visual acuity, are located most densely in the fovea, the nexus of the retina. This allocation translates in human perception terms to the center of our field of vision, where images appear crisp and clear.

The second light receptor type, the cone, is not color-sensitive, nor does it have high acuity. However, rods are comparatively more responsive to movement, as well as being more sensitive in low-light conditions. Rods are concentrated in the outer areas of the retina, creating what humans perceive as peripheral vision. Foveated rendering techniques take advantage of this anatomy arrangement by only choosing to render high detail in areas of the scene on which the eyes are currently concentrated, conversely rendering only partial detail in the periphery. This technique significantly reduces graphical rendering overhead, but it requires extremely rapid eye tracking to ensure that the user is never directly viewing a low-detail region.

The sense of immersion can also be improved through eye tracking technology. Current stereoscopic VR systems, for example, do not take into account that the eye's flexible optics enable users to dynamically and rapidly change focal depth in a scene. When facing a static screen, the user is forced to focus on a static 2D plane, in spite of the fact that 3D content is being displayed. The result is a world that appears 3D, yet the eyes have no ability to focus near or far within it. Focal point detection is therefore necessary in order to realistically re-render a scene relative to the depth of where the user is currently focusing. As with previously discussed foveated rendering, this technique can reduce system computational and memory requirements; in this case, certain areas of the scene are dynamically downscaled based on what particular depth the user is focused on at the time.

Finally, eye tracking can find use in fine-tuning a VR system for any particular user's vision-related traits. Even minor person-to-person variations in interpupillary distance (eye spacing) can result in degradation to any particular user's image perception sufficient to result in nausea and/or dizziness (Table 1). The typical VR HMD-expected eye separation is 64 mm, but users can have interpupillary deviations of up to several millimeters around this average value.



Sample Size




Standard Deviation


































Table 1. Interpupillary distance variations (mm), taken from a 1988 U.S. Army survey (courtesy Wikipedia).

The accommodation amplitude (capability of the eye to focus over distance) also varies, not only from person to person but also as a person ages (Figure 3). In addition, the human eye does not naturally remain stationary but is constantly moving (in saccade); any lag or other asynchronous effects in the image being viewed can also contribute to an unnatural and otherwise irritating user experience.

Figure 3. Accommodation amplitude, the eye's ability to focus over distance, varies not only from person to person but also with age (courtesy Wikipedia).

Implementation Details

One possible eye tracking solution leverages near-infrared (i.e. 800-1000 nm) LED light sources, often multiple of them to improve accuracy, within the VR headset and illuminating one of the user’s eyes. Since a HMD is an enclosed system absent any interfering ambient light, dark-pupil tracking (where the light source can directly illuminate the pupil) is feasible. This technique is more accurate than the alternative bright-pupil solutions used, for example, in ADAS (advanced driver assistance systems) to ascertain a driver alertness and attention to the road ahead.

The LEDs in VR HMDs are pulsed in order to both achieve power savings and improve measurement accuracy. In such a design, a monochrome CMOS image sensor that is tuned for near-IR peak performance can be used. This sensor is capture-synchronized to the pulses of the LED light source(s).

In order to achieve the necessary accuracy and capture performance for this particular application, a global shutter-style image sensor (which captures a full frame of image data at one instant) will likely be required. The alternative and more common rolling shutter-based sensor, wherein portions of the image are sequentially read out of the pixel array across the total exposure time, can result in distortion when images of fast-moving pupils or other objects are captured (Figure 4). Given that the human eye makes hundreds of miniature motions every second, extremely high-speed tracking is required, easily exceeding 100 Hz.

Figure 4. A global shutter-based image sensor (top), unlike the rolling shutter alternative (bottom), delivers the performance and undistorted image quality necessary for eye tracking designs (courtesy ON Semiconductor).

Tracking performance needs also guide system processor selection. While eye tracking algorithms alone can run on general-purpose application processors, dedicated vision processors or other coprocessors may also be required if operations such as head tracking and gesture recognition are also to be supported. Due to the large number of sensors (vision and otherwise) likely present in a VR system, the aggregate bandwidth of their common connection to the host also bears careful consideration.

More generally, eye tracking and other vision functions are key factors in determining overall system requirements, since they (along with foundation graphics rendering) must all operate with extremely low latencies in order to minimize motion sickness effects. In an un-tethered HMD scenario, this requirement means that processing and other resources across numerous vision tasks must be carefully distributed, while retaining sufficient spare GPU headroom for realistic and high-speed rendering. In a tethered setting, one must also consider the total bandwidth requirements of data from incoming HMD-based sensors, through system DRAM, and out to the display via a PC or game console GPU.


VR is one of the hottest products in technology today, and its future is bright both in the current consumer-dominated market and a host of burgeoning commercial applications. System capabilities delivered by vision processing-enabled functions such as environment mapping, user body tracking, and gaze tracking are key features that will transform today's robust VR market forecasts into tomorrow's reality. And more generally, vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products. And it can provide significant new markets for hardware, software and semiconductor suppliers  (see sidebar "Additional Developer Assistance").

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. Infineon Technologies, Movidius, ON Semiconductor and videantis, the co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance also holds Embedded Vision Summit conferences. Embedded Vision Summits are technical educational forums for product creators interested in incorporating visual intelligence into electronic systems and software. They provide how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. These events are intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings.

The most recent Embedded Vision Summit was held in May 2016, and a comprehensive archive of keynote, technical tutorial and product demonstration videos, along with presentation slide sets, is available on the Embedded Vision Alliance website and YouTube channel. The next Embedded Vision Summit, along with accompanying workshops, is currently scheduled take place on May 1-3, 2017 in Santa Clara, California. Please reserve a spot on your calendar and plan to attend.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Martin Lass
Product Marketing Manager, Infineon Technologies

Jack Dashwood
Marketing Communications Director, Movidius

Guy Nicholson
Marketing Director, Mobile & Consumer Division, ON Semiconductor

Marco Jacobs
Vice President of Marketing, videantis

Vision Processing Opportunities in Drones

Bookmark and Share

Vision Processing Opportunities in Drones

UAVs (unmanned aerial vehicles), commonly known as drones, are a rapidly growing market and increasingly leverage embedded vision technology for digital video stabilization, autonomous navigation, and terrain analysis, among other functions. This article reviews drone market sizes and trends, and then discusses embedded vision technology applications in drones, such as image quality optimization, autonomous navigation, collision avoidance, terrain analysis, and subject tracking. It also introduces an industry alliance available to help product creators incorporate robust vision capabilities into their drone designs.

UAVs (unmanned aerial vehicles), which the remainder of this article will refer to by their more common "drone" name, are a key potential growth market for embedded vision. A drone's price point, therefore cost, is always a key consideration, so the addition of vision processing capabilities must not incur a significant bill-of-materials impact. Maximizing flight time is also a key consideration for drones, so low incremental weight and power consumption are also essential for vision processing hardware and software. Fortunately, the high performance, cost effectiveness, low power consumption, and compact form factor of various vision processing technologies have now made it possible to incorporate practical computer vision capabilities into drones, along with many other kinds of systems, and a rapid proliferation of the technology is therefore already well underway.

Note, for example, that vision processing can make efficient use of available battery charge capacity by autonomously selecting the most efficient flight route. More generally, vision processing, when effectively implemented, will incur a notably positive return on your integration investment, as measured both by customer brand and model preference and the incremental price they're willing to pay for a suitably equipped drone design.

Market Status and Forecasts

The market growth opportunity for vision processing in consumer drones, both to expand the total number of drone owners and to encourage existing owners to upgrade their hardware, is notable. Worldwide sales of consumer drones reached $1.9 billion in 2015, according to market analysis firm Tractica, and the market will continue to grow rapidly over the next few years, reaching a value of $5 billion by 2021. Tractica also forecasts that worldwide consumer drone unit shipments will increase from 6.4 million units in 2015 to 67.7 million units annually by 2021 (Figure 1).

Figure 1. Consumer drone market growth is expected to be dramatic in the coming years (top); drone shipments into commercial applications will also significantly increase (bottom).

The demand for consumer drones, according to Tractica, is driven by the global trends of increasing enthusiasm for high definition imaging for personal use, recreational activities, and aerial games. The possibilities of combining augmented and virtual reality, along with the expanded capabilities of drones and smart devices, is also creating lots of new opportunities in the market. Consumer drones are seeing improved design, quality, and features, while becoming increasingly affordable due to the falling costs of various components, particularly image sensors and the cameras containing them, along with associated imaging and vision processors.

Equally notable, albeit perhaps less widely discussed, is the opportunity for drones to generate value in commercial markets such as film and media, agriculture, oil and gas, and insurance. While the military has been using drones for some time, drones were virtually nonexistent in the commercial market until recently. Decreasing technology costs, leading to decreasing prices, coupled with the emergence of new applications with strong potential return on investment, have created new markets for commercial drones. Although the fleet of commercial drones is currently limited in size, Tractica forecasts that worldwide drone shipments will grow at a rapid pace in the coming decade, rising from shipments of approximately 80,000 units in 2015 to 2.7 million units annually by 2025.

Commercial applications for drones fall into two major categories: aerial imagery and data analysis. Imaging applications involve the utilization of a drone-mounted camera for a multitude of purposes, ranging from the ability to capture aerial footage to the creation of digital elevation maps by means of geo-referencing capabilities. Users have the ability to capture an abundance of images, on their own time schedule and at affordable pricing.

For data analysis applications, one key value of flying a commercial drone happens post-flight. Data collection and image processing capabilities and techniques deliver the ability to produce fine-grained data; anything from crop quantity to water quality can be assessed in a fraction of the time and cost it would take with a low-flying airplane. The reports produced post-flight can offer end users an easy-to-read product that adds value to their operations.

The current commercial usage of drones is centered on niche use cases, although Tractica expects that usage will broaden significantly over the next several years, in the process generating a sizable market not just for drone hardware but also for drone-enabled services. Specifically, Tractica forecasts that global drone-enabled services revenue will increase from $170 million in 2015 to $8.7 billion by 2025 (Figure 2). Most drone-enabled services will rely on onboard imaging capabilities; the largest applications will include filming and entertainment, mapping, prospecting, and aerial assessments.

Figure 2. Commercial drone-enabled services revenue will increase in lockstep with hardware sales growth.

In the near term, the four main industries that will lead this market are film, agriculture, media, and oil and gas. The aerial imagery and data analytics functions mentioned previously are the primary drivers for their use in these industries. The capacity to collect, analyze, and deliver information in near real time will continue to be a reason for industries to adopt this technology in their supply chains.

Application Opportunities and Function Needs

Computer vision is a key enabling technology for drone-based applications. Most of these applications today are cloud-based, with imaging data transmitted from the drone via a wireless connection to a backend server where the relevant data is extracted and analyzed. This approach works well if there’s enough bandwidth to send the images over the air with required quality, and if the overall delay between image capture and analysis is acceptable. Newer computer vision applications that run nearly to completely on the drones themselves are promising in opening new markets for drone manufacturers as well as their hardware and software suppliers. Such applications further expand the capabilities of drones to include real-time availability of results, enabling faster decision making by users.

One computer vision-enabled function that's key to these emerging real-time applications for drones is self-navigation. Currently, most drones are flown manually, and battery capacity limits flight time to around 30 to 40 minutes. Vision-based navigation, already offered in trendsetting consumer drones, conversely enables them to chart their own course from point of origin to destination. Such a drone could avoid obstacles, such as buildings and trees, as well as more generally be capable of calculating the most efficient route. Self-navigation will not only obviate the need for an operator in some cases (as well as more generally assist the operator in cases of loss of manual control, when the drone is out of sight, etc.), but will also enable extended battery life, thus broadening the potential applications for drones.

Object tracking is another function where onboard computer vision plays an important role. If a drone is following the movement of a car or a person, it must know what that object looks like and how to track it. Currently, object tracking is largely a manual process; an operator controls the drone via a drone-sourced video feed. In the near future (for commercial applications) and already available (again, in trendsetting consumer drones), conversely, a user can tell the drone to track an object of interest and the drone has sufficient built-in intelligence to navigate itself while keeping the object in sight. Such a function also has the potential to be used in sports, for example, where drones can track the movements of individual players.

Real-time processing is already being used in asset tracking for the construction and mining industries. In such an application, the drone flies over a work site, performs image analysis, identifies movable assets (such as trucks), and notifies the user of their status. Similar technology can also be used in the retail industry to assess inventory levels.

Image Quality Optimization

As previously mentioned, high quality still and video source images are critical to the robust implementation of any subsequent vision processing capabilities. Until recently, correction for the effects of motion, vibration and poor lighting required expensive mechanical components, such as gimbals and customized, multi-element optical lenses. However, new electronic image and video stabilization approaches can eliminate the need for such bulky and complex mechanical components. These new solutions leverage multiple state-of-the-art, real-time image analysis techniques, along with detailed knowledge of drone location, orientation and motion characteristics via accelerometers, gyroscopes, magnetometers and other similar "fusion" sensors, to deliver robust image stabilization performance (Figure 3).

Figure 3. Electronic image stabilization employs multiple data inputs and consists of several function blocks.

Synchronization ensures that the video data and sensor fusion data are time-stamped by the same clock, so that the exact location, orientation and motion of the drone at the time of capture of each video frame are known. This critical requirement needs to be taken into consideration when designing camera platforms.

A dedicated DCE (distortion correction engine) is one an example of a hardware approach to electronic image stabilization. The DCE is a dedicated hardware block (either a chip or silicon IP) that provides flexible 2D image warping capabilities. It supports a wide variety of image deformations, including lens distortion correction, video stabilization (including simultaneous lens geometry correction), perspective correction, and fish eye rectification. Advanced resampling algorithms enable the DCE to deliver high image quality with high-resolution video sources at high frame rates (up to 8K @ 60 FPS) while consuming 18 mW of power.

In the absence of dedicated DCE or equivalent hardware, electronic image stabilization solutions can alternatively leverage the platform’s GPU and/or other available heterogenous computing resources to implement various functions. Such an approach is cost-effective, because it harnesses already existing processors versus adding additional resources to the SoC and/or system design. However, comparative image quality may be sub-optimal, due to capability constraints, and power consumption may also be correspondingly higher.

The adaptive video stabilization filter computes image correction grids, thereby achieving natural looking video across varied recording conditions (Figure 4). It analyzes camera motion and dynamically modifies stabilization characteristics to provide steady shots, as well as reacting quickly when it determines that camera movements are intentional versus inadvertent. Such a filter can leverage motion sensor inputs to address situations in which image analysis alone cannot provide definitive camera motion estimates, thereby further improving stabilization reliability. It also combines lens distortion correction, video stabilization and rolling shutter correction into a unified correction function, in order to minimize the number of image data transfers to and from main memory. The integrated approach also improves image quality without need for multi-sampling.

Figure 4. High frequency rolling shutter artifact removal employs a correction grid (top) to eliminate distortions (bottom).

Autonomous Navigation and Collision Avoidance

Once high quality source images are in hand, embedded vision processing can further be employed to implement numerous other desirable drone capabilities, such as enabling them to fly without earth-bound pilot intervention. A simplified form of autonomous navigation for drones has been available for some time, leveraging GPS localization technologies. Developing an algorithm that tells a drone to travel to a specific set of destination GPS coordinates is fairly simple, and the drone would execute the algorithm in a straighforward and predictable manner...as long as there are no other drones contending for the same airspace, or other impeding object en route.

Unfortunately, many of today's drone owners have discovered the hard way the inherent limitations of GPS-only navigation, after their expensive UAV smashes into a building, bridge, tree, pole, etc. Note, too, that it would be practically impossible to create, far from maintain, sufficiently detailed 3D maps of all possible usage environments, which drones could theoretically use for navigation purposes as do fully autonomous vehicles on a limited-location basis. Fortunately, vision technologies can effectively tackle the navigation need, enabling drones to dynamically react to their surroundings so they can route from any possible origin to destination, while avoiding obstacles along the way.

Collision avoidance is not only relevant for fully autonomous navigation, but also for "copilot" assistance when the drone is primarily controlled by a human being, analogous to today's ADAS (advanced driver assistance systems) features in vehicles. The human pilot might misjudge obstacles, for example, or simply be flying the drone with limited-to-no visibility when it's travelling laterally, backward, or even taking off or landing. In such cases, the pilot might not have a clear view of where the drone is headed; the drone's own image capture and vision processing intelligence subsystems could offer welcome assistance.

The technologies that implement collision avoidance functions are based on the use of one or more cameras coupled to processors that perform the image analysis, extracting the locations, distances and sizes of obstacles and then passing this information to the drone’s autonomous navigation system. The challenges to integrating such technologies include the need to design an embedded vision system that performs quickly and reliably enough to implement foolproof collision avoidance in line with the speed of the drone, in 3 dimensions, and not only when the drone travels along a linear path but also in combination with various rotations.

The vision subsystems cost, performance, power consumption, size and weight all need to be well matched to the drone’s dimensions and target flight time and capabilities. Different embedded vision camera, processor and other technologies will deliver different tradeoffs in all of these areas. Keep in mind, too, that beyond the positional evaluation of the static obstacles that's now becoming available in trendsetting drones, the "holy grail" of collision avoidance also encompasses the detection of other objects in motion, such as birds and other drones. Such capabilities will require even more compute resources than are necessary today, as the reaction time will need to be significantly improved, as well as the need to detect much smaller objects. And as the speeds of both the drone and potential obstacles grows, the need for precision detection and response further increases, since the drone will need to discern objects that are even smaller and farther away than before.

Terrain Analysis and Subject Tracking

For both drone flight autonomy and environment analysis purposes, at least some degree of terrain understanding is necessary. At the most basic level, this information may comprise altitude and/or longitude and latitude, derived from an altimeter (barometer) and/or a GPS receiver. Embedded vision, however, can deliver much more advanced terrain analysis insights.

Downward-facing image sensors, for example, can capture and extract information that can provide a drone with awareness of its motion relative to the terrain below it. By using approaches such as optical flow, where apparent motion is tracked from frame to frame, a downward facing camera and associated processor can even retrace a prior motion path. This awareness of position relative to the ground is equally useful, for example, in circumstances where a GPS signal is unavailable, or where a drone needs to be able to hover in one place without drifting.

As previously mentioned, terrain is increasingly being mapped for commercial purposes, where not only traditional 2D, low-resolution satellite data is needed, but high-resolution and even 3D data is increasingly in demand. A number of techniques exist to capture and re-create terrain in detail. Traditional image sensors, for example, can find use in conjunction with a technique known as photogrammetry in order to "stitch" multiple 2D still images into a 3D map. Photogrammetry not only involves capturing a large amount of raw image data, but also requires a significant amount of compute horsepower. Today, cloud computing is predominantly utilized to generate such 3D models but, as on-drone memory and compute resources become more robust in the future, real-time drone-resident photogrammetry processing will become increasingly feasible.

Alternatives to photogrammetry involve approaches such as LIDAR (laser light-based radar), which can provide extremely high-resolution 3D representations of a space at the expense of the sensor's substantial size, weight, and cost. Mono or stereo pairs of RGB cameras can also be utilized for detecting structure from motion (further discussed in the next section of this article), generating 3D point clouds that are then used to create mesh models of terrain. Such an approach leverages inexpensive conventional image sensors, at the tradeoff of increased necessary computational requirements. Regardless of the image capture technology chosen, another notable consideration for terrain mapping involves available light conditions. Whether or not reliable ambient light exists will determine if it's possible to use a passive sensor array, or if a lighting source (visible, infrared, etc) must be supplied.

A related function to terrain mapping is subject tracking, such as is seen with the “follow me” functions in consumer and prosumer drones. Autonomous vision-based approaches tend to outperform remote-controlled and human-guided approaches here, due to both comparative tracking accuracy and the ability to function without the need for a transmitter mounted to the subject. Subject tracking can be achieved through computer vision algorithms that extract “features” from video frames. Such features depend on the exact approach but are often salient points such as corners, areas of high contrast, and edges. These features can be assigned importance either arbitrarily, i.e. by drawing a bounding box around the object that the user wants the drone to track, or via pre-tracking object classification.

3D Image Capture and Data Extraction

One of the key requirements for a drone to implement any or all of the previously discussed functions is that it have an accurate and complete understanding of its surroundings. The drone should be aware of where other objects are in full 3D space, as well as its own location, direction and velocity. Such insights enable the drone to calculate critical metrics such as distance to the ground and other objects, therefore time to impact with these objects. The drone can therefore plan its course in advance, as well as take appropriate corrective action en route.

It’s common to believe that humans exclusively use their two eyes to sense depth information. Analogously, various techniques exist to discern distance in the computer vision realm, using special cameras. Stereo camera arrays, for example, leverage correspondence between two perspective views of the same scene to calculate depth information. LIDAR measures the distance to an object by illuminating a target with a laser and analyzing the reflected light. Time-of-flight cameras measure the delay of a light signal between the camera and the subject for each point of the image. And the structured light approach projects onto the scene a pattern that is subsequently captured, with distortions extracted and interpreted to determine depth information.

Keep in mind, however, that humans are also easily able to catch a ball with one eye closed. In fact, research has shown that humans primarily use monocular vision to sense depth, via a cue called motion parallax. As we move around, objects that are closer to us move farther across our field of view than do more distant objects. This same cue is leveraged by the structure from motion algorithm to sense depth in scenes captured by conventional mono cameras (see sidebar "Structure from Motion Implementation Details").

With the structure from motion approach, active illumination of the scene (which can both limit usable range and preclude outdoor use) is likely not required. Since a conventional camera (likely already present in the drone for traditional image capture and streaming purposes) suffices, versus a more specialized depth-sensing camera, cost is significant reduced. Leveraging existing compact and lightweight cameras also minimizes the required size and payload of the implementation, thereby maximizing flight time for a given battery capacity.

In comparing 3D sensor-based versus monocular approaches, keep in mind that the former can often provide distance information without any knowledge of the object in the scene. With a stereo camera approach, for example, you only need to know the relative "pose" between the two cameras. Conversely, monocular techniques need to know at least one distance measurement in the scene, i.e. between the scene the camera, in order to resolve any particular scene object's distance and speed. In the example of the ball, for example, a human can catch it because he or she can estimate the size of the ball from past experience with them. Some embedded vision systems will therefore employ both monocular and 3D sensor approaches, since stereo vision processing (for example) can be costly in terms of processing resources but may be more robust in the results it delivers.

Deep Learning for Drones

Traditional post-capture and -extraction image analysis approaches are now being augmented by various machine learning techniques that can deliver notable improvements in some cases. Tracking is more robust across dynamic lighting, weather and other environmental conditions commonly experienced by drones, vehicles and in other applications, for example, along with reliably accounting for changes in the subject being tracked (such as a person who change posture or otherwise moves). Deep learning, a neural network-based approach to machine learning, is revolutionizing the ways that we think about autonomous capabilities and, more generally, solving problems in a variety of disciplines that haven’t been previously feasible in such a robust manner.

Deep learning-based approaches tend to work well with unfamiliar situations, as well as being robust in the face of noisy and otherwise incomplete inputs. Such characteristics make them a good choice for drones and other situations where it is not possible to control the environment or otherwise completely describe the problem to be solved in advance across all possible scenarios. The most studied to-date application of deep learning is image classification, i.e. processing an image and recognizing what objects it contains. Deep learning has been shown to notably perform better than traditional computer vision approaches for this particular task, in some cases even better than humans.

As an example of the power of deep learning in image classification, consider the ImageNet Challenge. In this yearly competition, researchers enter systems that can classify objects in images. In 2012, the first-ever deep learning-based system included in the ImageNet Challenge lowered the error rate versus traditional computer vision-based approaches by nearly 40%, going from a 26% error rate to a 16% error rate. Since then, deep learning-based approaches have dominated the competition, delivering a super-human 3.6% error rate in 2015 (human beings, in comparison, score an approximate 4.9% error rate on the same image data set).

Common real-world drone applications that benefit from deep learning techniques include:

  • Image classification: Detecting and classifying infrastructure faults during routine inspections
  • Security: Identifying and tracking people of interest, locating objects, and flagging unusual situations
  • Search and rescue: Locating people who are lost
  • Farm animal and wildlife management: Animal tracking

Deep learning is also a valuable capability in many other applications, such as power line detection, crop yield analysis and improvement and other agriculture scenarios, and stereo matching and segmentation for navigation.

Deep learning-based workflows are notably different (and in many ways simpler) than those encountered in traditional computer vision. Conventional approaches require a software engineer to develop vision algorithms and pipelines that are capable of detecting the relevant features for the problem at hand. This requires expertise in vision algorithms, along with a significant time investment to iteratively fine-tune performance and throughput toward the desired result.

With deep learning, conversely, a data scientist designs the topology of the neural network, subsequently exposing it to a large dataset (consisting of, for the purposes of this article's topics, a collection of images), in an activity known as training. During training, the neural network automatically learns the important features for the dataset, without human intervention. Algorithm advances in conjunction with the availability of cost-effective high-capacity storage and highly parallel processing architectures mean that a network that previously would have taken weeks or months to train can now be developed in mere hours or days.

The resulting neural network model is then deployed on the drone or other target system, which is then exposed to new data, from which it autonomously draws conclusions; classifying objects in images captured from an onboard camera, for example. This post-training activity, known as inferencing, is not as computationally intensive as training but still requires significant processing capabilities. As with training, inference typically benefits greatly from parallel processing architectures; it can also optionally take place on the drone itself, in the "cloud" (if latency is not a concern), or partitioned across both. In general, increased local processing capabilities tend to translate into both decreased latency and increased overall throughput, assuming memory subsystem bandwidth and other system parameters are equally up to the task.


Drones are one of the hottest products in technology today, and their future is bright both in the current consumer-dominated market and a host of burgeoning commercial applications. Vision processing-enabled capabilities such as collision avoidance, broader autonomous navigation, terrain analysis and subject tracking are key features that will transform today's robust drone market forecasts into tomorrow's reality. And more generally, vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products. And it can provide significant new markets for hardware, software and semiconductor suppliers (see sidebar "Additional Developer Assistance").

Sidebar: Structure from Motion Implementation Details

The structure from motion algorithm consists of three main steps:

  1. Detection of feature points in the view
  2. Tracking feature points from one frame to another
  3. Estimation of the 3D position of these points, based on their motion

The first step is to identify candidate points in a particular image that can be robustly tracked from one frame to the next (Figure A). Features in texture-less regions such as blank walls and blue skies are difficult to localize. Conversely, many objects in a typical scene contain enough texture and geometry data to enable robust tracking from one frame to another. Locations in the captured image where you find gradients in two significantly different orientations, for example, are good feature point candidates. Such features show up in the image as corners or other places where two lines come together. Various feature detection algorithms exist, many of which have been widely researched in the computer vision community. The Harris feature detector, for example, works well in this particular application.

Figure A. The first step in the structure from motion algorithm involves identifying reliable frame-to-frame tracking feature point candidates (courtesy VISCODA).

Next, the structured motion algorithm needs to track these feature points from frame to frame, in order to find out how much they've moved in the image. The Lucas-Kanade optical flow algorithm is typically used for this task. The Lucas-Kanade algorithm first builds a multi-scale image pyramid, where each image is a smaller scaled image of the originally captured frame. The algorithm then searches in the vicinity of the previous frame’s feature point location for a match in the current image frame. When the match is found, this position becomes the initial estimate for the feature's location in the next-larger image in the pyramid; the algorithm travels through the pyramid until it reaches the original image resolution version. This way, it's also possible to track larger feature displacements.

The result consists of two lists of corresponding feature points, one for the previous image and one for the current image. From these point pairs, the structured motion algorithm can define and solve a linear system of equations that determines the camera motion, and consequently the distance of each point from the camera (Figure B). The result is a sparse 3D point cloud in "real world" coordinates that covers the camera’s viewpoint. Subsequent consecutive image frames typically add additional feature points to this 3D point cloud, combining it into a point database that samples the scene more densely. Multiple cameras each capturing its own point cloud, are necessary in order to capture the drone's entire surroundings. These individual-perspective point clouds can then be merged into a unified data set, capable of functioning as a robust input to a subsequent path planning or collision avoidance algorithm.

Figure B. After identifying corresponding feature points in consecutive image frames, solving a system of linear equations can determine motion characteristics of the camera and/or subject, as well as the distance between them (courtesy VISCODA).

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. FotoNation, Movidius, NVIDIA, NXP and videantis, co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance also holds Embedded Vision Summit conferences. Embedded Vision Summits are technical educational forums for product creators interested in incorporating visual intelligence into electronic systems and software. They provide how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. These events are intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings.

The most recent Embedded Vision Summit was held in May 2016, and a comprehensive archive of keynote, technical tutorial and product demonstration videos, along with presentation slide sets, is available on the Embedded Vision Alliance website and YouTube channel. The next Embedded Vision Summit, along with accompanying workshops, is currently scheduled take place on May 1-3, 2017 in Santa Clara, California. Please reserve a spot on your calendar and plan to attend.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Kevin Chen
Senior Director, Senior Director, Technical Marketing, FotoNation

Jack Dashwood
Marketing Communications Director, Movidius

Jesse Clayton
Product Manager of Autonomous Machines, NVIDIA

Stéphane François
Software Program Manager, NXP Semiconductors

Anand Joshi
Senior Analyst, Tractica

Manoj Sahi
Research Analyst, Tractica

Marco Jacobs
Vice President of Marketing, videantis

Embedded Vision Application: A Design Approach for Real Time Classifiers

This article was originally published at PathPartner Technology's website. It is reprinted here with the permission of PathPartner Technology.

FPGAs for Deep Learning-based Vision Processing

FPGAs have proven to be a compelling solution for solving deep learning problems, particularly when applied to image recognition. The advantage of using FPGAs for deep learning is primarily derived from several factors: their massively parallel architectures, efficient DSP resources, and large amounts of on-chip memory and bandwidth. An illustration of a typical FPGA architecture is shown in Figure 1.

Deep Learning for Object Recognition: DSP and Specialized Processor Optimizations

Bookmark and Share

Deep Learning for Object Recognition: DSP and Specialized Processor Optimizations

Neural networks enable the identification of objects in still and video images with impressive speed and accuracy after an initial training phase. This so-called "deep learning" has been enabled by the combination of the evolution of traditional neural network techniques, with one latest-incarnation example known as a CNN (convolutional neural network), by the steadily increasing processing "muscle" of CPUs aided by algorithm acceleration via various co-processors, by the steadily decreasing cost of system memory and storage, and by the wide availability of large and detailed data sets. In this article, we provide an overview of CNNs and then dive into optimization techniques for object recognition and other computer vision applications accelerated by DSPs and vision and CNN processors, along with introducing an industry alliance intended to help product creators incorporate vision capabilities into their designs.

Classical computer vision algorithms typically attempt to identify objects by first detecting small features, then finding collections of these small features to identify larger features, and then reasoning about these larger features to deduce the presence and location of an object of interest, such as a face. These approaches can work well when the objects of interest are fairly uniform and the imaging conditions are favorable, but they often struggle when conditions are more challenging. An alternative approach, deep learning, has been showing impressive results on these more challenging problems where there's a need to extract insights based on ambiguous data, and is therefore rapidly gaining prominence.

The algorithms embodied in deep learning approaches such as CNNs are fairly simple, comprised of operations like convolution and decimation. CNNs gain their powers of discrimination through a combination of exhaustive training on sample images and massive scale – often amounting to millions of compute nodes (or "neurons"), requiring billions of compute operations per image. This high processing load creates challenges when using CNNs for real-time or high-throughput applications, especially in performance-, power consumption- and cost-constrained embedded systems.

Due to the massive computation and memory bandwidth requirements of sophisticated CNNs, implementations often use highly parallel processors. General-purpose GPUs (graphics processing units) are popular, as are FPGAs (field programmable gate arrays), especially for initial network training. But since the structure of CNN algorithms is very regular and the types of computation operations used, such as repetitions of MACs (multiply-accumulates), are very uniform, they're well matched to a DSP or another processor more specifically tailored for CNNs, particularly for subsequent object recognition tasks.

Neural Network Overview

A neural network is a system of interconnected artificial neurons that respond to inputs and generate output signals that flow to other neurons, typically constructed in multiple layers, representing recognition or triggering on more and more complex patterns. The connections among neurons have weights that are tuned during the initial network training process, so that a properly trained network will respond correctly when presented with an image or pattern to recognize. Each layer of the network comprises many neurons, each either responding to different inputs or responding in different ways to the same inputs. The layers build up so that the first layer detects a set of primitive patterns in the input, the second layer detects patterns of these primitive patterns, the third layer detects patterns of these patterns, and so on (Figure 1).

Figure 1. An artificial neural network is comprised of multiple layers, each containing multiple neurons.

CNNs are a particularly important category of neural networks; the same weights are used uniformly on different sections of the inputs, and the response function is based on a sum-of-products of inputs and weights in the form of a dot-product or convolution operation. Typical CNNs use multiple distinct layers of pattern recognition; sometimes as few as two or three, other times more than a hundred. CNNs follow a common pattern in the relationships between layers. Initial layers typically have the largest size inputs, as they often represent full-size images (for example), although the number of unique input channels is small (red, green and blue for RGB images, for example). These initial layers often encompass a narrow set of weights, as the same small convolution kernel is applied at every input location.

Initial layers also often create smaller outputs in X-Y dimensions, by effectively sub-sampling the convolution results, but create a larger number of output channels. As a result, the neuron data gets "shorter" but "fatter" as it traverses the network (Figure 3). As the computation progresses through the network layers, the total amount of neurons may also decrease, but the number of unique weights often grows. The final network layers are often fully connected, wherein each weight is used exactly once. In such layers the size of the intermediate data is much smaller than the size of the weights. The relative sizes of data and weights, along with the ratio of computation to data usage, ultimately become critical factors in selecting and optimizing the architecture's implementation.

Figure 2. As computation progresses through the network layers, the total amount of neurons may decrease, but the number of unique weights often grows; the final network layers are often fully connected (top). An AlexNet-specific implementation (middle) provides another perspective on the network layers' data-versus-coefficient transformations (bottom).

Network Training

Training is performed using a labeled dataset of inputs in a wide assortment of representative input patterns that are tagged with their intended output response. It uses general-purpose methods to iteratively determine the weights for intermediate and final-feature neurons (Figure 3). The training process typically begins with random initialization of the weights; labeled inputs are then sequentially applied, including notation of the computed value for each input. This phase of the training process is known as forward inferencing, or forward propagation. Each output is compared to the expected value, based on the previously set label for the corresponding input. The difference, i.e. the error, is fed back through the network; a portion of the total error is allocated to each layer and to each weight used in that particular output computation. This phase of the training process is called back propagation. The allocated error at each neuron is used to update the weights, with the goal of minimizing the expected error for that input.

Figure 3. Neural network training iteratively determines the weights for intermediate and final-feature neurons

After the application of a large number of independent inputs, along with a large number of iterations covering all of the inputs, the weights at each layer will eventually converge, resulting in an effective recognition system. The pattern recognizer's characteristics are determined by two factors: the structure of the network (i.e. which neurons feed into which other neurons on successive layers), and the weights. In most neural network applications, the network's architects manually design its structure, while automated training determines its set of weights.

A number of software environments have emerged for conducting this automated training. Each framework typically allows for a description of the network structure, along with setting parameters that tune the iterative coefficient computations, and also enables labeled training data to be used in controlling the training. The training data set is often quite large – tens of thousands to millions of images or other patterns – and effective training often also requires a large number of iterations using the training data. Training can be extremely computationally demanding; large training exercises may require weeks of computation across thousands of processors.

Some of the popular training environments include:

  • Caffe, from the University of California at Berkeley
    Caffe is perhaps the most popular open-source CNN framework among deep learning developers today. It is generally viewed as being easy to use, though it's perhaps not as extensible as some alternatives in its support of new layers and advanced network structures. Its widespread popularity has been driven in party by NVIDIA’s selection of Caffe as the foundation for its deep learning tool, DIGITS. Caffe also tends to undergo frequent new-version releases, along with exhibiting a relatively high rate of both format and API changes.
  • Theano, from the University of Montréal in Canada
    Theano is a Python-based library with good support for GPU-acceleration training. It is highly extensible, as a result being particularly attractive to advanced researchers exploring new algorithms and structures, but is generally viewed as being more complex to initially adopt than is Caffe.
  • MatConvNet from Oxford University
    MatConvNet is a MATLAB toolbox for CNN. It is simple and efficient, and features good support for GPU training acceleration. It is also popular among deep learning researchers because it is flexible, and adding new architectures is straightforward.
  • TensorFlow from Google
    TensorFlow is a relatively new training environment. It extends the notion of convolutions to a wider class of N-dimensional matrix multiplies, which Google refers to as "tensors", and promises applicability to wider class of structures and algorithms. It is likely to be fairly successful if for no other reason than because of Google’s promotion of it, but its acceleration capabilities (particularly on GPUs) are not yet as mature as with some of the other frameworks.
  • Torch from the Dalle Molle Institute for Perpetual Artificial Intelligence
    Torch is one of the most venerable machine learning toolkits, dating back to 2002. Like TensorFlow, it uses tensors as a basic building block for a wide range of computation and data manipulation primitives.
  • The Distributed Machine Learning Toolkit (DMLT) from Microsoft
    DMLT is a general-purpose machine learning framework, less focused on deep neural networks in particular. It includes a general framework for server-based parallelization, data structures for storage, model scheduling and automated pipelining for training.

Most of these training environments, which run on standard infrastructure CPUs, also include optimizations supporting various coprocessors.

Object Recognition

Once the network is trained, its structure and weights can subsequently be utilized for recognition tasks, which involve only forward inferences on unlabeled images and other data structures. In this particular case, the purpose of the inference is not to compute errors (as with the preceding training) but to identify the most likely label or labels for the input data. Training is intended to be useful across a wide assortment of inputs; one network training session might be used for subsequent recognition or inferences on millions or billions of distinct data inputs.

Depending on the nature of the recognition task, forward inference may end up being run on the same system that was used for the training. This scenario may occur, for example, with infrastructure-based "big data analytics" applications. In many other cases, however, the recognition function is part of a distributed or embedded system, where the inference runs on a mobile phone, an IoT node, a security camera or a car. In such cases, trained model downloads into each embedded device occur only when the model changes – typically when new classes of recognition dictate retraining and update.

These latter scenarios translate into a significant workload asymmetry between initial training and subsequent inference. Training is important to the recognition rate of the system, but it's done only infrequently. Training computational efficiency is therefore fundamentally dictated by what level of computational resources can fit into an infrastructure environment, and tolerable training times may extend to many weeks' durations. In contrast, inference many be running on hundreds or thousands of images per second, as well as across millions or potentially billions of devices; the aggregate inference computation rate for a single pre-trained model could reach 1014-1018 convolution operations per second. Needless to say, with this kind of workload impact, neural network inference efficiency becomes crucial.

Neural network inferences typically make both arithmetic and memory capacity demands on the underlying computing architecture (Figure 4). Inferences in CNNs are largely dominated by MAC (multiply-accumulate) operations that implement the networks' essential 3-D convolution operations. In fact, MACs can in some cases represent more than 95% of the total arithmetic operations included in the CNN task. Much of this computation is associated with the initial network layers; conversely, much of the weight memory usage – sometimes as much as 95% - is associated with the later layers. The popular AlexNet CNN model, for example, requires approximately 80 million multiplies per input image, along with approximately 60 million weights. By default, training produces 32-bit floating-point weights, so the aggregate model size is roughly 240 MBytes, which may need to be loaded and used in its entirety once per input image.

Figure 4. Neural network inferences involve extensive use of both arithmetic calculations and memory accesses.

These arithmetic and memory demands drive the architecture of CNN inference hardware, especially in embedded systems where compute rates and memory bandwidth have a directly quantifiable impact on throughput, power dissipation and silicon cost. To a first order, therefore, the following hardware capabilities are likely to dominate CNN inference hardware choices:

  • MAC throughput and efficiency, the latter measured in metrics such as MACs per second, MACs per watt, MACs per mm2 and MACs per dollar.
  • Memory capacity, especially for on-chip memory storage, to hold weights, input data, and intermediate neuron results data.
  • Memory bus bandwidth to transfer data, especially coefficients, on-chip.

Embedded Optimizations

With embedded systems using large CNN models, it may be difficult to hold the full set of weights on-chip, especially if they're in floating point form. Off-chip weight fetch bandwidth may be as high as the frame rate multiplied by the model size, a product that can quickly reach tens of GBytes per second. Providing (and powering) that off-chip memory bandwidth can dominate the design and complete overshadow the energy even for the large number of multiplies. Managing the model size becomes a key concern in embedded implementations.

Fortunately, a number of available techniques, some just emerging from the research community, can dramatically reduce the model load bandwidth problem. They include:

  • Quantizing the weights to a smaller number of bits, by migrating to 16-bit floating point, 16-bit fixed point, 8-bit fixed point or even lower-precision data sizes. Some researchers have even reported acceptable-accuracy success with single-bit weights. Often, the trained floating-point weights can simply be direct-converted to a suitably scaled fixed-point representation without significant loss of recognition accuracy. Building this quantization step into the training process can deliver even higher accuracy. In addition, some research results recommend encoding the data, with the subsequently compressed weights re-expanded to their full representation upon use in the convolution.
  • In some networks, both weights and intermediate result data may contain a large number of zero values, i.e. they are sparse. In this case, a range of simple and effective lossless compression methods are available to represent the data or weights within a smaller memory footprint, additionally translating to a reduction in required memory bus bandwidth. Some networks, especially if training has been biased to maximize sparseness, can comprise 60-90% zero coefficients, with corresponding potential benefits on required capacity, bandwidth and sometimes even compute demands.
  • Large model sizes, along with the prevalence of sparseness, together suggest that typical neural network models may contain a large amount of redundancy. One of the most important emerging optimization sets for neural networks therefore involves the systematic analysis and reduction in the number of layers, the number of feature maps or intermediate channels, and the number of connections between layers, all without negatively impacting the overall recognition rate. These optimizations have the potential reduce the necessary MAC rate, the weight storage and the memory bandwidth dramatically, above and beyond the benefits realized by quantization alone.

Together, these methods can significantly reduce the storage requirements when moving CNNs into embedded systems. These same techniques can also deliver significant benefits in reducing the total computation load, specifically in replacing expensive 32-bit floating-point operations with much cheaper and lower energy 16-bit, 8-bit or smaller integer arithmetic operations.

Coprocessor Characteristics

Optimizing the processing hardware employed with CNNs also provides a significant opportunity to maximize computational efficiency, by taking full advantage of the high percentage of convolution operations present. Convolution can be considered to be a variant of the matrix multiplication operation; hardware optimized for matrix multiplies therefore generally does well on convolutions. Matrix multiplies extensively reuse the matrices' rows and columns, potentially reducing the memory bandwidth and even register bandwidth needed to support a given MAC rate. Lower bandwidth translates into reduced latency, cost and power consumption.

Google’s TensorFlow environment, in fact, leverages 3D matrix multiplies (i.e. tensor operations) as its foundation computational building block. For most convolutions, further optimizations of the hardware are possible, for example in doing direct execution of wide convolutions that take explicit advantage of the extensive data or weight reuse in 1-D, 2-D and 3-D convolutions, beyond what’s possible with just matrix multiplies.

Hardware for CNNs is evolving rapidly. Much of the recent explosion in CNN usage has leveraged off-the-shelf chips such as GPUs and FPGAs. These are reasonable pragmatic choices, especially for initial network training purposes, since both types of platform feature good computational parallelism, high memory bandwidth, and some degree of flexibility in data representation. However, neither product category was explicitly built for CNNs, and neither category is particularly efficient for CNNs from cost and/or energy consumption standpoints.

DSP cores are becoming increasingly used for SoC (system-on-chip) CNN processing implementations, particularly with imaging- and vision-oriented architectures. They typically feature very high MAC rates for lower-precision (especially 16-bit and 8-bit) data, robust memory hierarchies and data movement capabilities to hide the latency of off-chip data and weight fetches, and very high local memory bandwidth to stream data and coefficients through the arithmetic units. Note that the memory access patterns for neural networks tend to be quite regular, thereby enabling improved efficiency by using DMA (direct memory access) controllers or other accelerators, versus generic caches.

The latest DSP variants now even include specific deep learning-tailored features such as convolution-optimized MAC arrays, to further boost the throughput and capabilities for on-the-fly compression and decompression of data, in order to reduce the required memory footprint and bandwidth. DSPs, like GPUs, allow for full programmability; everything about the structure, resolution, and model of the network is represented in program code and data. As a result, standard hardware can run any network without any hardware changes or even hardware-level reprogramming.

Finally, the research community is beginning to produce a wide range of even more specialized engines, completely dedicated to CNN execution, with minimal energy consumption, and capable of cost-effectively scaling to performance levels measured in multiple teraMACs per second. These new optimized engines support varying levels of programmability and adaptability to non-standard network structures and/or mixes of neural network and non-neural network recognition tasks. Given the high rate of evolution in neural network technology, significant levels of programmability are likely to be needed for many embedded tasks, for a significant time to come.

Efficient hardware, of course, is not by itself sufficient. Effective standardized training environments aid developers in creating models, but these same models often need to migrate into embedded systems quite unlike the platforms on which they were initially developed. One or more of the following software capabilities therefore often complement the emerging CNN implementation architectures, to ease this migration:

  • Software development tools that support rapid design, coding, debug and characterization of neural networks on embedded platforms, often by leveraging a combination of virtual prototype models and example hardware targets
  • Libraries and examples of baseline neural networks and common layer types, for easy construction of standard and novel CNN inference implementations, and
  • CNN "mappers" that input the network structures as entered, along with the weights as generated by the training framework, outputting tailored neural networks for the target embedded implementation, often working in combination with the previously mentioned development tools and libraries.

SoC and System Integration

Regardless of whether the neural network-based vision processing subsystem takes the form of a standalone chip (or set of chips) or is integrated in the form of one or multiple cores within a broader-function SoC, it's not fundamentally different than any other kind of vision processing function block (Figure 5). The system implementation requires one or more capable input sensors, high-bandwidth access to the on- and/or off-chip frame buffer memory, the vision processing element itself, the hardware interface between it and the rest of the chip and/or system, and a software framework to tie the various hardware pieces together. Consideration also should be given to how overall system processing is portioned across the various cores, chips and other resources, since this partition will impact how the hardware and software is structured.

Figure 5. Neural network processing can involve either/both standalone ICs or, as shown here, a core within a broader-function SoC.

The hardware suite comprises four main function elements: sensor input, sensor processing, data processing (along with potential multi-sensor data fusion), and system output. Sensor input can take one or multiple of many possible forms (one or several cameras, visible- or invisible-light spectrum-based, radar, laser, etc.) and will depend on system operating requirements. In most cases, some post-capture processing of the sensor data will be required, such as Bayer-to-RGB interpolation for a visible light image sensor. This post-processing may be supported within the sensor assembly itself, or may need to be done elsewhere as an image pre-processing process prior to primary processing by the deep learning algorithms. This potential need is important to ascertain, because if the processing is done on the deep learning processor, sufficient incremental performance will need to be available to support it.

In either case, the data output from the sensor will be stored in frame buffer memory, either in part or in its entirety, in preparation for processing. As previously discussed, the I/O data bandwidth of this memory is often at least as important, if not more so, than its density, due to the high volume of data coming from the sensor in combination with an often-demanding frame rate. A few years ago, expensive video RAM would have been necessary, but today's standard DDR2/DDR3 SDRAM can often support the required bandwidth while yielding significant savings in overall system cost.

The data next has to be moved from the frame buffer memory to the deep learning processing element. In most cases, this transfer will take place in the background using DMA. Once the data is the deep learning processing element, any required pre-processing will take place, followed by image analysis, which will occur on a frame-by-frame basis in real-time. This characteristic is important to keep in mind while designing the system because a 1080p60 feed, for example, will require larger memory and much higher processing speeds than a 720p30 image. Once analysis is complete, the generated information will pass to the system output stage for post-processing and to generate the appropriate response (brake, steer, recognize a particular person, etc). As with the earlier sensor output, analysis post-processing can take place either in the deep learning processing element or another subsystem. Either way, the final output information will in most cases not require high bandwidth or large storage spaces; it can instead transfer over standard intra- or inter-chip busses and via modest-size and –performance buffer memory.

Software Partitioning

Designing the software to effectively leverage the available resources in a deep learning vision system can be challenging, because of the diversity of coprocessors that are often available. Dynamically selecting the best processor for any particular task is essential to efficient resource utilization across the system. Fortunately, a number of open source standards can find use in programming and managing deep learning and other vision systems. For example, the Khronos Group's OpenVX framework enables the developer to create a connected graph of vision nodes, used to manage the vision pipeline. OpenVX supports a broad range of platforms, is portable across processors, and doesn’t require a high-performance CPU. OpenVX nodes can be implemented in C/C++, OpenCL, intrinsics, dedicated hardware, etc, and then connected into a graph for execution. OpenVX offers a high level of abstraction that makes system optimization easier by allowing the algorithms to take advantage of all available and appropriate resources in the most efficient manner.

The individual nodes are usually programmed using libraries such as OpenCV (the Open Source Computer Vision Library), which enable designers to deploy vision algorithms without requiring specialized in-advance knowledge of image processing theory. Elements in the vision system that include wide parallel structures, however, may be more easily programmed with Khronos' OpenCL, a set of programming languages and APIs for heterogeneous parallel programming that can be used for any application that is parallelizable. Originally targeted specifically to systems containing CPUs and GPUs, OpenCL is now not architecture-specific and can therefore find use on a variety of parallel processing architectures such as FPGAs, DSPs and dedicated vision processors. OpenCV and OpenVX can both use OpenCL to accelerate vision functions.

Many systems that employ a vision subsystem will also include a host processor. While some level of interaction between the vision subsystem and the host processor will be inevitable, the degree of interaction will depend on the specific design and target application. In some cases, the host will closely control the vision subsystem, while in others the vision subsystem will operate near-autonomously. This design consideration will be determined by factors such as the available resources on the host, the capability of the vision subsystem, and the vision processing to be done. System partitioning can have a significant effect on performance and power optimization, and therefore begs for seriously scrutiny when developing the design.


The popularity and feasibility of real-world neural network deployments are growing rapidly. This accelerated transformation of the computer vision field affects a tremendous range of platform types and end applications, driving profound changes with respect to their cognitive capabilities, the development process, and the underlying silicon architectures for both infrastructure and embedded systems. Optimizing an embedded-intended neural network for the memory and processing resources (and their characteristics) found in such systems can deliver tangible benefits in cost, performance, and power consumption parameters.

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. BDTI, Cadence, Movidius, NXP and Synopsys, the co-authors of this article, are members of the Embedded Vision Alliance. First and foremost, the Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance also holds Embedded Vision Summit conferences. Embedded Vision Summits are technical educational forums for product creators interested in incorporating visual intelligence into electronic systems and software. They provide how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. These events are intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings.

The most recent Embedded Vision Summit was held in May 2016, and a comprehensive archive of keynote, technical tutorial and product demonstration videos, along with presentation slide sets, is available on the Embedded Vision Alliance website and YouTube channel. The next Embedded Vision Summit, along with accompanying workshops, is currently scheduled take place on May 1-3, 2017 in Santa Clara, California. Please reserve a spot on your calendar and plan to attend.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Jeff Bier
Founder, Embedded Vision Alliance
Co-founder and President, BDTI

Chris Rowen
Chief Technology Officer, Cadence

Jack Dashwood
Marketing Communications Director, Movidius

Daniel Laroche
Systems Architect, NXP Semiconductors

Ali Osman Ors
Senior R&D Manager, NXP Semiconductors

Mike Thompson
Senior Product Marketing Manager, Synopsys