Embedded Vision Alliance: Technical Articles

Smart Watch, Smart Home, Smart City – How the Internet of Things Helps Shape the Future

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.

BOSCH Visiontec Brings Innovative Automotive IP to Market Fast Using High-Level Synthesis

This article was originally published at Mentor's website. It is reprinted here with the permission of Mentor.

Stereo Vision: Facing the Challenges and Seeing the Opportunities for ADAS Applications

Bookmark and Share

Stereo Vision: Facing the Challenges and Seeing the Opportunities for ADAS Applications

This technical article was originally published on Texas Instruments' website (PDF). It is reprinted here with the permission of Texas Instruments.


Cameras are the most precise mechanisms used to capture accurate data at high resolution. Like human eyes, cameras capture the resolution, minutiae and vividness of a scene with such beautiful detail that no other sensors such as radar, ultrasonic and lasers can match. The prehistoric paintings discovered and dated back tens of thousands of years ago in caves across the world are testaments that pictures and paintings coupled with visual sense have been the preferred method to convey accurate information[1].

The next engineering frontier, that some might argue will be the most challenging for the technology community, is real-time machine vision and intelligence. The applications include, but are not limited to, real-time medical analytics (surgical robots), industrial machines and cars that are driven with autonomous intelligence. In this particular paper, we will focus on autonomous Advanced Driver Assistance Systems (ADAS) applications and how cameras and stereo vision in particular is the keystone for safe, autonomous cars that can “see and drive” themselves.

The key applications that require cameras for ADAS are shown below in Figure 1. Some of the applications shown can be implemented using just a vision system such as forward-, rear- and side-mounted cameras for pedestrian detection, traffic sign recognition, blind spots and lane detect systems. Others such as intelligent adaptive cruise control can be implemented robustly as a fusion of radar data with the camera sensors, especially for complex scenarios such as city traffic, curvy non-straight roads or higher speeds.

Figure 1: Applications of camera sensors for ADAS in a modern vehicle: (a) Forward facing camera for – lane detect, pedestrian detect, traffic sign recognition and emergency braking. (b) Side- and rear-facing cameras for parking assistance, blind spot detection and cross traffic alerts.

What kind of camera is needed?

All the real world scenes that a camera encounters are three dimensional. The objects that are at different depths in real world may appear to be adjacent to each other in the two-dimensional mapped world of the camera sensor. Figure 2 shows a picture from the Middlebury image dataset[2]. Clearly the motor bike in the foreground of the picture is about two meters closer to the camera than the storage shelf in the background. Please pay attention to point 1 and 2 annotated in the figure. The red box (point 1) that is in the background appears adjacent to the forks (2) of the bike in the captured image, even though it is at least two meters farther away from the camera. Human brain has the power of perspective that allows us to make the decision about depth from a 2-D scene. For a forward-mounted camera in the car, the ability to analyze perspective does not come as easy.

Figure 2: Image from 2014 Middlebury database. The motor bike in the foreground is much closer to the camera than the storage shelf, though all objects appear adjacent in a 2-D mapped view.

If we have a single camera sensor mounted and capturing video that needs to be processed and analyzed, that system is called a monocular (single- eyed) system, whereas a system with two cameras, separated from each other is called a stereo vision system. Before we go any further, please have a look at Table 1 that compares the basic attributes of a monocular-camera ADAS with a stereo-camera system.

Table 1: High-level comparison of system attributes for a mono- vs. stereo-camera ADAS system.

The monocular-camera-based video system can do many things reasonably well. The system and analytics behind it can identify lanes, pedestrians, many traffic signs and other vehicles in the path of the car with good accuracy. Where the monocular system is not as robust and reliable is in calculating the 3-D view of the world from the planar 2-D frame that it receives from the single camera sensor. That’s not surprising if we consider the natural fact that humans (and most advanced) animals are born with two eyes. Before analyzing this problem in further detail, please take a look at Figure 3. This figure describes at high level the process and algorithms used to analyze the video (image) frame received from a camera sensor.

Figure 3: High-level algorithm flow and processes for analyzing an image in ADAS system.

The first stage in Figure 3 is the image pre- processing step, where various filters are run on the image (typically every pixel), to remove sensor noise and other un-needed information. This stage also converts the format of the received BAYER data from camera sensor to a YUV or RGB mode that can be analyzed by subsequent steps. On the basis of the preliminary feature extraction (edges, haar, Gabor filters, Histogram of oriented gradients, etc.) done in this first stage, the second and third stages further analyze the images to identify regions of interest by running algorithms such as segmentation, optical flow, block matching and pattern recognition. The final stage utilizes region information and feature data generated from the prior stages to create intelligent analysis decisions about the class of the object in the regions of interest. This brief explanation does not quite do justice to the involved ADAS image- processing algorithms’ field, however since the primary objective of this article is to highlight the additional challenges and robustness that a stereo vision system provides, the block-level algorithmic information is sufficient background for us to delve deeper into the topic.

How does a monocular camera measure distance to an object from 2-D data?

There are two distinct possibilities through which the distance measurement is performed by a monocular camera. The first of these is based on the simple premise that the objects closer to the camera appears bigger, and therefore takes up a larger pixel area in the frame. If an object is identified as a car, then the size of the object can be approximated by a maximum-covering-rectangle drawn around it. The bigger the size of this rectangle, the closer that object is to the camera (i.e. the car). The emergency braking algorithm will assess if the distance to each object identified in the frame is closer than a safe predefined value, then initiate the collision avoidance or driver warning actions as necessary. See Figure 4 for a simple illustration of this idea.

Figure 4: A picture showing various identified objects and their estimated distance from a monocular camera. It is clear that the more the distance of the identified object from the car, smaller is the maximum covering rectangle size[3], [17].

Simplicity and elegance are both benefits of this method, however there are some drawbacks to this approach. The distance to any identified object cannot be assessed until the object is pre-identified “correctly”. Consider the scenario shown in Figure 5. There are three graphical pedestrians shown in this figure. Pedestrian 1 is a tall person, while pedestrian 2 is a shorter boy. The distance of both these individuals to the camera is same. The third pedestrian (3) shown in the picture is farther away from the camera, and is again a tall person. Here the object detection algorithm will identify and draw rectangles around the three identified pedestrians.

Figure 5: A virtual info-graphic showing 3 pedestrians in the path of a moving vehicle with camera. The pixel size of the individuals 3 and 2 are exactly same, however individual 2 is much closer to the vehicle than individual 3.[4]

Unfortunately the size of the rectangle drawn around the short boy (individual 2), who is much closer to the camera than the tall person (individual 3) who is farther will be equal. Therefore, size of an identified object in pixels on the captured 2-D frame is not a perfectly reliable indicator of the distance of that object from the camera. The other issue to consider is if an object remains unidentified in a scene, then its distance cannot be ascertained, since the algorithm does not know the size of the object (in pixels). The object can remain unidentified for a multitude of reasons such as occlusion, lighting and other image artifacts.

The second method which can be utilized to calculate the distance of an object using monocular cameras is called “structure-from-motion (SFM)”. Since the camera is moving hence in theory, consecutive frames captured in time can be compared against each other for key features. Epipolar geometry defines constrained parameters for where a given point in 3-D space can move to in two consecutive frames captured by a moving (translated and possibly rotated) camera. SFM is an involved topic by itself and therefore in this article, we will draw attention to the challenges of distance computation using SFM rather than the mechanics and mathematics of how it is done. For readers who are deeply interested in how SFM works Reference[5] is a good summary. It is sufficient here to understand a high-level flow of how an SFM algorithm works (Figure 6).

Figure 6: A high-level data flow for SFM-based distance calculation. The sparse optical flow (OF) may be replaced with the dense flow calculation (for every pixel) as well. The above flow assumes 30 fps.

Given the data flow, it is easy to understand the challenges the SFM-based distance computation faces for a monocular camera system. Please see Table 2 for the list of these issues.

Table 2: Challenges for SFM-based distance computation.

How does stereo vision calculate distance of objects from 2-D planar data?

Before radars were invented, ships already used stereo reflection mechanisms coupled with a clockwork dial to calculate distance of enemy or pirate ships. (This information would then be used to aim the cannon at enemy ships.) There were two, sometimes more, mirrors (stereo) mounted on each side of the ship hull. A system of carefully arranged reflection mirrors relayed the images from the primary stereo mirrors to a control station. An operator on the control station would adjust the clockwork mechanism to superimpose and align the two images received over each other. The reading on the pre-calibrated dial attached to the clockwork would indicate the distance of the enemy ship. The fundamental stereo algorithm has not changed for centuries now. Therefore, the methods are stable and reliable. The regularity and stability of algorithms allows the opportunity to design an optimum hardware machine to perform the stereo vision calculations.

Figure 7 shows the stereo geometry equations. If the two cameras are calibrated, then the problem of finding distance to an object can be reformulated to find the disparity between the simultaneous images captured by the left and right cameras for that point. For pre- calibrated stereo cameras, the images could be rectified such that epipolar geometrical lines are simple horizontal searches (on the same row) for every point between two images. The disparity is then defined as the number of pixels a particular point has moved in the right camera image compared to the left camera image. This concept is crucial to remember, since it allows regular computation patterns that are amenable for hardware implementation. Before we delve deeper into this topic, the concept of disparity needs to be clarified further.

Figure 7: Stereo geometry equations. The depth of a point in 3-D space is inversely proportional to the disparity of that point between the left and right cameras.[6]

Stereo disparity calculation and accuracy of calculated distance

Figure 8 shows three different graphs to demonstrate the relationship between disparity and the distance-to-object. The first thing to notice is that the measured disparity is inversely proportional to the distance of an object. The closer an object is to the stereo cameras; more is the disparity and vice-versa. Theoretically, a point with zero disparity is infinitely away from the cameras. Concretely, the calculation shows that for the chosen physical parameters of the system (see Figure 8-a), a disparity of 1 pixel implies a distance of ~700 meters, while for a calculated disparity of 2 pixels, the estimated distance is ~350 meters. That is a really large resolution and if the disparity calculation is inaccurate by one pixel, then the estimated distance will be incorrect by a large amount (for longer distances > 100 meters). For shorter distances (lower part of the curves in Figure 8 < 50 meters), the resolution of distance calculation is much more improved. It is evident by the distance calculation points in the graphs which crowd together. In this range, if the disparity calculation is inaccurate by one pixel (or less), the calculated distance is wrong by approximately two to three meters.

Figure 8: Distance vs. disparity graphs for different accuracy of calculation. The distance accuracy improves with increased pixelar accuracy of disparity calculation. Calculations made for (a) 30-cm distance between two cameras, (b) focal length of 10 mm, (c) pixel size of 4.2 microns.

There are methods to improve the accuracy of the system further. As shown by Figures 8(b) and 8(c), if the disparity calculation is performed at half or quarter pixel levels, then the resolution of distance calculation improves proportionally. In those scenarios, for distances larger than 100 meters (but less than 300 meters), the resolution of calculated distance for consecutive disparity increase is ~30– 40 meters. For distances smaller than 100 meters, the accuracy can be better than 50 cms. It may be important to reiterate that the accuracy needs to be maximized (preferable to < 0.1 meters range) for a collision avoidance system operating for close distances. At the same time, operating range of the stereo camera needs to be improved, even at the cost of slight loss to accuracy if needed.

The range of the stereo camera ADAS system

If you see the basic stereo equation (Figure 7) once again, then it is evident that to improve the maximum range of the system, the distance computation needs to be reasonably accurate for low disparities. That can be achieved by any of the following three methods. Each of these methods has associated trade offs for mechanical or electronic design and eventually system cost.

  1. Use a smaller pixel size: If we use a smaller pixel size (let’s say half), and if everything else stays the same, the range improves by about 50 percent (for the same accuracy)
  2. Increase the distance between the two cameras: If “T” is increased to double, and if everything else stays the same, the range improves by about 50 percent (for the same accuracy)
  3. Change the focal length: If “f” is increased to double, and if everything else stays the same, the range improves by about 50 percent (for the same accuracy), but field of view narrows down
  4. Use a computation system that calculates stereo disparity with sub-pixel accuracy

Although mathematically feasible, options (b) and (c) have a direct bearing on the physical attributes of the system. When a stereo system needs to be mounted in a car, typically it will have a fixed dimension or the requirement for the system to be as small as possible. This aesthetic need goes against increasing the distance between the cameras (T) or the focal length (f). Therefore, most practical options for an accurate stereo distance calculation system with high range and accuracy revolve around options (a) and (d) above.

The process

Figure 9 shows the high-level block diagram for data flow and compute chains to calculate the stereo disparity. Please note the absence of the camera calibration step that was present in the SFM block diagram in Figure 6, and that there is no need to search for features in the dense stereo disparity algorithm either. Identifying features and objects is required for SFM-based distance calculation methods that compute distance based on the size of an object.

Figure 9: A high-level data and algorithm flow for stereo disparity-based distance computation.

The image rank transformation is most often the first or second step in the stereo image processing pipe. The purpose of this step is to ensure that the subsequent block comparisons between two images are robust to real-world noise conditions such as illumination or brightness changes between the left and right images[7]. These changes can be caused by many factors. Some of these include different illumination because of varied points of view from the two cameras and slight differences between the shutter speeds and other jitter artifacts that may cause the left and right images to be captured at slightly different points of time by the cameras.

There are various papers and approaches suggested by researchers for different rank transform options for images and how they impact the robustness of disparity calculations[8]. The image rectification step in Figure 9 ensures that the subsequent disparity calculation can be performed along the horizontal epipolar search lines. The next steps in the process are actual calculation of disparity, confidence levels of the computation and post processing. The dense disparity calculation is mostly performed in spatial domain although there are some approaches suggested to calculate disparity in frequency domain[9].

These approaches attempt to take advantage of the fact that large FFTs can be calculated comparatively quickly, yet there are other complications involved in FFT that don’t tilt the balance in its favor yet. Without the need to go deeper into that discussion in this article, it is a fair claim that most (if not all) productized stereo disparity algorithms are implemented in spatial domain. At the most basic level, this analysis requires that for every pixel in the left (transformed) image, we need to pick a small block of pixels surrounding it.

Next, we need to search in the right side (transformed) image along the epipolar (horizontal) line until finding where the same block is located. This computation is performed for every possible value of disparity (from one to the maximum—64 or 128 or any other value). The difference (or cross correlation) between the left and right side block will approach minima (maxima) close to the actual value of disparity for the pixel. The performed “moving-window” block comparison and matching will calculate how much the block has moved, and the result will be used to calculate distance of that particular pixel in 3-D space. This process is shown in Figure 10. One such example of disparity calculation using rank transform followed by sum of absolute differences (SAD) based cost function minimization is given in[8].

Figure 10: Simple SAD-based block-comparison algorithm for finding disparity.

The SAD-based approach for finding disparity is elegant and sometimes too simplistic. The basic premise of this approach is that for a given block of pixels, the disparities are equal, however this is almost never true at the edges of the objects. If you review Figure 2 again and pay attention to the annotations made for forks of the motorcycle and the red box, you would quickly realize that there will be many adjacent pixels where disparity will be different. It is indeed normal since the “red box on the shelf” is about two meters farther from the camera than the “forks”. The disparity for a small block of pixels may change drastically at object boundaries and marginally for slanted or curved surfaces. The “cones and faces” image from Middlebury dataset[10] highlights this fact perfectly (Figure 11). The adjacent pixels found over one cone (slightly slanted surface) will have minor disparity changes, while the object boundaries will have large disparity differences. Using a simple SAD-based algorithm along with rank transform will leave large disparity holes on both occlusions such as the artifacts that are visible only in one camera and object boundaries.

Figure 11: Cones and faces from Middlebury dataset. The disparity calculation is performed using simple SAD. The disparity keeps changing marginally on the curved surfaces, while it changes drastically on the object boundaries. See the disparity holes in the fence on the top-right, other object discontinuities and occlusions on left border.

To resolve such inaccuracies with deterministic run time, an elegant approach was suggested by[11]. This approach is called “semi global matching”. In this approach, a smoothness cost function is calculated for every pixel in more than one direction (4, 8 or 16). The cost function calculation is shown in Figure 12. The objective is to optimize the cost function S(p,d) in multiple directions for every pixel and to ensure a smooth disparity map. The original paper for SGM suggested 16 directions for optimization of the cost function, though practical implementations have been attempted for 2, 4 and 8 directions as well.

Figure 12: Optimization cost function equations for SGM.

A concrete implementation of SGM cost functions and optimization algorithm is shown in Figure 13. With this pseudo-code segment, it is easy to assess the memory, computation and eventually hardware complexity requirements to enable SGM-based computation in real time.

Figure 13: Pseudo code for implementation of SGM.

Computation and memory requirements for the disparity calculation

As you can well imagine, this calculation is compute heavy for ADAS applications. A typical front-facing stereo camera rig is a set of 1-Mpixel cameras operating at 30 frames per second. The first steps in the disparity calculation process are rank transforms (Figure 14). A typical rank transform is a census transform or a slightly modified version. The inputs required are both stereo images, while the outputs are census-transformed image pairs. The computation required for census transform for an N×N block around the pixel is to perform 60 million, N×N census transforms. Every census transform done for a pixel over a N×N block requires N2 comparison operations. Some other involved rank transforms need an N2 point sort for every pixel. It is safe to assume that the minimum possible requirement is to run 60 million × N2 comparison operations for rank transformation in practical systems deployed on real vehicles for next few years.

Figure 14: Rank transform examples for images. The left part of the image is simple census transform. The right half is called “complete rank transform”[7].

The second step in the process requires image rectification to ensure that the epipolar disparity searches are needed on the horizontal lines. The third step is more interesting since It involves calculation of C(p,d), Lr(p,d) and S(p,d) for every pixel and disparity combination (see Figure 13). If C(p,d) is a block-wise SAD operation and the block size is N×N, the required system range is ~200 meters and the accuracy of distance calculated required half-pixel disparity calculation then the system will require to calculate C(p,d) for 64–128 disparity possibilities. The total compute requirements for C(p,d) with these parameters are to perform 60 million × N2 × 128 SAD operations every second.

The calculation of Lr(p,d) needs to be done for every pixel in “r” possible directions, hence the calculation of this term (see Figure 13) needs to be done 60 million × 128 × r times. The calculation for one pixel requires five additions (if you consider subtraction a special form of addition) and one minima-finding operation. Putting it together, the calculation of Lr(p,d) per second needs 60 million × 128 × r × 5 additions and 60 million × 128 × r minima computations for four terms.

The calculation of S(p,d) needs to be done r times for every possible pixel and disparity value, every computation of S(p,d) requires “r” additions and one comparison. The total operations needed for calculating this per second are 60 million × 128 × r additions and 60 million × 128 comparisons.

Putting all three together, an accurate SGM-based disparity calculation engine, running on 1 Mpixel, 30-fps cameras and intending to calculate 128 disparity possibilities will need to perform approximately 1 Tera operations (additions, subtractions, minima finding) every second. To put this number in perspective, advanced general- purpose processors in embedded domain issue seven to ten instructions per cycle. Some of these instructions are SIMD type, i.e., they can tackle 8–16 pieces of data in parallel. Considering the best IPC that a general-purpose processor has to offer, a quad-core processor running at 2 GHz will offer about 320 Giga 64-bit operations. Even if we consider that most of the stereo pipeline will be 16 bits and the data can be packed in 64-bit bins with 100 percent efficiency, a quad-core general-purpose processor is hardly enough to meet the demands of a modern day ADAS stereo vision system. The objective of a general-purpose processor is to afford high-level programmability of all kinds. What it means is that designing a real-time ADAS stereo vision system requires specialized hardware.

Robustness of calculations

The purpose of ADAS vision systems is to avoid or at least minimize the severity of road accidents. More than 1.2 million people are killed every year due to road accidents, making it the leading cause of death for people aged 15–29 years. Pedestrians are the most vulnerable road users, with more than 250,000 pedestrians succumbing to injuries each year. The major cause of road accidents is mistakes made by drivers either due to inattention or fatigue. Therefore, the most important purpose of an ADAS vision system for emergency braking is to reduce the severity and frequency of the accidents. That is a double-edged requirement since a vision system not only has to estimate the distances correctly with high robustness every video frame and every second, but also minimize the false positive scenarios. To ensure the ADAS system is specified and designed for the right level of robustness, ISO 26262 was created as an international standard for the specification, design and development of electronics system for automotive safety applications.

A little calculation here will bring out the estimated errors in computing distance for a stereo vision system. Please see Figure 15. If the error tolerances for the focal length (f) and the distance between the cameras (T) is 1 percent and the accuracy of disparity calculation algorithm is 5 percent, then the calculated distance (Z) will still be about 2.5 percent inaccurate. Improving the accuracy of the disparity calculation algorithm to a sub-pixel (quarter or half pixel) level therefore is important. This has two implications. The first being increased post- processing interpolation compute requirements of the algorithm and the hardware. The second requirement is more sophisticated and is related to ISO 26262.

Figure 15: Statistical error estimation for calculated distance by a stereo vision system[13].

The architecture and design needs to ensure that both transient and permanent errors in the electronic components are detected and flagged within the fault tolerant time interval (FTTI) of the system. The calculation of FTTI and the other related metrics is beyond the scope of this article, yet it should suffice to point out that the electronic components used to build the system need to enable achieving the required ASIL levels for the ADAS vision system.

System hardware options and summary

In this article, we reviewed the effectiveness of various algorithm options in general and stereo- vision algorithms in particular to calculate distance for an automotive ADAS safety emergency braking system. Texas Instruments is driving deep innovation in the field of ADAS processing in general, and efficient and robust stereo vision processing in particular.

There can be many different electronic system options to achieve the system design and performance objectives for an ADAS safety vision system. Heterogeneous chip architectures by Texas Instruments (TDA family) are suitable to meet the performance, power, size and ASIL functional safety targets for this particular application. A possible system block diagram for stereo and other ADAS systems using TI TDA2x and TDA3x devices and demonstrations of the technology are available at www.ti.com/ADAS.


  1. Cave paintings: http://en.wikipedia.org/wiki/Cave_painting#Southeast_Asia
  2. D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, and P. Westling. “High-resolution stereo datasets with subpixel-accurate ground truth”. In German Conference on Pattern Recognition (GCPR 2014), Münster, Germany, September 2014
  3. Figure credit: “Vision-based Object Detection and Tracking”, Hyunggic!, http://users.ece.cmu.edu/~hyunggic/vision_detection_tracking.html
  4. Image credit: pixabay.com
  5. 3D Structure from 2D Motion, http://www.cs.columbia.edu/~jebara/papers/sfm.pdf
  6. Chapter 7 “Stereopsis” of the textbook of E. Trucco and A. Verri, Introductory Techniques for 3-D Computer Vision, Prentice Hall, NJ, 1998 and lecture notes from https://www.cs.auckland.ac.nz/courses/compsci773s1t/lectures/773-GG/topCS773.htm
  7. “The Complete Rank Transform: A Tool for Accurate and Morphologically Invariant Matching of Structures”, Mathematical Image Analysis Group, Saarland University, http://www.mia.uni-saarland.de/Publications/demetz-bmvc13.pdf
  8. “A Novel Stereo Matching Method based on Rank Transformation”, Wenming Zhang , Kai Hao*, Qiang Zhang, Haibin Li Reference: http://ijcsi.org/papers/IJCSI-10-2-1-39-44.pdf
  9. “FFT-based stereo disparity estimation for stereo image coding”, Ahlvers, Zoelzer and Rechmeier
  10. “Semi-Global Matching”, http://lunokhod.org/?p=1356
  11. “Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information”, http://www.dlr.de/rm/en/PortalData/3/Resources/papers/modeler/cvpr05hh.pdf
  12. “More than 270,000 pedestrians killed on roads each year”, http://www.who.int/mediacentre/ news/notes/2013/make_walking_ safe_20130502/en/
  13. “Propagation of Error”, http://chemwiki.ucdavis.edu/Analytical_Chemistry/Quantifying_Nature/Significant_Digits/
  14. Optical flow reference: “Structure from Motion and 3D reconstruction on the easy in OpenCV 2.3+ [w/ code]” http://www.morethantechnical.com/2012/02/07/structure-from-motion-and-3d-reconstruction-on-the-easy-in-opencv-2-3-w-code/
  15. Mutual information reference: “Mutual Information as a Stereo Correspondence Measure”, http://repository.upenn.edu/cgi/viewcontent.cgi?article=1115&context=cis_reports
  16. Image entropy analysis using Matlab®
  17. Image credit: “Engines idling in New York despite law”, CNN News, http://www.cnn.com/2012/02/06/health/engines-new-york-law/

By Aish Dubey
ADAS, Texas Instruments

The Flying Eye

This article was originally published at IDS Imaging's website. It is reprinted here with the permission of IDS Imaging.

Beyond-visible Light Applications in Computer Vision

Bookmark and Share

Beyond-visible Light Applications in Computer Vision

Computer vision systems aren't necessarily restricted to solely analyzing the portion of the electromagnetic spectrum that is visually perceivable by humans. Expanding the analysis range to encompass the infrared and/or ultraviolet spectrum, either broadly or selectively and either solely or in conjunction with visible spectrum analysis, can be of great benefit in a range of visual intelligence applications. Successfully implementing these capabilities in traditional PC-centric hardware and software configurations is challenging enough; doing so in an embedded vision system which has stringent size, weight, cost, power consumption and other constraints is even more difficult. Fortunately, an industry alliance is available to help product creators optimally implement such vision processing in their resource-constrained hardware and software designs.

All objects emit radiation, with the type of radiation emitted primarily dependent on a particular object's temperature. Colder objects emit very low frequency waves (such as radio, microwaves, and infrared radiation), while warmer objects emit visible light or higher frequencies (ultraviolet, for example, or x-rays or gamma radiation). This spectral expanse is called the electromagnetic spectrum (EMS), and consists of all wavelengths from radio waves at the low-frequency end of the range to gamma rays at the high-frequency endpoint (Figure 1).

Figure 1. The electromagnetic spectrum encompasses wavelengths from radio waves to gamma rays, both selectively emitted and absorbed by various objects (courtesy Khan Academy).

Just as different objects emit radiation at different frequencies, different materials also absorb energy at different wavelengths across the EMS. A radio antenna, for example, is designed to capture radio waves, while humans' eyes have evolved to capture visible light. Technology has also evolved to take advantage of both the emission and absorption of EM waves. X-rays, for example, are an efficient means of imaging tissue and bone, because these particular materials exhibit high absorption at these particular wavelengths. Various applications in biology, meteorology and agriculture leverage these same transmission and absorption phenomena, albeit at different wavelengths and with different materials.

Within the EMS is a region known as the solar spectrum. Sunlight (i.e. EM radiation emitted by the sun) encompasses the infrared, visible, and ultraviolet light wavelength portions of the EMS. The solar spectrum is the particular focus of this article, which describes the technology, imaging components, and applications intended for use in these spectra. Figure 2 shows the ultraviolet, visible light and infrared band subsets of the solar spectrum in greater detail.

Figure 2. The solar spectrum is a subset of the EMS extending from ultraviolet (short wavelength) to infrared (long wavelength), and includes the intermediate human-visible range (courtesy Allied Vision).

Each of these three bands is further sub-categorized. Ultraviolet (UV), for example, is commonly segmented into three distinct regions, near UV (UVA, 315–400 nm), middle UV (UVB, 280–315 nm), and far UV (UVC, 180–280 nm). Visible light is similarly subdivided into separate color bands, commonly abbreviated as ROYGBIV (red/orange/yellow/green/blue/indigo/violet, ordered by decreasing wavelength i.e. increasing frequency). And infrared (IR) begins just beyond the red end of the visible light spectrum, extending from there into up to five distinct segments; near IR (NIR or IR-A, 750-1,400 nm), short wavelength IR (SWIR or IR-B, 1,400-3,000 nm), mid wavelength IR, also called intermediate-IR (MWIR or IIR, 3,000-8,000 nm), long wavelength IR (LWIR or IR-C, 8,000-15,000 nm) and far IR (FIR, 15,000-1,000,000 nm).

Infrared Imaging

Infrared imaging encompasses many different technologies, products and applications. A variety of detection devices find use depending upon the specific wavelength(s) being sensed within the infrared band. Since silicon exhibits some absorption (typically 10%-25% quantum efficiency, QE, i.e. spectral response) in the NIR range, and sunlight is comprised of nearly 50% NIR wavelengths, standard silicon-based sensors can often also be used in NIR applications. Therefore, many conventional visible-light imaging applications can be easily modified to operate in the NIR. Several methods are available to accomplish this objective. One approach, for example, leverages a standard CCD or CMOS imager, blocking visible light from reaching the sensor by adding low-pass or band-pass filters to the optics and/or imager.

Other enhancement methods involve special sensor designs, encompassing altered pixel geometry and spectrum-tailored glass/microlens arrangements, for example. For night vision implementations or when sunlight is otherwise not available, the incorporation of NIR illumination is another proven method, used extensively in security and surveillance applications. More generally, the applications for NIR imaging are numerous and growing. Biometrics has adopted NIR technology for fingerprint, vein and iris imaging, for example. Medical and forensics applications also benefit from NIR imaging, as do sports analytics, medical equipment, and machine vision. Because as previously mentioned many commercial sensors have NIR sensitivity, numerous NIR-capable cameras and camera modules are available on the market for embedded applications.

Moving from NIR to SWIR requires a transition in detector technology. Sensors used in SWIR cameras conceptually operate similar to silicon-based CCD or CMOS sensors; they convert photons into electrons, thereby acting as quantum detectors. However, in order to detect light beyond the visible spectrum, these particular detectors are comprised of materials such as indium gallium arsenide (InGaAs) or mercury cadmium telluride (MCT, HgCdTe). Although infrared radiation in the SWIR region is not visible to the human eye, it interacts with objects in a similar manner as visible wavelengths. Images from an InGaAs sensor, for example, are comparable to visible light images in resolution and detail, although SWIR images are monochrome-only.

InGaAs sensors also respond to differences in thermal energy. This makes them less sensitive to changes in light conditions, thereby rendering them effective in challenging environmental conditions such as low light, dust and haze. One major benefit of imaging in the SWIR band is the ability to image through glass. Special lenses are usually unnecessary with SWIR cameras. SWIR cameras are also used in high-temperature environments; they are ideally suited for imaging objects ranging from 250ºC to 800ºC. With respect to embedded applications, SWIR cameras do not typically meet the size, power and cost targets of most systems. Many SWIR cameras require active cooling, for example. However, ongoing design and production improvements are helping SWIR technology deliver reasonable return on investment (ROI), especially for the inspection of high-value products, for example, or in emerging medical OEM applications.

MWIR, like SWIR, requires a dedicated detector material. Mercury cadmium telluride (MCT) and indium antimonide (InSb) are most commonly used, and an integrated Dewar cooler assembly (IDCA) is also usually included in the design. MWIR camera assemblies can be complicated, but their size, weight, power and cost are steadily decreasing. They most often find use in target signature identification, surveillance, and non-destructive testing applications. They are capable of very long-range imaging, as well as detecting minute changes in temperature.

LWIR, also referred to as thermal imaging, is an approach wherein the detector responds to thermal energy rather than photons (Figure 3). Such thermal detectors operate differently than quantum (photon) detectors. Thermal detectors absorb energy, causing a localized increase in temperature, which in turn creates an electrical charge. These detectors are typically microbolometers, most commonly constructed of either amorphous silicon (a-Si) or vanadium oxide (VOx). Applications for LWIR imaging are numerous, ranging from preventive maintenance of equipment, energy audits of homes, agriculture monitoring, and border control along with other surveillance scenarios. The cost and complexity of LWIR cameras and camera cores has significantly decreased in recent years; LWIR components for consumers are even available.

Figure 3. While a gyrocopter-mounted conventional digital machine vision camera is capable of capturing only a monochrome view of a scene (left), a LWIR camera provides additional thermal information for analysis purposes (right) (courtesy Application Center for Multimodal and Airborne Sensor Technology).

Ultraviolet Imaging

UV imaging finds use in a wide variety of applications, from industrial to medical. Two fundamental reasons for imaging with UV are the need to detect or measure very small features in an object, and to leverage the UV-sensitive properties of an object. Two main techniques for UV imaging exist: reflected imaging and fluorescence imaging. Reflected imaging combines a UV light source and a UV-sensitive detector. Since UV spans a wide range of wavelengths, wide-range UV detectors are correspondingly necessary for reflective imaging. For near UV, standard CCD/CMOS sensors deliver some relative response. However, both the sensor and optics must not include any UV filtering or coating. Sunlight may provide enough UV light for imaging in the near UV band, otherwise a secondary light source is required. For middle UV and far UV, conversely, special detectors, light sources and optics are required. These bands are relevant in some commercial imaging applications. However, as with MWIR and beyond, increased cost and design effort is required to leverage them.

Fluorescence imaging also uses a UV light source, in this case accompanied by a detector with sensitivity to another wavelength (typically somewhere in the visible band). The UV light source shines on the material(s) to be imaged, which absorbs it. The object subsequently emits photons, with the material outputting light at a longer-than-UV wavelength; blue is commonly encountered in industrial applications, for example. To effectively image the fluorescence requires the use of bandpass filters tuned for the emitted wavelength (Figure 4). Such bandpass filters block the light coming from the UV light source, which can otherwise interfere during imaging. Applications for fluorescence imaging include microscopy, and 2D data matrix recognition and inspection.

Figure 4. Normally, an object illuminated with UV light provides at best a muted fluorescence response (left).  However, by filtering the UV light with a bandpass filter, the object’s photon response becomes more obvious (right) (courtesy Allied Vision).

Infrared Imaging Case Study

Use of the infrared spectrum can be beneficial for numerous real world applications. Consider agricultural applications, for example, where infrared cameras can monitor plant health in a nondestructive manner (Figure 5). LemnaTec GmbH is one company developing solutions for this particular application. LemnaTec's Scanalyzer3D system utilizes various infrared wavelengths to monitor plant health. Agriculturalists use the information provided by Scanalyzer3D to determine reactive and proactive measures necessary to ensure robust crop growth and longevity.

Figure 5. LemnaTec’s Scanalyzer3D IR-based imaging and analysis system enables non-destructive plant phenotyping (top). Thermal analysis, for example, generates false coloring that depicts temperature variance (bottom) (courtesy LemnaTec GmbH).

Water absorbs infrared light in the 1400 to 1500 nm wavelength range. In an infrared image, water therefore appears opaque. As a result, the areas of a plant that are darker in an infrared image reflect the presence (and relative abundance) of water in these regions. More generally, Scanalyzer3D uses multiple infrared cameras to implement the following functions:

  • A LWIR camera measure the plant’s temperature, which indicates if the plant is within its ideal growing temperature range
  • A NIR camera, in conjunction with a SWIR camera, detects moisture absorption in each plant, thereby determining its root system’s efficiency
  • The SWIR camera also finds use in viewing each plant's water distribution, determining if it is sufficiently hydrated
  • LemnaGrid, LemnaTec’s proprietary image processing software, handles all image processing operations. It enables users to both design imaging processes and to implement various image enhancements, such as false color rendering.


Expanding the computer vision analysis range beyond the visual spectrum to encompass the infrared and/or ultraviolet electromagnetic spectrum, either broadly or selectively and either solely or in conjunction with visible spectrum analysis, is of great benefit in a range of visual intelligence applications (see sidebar "Multispectral and Hyperspectral Imaging"). Successfully implementing such capabilities in an embedded vision system with stringent size, weight, cost, power consumption and other constraints can be challenging. Fortunately, an industry alliance is available to help product creators optimally implement such vision processing in their resource-constrained hardware and software designs (see sidebar "Additional Developer Assistance").

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Francis Obidimalor
Marketing Manager Americas, Allied Vision

Sidebar: Multispectral and Hyperspectral Imaging

Monochrome cameras, as their name implies, accept all incoming light, with no spectral differentiation. The most common cameras that selectively distinguish between different wavelengths are visible-light color cameras, in which filter arrangements such as the popular Bayer pattern are located above the surface of the image sensor(s). Only the red, green, or blue portion of the visible light spectrum passes through each filter to its associated sensor pixel. A color image is generated by these directly captured spectrum components, along with inter-pixel interpolation to create an approximation of the remainder of the visible spectrum.

In order to more precisely visualize both visible and beyond-visible light spectral signatures, a considerably larger number of selective-spectrum channels are required. Equipment that captures and analyzes fewer than 100 spectral bands is commonly known as a multispectral camera; hyperspectral cameras conversely comprehend more than 100 channels' worth of selective-spectrum data. Conceptually common to all such spectral cameras is the splitting of incident light into many individual components, which are individually detected on different areas of the sensor surface. In contrast to conventional monochrome or color cameras, much more inbound light is needed to obtain easily evaluated images with spectral cameras.

Various technologies find use in accomplishing this fine-grained spectral splitting. For simpler situations involving comparatively few spectral bands, different band-pass filters sequentially locate in front of the sensor, in the process of capturing a long-exposure (or alternatively, multi-exposure) image. Overall spectral signature recording is accomplished via combining the data generated by this sequential individual-band recording. One perhaps obvious disadvantage of this technique is that the object being analyzed must remain completely still during the lengthy single- or multiple-exposure interval.

Today's market is dominated by so-called "pushbroom" cameras (Figure A). This approach incorporates a narrow light gap passing through a prism or diffraction grating and subsequently projected onto an area image sensor. Since pushbroom cameras "see" only one light slot at a time, either the camera or the object being analyzed must move vertically in relation to the slot (via a drone, for example, or a conveyor belt) in order to capture area data. Pushbroom cameras are capable of achieving high spatial and spectral resolutions, up to several hundred spectral bands depending on the sensor type employed.

Figure A. "Pushbroom" cameras are capable of high spatial and spectral resolutions but require sequential movement of either the camera or object being visualized, with the other held completely still through the lengthy scanning interval (courtesy XIMEA).

In addition to these established approaches, emerging alternative technologies have also appeared for generating spectral differentiation at the image sensor. One approach involves the creation of fine-grained bandpass filters, conceptually similar to the previously mentioned coarse-grain Bayer color pattern, for both visible and beyond-visible light bands. Sensors developed by IMEC, a research institute in Belgium, have also recently become viable candidates for industrial use. This particular approach involves a small Fabry-Perot interference filter applied to each sensor pixel, resulting in a narrow pass-through spectral band.

Both "fan-out" filter arrangements, resembling the spectral line-scan behavior of a pushbroom camera, and "snaphot" arrangements are available. In the latter case, 4x4 or 5x5 filter patterns are constructed and replicated across the sensor surface. Advantages of the "snapshot" approach include the fact that each image captured contains a complete set of spectral information, and that the object being analyzed can be in motion. However, accompanying these advantages is a reduction in spatial resolution capability versus with other approaches.

By Jürgen Hillmann

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. Allied Vision and XIMEA, the co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance and its member companies periodically deliver webinars on a variety of technical topics. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Alliance website. Also, the Embedded Vision Alliance has begun offering "Deep Learning for Computer Vision with TensorFlow," a full-day technical training class planned for a variety of both U.S. and international locations. See the Alliance website for additional information and online registration.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, is intended for product creators interested in incorporating visual intelligence into electronic systems and software. The Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings.

The most recent Embedded Vision Summit took place in Santa Clara, California on May 1-3, 2017; a slide set along with both demonstration and presentation videos from the event are now in the process of being published on the Alliance website. The next Embedded Vision Summit is scheduled for May 22-24, 2018, again in Santa Clara, California; mark your calendars and plan to attend.

Use a Camera Model to Accelerate Camera System Design

This blog post was originally published by Twisthink. It is reprinted here with the permission of Twisthink.

Visual Intelligence Opportunities in Industry 4.0

Bookmark and Share

Visual Intelligence Opportunities in Industry 4.0

In order for industrial automation systems to meaningfully interact with the objects they're identifying, inspecting and assembling, they must be able to see and understand their surroundings. Cost-effective and capable vision processors, fed by depth-discerning image sensors and running robust software algorithms, continue to transform longstanding industrial automation aspirations into reality. And, with the emergence of the Industry 4.0 "smart factory," this visual intelligence will further evolve and mature, as well as expand into new applications, as a result becoming an increasingly critical aspect of various manufacturing processes.

Computer vision-based products have already established themselves in a number of industrial applications, with the most prominent one being factory automation, where the application is also commonly referred to as machine vision. Machine vision was one of the first, and today is one of the most mature, high volume computer vision opportunities. And as manufacturing processes become increasingly autonomous and otherwise more intelligent, the associated opportunities for computer vision leverage similarly expand in both scope and robustness.

The term "Industry 4.0" is shorthand for the now-underway fourth stage of industrial evolution, with the four stages characterized as:

  1. Mechanization, water power, and steam power
  2. Mass production, assembly lines, and electricity
  3. Computers and automation
  4. Cyber-physical systems

Wikipedia introduces its entry for the term Industry 4.0 via the following summary:

Industry 4.0 is the current trend of automation and data exchange in manufacturing technologies. It includes cyber-physical systems, the Internet of things and cloud computing. Industry 4.0 creates what has been called a "smart factory". Within the modular structured smart factories, cyber-physical systems monitor physical processes, create a virtual copy of the physical world and make decentralized decisions. Over the Internet of Things, cyber-physical systems communicate and cooperate with each other and with humans in real time, and via the Internet of Services, both internal and cross-organizational services are offered and used by participants of the value chain.

The following series of essays expands on previously published information about industrial automation, which covered robotics systems and other aspects of computer vision-enabled autonomy at specific steps in the manufacturing process. The capabilities possible at each of these steps have notably improved in recent times, courtesy of deep learning-based algorithms and other technology advancements. And, given the current focus on the "smart factory," it's critical to implement robust interoperability and data interchange between various manufacturing steps (as well as between the hardware and software present at each of these stages), along with at the centralized "cloud" server resources that link steps together, enable data archive and notably contribute to the overall data processing chain.

This article provides both background overview and implementation-specific information on the visual intelligence-enabled capabilities that are key to a robust and evolvable Industry 4.0 smart factory infrastructure, and how to support these capabilities at the chip, software, camera and overall system levels. It focuses on processing both at each "edge" stage in the manufacturing process, and within the "cloud" server infrastructure that interconnects, oversees and assists them. It particularly covers three specific vision-enabled capabilities that are critical to a meaningful Industry 4.0 implementation:

  • Identification of piece parts, and of assembled subsystems and systems
  • Inspection and quality assurance
  • Location and orientation during assembly

The contributors, sharing their insights and perspectives on various Industry 4.0 subjects, are all computer vision industry leaders and members of the Embedded Vision Alliance, an organization created to help product creators incorporate vision capabilities into their hardware systems and software applications (see sidebar "Additional Developer Assistance"):

  • Industrial camera manufacturer Basler
  • Machine vision software developer MVTec
  • Vision processor supplier Xilinx

A case study of an autonomous robot system developed by MVTec's partner, Bosch, provides implementation examples that demonstrate concepts discussed elsewhere in the article.

Basler's Perspectives on Visual Intelligence Opportunities in Industry 4.0

Image processing systems built around industrial cameras are already an essential component in automated production. Throughout all steps of production, from the inspection of raw materials and production monitoring (i.e. flaw detection) to final inspections and quality assurance, they are an indispensable aspect of achieving high efficiency and quality standards.

The term Industry 4.0 refers to new process forms and organization of industrial production. The core elements are networking and extensive data communication. The goal is self-organized, more strongly customized and efficient production based on comprehensive data collection and effective exchange of information.

Image processing plays a decisive role in Industry 4.0. Importantly, cameras are becoming smaller and more affordable, even as their performance improves. Where complex systems were once required, today's small, efficient systems can produce the same or better results. This technological progress, together with the possibilities of ever-expanding networking, opens up the potential for new Industry 4.0 applications.

Identification of piece parts, and of assembled subsystems and systems

Embedded vision systems, each comprised of a camera module and processing unit, can conceivably find use in every step of a manufacturing process. This applicability includes in-machine and otherwise difficult-to-access locations, thanks to the vision systems' small size, light weight and low heat dissipation (Figure 1). Such flexibility makes them useful in identifying both piece parts and complete products, a particularly useful capability with goods that cannot be tracked via conventional barcodes, for example.

Figure 1. Modern industrial cameras' compact form factors and low power consumption enable their use in a wide range of applications and settings (courtesy Basler).

Visual identification is also relevant for individually customized or otherwise uniquely manufactured products. In these and other usage scenarios, a cost-effective embedded vision system can be a beneficial data capture and delivery component, reducing the normally complex logistics of a bespoke manufacturing process.

Cameras utilized in embedded vision systems, whether implemented in small-box or bare-board form factors, are capable of delivering comparable image capture speed to classical machine vision cameras. This is a key capability, given that such performance is often an important parameter in parts identification applications. USB 3, LVDS, and MIPI CSI-2-based interface options correlate to transfer bandwidths of 250-500 MBytes/second, translating into 125-250 frame-per-second capture rates for typical 2 megapixel (i.e. "Full HD") resolution images.

Surface inspection and other quality assurance tests

Enhanced inspection and quality assurance can involve not only the visual inspection of the final product but also of individual components prior to their assembly. By eliminating defective piece parts, final product yields will likely improve, saving money. Historically, however, this additional cost savings did not always justify the expense of an additional vision inspection system.

Nowadays, though, as Industry 4.0 manufacturing setups become commonplace, vision systems are increasingly affordable. From a camera standpoint, this financial improvement is primarily the result of ongoing improvements in CMOS sensor technology. Today, even the most cost-effective image sensors frequently deliver sufficient image quality to meet industrial application demands. As a result, board-level cameras can now increasingly be implemented for piece part prequalification, thereby delivering a sophisticated inspection system leveraged in multiple stages of the manufacturing process.

Determination of part/subsystem location and orientation during assembly

Piece part identification, along with location and orientation determination (and adjustment, if necessary) are common industrial vision operations, frequently employed by robotic assembly systems. Industry 4.0 "smart factories" provide the opportunity to further expand these kinds of functions to include various human-machine interaction scenarios. Consider, for example, a head-mounted mobile vision system, perhaps combined with an augmented reality information display, used when overseeing both human worker and autonomous robot operations.

Such a system, capable of decreasing failure rates and otherwise optimizing various manufacturing processes, would need to be small in size (ideally, around the size of a postage stamp for the camera module), light in weight, and low in power consumption. Fortunately, in the modern day embedded vision era, such a combination is increasingly possible, as well as affordable.

By Thomas Rademacher
Product Market Manager, Basler

MVTec's Perspectives on Visual Intelligence Opportunities in Industry 4.0

Industry 4.0, also known as the Industrial Internet of Things (IIoT), is one of the most significant trends in the history of industrial production. As a result of this digital transformation, processes along the entire value chain are becoming consistently networked and highly automated. All aspects of production, including all participants and technologies involved—people, machines, sensors, transfer systems, smart devices, and software solutions—seamlessly work together via internal company networks and/or the Internet. Another manifestation of this trend is the digital factory, also known as the smart factory. In this environment, automated production components interact independently and in close cooperation with humans via the IIoT.

Machine vision, a technology that has paved the way for the IIoT, plays a key role (Figure 2). State-of-the-art image acquisition devices such as high-resolution cameras and sensors, in combination with high-performance machine vision software, process the recorded digital image information. A wide range of objects in the production environment is therefore detected and analyzed automatically based on their visual features - safely, precisely, and at high speeds. The interaction and communication between different system environments, such as machine vision and programmable logic controllers (PLCs), are continuously improved in automated production scenarios. And the systematic development of standards such as the Open Platform Communications Unified Architecture (OPC UA) is key to ongoing development in this area.

Figure 2. The "eye of production" monitors all Industry 4.0 processes (courtesy MVTec).

Object identification based on a wide range of features

As the "eye of production", machine vision continuously sees and monitors all workflows relevant to production, using this data to optimize numerous application scenarios within the IIoT. For example, machine vision can unambiguously identify work pieces based on surface features, such as shape, color, and texture. Bar codes and data codes also allow for unmistakable identification. The machine vision software can reliably read overly narrow or wide code bars, as well as identify products by imprinted character combinations via optical character recognition (OCR) with excellent results (Figure 3). The technology can even recognize blurry, distorted, or otherwise hard-to-read letters and numbers.

Figure 3. Robust machine vision software can even reliably read defect-ridden bar codes (courtesy MVTec).

For challenging identification tasks, deep learning methods such as convolutional neural networks (CNNs) are increasingly finding use. One key advantage of these technologies is that they are self-learning. This means that their algorithms analyze and evaluate large amounts of training data, independently recognize certain patterns in it, and subsequently use these trained patterns for identification (inference). This ability enables the software to significantly improve reading rates and accuracy. In networked IIoT processes, both codes and characters contain essential production information that finds use in automatically controlling all other process steps.

Reliable inspection of work pieces for defects

Quality assurance is another important machine vision application. Software fed by hardware-captured images reliably detects and identifies product damage and other various production faults. Erroneous deviations can be immediately identified by comparison, after being trained with only one or a few sample images. This robustness enables users to reliably locate many different defects using only a few parameters. The technology can also reveal tiny scratches, dents, hairline cracks, etc. which are not visible to the naked eye. Rejects can thus be removed automatically and in a timely fashion, before they proceed farther in the process chain.

Highly developed 3D-based machine vision methods also make it possible to detect material defects that extend below the surface of objects, such as embedded air bubbles and dust particles. Even the smallest unevenness can be detected using images captured from different perspectives, with corresponding varying shadow formation. This enhancement significantly improves the process of surface inspection. An especially important element in the context of the IIoT is that the number and type of defective parts is immediately reported to production planning and control systems, as well as job planning, thereby allowing for automatic triggering of reorders.

The increasingly widespread use of mobile devices in the industrial setting is also a notable characteristic of the IIoT. It is therefore important that a broad range, if not the entirety, of machine vision capabilities be accessible by devices such as smartphones, tablets, smart cameras, and smart sensors. Specifically, software support for Arm processors, common in such products, is critical. When this requirement is met, machine vision can become increasingly independent of legacy PCs. After all, the majority of these mobile devices also have very powerful components, such as high performance processors, an abundance of RAM and flash memory, and high-resolution image sensors.

Determining the location and orientation of objects during assembly

The use of robots, which are becoming more and more compact, mobile, flexible, and human-like over time, is typical in IIoT scenarios. So-called collaborative robots (cobots) often work closely with their human colleagues; they may, for example, transfer raw materials or work pieces to each other during assembly. Machine vision solutions ensure a high standard of safety and efficiency in such scenarios. Depth-sensing functionality integrated into the software can safely determine the position, direction of movement, and speed of objects in three-dimensional space, thus preventing collisions and other calamities.

The handling of parts can likewise be consistently automated by the ability to accurately determine the parts’ locations. Robots can thus precisely position, reliably grip, process, place, and palletize a wide range of components. It is also possible to use components' sample images and CAD data to determine the exact 3D position, in order to detect curvature and edges.

Optimizing maintenance processes

Machine vision also performs valuable services in maintenance processes. Via its assistance, smartphones and tablets can find use for machine maintenance tasks. For example, if a defective component requires replacement, the technician can simply point a camera-equipped mobile device toward the control cabinet and take a photo. The embedded machine vision software integrated into the smart phone will then determine the appropriate part to be substituted, processing the image data and making it available to the remainder of the maintenance process.

Machine vision can implement predictive maintenance, too. Cameras installed in various locations are capable of, for example, monitoring the machine park. If an irregularity is noticed - for example, if a thermal imaging camera detects an overheating machine - machine vision software can immediately sound an alarm. Technicians can then intervene before the machine experiences complete failure.

Best practice: Machine vision and robots work together in assembly

A case study application scenario demonstrates how automated handling is already robustly implemented. Bosch's APAS (Automatic Production Assistants) are capable of performing a wide range of automated assembly and inspection tasks. Individual process modules can be flexibly combined to implement various functions. Both the APAS assistant collaborative robot and the APAS inspector for visual inspections leverage MVTec's HALCON machine vision software (Figure 4). The photometric stereo 3D vision technology integrated into the APAS inspector, for example, analyzes the "blanks" handled by the APAS assistant for scratches and other damage.

Figure 4. Multiple Bosch APAS process modules leverage HALCON machine vision software (courtesy Bosch).

Machine vision software also supports the precise positioning of individual process modules that are not physically connected. By means of optical markers and with the help of machine vision software, the cameras integrated into the APAS assistant's gripper arms precisely determine module coordinates. The robot thus always knows the exact position of individual modules in three-dimensional space and can coordinate them with its gripper arm. This approach results in optimum interaction between the robot and the other modules, even if the modules’ arrangement changes.

By Johannes Hiltner
Product Manager HALCON, MVTec

Xilinx's Perspectives on Visual Intelligence Opportunities in Industry 4.0

Embedded vision-based systems are increasingly being deployed in so-called "Industry 4.0" applications. Here embedded vision enables both automation, using vision for positioning and guidance, and data collection and decision-making, by means of vision systems operating in the visible or wider electromagnetic spectrum. And both the IIoT and the "cloud" serve to interconnect these technologies.

Within Industry 4.0 applications, common use cases for embedded vision include those involving the identification and inspection of components, piece parts, sub-systems and systems. In many cases, the embedded vision system may also be responsible for feeding data back into the manufacturing process in order to adjust part positioning, for example, or to remove defective or otherwise incorrect parts from the manufacturing flow. These diverse functions create challenges for the processing system, which connects together cameras and other subsystems; robotic, positioning, actuation, etc:

  • Multi-camera support: The processing system must be able to interface with multiple cameras, co-aligned for depth perception to provide a more complete view of parts in the manufacturing flow.
  • Connectivity: It must be able to connect to both the operational and information networks, along with other standard industrial interfaces such as illumination systems, and actuation and positioning systems.
  • Highly deterministic: It must be able to perform the required analysis and decision-making tasks without delaying the production line.
  • Leverage machine learning: It must be able to implement machine learning techniques at the "edge" to achieve better quality of results and therefore optimize yield.
  • SWAP-C: Due to the number of embedded vision systems deployed, developers must consider the total size, weight, power consumption and cost of the proposed solution.

The processing system must also address the evolving demands brought by ever-increasing frame rates, resolution, and bits per pixel.

Support for multiple cameras and their associated connectivity requires a solution that enables any-to-any connectivity by being PHY-configurable for any necessary interface. Many Industry 4.0 applications utilize flexible programmable logic, which inherently supports any-to-any interfacing capabilities, to address these interface challenges. And image processing also commonly leverages an application processor combining a CPU core with one or multiple heterogeneous co-processors (GPU, DSP, programmable logic, etc.) thereby enabling increased determinism, reduced latency, higher performance, lower power consumption and otherwise more efficient parallel processing within a single SoC.

To simplify the development and implementation of these processing systems, developers implementing Industry 4.0 applications can leverage open-source, high-level languages and frameworks such as OpenCV and OpenVX for image processing and Caffe for machine intelligence. These frameworks enable algorithm development time to be greatly reduced and also allow developers to focus on their value-added activities, thereby differentiating themselves from their competitors. Xilinx’s reVISION stack, which enables OpenCV and Caffe functions to be accelerated in programmable logic, is an example of a vendor-optimized development toolset.

Even if Industry 4.0 applications are processed at the edge, information regarding these processing decisions is commonly also communicated to the cloud for archive, further analysis, etc. One example application involves the centralized collation and analysis of yield information from several manufacturing lines located across the world, in order to determine and "flag" quality issues. While this collective analysis does not necessarily need to be performed in real time, it still needs to occur quickly; should a quality issue arise, a swift response is necessary in order to minimize yield and other impacts. And archive of the data is valuable to assess long-term trends.

Such situations can potentially employ deep machine inference within the "cloud" server, guided by prior training. FPGA-based inference accelerators are beneficial here, to minimize analysis and response latencies. Xilinx's Reconfigurable Acceleration Stack (RAS) is one example of a vendor-optimized toolset that simplifies "cloud" application development, leveraging industry-standard frameworks and libraries such as Caffe and SQL.

By Giles Peckham
Regional Marketing Director, Xilinx

and Adam Taylor
Embedded Systems Consultant, Xilinx


The rapidly expanding use of vision technology in industrial automation is part of a much larger trend. From consumer electronics to automotive safety systems, vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. The term “embedded vision” refers to this growing practical use of visual intelligence in embedded systems, mobile devices, special-purpose PCs, and the cloud, with industrial automation being one showcase application.

Embedded vision can add valuable capabilities to existing products, such as the vision-enhanced industrial automation systems discussed in this article. It can provide significant new markets for hardware, software and semiconductor manufacturers. And a worldwide industry alliance is available to help product creators optimally implement vision processing in their hardware and software designs.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. Basler, MVTec, and Xilinx, the co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance and its member companies periodically deliver webinars on a variety of technical topics. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Alliance website. Also, the Embedded Vision Alliance has begun offering "Deep Learning for Computer Vision with TensorFlow," a full-day technical training class planned for a variety of both U.S. and international locations. See the Alliance website for additional information and online registration.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, is intended for product creators interested in incorporating visual intelligence into electronic systems and software. The Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings.

The most recent Embedded Vision Summit took place in Santa Clara, California on May 1-3, 2017; a slide set along with both demonstration and presentation videos from the event are now in the process of being published on the Alliance website. The next Embedded Vision Summit is scheduled for May 22-24, 2018, again in Santa Clara, California; mark your calendars and plan to attend.

The Evolution of Deep Learning for ADAS Applications

This technical article was originally published at Synopsys' website. It is reprinted here with the permission of Synopsys.

Building Mobile Apps with TensorFlow: An Interview with Google's Pete Warden

Pete Warden, Google Research Engineer and technical lead on the company's mobile/embedded TensorFlow team, is a long-time advocate of the Embedded Vision Alliance.

Software Frameworks and Toolsets for Deep Learning-based Vision Processing

Bookmark and Share

Software Frameworks and Toolsets for Deep Learning-based Vision Processing

This article provides both background and implementation-detailed information on software frameworks and toolsets for deep learning-based vision processing, an increasingly popular and robust alternative to classical computer vision algorithms. It covers the leading available software framework options, the root reasons for their abundance, and guidelines for selecting an optimal approach among the candidates for a particular implementation. It also covers "middleware" utilities that optimize a generic framework for use in a particular embedded implementation, comprehending factors such as applicable data types and bit widths, as well as available heterogeneous computing resources.

For developers in specific markets and applications, toolsets that incorporate deep learning techniques can provide an attractive alternative to an intermediary software framework-based development approach. And the article also introduces an industry alliance available to help product creators optimally implement deep learning-based vision processing in their hardware and software designs.

Traditionally, computer vision applications have relied on special-purpose algorithms that are painstakingly designed to recognize specific types of objects. Recently, however, CNNs (convolutional neural networks) and other deep learning approaches have been shown to be superior to traditional algorithms on a variety of image understanding tasks. In contrast to traditional algorithms, deep learning approaches are generalized learning algorithms trained through examples to recognize specific classes of objects, for example, or to estimate optical flow. Since deep learning is a comparatively new approach, however, the usage expertise for it in the developer community is comparatively immature versus with traditional algorithms such as those included in the OpenCV open-source computer vision library.

General-purpose deep learning software frameworks can significantly assist both in getting developers up to speed and in getting deep learning-based designs completed in a timely and robust manner, as can deep learning-based toolsets focused on specific applications. However, when using them, it's important to keep in mind that the abundance of resources that may be assumed in a framework originally intended for PC-based software development, for example, aren't likely also available in an embedded implementation. Embedded designs are also increasingly heterogeneous in nature, containing multiple computing nodes (not only a CPU but also GPU, FPGA, DSP and/or specialized co-processors); the ability to efficiently harness these parallel processing resources is beneficial from cost, performance and power consumption standpoints.

Deep Learning Framework Alternatives and Selection Criteria

The term "software framework" can mean different things to different people. In a big-picture sense, you can think of it as a software package that includes all elements necessary for the development of a particular application. Whereas an alternative software library implements specific core functionality, such as a set of algorithms, a framework provides additional infrastructure (drivers, a scheduler, user interfaces, a configuration parser, etc.) to make practical use of this core functionality. Beyond this high-level overview definition, any more specific classification of the term "software framework", while potentially more concrete, intuitive and meaningful to some users, would potentially also exclude other subsets of developers' characterizations and/or applications' uses.

When applied to deep learning-based vision processing, software frameworks contain different sets of elements, depending on their particular application intentions. Frameworks for designing and training DNNs (deep neural networks) provide core algorithm implementations such as convolutional layers, max pooling, loss layers, etc. In this initial respect, they're essentially a library. However, they also provide all of the necessary infrastructure to implement functions such as reading a network description file, linking core functions into a network, reading data from training and validation databases, running the network forward to generate output, computing loss, running the network backward to adapt the weights, and repeating this process as many times as is necessary to adequately train the network.

It’s possible to also use training-tailored frameworks for inference, in conjunction with a pre-trained network, since such "running the network forward" operations are part of the training process (Figure 1). As such, it may be reasonable to use training-intended frameworks to also deploy the trained network. However, although tools for efficiently deploying DNNs in applications are often also thought of as frameworks, they only support the forward pass. For example, OpenVX with its neural network extension supports efficient deployment of DNNs but does not support training. Such frameworks provide only the forward pass components of core algorithm implementations (convolution layers, max pooling, etc.). They also provide the necessary infrastructure to link together these layers and run them in the forward direction in order to infer meaning from input images, based on previous training.

Figure 1. In deep learning inference, also known as deployment (right), a neural network analyzes new data it’s presented with, based on its previous training (left) (courtesy Synopsys).

Current examples of frameworks intended for training DNNs include Caffe, Facebook's Caffe2, Microsoft's Cognitive Toolkit, Darknet, MXNet, Google's TensorFlow, Theano, and Torch (Intel's Deep Learning Training Tool and NVIDIA's DIGITS are a special case, as they both run Caffe "under the hood"). Inference-specific frameworks include the OpenCV DNN module, Khronos' OpenVX Neural Network Extension, and various silicon vendor-specific tools. Additionally, several chip suppliers provide proprietary tools for quantizing and otherwise optimizing networks for resource-constrained embedded applications, which will be further discussed in subsequent sections of this article. Such tools are sometimes integrated into a standalone framework; other times they require (or alternatively include a custom version of) another existing framework.

Why do so many framework options exist? Focusing first on those intended for training, reasons for this diversity include the following:

  • Various alternatives were designed more or less simultaneously by different developers, ahead of the potential emergence of a single coherent and comprehensive solution.
  • Different offerings reflect different developer preferences and perspectives regarding DNNs. Caffe, for example, is in some sense closest to an application, in that a text file is commonly used to describe a network, with the framework subsequently invoked via the command line for training, testing and deployment. TensorFlow, in contrast, is closer to a language, specifically akin to Matlab with a dataflow paradigm. Meanwhile, Theano and Torch are reminiscent of a Python library.
  • Differences in capabilities also exist, such as different layer types supported by default, as well as support (or not) for integer and half-float numerical formats.

Regarding frameworks intended for efficient DNN deployment, differences between frameworks commonly reflect different design goals. OpenVX, for example, is primarily designed for portability while retaining reasonable levels of performance and power consumption. The OpenCV DNN module, in contrast, is designed first and foremost for ease of use. And of course, the various available vendor-specific tools are designed to solely support particular hardware platforms.

Finally, how can a developer select among the available software framework candidates to identify one that's optimum for a particular situation? In terms of training, for example, the decision often comes down to familiarity and personal preference. Substantial differences also exist in capabilities between the offerings, however, and these differences evolve over time; at some point, the advancements in an alternative framework may override legacy history with an otherwise preferred one.

Unfortunately there’s no simple answer to the "which one's best" question. What does the developer care about most? Is it speed of training, efficiency of inference, the need to use a pre-trained network, ease of implementing custom capabilities in the framework, etc? For each of these criteria, differences among frameworks exist both in capabilities offered and in personal preference (choice of language, etc). With that all said, a good rule of thumb is to travel on well-worn paths. Find out what frameworks other people are already using in applications as similar as possible to yours. Then, when you inevitably run into problems, your odds of finding a documented solution are much better.

Addressing Additional Challenges

After evaluating the tradeoffs of various training frameworks and selecting one for your project, several other key design decisions must also be made in order to implement an efficient embedded deep learning solution. These include:

  • Finding or developing an appropriate training dataset
  • Selecting a suitable vision processor (or heterogeneous multi-processor) for the system
  • Designing an effective network model topology appropriate for the available compute resources
  • Implementing a run-time that optimizes any available hardware acceleration in the SoC

Of course, engineering teams must also overcome these development challenges within the constraints of time, cost, and available skill sets.

One common starting point for new developers involves the use of an example project associated with one of the training frameworks. In these tutorials, developers are typically guided through a series of DIY exercises to train a preconfigured CNN for one of the common image classification problems such as MNIST, CIFAR-10 or ImageNet. The result is a well-behaved neural net that operates predictably on a computer. Unfortunately, at this point the usefulness of the tutorials usually begins to diminish, since it’s then left as an "exercise for the reader" to figure out how to adapt and optimize these example datasets, network topologies and PC-class inference models to solve other vision challenges and ultimately deploy a working solution on an embedded device.

The deep learning aspect of such a project will typically comprise six distinct stages (Figure 2). The first four take place on a computer (for development), with the latter two located on the target (for deployment):

  1. Dataset creation, curation and augmentation
  2. Network design
  3. Network training
  4. Model validation and optimization
  5. Runtime inference accuracy and performance tuning
  6. Provisioning for manufacturing and field updates

Figure 2. The first five (of six total) stages of a typical deep learning project are frequently iterated multiple times in striving for an optimal implementation (courtesy Au-Zone Technologies).

Development teams may find themselves iterating steps 1-5 many times in searching for an optimal balance between network size, model accuracy and runtime inference performance on the processor(s) of choice. For developers considering deployment of Deep Learning vision solutions on standard SoC’s, development tools such as Au-Zone Technologies' DeepView ML Toolkit and Run-Time Inference Engine are helpful in addressing the various challenges faced at each of these developmental stages (see sidebar "Leveraging GPU Acceleration for Deep Learning Development and Deployment") (Figure 3).

Figure 3. The DeepView Machine Learning Toolkit provides various facilities useful in addressing challenges faced in both deep learning development and deployment (courtesy Au-Zone Technologies).

Framework Optimizations for DSP Acceleration

In comparison to the abundant compute and memory resources available in a PC, an embedded vision system must offer performance sufficient for target applications, but at greatly reduced power consumption and die area. Embedded vision applications therefore greatly benefit from the availability of highly optimized heterogeneous SoCs containing multiple parallel processing units, each optimized for specific tasks. Synopsys' DesignWare EV6x family, for example, integrates a scalar unit for control, a vector unit for pixel processing, and an optional dedicated CNN engine for executing deep learning networks (Figure 4).

Figure 4. Modern SoCs, as well as the cores within them, contain multiple heterogeneous processing elements suitable for accelerating various aspects of deep learning algorithms (courtesy Synopsys).

Embedded vision system designers have much to consider when leveraging a software framework for training a CNN graph. They must pay attention to the bit resolution of the CNN calculations, consider all possible hardware optimizations during training, and evaluate how best to take advantage of available coefficient and feature map pruning and compression techniques. If silicon area (translating to SoC cost) isn’t a concern, an embedded vision processor might directly use the native 32-bit floating-point outputs of PC-tailored software frameworks. However, such complex data types demand large MACs (multiply-accumulator units), sizeable memory for storage, and high transfer bandwidth. All of these factors adversely affect the SoC and system power consumption and area budgets. The ideal goal, therefore, is to use the smallest possible bit resolution without adversely degrading the accuracy of the original trained CNN graph.

Based on careful analysis of popular graphs, Synopsys has determined that CNN calculations on common classification graphs currently deliver acceptable accuracy down to 10-bit integer precision in many cases (Figure 5). The EV6x vision processor's CNN engine therefore supports highly optimized 12-bit multiplication operations. Caffe framework-sourced graphs utilizing 32-bit floating-point outputs can, by using vendor-supplied conversion utilities, be mapped to the EV6x 12-bit CNN architecture without need for retraining and with little to no loss in accuracy. Such mapping tools convert the coefficients and graphs output by the software framework during initial training into formats recognized by the embedded vision system for deployment purposes. Automated capabilities like these are important when already-trained graphs are available and retraining is undesirable.

Figure 5. An analysis of CNNs on common classification graphs suggests that they retain high accuracy down to at least 10-bit calculation precision (courtesy Synopsys).

Encouragingly, software framework developers are beginning to pay closer attention to the needs of not only PCs but also embedded systems. In the future, therefore, it will likely be possible to directly train (and retrain) graphs for specific integer bit resolutions; 8-bit and even lower-resolution multiplications will further save cost, power consumption and bandwidth.

Framework Optimizations for FPGA Acceleration

Heterogeneous SoCs that combine high performance processors and programmable logic are also finding increasing use in embedded vision systems (Figure 6). Such devices leverage programmable logic's highly parallel architecture in implementing high-performance image processing pipelines, with the processor subsystem managing high-level functions such as system monitoring, user interfaces and communications. The CPU-plus-FPGA combination delivers a flexible and responsive system solution.

Figure 6. GPUs (left) and FPGA fabric (right) are two common methods of accelerating portions of deep learning functions otherwise handled by a CPU (courtesy Xilinx).

To gain maximum benefit from such a heterogeneous SoC, the user needs to be able to leverage industry standard frame works such as Caffe for machine learning, as well as OpenVX and OpenCV for image processing. Effective development therefore requires a tool chain that not only supports these industry standards but also enables straightforward allocation (and dynamic reallocation) of functionality between the programmable logic and the processor subsystems. Such a system-optimizing compiler uses high-level synthesis (HLS) to create the logic implementation, along with a connectivity framework to integrate it with the processor. The compiler also supports development with high-level languages such as C, C++ and OpenCL.

Initial development involves implementing the algorithm solely targeting the processor. Once algorithm functionality is deemed acceptable, the next stage in the process is to identify performance bottlenecks via application profiling. Leveraging the system-optimizing compiler to migrate functions into the programmable logic is a means of relieving these bottlenecks, an approach which can also reduce power consumption.

In order to effectively accomplish this migration, the system-optimizing compiler requires the availability of predefined implementation libraries suitable for HLS, image processing, and machine learning. Some toolchains refer to such libraries as middleware. In the case of machine learning within embedded vision applications, both predefined implementations supporting machine learning inference and the ability to accelerate OpenCV functions are required. Xilinx's reVISION stack, for example, provides developers with both Caffe integration capabilities and a range of acceleration-capable OpenCV functions (including the OpenVX core functions).

reVISION’s integration with Caffe for implementing machine learning inference engines is as straightforward as providing a prototxt file and the trained weights; Xilinx's toolset handles the rest of the process (Figure 7). This prototxt file is finds use in configuring the C/C++ scheduler running on the SoC's processor subsystem, in combination with hardware-optimized libraries within the programmable logic that accelerate the neural network inference. Specifically the programmable logic implements functions such as Conv, ReLu and Pooling. reVISION's integration with industry-standard embedded vision and machine learning frameworks and libraries provides development teams with programmable logic's benefits without the need to delve deep into the logic design.

Figure 7. Accelerating Caffe-based neural network inference in programmable logic is a straightforward process, thanks to reVISION stack toolset capabilities (courtesy Xilinx).

Deep Learning-based Application Software

Industry 4.0, an "umbrella" term for the diversity of universally connected and high-level automated processes that determine everyday routines in modern production enterprises, is one example of a mainstream computer vision application that has attracted widespread developer attention. And deep learning-based technologies, resulting in autonomous and self-adaptive production systems, are becoming increasingly more influential in Industry 4.0. While it's certainly possible to develop applications for Industry 4.0 using foundation software frameworks, as discussed elsewhere in this article, a mature, high volume market such as this one is also served by deep learning-based application software whose "out of box" attributes reduce complexity.

MVTec's machine vision software products such as HALCON are one example. The company's software solutions run both on standard PC-based hardware platforms and on ARM-based processor platforms, such as Android and iOS smartphones and tablets and industry-standard smart cameras. In general, they do not require customer-specific modifications or complex customization. Customers can therefore take advantage of deep learning without having any specialized expertise in underlying technologies, and the entire rapid-prototyping development, testing and evaluation process runs in the company's interactive HDevelop programming environment.

Optical character recognition (OCR) is one specific application of deep learning. In a basic office environment, OCR is used to recognize text in scanned paper documents, extracting and digitally reusing the content. However, industrial use scenarios impose much stricter demands on OCR applications. Such systems must be able to read letter and/or number combinations printed or stamped onto objects, for example. The corresponding piece parts and end products can then be reliably identified, classified and tracked. HALCON employs advanced functions and classification techniques that enable a wide range of characters to be accurately recognized even in challenging conditions, thus addressed the stringent requirements that need to be met by a robust solution in industrial environments.

In environments such as these, text not only needs to be identified without errors under varied lighting conditions and across a wide range of fonts, it must also be accurately recognized even when distorted due to tilting and when smudged due to print defects. Furthermore, text to be recognized can be in a blurry condition and printed onto or etched into reflective surfaces or highly textured color backgrounds. With the help of deep learning technologies, OCR accuracy can be improved significantly. By utilizing a standard software solution like MVTec's HALCON, users are unburdened from the complex and expensive training process. After all, huge amounts of data are generated during training, and hundreds of thousands of images are required for each class, all of which have to be labeled.


Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products, and can provide significant new markets for hardware, software and semiconductor suppliers (see sidebar "Additional Developer Assistance"). Deep learning-based vision processing is an increasingly popular and robust alternative to classical computer vision algorithms; conversion, partitioning, evaluation and optimization toolsets enable efficient retargeting of originally PC-tailored deep learning software frameworks for embedded vision implementations. These frameworks will steadily become more inherently embedded-friendly in the future, and applications that incorporate deep learning techniques will continue to be an attractive alternative approach for vision developers in specific markets.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Brad Scott
President, Au-Zone Technologies

Amit Shoham
Distinguished Engineer, BDTI

Johannes Hiltner
Product Manager for HALCON Embedded, MVTec Software GmbH

Gordon Cooper
Product Marketing Manager for Embedded Vision Processors, Synopsys

Giles Peckham
Regional Marketing Director, Xilinx

Sidebar: Leveraging GPU Acceleration for Deep Learning Development and Deployment

The design example that follows leverages the GPU core in a SoC to accelerate portions of deep learning algorithms which otherwise run on the CPU core. The approach discussed is an increasingly common one for heterogeneous computing in embedded vision, given the prevalence of robust graphics subsystems in modern application processors. These same techniques and methods also apply to the custom hardware acceleration blocks available in many modern SoCs.

Dataset Creation, Curation and Augmentation

A fundamental requirement for deep learning implementations is to source or generate two independent datasets: one suitable for network training, and the other to evaluate the effectiveness of the training. To ensure that the trained model is accurate, efficient and robust, the training dataset must be of significant size; it's often on the order of hundreds of thousands of labeled and grouped samples. One widely known public dataset, ImageNet, encompasses 1.5 million images across 1,000 discrete categories or classes, for example.

Creating large datasets is a time-consuming and error-prone exercise. Avoid these common pitfalls in order to ensure efficient and accurate training:

  1. Avoid incorrect annotation labels. This goal is much harder to achieve than might seem to be the case at first glance, due to inevitable human interaction with the high volume of data. It's unfortunately quite common to find errors in public datasets. Using advanced visualization and automated inspection tools greatly helps in improving dataset quality (Figure A).
  2. Make sure that the dataset represents the true diversity of expected inputs. For example, imagine training a neural network to classify images of electronic components on a circuit board. If you’ve trained it only with images of components on green circuit boards, it may fail when presented with an image of a component on a brown circuit board. Similarly, if all images of diodes in the training set happen to also have a capacitor partly visible at the edge of the image, the network may inadvertently learn to associate the capacitor with diodes, and fail to classify a diode when a capacitor is not also visible.
  3. In many cases, it makes sense to generate image samples from video as a means of quickly populating datasets. However, in doing so you must take great care to avoid reusing annotations from a common video sequence for both the training and testing databases. Such a mistake could lead to high training scores that can't be replicated by real-life implementations.
  4. Dataset creation should be an iterative process. You can greatly improve the trained model if you inspect the error distribution and optimize the training dataset if you find that certain classes are unrepresented or misclassified. Keeping dataset creation in the development loop allows for a better overall solution.

Figure A. DeepView's Dataset Curator Workspace enables visual inspection to ensure robustness without redundancy (courtesy Au-Zone Technologies).

For image classification implementations, in addition to supplying a sufficient number of samples, you should ensure that the dataset accurately represents information as captured by the end hardware platform in the field. As such, you need to comprehend the inevitable noise and other sources of variability and error that will be introduced into the data stream when devices are deployed into the real world. Randomly introducing image augmentation into the training sample set is one possible technique for increasing training data volume while improving the network's robustness, i.e. ensuring that the network is trained effectively and efficiently (Figure B).

Figure B. Random image augmentation can enhance not only the training sample set size but also its effectiveness (courtesy Au-Zone Technologies).

The types of augmentation techniques employed, along with the range of parameters used, both require adaptation for each application. Operations that make sense for solving some problems may degrade results in others. One simple example of this divergence involves horizontally flipping images; doing so might improve training for vehicle classification, but it wouldn’t make sense for traffic sign classification where numbers would then be incorrectly reversed in training.

Datasets are often created with images that tend to be very uniformly cropped, and with the objects of interest neatly centered. Images in real-world applications, on the other hand, may not be captured in such an ideal way, resulting in much greater variance in the position of objects. Adding randomized cropping augmentation can help the neural network generalize to the varied real-world conditions that it will encounter in the deployed application.

Network Design

Decades of research in the field of artificial neural networks have resulted in many different network classes, each with a variety of implementations (and variations of each) optimized for a diverse range of applications and performance objectives. Most of these neural networks have been developed with the particular objective of improving inference accuracy, and they have typically made the assumption that run time inference will be performed on a server- or desktop-class computer. Depending on the classification and/or other problem(s) you need to solve for your embedded vision project, exploring and selecting among potential network topologies (or alternatively designing your own) can therefore be a time consuming and otherwise challenging exercise.

Understanding which of these networks provides the "best" result within the constraints of the compute, dataflow and memory footprint resources available on your target adds a whole new dimension to the complexity of the problem. Determining how to "zero in" on an appropriate class of network, followed by a specific topology within that class, can rapidly become time-consuming endeavor, especially so if you're attempting to do anything other than solve a "conventional" deep learning image classification problem. Even when using a standard network for conventional image classification, many system considerations bear attention:

  • The image resolution of the input layer can significantly impact the network design
  • Will your input be single-channel (monochromatic) or multi-channel (RGB, YUV)? This is a particularly important consideration if you’re going to attempt transfer learning (to be further discussed shortly), since you’ll start with a network that was either pre-trained with color or monochrome data, and there’s no simple way to convert that pre-trained network from one format to another. On the other hand, if you’re going to train entirely from scratch, it’s relatively easy to modify a network topology to use a different number of input channels, so you can just take your network of choice and apply it to your application’s image format.
  • Ensure that the dataset format matches what you’ll be using on your target
  • Are pre-trained models compatible with your dataset, and is transfer learning an option?

When developing application-specific CNNs intended for deployment on embedded hardware platforms, it’s often once again very challenging to know where to begin. Leveraging popular network topologies such as ResNet and Inception will often lead to very accurate results in training and validation, but will often also require the compute resources of a small server to obtain reasonable inference times. As with any design optimization problem, knowing roughly where to begin, obtaining direct feedback on key performance indicators during the design process, and profiling on target hardware to enable rapid design iterations are all key factors to quickly converging on a deployable solution.

When designing a network to suit your specific product requirements, some of the key network design parameters that you will need to evaluate include:

  • Overall accuracy when validated both with test data and live data
  • Model size: number of layers, weights, bits/weight, MACs/image, total memory footprint/image, etc.
  • The distribution of inference compute time across network layers (scene graph nodes)
  • On-target inference time
  • Various other optimization opportunities

The Network Designer in the DeepView ML Toolkit allows users to select and adapt from preexisting templates for common network topologies, as well as to quickly create and explore new network design topologies (Figure C). With more than 15 node types supported, the tool enables quick and easy creation and configuration of scene graph representations of deep neural networks for training.

Figure C. The DeepView Network Design Workspace supports both customization of predefined network topologies and the creation of new scene graphs (courtesy Au-Zone Technologies).

Network Training

Training a network can be a tedious and repetitive process, with iteration necessary each time the network architecture is modified or the dataset is altered. The time required to train a model is directly impacted by the complexity of the network and the dataset size, and typically ranges from a few minutes to multiple days. Monitoring the loss value and a graph of the accuracy at each epoch helps developers to visualize the training session's efficiency.

Obtaining a training trend early in the process allows developers to save time by aborting training sessions that are not training properly (Figure D). Archiving training graphs for different training sessions is also a great way of analyzing the impact of: dataset alterations, network modifications and training parameter adjustments.

Figure D. Visually monitoring accuracy helps developers easily assess any particular training session's effectiveness (courtesy Au-Zone Technologies).

Transfer learning is a powerful method for optimizing network training. It's conceptually similar to the problem a developer would normally have with a dataset that's too small to properly train a rich set of parameters. By using transfer learning, you're leveraging an existing network trained on a similar problem to solve a new problem. For example, you can leverage a network trained on the very general (and very large) ImageNet dataset to specifically classify types of furniture with much less training time and effort than would otherwise be needed.

By importing a model already trained on a large dataset and freezing its earlier layers, a developer can then re-train the later network layers against a significantly smaller dataset, targeting the specific problem to be solved. Note, however, that such early-layer freezing isn't always the optimum approach; in some applications you might obtain better results by allowing the earlier network to learn the features of the new application.

And while dataset size reduction is one key advantage of transfer learning, another critical advantage is the potential reduction in training time. When training a network "from scratch," it can take a very long time to converge on a set of weights that delivers high accuracy. Via transfer learning, in summary, you can (depending on the application) use a smaller dataset, train for fewer iterations, and/or reduce training time by training only the last few layers of the network,.

Model Validation and Optimization

Obtaining sufficiently high accuracy on the testing dataset is a leading indicator of the network performance for a given application. However, limiting the result analysis to global score monitoring isn’t sufficient. In-depth analysis of the results is essential to understand how the network currently behaves and how to make it perform better.

Building a validation matrix is a solid starting point to visualize the error distribution among classes (Figure E). Filtering validation results is also an effective way to investigate the dataset entries that perform poorly, as well as to understand error validity and identify pathways to resolution.

Figure E. Graphically analyzing the error distribution among classes, along with as filtering validation results, enables evaluation of dataset entries that perform poorly (courtesy Au-Zone Technologies).

Many applications can also benefit from hierarchically ordering the classification labels used for analyzing the groups' accuracy. A distracted driving application containing 1 safe class and 9 unsafe classes, for example, could have mediocre overall classification accuracy but still be considered sufficient if the "safe" versus "unsafe" differentiation performs well.

Runtime Inference Accuracy and Performance Tuning

As the design and training activities begin to converge to acceptable levels in the development environment, target runtime inference optimization next requires consideration. The deep learning training frameworks discussed in the main article provide a key aspect of the overall solution, but leave the problem of implementing and optimizing the runtime to the developer. While general-purpose runtime implementations exist, they frequently do a subpar job of addressing several important aspects of deployment:

  1. Independence between the network model and runtime inference engine
    Separation of these two items enables independent optimization of the engine for each supported processor architecture option. Compute elements within each unique SoC, such as the GPU, vision processor, memory interfaces and other proprietary IP, can be fully exploited without concern for the model that will be deployed on them.
  2. The ability to accommodate NNEF-based models
    Such a capability allows for models created with frameworks not directly supported by tools such as DeepView to be alternatively imported using an industry-standard exchange format.
  3. Support for multiple, preloaded instantiations
    Enabling multiple networks on single device via fast context switching is desirable when a single device is required to perform a plurality of deep learning tasks but does not have the capacity to perform them concurrently.
  4. Portability between devices
    Support for any OpenCL 1.2-capable device enables the DeepView Inference Engine (for example) to be highly portable, easily integrated into both existing and new runtime environments with minimal effort. Such flexibility enables straightforward device benchmarking and comparison during the hardware-vetting process.
  5. Development tool integration
    The ability to quickly and easily profile runtime performance, validate accuracy, visualize results and return to network design for refinement becomes extremely helpful when iterating on final design details.

In applications where speed and/or power consumption are critical, optimization considerations for these parameters should be comprehended in the network design and training early in the process. Once you have a dataset and an initial network design that trains with reasonable accuracy, you can then explore tradeoffs in accuracy vs. # of MACs, weights, types of activation layers used, etc., tuning these parameters for the target architecture.

Provisioning For Manufacturing and Field Updates

When deploying working firmware to the field, numerous steps require consideration in order to ensure integrity and security at the end devices. Neural network model updates present additional challenges to both the developer and system OEM. Depending on the topology of the network required for your application, for example, trained models range from hundreds of thousands to many millions of parameters. When represented as half-floats, these models typically range from tens to hundreds of MBytes in size. And if the device needs to support multiple networks for different use cases or modes, the required model footprint further expands.

For all but the most trivial network examples, therefore, managing over-the-air updates quickly becomes unwieldy, time-consuming and costly, especially in the absence of a compression strategy. Standard techniques for managing embedded system firmware and binary image updates also don’t work well with network models, for three primary reasons:

  1. When models are updated, it’s "all or nothing". No package update equivalent currently exists to enable replacement of only a single layer or node in the network.
  2. All but the most trivial model re-training results in an incremental differential file that is equivalent in size to the original file.
  3. Lossless compression provides very little benefit for typical neural network models, given the highly random nature of the source data.

Fortunately, neural networks are relatively tolerant of noise, so lossy compression techniques can provide significant advantages. Figure F demonstrates the impact that lossy compression has on inference accuracy for 4 different CNN’s implemented using DeepView. Compression ratios greater than 80% are easily achievable for most models, with minimal degradation in accuracy. And with further adjustment to the network topology and parameter representations, compression ratios exceeding 90% are realistically achievable for practical, real-world network models.


Figure F. Deep learning models can retain high accuracy even at high degrees of lossy compression (courtesy Au-Zone Technologies).

A trained network model requires a significant investment in engineering time, encompassing the effort invested in assembling a dataset, designing the neural network, and training and validating it. When developing a model for a commercial product, protecting the model on the target and ensuring its authenticity are critical requirements. DeepView, for example, has addressed these concerns by providing a fully integrated certificate management system. The toolkit provides both graphical and command line interface options, along with both C- and Python-based APIs, for integration with 3rd-party infrastructure. Such a system ensures model authenticity as well as security from IP theft attempts.

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. Au-Zone Technologies, BDTI, MVTec, Synopsys and Xilinx, the co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance and its member companies periodically deliver webinars on a variety of technical topics. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Alliance website. Also, the Embedded Vision Alliance has begun offering "Deep Learning for Computer Vision with TensorFlow," a full-day technical training class planned for a variety of both U.S. and international locations. See the Alliance website for additional information and online registration.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, is intended for product creators interested in incorporating visual intelligence into electronic systems and software. The Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings.

The most recent Embedded Vision Summit took place in Santa Clara, California on May 1-3, 2017; a slide set along with both demonstration and presentation videos from the event are now in the process of being published on the Alliance website. The next Embedded Vision Summit is scheduled for May 22-24, 2018, again in Santa Clara, California; mark your calendars and plan to attend.