Embedded Vision Alliance: Technical Articles

Bringing it All into Focus: Finding the Right Lens for Your Camera

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.

Intel’s Recommendations for the U.S. National Strategy on Artificial Intelligence

This article was originally published at Intel's website. It is reprinted here with the permission of Intel.

Improving TensorFlow Inference Performance on Intel Xeon Processors

This article was originally published at Intel's website. It is reprinted here with the permission of Intel.

The Four Key Trends Driving the Proliferation of Visual Perception

A version of this article was previously published in the February/March edition of Imaging and Machine Vision Europe. It is reprinted here with the permission of Imaging and Machine Vision Europe.

High-Sensitivity Image Processing Cameras

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.

High-sensitivity image processing cameras are essential for achieving a quality video image with low image noise even in poor lighting conditions.

Harnessing the Power of AI: An Easy Start with Lattice’s sensAI

Bookmark and Share

Harnessing the Power of AI: An Easy Start with Lattice’s sensAI

This article was originally published at Lattice Semiconductor's website. It is reprinted here with the permission of Lattice Semiconductor.

Artificial intelligence, or AI, is everywhere. It’s a revolutionary technology that is slowly pervading more industries than you can imagine. It seems that every company, no matter what their business, needs to have some kind of AI story. In particular, you see AI seriously pursued for applications like self-driving automobiles, the Internet of Things (IoT), network security, and medicine. Company visionaries are expected to have a good understanding of how AI can be applied to their businesses, and success by early adopters will force holdouts into the fray.

Not all AI is the same, however, and different application categories require different AI approaches. The application class that appears to have gotten the most traction so far is embedded vision. AI for this category makes use of so-called convolutional neural networks, or CNNs, which attempt to mimic the way that the biological eye is believed to operate. We will focus on vision in this AI whitepaper, even though many of the concepts will apply to other applications as well.

AI Edge Requirements

AI involves the creation of a trained model of how something works. That model is then used to make inferences about the real world when deployed in an application. This gives an AI application two major life phases: training and inference.

Training is done during development, typically in the cloud. Inference, on the other hand, is required by deployed devices as an ongoing activity. Because inference can also be a computationally difficult problem, much of it is currently done in the cloud. But there is often little time to make decisions. Sending data to the cloud and then waiting until a decision arrives back can take time – and by then, it may be too late. Making that decision locally can save precious seconds.

This need for real-time control applies to many application areas where decisions are needed quickly. Many such examples detect human presence:

Other always-on applications include:

Because of this need for quick decisions, there is a strong move underway to take inference out of the cloud and implement it at the “edge” – that is, in the devices that gather data and then take action based on the AI decisions. This takes the delays inherent in the cloud out of the picture.

There are two other benefits to local inference. The first is privacy. Data enroute to and from the cloud, and data stored up in the cloud, is subject to hacking and theft. If the data never leaves the equipment, then there is far less opportunity for mischief.

The other benefit relates to the bandwidth available in the internet. Sending video up to the cloud for real-time interpretation chews up an enormous amount of bandwidth. Making the decisions locally frees that bandwidth up for all of the other demanding uses.

In addition:

  • Many such devices are powered by a battery – or, if they are mains-supplied, have heat constraints that limit how much power is sustainable. In the cloud, it’s the facility’s responsibility to manage power and cooling.
  • AI models are evolving rapidly. Between the beginning and end of training, the size of the model may change dramatically, and the size of the required computing platform may not be well understood until well into the development process. In addition, small changes to the training can have a significant impact on the model, adding yet more variability. All of this makes it a challenge to size the hardware in the edge device appropriately.
  • There will always be tradeoffs during the process of optimizing the models for your specific device. That means that a model might operate differently in different pieces of equipment.
  • Finally, edge devices are often very small. This limits the size of any devices used for AI inference.

All of this leads to the following important requirements for interference at the edge. Engines for making AI inference at the edge must:

  • Consume very little power
  • Be very flexible
  • Be very scalable
  • Have a small physical footprint

Lattice’s sensAI offering lets you develop engines with precisely these four characteristics. It does so by including a hardware platform, soft IP, a neural-net compiler, development modules, and resources that will help get the design right quickly.

Inference Engine Options

There are two aspects to building an inference engine into an edge device: developing the hardware platform that will host the execution of the model, and developing the model itself.

Execution of a model can, in theory, take place on many different architectures. But execution at the edge, taking into account the power, exibility, and scalability requirements above, limits the choices – particularly for always-on applications.

MCUs

The most common way of handling AI models is by using a processor. That may be a GPU or a DSP, or it may be a microcontroller. But the processors in edge devices may not be up to the challenge of executing even simple models; such a device may have only a low-end microcontroller (MCU) available. Using a larger processor may violate the power and cost requirements of the device, so it might seem like AI would be out of reach for such devices.

This is where low-power FPGAs can play an important role. Rather than bee ng up a processor to handle the algorithms, a Lattice ECP5 or UltraPlus FPGA can act as a co-processor to the MCU, providing the heavy lifting that the MCU can’t handle while keeping power within the required range. Because these Lattice FPGAs can implement DSPs, they provide computing power not available in a low-end MCU.


Figure 1. FPGA as a Co-Processor to MCU

ASICs and ASSPs

For AI models that are more mature and will sell in high volumes, ASICs or application-specific standard products (ASSPs) may be appropriate. But, because of their activity load,they will consume too much power for an always-on application.

Here Lattice FPGAs can act as activity gates, handling wake-up activities involving wake words or recognition of some broad class of video image (like identifying something that looks like it might be a person) before waking up the ASIC or ASSP to complete the task of identifying more speech or confirming with high confidence that an artifact in a video is indeed a person (or even a specific person).

The FPGA handles the always-on part, where power is most critical. While not all FPGAs can handle this role, since many of them still consume too much power, Lattice’s ECP5 and UltraPlus FPGAs have the power characteristics necessary for this role.


Figure 2. FPGA as activity gate to ASIC/ASSP

Stand-Alone FPGA AI Engines

Finally, low-power FPGAs can act as stand-alone, integrated AI engines. The DSPs available in the FPGAs take the starring role here. Even if an edge device has no other computing resources, AI capabilities can be added without breaking the power, cost, or board- area budgets. And they have the exibility and scalability necessary for rapidly evolving algorithms.


Figure 3. Stand-alone, Integrated FPGA Solution

Building an Inference Engine in a Lattice FPGA

Designing hardware that will execute an AI inference model is an exercise in balancing the number of resources needed against performance and power requirements. Lattice’s ECP5 and UltraPlus familes provide this balance.

The ECP5 family has three members of differing sizes that can host from one to eight inference engines. They contain anywhere from 1 Mb to 3.7 Mb of local memory. They run up to 1 W of power, and they have a 100 mm2 footprint.

The UltraPlus family, by contrast, has power levels as low as one thousandth the power of the ECP5 family, at 1 mW. Consuming a mere 5.5 mm2 of board area, it contains up to eight multipliers and up to 1 Mb of local memory.

Lattice also provides CNN IP designed to operate efficiently on these devices. For the ECP5 family, Lattice has a CNN Accelerator.


Figure 4. CNN Accelerator for the ECP5 family

For the UltraPlus family, Lattice provides a CNN Compact Accelerator.


Figure 5. Compact CNN Accelerator for the UltraPlus family

We won’t dive into the details here; the main point is that you don’t have to design your own engine from scratch. Much more information is available from Lattice regarding these pieces of IP.

Finally, you can run examples like this and test them out on development modules, with one for each device family. The Himax HM01B0 UPduino shield uses an UltraPlus device, requiring 22 x 50 mm2 of space. The Embedded Vision Development Kit uses an ECP5 device, claiming 80 x 80 mm2 of space.


Figure 6. Development modules for evaluation of AI application

Given an FPGA, soft IP, and all of the other hardware details needed to move data around, the platform can be compiled using Lattice’s Diamond design tools in order to generate the bitstream that will configure the FPGAs at each power-up in the targeted equipment.

Building the Inference Model in a Lattice FPGA

Creating an inference model is very different from creating the underlying execution platform. It’s more abstract and mathematical, involving no RTL design. There are two main steps: creating the abstract model and then optimizing the model implementation for your chosen platform.

Model training takes place on any of several frameworks designed specifically for this process. The two best-known frameworks are Caffe and TensorFlow, but there are others as well.

A CNN consists of a series of layers – convolution layers, along with possible pooling and fully connected layers – each of which has nodes that are fed by the result of the prior layer. Each of those results is weighted at each node, and it is the training process that decides what the weights should be.

The weights output by the training frameworks are typically floating-point numbers. This is the most precise embodiment of the weights – and yet most edge devices aren’t equipped with floating-point capabilities. This is where we need to take this abstract model and optimize it for a specific platform – a job handled by Lattice’s Neural Network Compiler.

The Compiler allows you to load and review the original model as downloaded from one of the CNN frameworks. You can run performance analysis, which is important for what is likely the most critical aspect of model optimization: quantization.

Because we can’t deal with floating-point numbers, we have to convert them to integers. That means that we will lose some accuracy simply by virtue of rounding off floating-point numbers. The question is, what integer precision is needed to achieve the accuracy you want? 16 bits is usually the highest precision used, but weights – and inputs – may be expressed as smaller integers. Lattice currently supports 16-, 8-, and 1-bit implementations. 1-bit designs are actually trained in the single-bit integer domain to maintain accuracy.Clearly, smaller data units mean higher performance, smaller hardware, and, critically, lower power. But, make the precision too low, and you won’t have the accuracy required to faithfully infer the objects in a field of view.


Figure 7. A single model can be optimized differently for different equipment

So the neural-network compiler lets you create an instruction stream that represents the model, and those instructions can then be simulated or outright tested to judge whether the right balance has been struck between performance, power, and accuracy. This is usually measured by the percentage of images that were correctly processed out of a set of test images (different from the training images).

Improved operation can often be obtained by optimizing a model, including pruning of some nodes to reduce resource consumption, and then retraining the model in the abstract again. This is a design loop that allows you to fine-tune the accuracy while operating within constrained resources.

Two Detection Examples

We can see how the tradeoffs play out with two different vision examples. The first is a face-detection application; the second is a human-presence-detection application. We can view how the differences in the resources available in the different FPGAs affects the performance and power of the resulting implementations.

Both of these examples take their inputs from a camera, and they both execute on the same underlying engine architecture. For the UltraPlus implementation, the camera image is downsized and then processed through eight multipliers, leveraging internal storage and using LEDs as indicators.


Figure 8. UltraPlus platform for face-detection and human-presence applications

The ECP5 family has more resources, and so it provides a platform with more computing power. Here the camera image is pre-processed in an image signal processor (ISP) before being sent into the CNN. The results are combined with the original image in an overlay engine that allows text or annotations to be overlaid on the original image.


Figure 9. ECP5 platform for face-detection and human-presence applications

We can use a series of charts to measure the performance, power, and area of each implementation of the applications. We also do two implementations of each application: one with fewer inputs and one with more inputs.

For the face-detection application, we can see the results in Figure 10. Here the two implementations use 32 x 32 inputs for the simple version and 90x90 inputs for the more complex one.


Figure 10. Performance, power, and area results for simple and complex implementations of the face-recognition application in UltraPlus and ECP5 FPGAs

The left-hand axis shows the number of cycles required to process an image and how those cycles are spent. The right-hand axis shows the resulting frames-per-second (fps) performance for each implementation (the green line). Finally, each implementation shows the power and area.

The orange bars in the 32 x 32 example on the left represent the cycles spent on convolution. The UltraPlus has the fewest multipliers of the four examples; the other three are ECP5 devices with successively more multipliers. As the number of multipliers increases, the number of cycles required for convolution decreases.

The 90 x 90 example is on the right, and the results are quite different. There is a significant new blue contribution to the cycles on the bottom of each stack. This is the result of the more complex design using more memory than is available internally in the devices. As a result, they have to go out to DRAM, which hurts performance. Note also that this version cannot be implemented in the smaller UltraPlus device.

A similar situation holds for the human-presence application. Here the simple version uses 64 x 64 inputs, while the complex version works with 128 x 128 inputs.


Figure 11. Performance, power, and area results for simple and complex implementations of the human-presence application in UltraPlus and ECP5 FPGAs

Again, more multipliers reduce the convolution burden, and relying on DRAM hurts performance.

The performance for all versions is summarized in Table 1. This includes a measure of the smallest identifiable object or feature in an image, expressed as a percent of the full field of view. Using more inputs helps here, providing additional resolution for smaller objects.


Table 1. Performance summary of the two example applications in four different FPGAs

Summary

In summary, then, edge-inference AI designs that demand low power, exibility, and scalability can be readily implemented in Lattice FPGAs using the resources provided by the Lattice sensAI offering. It makes available the critical elements needed for successful deployment of AI algorithms:

  • Neural network compiler
  • Neural engine soft IP
  • Diamond design tools
  • Development boards
  • Reference designs

Much more information is available from Lattice; go to www.latticesemi.com to start using the power of AI in your designs.

What Is CoaXPress?

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.

Combining an ISP and Vision Processor to Implement Computer Vision

Bookmark and Share

Combining an ISP and Vision Processor to Implement Computer Vision

An ISP (image signal processor) in combination with one or several vision processors can collaboratively deliver more robust computer vision processing capabilities than vision processing is capable of providing standalone. However, an ISP operating in a computer vision-optimized configuration may differ from one functioning under the historical assumption that its outputs would be intended for human viewing purposes. How can such a functional shift be accomplished, as well as handling applications in which both computer vision and human viewing functions require support? This article discusses the implementation options involved in heterogeneously leveraging an ISP and one or more vision processors to efficiently and effectively execute both traditional and deep learning-based computer vision algorithms.

ISPs, whether in the form of a standalone IC or as an IP core integrated into a SoC or image sensor, are common in modern camera-inclusive designs (see sidebar "ISP Fundamentals"). And vision processors, whether to handle traditional- or deep learning-based algorithms, or a combination of the two, are increasingly common as well, as computer vision adoption becomes widespread. Sub-dividing the overall processing of computer vision functions among the collaborative combination of an ISP and vision processor(s) is conceptually appealing from the standpoint of making cost-effective and otherwise efficient use of all available computing resources.

However, ISPs are historically "tuned" to process images intended for subsequent human viewing purposes; as such, some ISP capabilities are unnecessary in a computer vision application, while others are redundant with their vision processor-based counterparts and the use of others may actually be detrimental to the desired computer vision accuracy and other end results. The situation is complicated even further in applications where an ISP's outputs are used for both human viewing and computer vision processing (see sidebar "Assessing ISP Necessity").

This article discusses the implementation options involved in combining an ISP and one or more vision processors to efficiently and effectively execute traditional and/or deep learning-based computer vision algorithms. It also discusses how to implement a design that handles both computer vision and human viewing functional requirements. It provides both general concept recommendations and detailed specific explanations, the latter in the form of case study examples. And it also introduces readers to an industry alliance created to help product creators incorporate vision-enabled capabilities into their SoCs, systems and software applications, along with outlining the technical resources that this alliance provides (see sidebar "Additional Developer Assistance").

Dynamic Range Compression Effects on Edge Detection

The following section from Apical Limited (now owned by Arm, who is subsequently implementing Apical's ISP technologies in its own IP cores), explores the sometimes-undesirable interaction between an ISP and traditional computer vision functions, as well as providing suggestions on how to resolve issues encountered.

In general, the requirements of an ISP to produce natural, visually accurate imagery and the desire to produce well-purposed imagery for computer vision are closely matched. However, it can be challenging to maintain accuracy in uncontrolled environments. Also, in some cases the nature of the image processing applied in-camera can have an unintended and significant detrimental effect on the effectiveness of embedded vision algorithms, since the data represented by a standard camera output is very different from the raw data recorded by the camera sensor. Therefore, it's important to understand how the pixels input to the vision algorithms have already been pre-processed by the ISP, since this pre-processing may impact the performance of those algorithms in real-life situations.

The following case study will illustrate a specific instance of these impacts. Specifically, we’ll look at how DRC (dynamic range compression), a key processing stage in all cameras, affects the performance of a simple threshold-based edge detection algorithm. DRC, as explained in more detail in the sidebar "ISP Fundamentals", is the process by which an image with high dynamic range is mapped into an image with lower dynamic range. Most real-world situations exhibit scenes with a dynamic range of up to ~100 dB, and the eye can resolve details up to ~150 dB. Sensors exist which can capture this range (or more), which equates to around 17-18 bits per color per pixel. Standard sensors capture around 70 dB, which corresponds to around 12 bits per color per pixel.

We would like to use as much information about the scene as possible for embedded vision analysis purposes, which implies that we should use as many bits as possible as input data. One possibility is just to take this raw sensor data in as-is in linear form. Often, however, the raw data isn't available, since the camera module may be separate from the computer vision processor, with the communication channel between them subsequently limited to 8 bits per color per pixel. Some kinds of vision algorithms also work better if presented with images exhibiting a narrower dynamic range since, as a result, less variation in scene illumination needs to be taken into account.

Often, therefore, some kind of DRC will be performed between the raw sensor data (which may in some cases be very high dynamic range) and the standard RGB or YUV output provided by the camera module. This DRC will necessarily be non-linear; one familiar example of the technique is gamma correction, which is more correctly described as dynamic range preservation. The intention is to match the gamma applied in-camera with an inverse function applied at the display, in order to recover a linear image at the output (as in, for example, the sRGB or rec709 standards for mapping 10-bit linear data into 8-bit transmission formats).

The same kind of correction is also frequently used for DRC. For the purposes of this vision algorithm example, it's optimal to work in a linear domain, and in principle, it would be straightforward to apply an inverse gamma and recover the linear image as input. Unfortunately, though, the gamma used in the camera does not always follow a known standard. This is for good reason; the higher the dynamic range potential of a sensor, the larger the amount of non-linear correction that needs to be applied to those images captured by the sensor that exhibit high dynamic range. Conversely, for images captured by the sensor that don’t fit the criteria i.e., images of scenes that are fairly uniform in illumination, little or no correction needs to be applied.

As a result, the non-linear correction needs to be adaptive. This means that the algorithm's function depends on the image itself, as derived from an analysis of the intensity histogram of the component pixels, and it may need also to be spatially varying, meaning that different transforms are applied in different image regions. The overall intent is to try to preserve as much information as accurately as possible, without clipping or distortion of the content, and while mapping the high input dynamic range down to the relatively low output dynamic range. Figure 1 gives an example of what DRC can achieve; the left-hand image retains the contrast of the original linear sensor image, while the right-hand post-DRC image is much closer to what the eye and brain would perceive.

Figure 1. An original linear image (left) transforms, after dynamic range compression, into a visual representation much closer to what the eye and brain would perceive (right) (courtesy Apical).

Now let's envision working with this same sort of non-linear image data as input to a vision algorithm, a normal situation in many real-world applications. How will the edge detection algorithm perform on this type of data, as compared to the original linear data? Consider a very simple edge detection algorithm, based on the ratio of intensities of neighboring pixels, such that an edge is detected if the ratio is above a pre-defined threshold. Also consider the simplest form of DRC, which is gamma-like, and might be associated with a fixed exponent or one derived from histogram analysis (i.e., "adaptive gamma"). What effect will this gamma function have on the edge intensity ratios?

The gamma function is shown in Figure 2 at top right. Here, the horizontal axis is the pixel intensity in the original image, while the vertical axis is the pixel intensity after gamma correction. This function increases the intensity of pixels in a variable manner, such that darker pixels have their intensities increased more than do brighter pixels.  Figure 2 at top left shows an edge, with its pixel intensity profile along the adjacent horizontal line. This image is linear; it has been obtained from the original raw data without any non-linear intensity correction. The edge corresponds to a dip in the intensity profile; assume that this dip exceeds the edge detection threshold by a small amount.

 


Figure 2. DRC has variable effects on an edge depending on the specifics of its implementation. The original edge is shown in the top-left corner, with the gamma-corrected result immediately below it. The intensity profile along the blue horizontal line is shown in the middle column. The result of gamma correction is shown in the middle row, with the subsequent outcome of applied local contrast preservation correction shown in the bottom row (courtesy Apical).

Now consider the same image after gamma correction (top right) as shown in Figure 2, middle row. The intensity profile has been smoothed out, with the amplitude of the dip greatly reduced. The image itself is brighter, but the contrast ratio is lower. The reason? The pixels at the bottom of the dip are darker than the rest, and their intensities are therefore relatively increased more by the gamma curve than the rest, thereby closing the gap. The difference between the original and new edge profiles is shown in the right column. The dip is now well below our original edge detection threshold.

This outcome is problematic for edge detection, since the strengths of the edges present in the original raw image are reduced in the corrected image. Even worse, they're reduced in a totally unpredictable way based on where the edge occurs in the intensity distribution. Making the transform image-adaptive and spatially variant further increases the unpredictability of how much edges will be smeared out by the transform. There is simply no way to relate the strength of edges in the output to those in the original linear sensor data. On the one hand, DRC is necessary to pack information recorded by the sensor into a form that the camera can output. However, this very same process also degrades important local pixel information needed for reliable edge detection. Such degradation could, for example, manifest as instability in the white line detection algorithm for an automotive vision system, particularly when entering or exiting a tunnel, or in other cases where the scene's dynamic range changes rapidly and dramatically.

Fortunately, a remedy exists. The fix arises from an observation that an ideal DRC algorithm should be highly non-linear on large length scales on the order of the image dimensions, but should be strictly linear on very short scales of a few pixels. In fact this outcome is also desirable for other reasons. Let's see how it can be accomplished, and what effect it would have on the edge problem. The technique involves deriving a pixel-dependent image gain via the formula Aij = Oij/D(Iij), where i,j are the pixel coordinates, O denotes the output image and I the input image, and D is a filter applied to the input image which acts to increase the width of edges. The so-called amplification map, A, is post-processed by a blurring filter which alters the gain for a particular pixel based on an average over its nearest neighbors. This modified gain map is multiplied with the original image to produce the new output image. The result is that the ratio in intensities between neighboring pixels is precisely preserved, independent of the overall shape of the non-linear transform applied to the whole image.

The result is shown in the bottom row of Figure 2. Although the line is brighter, its contrast with respect to its neighbors is preserved. We can see this more clearly in the image portion example of Figure 3. Here, several edges are present within the text. The result of standard gamma correction is to reduce local contrast, thereby "flattening" the image, while the effect of the local contrast preservation algorithm is to lock the ratio of edge intensities, such that the dips in the intensity profile representing the dark lines within the two letters in the bottom image and the top image are identical.


Figure 3. Showing a portion of an image makes the DRC effects even more evident. The original linear image is in the top-left corner, with the gamma-corrected result immediately below it. The effect of local contrast preservation is shown in the bottom-left corner (courtesy Apical).

In summary, while non-linear image contrast correction is essential for forming images that are viewable and transmissible, such transforms should retain linearity on small scales important for edge analysis. Note that our earlier definition of the amplification map as a pixel position-dependent quantity implies that such transforms must be local rather than global (i.e., position-independent). It is worth noting that unfortunately, the vast majority of cameras in the market employ only global processing and therefore have no means of controlling the relationship between edges in the original linear sensor data and the camera output.

Michael Tusch
Founder and CEO
Apical Limited

Software Integration of ISP Functions into a Vision Processor

An ISP's pre-processing impacts the effectiveness of not only traditional but also emerging deep learning-based computer vision algorithms. And, as vision processors become increasingly powerful and otherwise capable, it's increasingly feasible to integrate image signal processing functions into them versus continuing to rely on a standalone image signal processor. The following section from Synopsys covers both of these topics, in the process showcasing the capabilities of the OpenVX open standard for developing high-performance computer vision applications portable to a wide variety of computing platforms.

CNNs (convolutional neural networks) are of late receiving a lot of justified attention as a state-of-the-art technique for implementing computer vision tasks such as object detection and recognition. With applications such as mobile phones, autonomous vehicles and augmented reality devices, for example, requiring vision processing, a dedicated vision processor with a CNN accelerator can maximize performance while minimizing power consumption. However, the quality of the images fed into the CNN engine or accelerator can heavily influence the accuracy of object detection and recognition.

To ensure highest quality of results, therefore, designers must make certain that the images coming from the camera are optimal. Images captured at dusk, for example, might normally suffer from a lack of differentiation between objects and their backgrounds. A possible way to improve such images is by using a normalization pre-processing step such as one of the techniques previously described in this article. More generally, in an example vision pipeline, light passes through the camera lens and strikes a CMOS sensor (Figure 4). The output of the CMOS sensor routes to an ISP to rectify lens distortions, make color corrections, etc. The pre-processed image then passes on to the vision processor for further analysis.


Figure 4. A typical vision system, from camera to CNN output, showcases the essential capabilities of an ISP (courtesy Synopsys).

Color image demosaicing is one example of the many important tasks handled by the ISP (Figure 5). Most digital cameras obtain their inputs from a single image sensor that's overlaid with a color filter array. The sensor output when using the most common color filter array, the Bayer pattern, has a green appearance to the naked eye. That's because the Bayer filter pattern is 50% green, 25% red and 25% blue; the extra green helps mimic the physiology of the human eye which is more sensitive to green light. Demosaicing makes the images look more natural by extrapolating the missing color values from nearby pixels. This pixel-by-pixel processing of a two-dimensional image exemplifies the various types of operations that an ISP must perform.


Figure 5. Demosaicing a Bayer pattern image to a normal RGB image requires two-dimensional pixel processing (courtesy Synopsys).

Some camera manufactures embed ISP capabilities in their modules. In other cases, the SoC or system developer will include a hardwired ISP in the design, connected to the camera module's output. To execute computer vision algorithms such as object detection or facial recognition on the images output by the ISP, a separate vision processor (either on- or off-chip) is also required; if these algorithms are deep learning-based, a CNN "engine" is needed.

Modern vision processors include both vector DSP capabilities and a neural network accelerator (Figure 6). The vision processor’s vector DSP can be used to replace a standalone hardwired ISP, since its capabilities are well suited to alternatively executing ISP functions. It can, for example, perform simultaneous multiply-accumulates on different streams of data; a vector DSP with a 512 bit wide word is capable of performing up to 32 parallel 8-bit multiplies or 16 parallel 16-bit multiplies. In combination with a power- and area-optimized architecture, a vector DSP's inherent parallelism delivers a highly efficient 2D image processing solution for embedded vision applications.


Figure 6. Synopsys’ DesignWare ARC EV62 Vision Processor includes two vector DSP cores and an optional, tightly integrated neural network engine (courtesy Synopsys).

A programmable vision processor requires a robust software tool chain and relevant library functions. Synopsys' EV62, for example, is supported by the company's DesignWare ARC® MetaWare EV Development Toolkit, which includes software development tools based on the OpenVX™, OpenCV, and OpenCL C embedded vision standards. Synopsys’ OpenVX implementation extends the standard OpenVX kernel library to include additional kernels that offer OpenCV-like functionality within the optimized, pipelined OpenVX execution environment. For vision processing, OpenVX provides both a framework and optimized vision algorithms—image functions implemented as kernels, which are combined to form an image processing application expressed as a graph. Both the standard and extended OpenVX kernels have been ported and optimized for the EV6x so that designers can take efficient advantage of the parallelism of the vector DSP.

Figure 7 shows an example of an OpenVX graph that uses a combination of standard and extended OpenVX kernels. In this example, cropping of the image is done during the distortion correction (i.e., remap) step. The output of demosaicing then passes through distortion correction, image scaling, and image normalization functions, the latter step adjusting the range of pixel intensity values to correct for poor contrast due to low light or glare, for example.


Figure 7. An OpenVX graph for implementing an ISP on a vision processor leverages both standard and extended kernels (courtesy Synopsys).

Because the EV62 has two vision processor CPU cores, it can do "double duty"; one vision processor can execute the ISP algorithms while the other handles other computer vision algorithms in parallel. An EV64, with four vision processor CPU cores, delivers even more parallel processing capabilities.

Gordon Cooper
Product Marketing Manager for DesignWare ARC Embedded Vision Processors
Synopsys

Implementing ISP Functions Using Deep Learning

In a deep learning-based computer vision design, the neural network accelerator is, according to Imagination Technologies, also a compelling candidate for additionally executing image signal processing functions that were historically handled by a standalone ISP. The company discusses its proposed integration approach in the following section.

Today, many designs combine both an ISP and a traditional and/or deep learning-based vision processor, and notable efficiencies can be gained by implementing computer vision applications heterogeneously between them. The challenge lies in the fact that (as previously discussed in this article) ISPs are tuned to process images for human viewing purposes, which can be at odds with the requirements of computer vision applications; in some cases, certain ISP capabilities could be redundant or even negatively impact overall accuracy. In addition, some applications will require outputs that are used for both human viewing and computer vision processing purposes, further complicating the implementation.

Computer vision implementations using CNNs can combine them with a camera and an ISP for deployment in a wide variety of systems that offer vision and AI capabilities. Imagination Technologies believes that modern CNN accelerators are so capable, and the compute requirements of the ISP are so modest in comparison to those of a CNN, that ISP and CNN compute functions can merge into a unified NNA (Neural Network Accelerator). Such an approach offers numerous benefits, particularly with respect to silicon area-measured implementation costs.

Modern CNNs have rapidly overtaken traditional computer vision algorithms in many applications, particularly with respect to accuracy on tasks such as object detection and recognition. CNNs are adaptable, enable rapid development and are inherently simple, consisting primarily of the multiplication and addition operations that make up convolutions. Their high accuracy, however, comes at a steep computational cost.

A CNN is organized in layers with many convolutions per layer. A "deep" network will have multiple convolution layers in sequence. Fortunately, the compute structure of convolutions is highly regular and can be efficiently implemented in dedicated hardware accelerators. However, the sheer scale of the computation required to process the data in a video stream, with adequate frame rate, color depth and resolution and in real-time, can be daunting.

NNAs are dedicated-function processors that perform these core arithmetic functions required by CNNs at performance rates not alternatively possible on a CPU or a GPU for the same area and power budget. An NNA will typically provide more than one TOp/s (trillion operations per second) of peak performance potential. And an NNA can accomplish this feat while requiring a silicon area of only between 1mm2 and 2mm2 when fabricated using modern semiconductor processes, and within a power budget of less than one watt.

An example will put this level of compute resource into tangible context. Consider an ISP that takes in streaming images from an image sensor in Bayer format, i.e., RGGB data, and outputs either RGB or YUV processed images at 30-120 fps (frames per second) rates. Specifically, assume 1920 by 1080 pixel HD resolution video at 60 fps, translating into 2 megapixels per frame or 120 megapixels per second. Each pixel includes at least one byte of data for each red, green and blue color channel, generating a data rate of at least 360 MBytes per second. At a 120 megapixel per second rate, a NNA is capable of executing approximately 10,000 operations (multiplies and adds) per pixel.

A typical CNN for object detection, such as the SSD (Single Shot Detector) network used to identify and place boxes around the vehicles in Figure 8, will require something on the order of 100,000 operations/pixel. The front-end ISP functions, dominated by denoising and demosaicing operations, necessitate an additional ~1,000 operations/pixel. Remarkably, these latter ISP operations demand only around 1% of the total processing potential of a single-core ISP-plus-NNA.


Figure 8. SSD is an example of a CNN used for object detection (courtesy Imagination Technologies).

A conventional architecture (part (a) of Figure 9) uses an ISP to generate images for human viewing purposes, along with a NNA fed by the ISP outputs to handle computer vision. A simpler approach (part (b)) would be to use only one or multiple NNAs, performing both the ISP and deep learning functions. The NNA would first provide a RGB output for human viewing, followed by further processing (on the same or a different NNA) for computer vision.


Figure 9. Possible combinations of a NNA and ISP for computer vision and (optional) human viewing (courtesy Imagination Technologies).

The ISP functions can optionally be programmed directly into the NNA in the same way that a standalone ISP might be developed. In such a case, however, some care will be needed in selecting ISP algorithms so that the necessary compute capabilities are fully supported by the operation primitives that an NNA is capable of executing.

Alternatively, the NNA may be running a CNN that is trained to perform the various functions of an ISP. Such an approach is capable of covering a wide range of image conditions such that auto-exposure, auto-white balance and even auto-focus can be supported. Training in this scenario may not be as difficult as it might appear at first glance. If an ISP and NNA version of the final system are available at training time, the combination can be used to train a single CNN through a process known as distillation. Using a single NNA in this way may reduce power consumption and will certainly reduce cost (i.e., silicon area). It also enables the use of off-the-shelf CNNs and is applicable wherever computer vision is automating some tasks but humans still require visual situational awareness.

The final scenario is shown in part (c) of Figure 11, where human viewing is not involved and the CNN running on the NNA has therefore been trained to operate directly on Bayer image data in order to perform both the ISP and computer vision functions. Training in this final case could also leverage the previously mentioned distillation. Applications could potentially include very low-cost IoT devices and extend to forward-looking collision prevention for autonomous vehicles.

Imagination Technologies believes that the traditional combination of a standalone ISP and a computer vision processor (in this case an NNA for CNNs) should be re-evaluated. The ISP functions could alternatively be either directly implemented on an NNA or trained into a CNN executing on an NNA. Regardless of whether or not a human visual output is required, the cost of such a re-architected computer vision system could be significantly lower than is the case with the legacy approach.

Tim Atherton
Senior Research Manager for Vision and AI, PowerVR
Imagination Technologies

Conclusion

An ISP in combination with one or several vision processor(s) can collaboratively deliver more robust computer vision processing capabilities than vision processing is capable of providing standalone. However, an ISP operating in a computer vision-optimized configuration may differ from one functioning under the historical assumption that its outputs would be intended for human viewing purposes. In general, the requirements of an ISP to produce natural, visually accurate imagery and the desire to produce well-purposed imagery for computer vision are closely matched. However, in some cases the nature of the image processing applied in-camera can have a detrimental effect on the effectiveness of embedded vision algorithms. Therefore, it's important to understand and, if necessary, compensate for how the pixels input to the vision algorithms have already been pre-processed by the ISP.

ISP Fundamentals

In order to understand how an ISP can enhance and/or hamper the effectiveness of a computer vision algorithm, it's important to first understand what an ISP is and how it operates in its historical primary function: optimizing images for subsequent human viewing purposes. The following section from Apical Limited explores these topics.

Camera designers have decades of experience in creating image-processing pipelines that produce attractive and/or visually accurate images, but what kind of image processing produces video that is optimized for subsequent computer vision analysis? It seems reasonable to begin by considering a conventional ISP. After all, the human eye-brain system produces what we consider aesthetically pleasing imagery for a purpose: to maximize our decision-making abilities. But which elements of such an ISP are most important to get right for good computer vision, and how do they impact the performance of the algorithms that run on them?

Figure A shows a simplified block schematic of a conventional ISP. The input is sensor data in a raw format (one color per pixel), and the output is interpolated RGB or YCbCr data (three colors per pixel).


Figure A. A simplified view inside a conventional ISP shows its commonly supported functions (courtesy Apical).

Table A briefly summarizes the function of each block. The list is not intended to be exhaustive; an ISP design team will frequently also implement other modules.

Module

Function

Raw data correction

Set black point, remove defective pixels.

Lens correction

Correct for geometric and luminance/color distortions.

Noise reduction

Apply temporal and/or spatial averaging to increase SNR (signal to noise ratio).

Dynamic range compression

Reduce dynamic range from sensor to standard output without loss of information.

Demosaic

Reconstruct three colors per pixel via interpolation with pixel neighbors.

3A

Calculate correct exposure, white balance and focal position.

Color correction

Obtain correct colors in different lighting conditions.

Gamma

Encode video for standard output.

Sharpen

Edge enhancement.

Digital image stabilization

Remove global motion due to camera shake/vibration.

Color space conversion

RGB to YCbCr.

Table A. Functions of main ISP modules

Computer vision algorithms may operate directly on the raw data, on the output data, or on data that has subsequently passed through a video compression codec. The data at each of these three stages often has very different quality and other characteristics; these issues are relevant to the performance of computer vision.

Next, let's review the stages of the ISP in order of decreasing importance to computer vision, an order which also happens to be approximately the top-to-bottom order shown in Figure A. We start with the sensor and optical system. Obviously, the better the sensor and optics, the better the quality of data on which to base decisions. But "better" is not a matter simply of resolution, frame rate or SNR (signal-to-noise ratio). Dynamic range, for example, is also a key characteristic. Dynamic range is the relative difference in brightness between the brightest and darkest details that the sensor can record within a single scene, with a value normally expressed in dB.

CMOS and CCD sensors commonly have a dynamic range of between 60 and 70 dB, sufficient to capture all details that are fairly uniformly illuminated in scenes. Special sensors are required to capture the full range of illumination in high contrast environments. Around 90dB of dynamic range is needed to simultaneously record information in deep shadows and bright highlights on a sunny day; this requirement rises further if extreme lighting conditions occur (the human eye has a dynamic range of around 120dB). If the sensor can’t capture such a wide range, objects that move across the scene will disappear into blown-out highlights and/or deep shadows below the sensor black level. High (i.e., wide) dynamic range sensors are helpful in improving computer vision performance in uncontrolled lighting environments. Efficient processing of such sensor data, however, is not trivial.

The next most important ISP stage, for a number of reasons, is noise reduction. In low light settings, noise reduction is frequently necessary to raise objects above the noise background, subsequently aiding in accurate segmentation. High levels of temporal noise can also easily confuse tracking algorithms based on pixel motion, even though such noise is largely uncorrelated both spatially and temporally. If the video passes through a lossy compression algorithm prior to post-processing, you should also consider the effect of noise reduction on compression efficiency. The bandwidth required to compress noisy sources is much higher than with "clean" sources. If transmission or storage is bandwidth-limited, the presence of noise reduces the overall compression quality and may lead to increased amplitude of quantization blocks, which confuse computer vision algorithms.

Effective noise reduction can readily increase compression efficiency by 70% or more in moderate noise environments, even when the increase in SNR is visually unnoticeable. However, noise reduction algorithms may themselves introduce artifacts. Temporal processing works well because it increases the SNR by averaging the processing over multiple frames. Both global and local motion compensation may be necessary to eliminate false motion trails in environments with fast movement. Spatial noise reduction aims to blur noise while retaining texture and edges and risks suppressing important details. It's important, therefore, to strike a careful balance between SNR increase and image quality degradation.

The correction of lens geometric distortions, chromatic aberrations and lens shading (i.e., vignetting) is of inconsistent significance, depending on the optics and application. For conventional cameras, uncorrected data may be perfectly suitable for post-processing. In digital PTZ (pan, tilt and zoom) cameras, on the other hand, correction is a fundamental component of the system. A set of "3A" algorithms control camera exposure, color and focus, based on statistical analysis of the sensor data. Their function and impact on computer vision is shown in Table B.

Algorithm

Function

Impact

Auto exposure

Adjust exposure to maximize the amount of scene captured. Avoid flicker in artificial lighting.

A poor algorithm may blow out highlights or clip dark areas, losing information. Temporal instabilities may confuse motion-based analysis.

Auto white balance

Obtain correct colors in all lighting conditions.

If color information is used by computer vision, it needs to be accurate. It is challenging to achieve accurate colors in all lighting conditions.

Auto focus

Focus the camera.

Which regions of the image should receive focus attention? How should the algorithm balance temporal stability versus rapid refocusing in a scene change?

Table B. The impact of "3A" algorithms

Finally, we turn to DRC (dynamic range compression). DRC is a method of non-linear image adjustment that reduces dynamic range, i.e., global contrast. It has two primary functions: detail preservation and luminance normalization.

As mentioned earlier, the better the dynamic range of the sensor and optics, the more data will typically be available for computer vision to work on. But in what form do the algorithms receive this data? For some embedded vision applications, it may be no problem to work directly with the high bit depth raw sensor data. But if the analysis is run in-camera on RGB or YCbCr data, or as post-processing based on already lossy-compressed data, the dynamic range of such data is typically limited by the 8-bit standard format, which corresponds to 60 dB. This means that unless DRC occurs in some way prior to encoding, the additional scene information will be lost. While techniques for DRC are well established (gamma correction is one form, for example), many of them decrease image quality in the process, by degrading local contrast and color information, or by introducing spatial artifacts.

Another application of DRC is in image normalization. Advanced computer vision algorithms, such as those employed in facial recognition, are susceptible to changing and non-uniform lighting environments. For example, an algorithm may recognize the same face differently depending on whether the face is uniformly illuminated or lit by a point source to one side, in the latter case casting a shadow on the other side. Good DRC processing can be effective in normalizing imagery from highly variable illumination conditions to simulated constant, uniform lighting, as Figure B shows.

Figure B. DRC can normalize (left) a source image with non-uniform illumination (right) (courtesy Apical).

Michael Tusch
Founder and CEO
Apical Limited

Assessing ISP Necessity

Ongoing academic research regularly revisits the necessity of routing the raw data coming out of an image sensor through an ISP prior to presenting the resulting images to computer vision algorithms for further processing. The potential for skipping the costly, power consuming and latency-inducing ISP step in the process is particularly relevant with deep learning-based visual analysis approaches, where the network model used for inference can potentially be pre-trained with "raw" images from a sensor instead of pre-processed images from an intermediary ISP. Recent encouraging results using uncorrected (i.e., pre-ISP) images suggest that that the inclusion of an ISP as a pre-processing step for computer vision may eventually no longer be necessary.

Note, however, that this ISP-less approach will only be appropriate for scenarios where computer vision processing is the only destination for the image sensor's output data. Consider, for example, a vehicle backup camera that both outputs images to a driver-viewable display and supplies those images as inputs to a passive collision warning or active autonomous collision-avoidance system. Or consider a surveillance system, where the initial detection of a potential intruder might be handled autonomously by computer vision but confirmation of intrusion is made by a human being, and/or when human-viewable images are necessary for the intruder's subsequent prosecution in a court of law. In these and other cases, an ISP (whether standalone or function-integrated into another processor) will still be necessary to parallel-process the images for humans' eyes and brains.

Additional Developer Assistance

The Embedded Vision Alliance® is a global partnership that brings together technology providers with end product and systems developers who are enabling innovative, practical applications of computer vision and visual AI. Imagination Technologies and Synopsys, the co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance’s annual conference and trade show, the Embedded Vision Summit®, is coming up May 20-23, 2019 in Santa Clara, California. Intended for product creators interested in incorporating visual intelligence into electronic systems and software, the Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings. More information, along with online registration, is now available.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other machine learning frameworks. Access is free to all through a simple registration process. The Embedded Vision Alliance and its member companies also periodically deliver webinars on a variety of technical topics, including various machine learning subjects. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Alliance website.

Multi-sensor Fusion for Robust Device Autonomy

Bookmark and Share

Multi-sensor Fusion for Robust Device Autonomy

While visible light image sensors may be the baseline "one sensor to rule them all" included in all autonomous system designs, they're not necessarily a sole panacea. By combining them with other sensor technologies:

  • "Situational awareness" sensors; standard and high-resolution radar, LiDAR, infrared and UV, ultrasound and sonar, etc., and

  • "Positional awareness" sensors such as GPS (global positioning system) receivers and IMUs (inertial measurement units)

the resultant "sensor fusion" configuration can deliver a more robust implementation in applications such as semi- and fully-autonomous vehicles, industrial robots, drones, and other autonomous devices. This article discusses implementation options, along with respective strengths and shortcomings of those options, involved in combining multiple of these sensor technologies within an autonomous device design.

Most sensors are single-purpose: one type of sensor for temperature, another for magnetic field, another for ambient light, etc. Image sensors are unique in that, when coupled with the right algorithms and sufficient processing power, they can become "software-defined sensors," capable of measuring many different types of things.

For example, using video of a person's face and shoulders, it's possible to identify the person, estimate their emotional state, determine heart rate and respiration rate, detect intoxication and drowsiness, and determine where the person's gaze is directed. Similarly, in cars and trucks, a single image sensor (or a small cluster of them) can detect and identify other vehicles, brake lights, pedestrians, cyclists, lane markings, speed limit signs, roadway and environment conditions, and more.

However, as their name implies, the performance of visible light image sensors, the most common category in widespread use today, can be sub-optimal in dimly lit settings, as well as at night; rain, snow, fog and other challenging environments can also notably degrade their discernment capabilities. And the ability to ascertain the device's current location and orientation, along with route direction, rate and acceleration, is also indirect at best, derived by recognizing landmarks in the surroundings and approximating the device's relative position to them.

Infrared and ultraviolet image sensors have range, resolution and other shortcomings but can also provide useful spectral information absent from a visible-light-only perspective of a scene. Radar, sonar and ultrasound work well after dark, although while they're good at detecting physical objects, they can't readily identify those objects, nor can they discern visible messages such as street signs, road markings, brake lights, or the color of traffic lights. LiDAR improves on radar from a resolution standpoint, albeit with notable size, weight, power consumption, cost and other tradeoffs. And the GPS and IMU modules now pervasive in smartphones and the like can also find use in determining an autonomous device's location, orientation, speed, acceleration, and other positional characteristics.

This article discusses implementation options, along with respective strengths and shortcomings of those options, involved in combining multiple of these sensor technologies within an autonomous device design, in order to provide the autonomous device both with redundancy and with enhanced insight into its surroundings as well as its relationship with those surroundings. Both traditional and deep learning-based algorithms are highlighted, along with both general concept recommendations and detailed specific explanations, the latter in the form of case study examples. And this article also introduces readers to an industry alliance created to help product creators incorporate vision-enabled capabilities into their SoCs, systems and software applications, along with outlining the technical resources that this alliance provides (see sidebar "Additional Developer Assistance").

Sensor Fusion-based System Modeling and Software Development

Developing a sensor fusion-based design from a hardware standpoint is only one part of the full implementation process. Software running one or multiple heterogeneous processors also needs to be developed in order to make sense of, and react appropriately to, the multiple streams of data sourced by the various sensors in the design. The following section, authored by MathWorks, explores this topic in depth.

The development of increasingly autonomous systems is experiencing exponential growth. Such systems range from roadway vehicles that meet various NHTSA (National Highway Traffic Safety Administration) levels of autonomy, through consumer quadcopters capable both of autonomous flight and remote piloting, package delivery drones, flying taxis, robots for disaster relief and space exploration, and the like. Work underway focusing on autonomous systems spans diverse industries and includes both academia and government.

In broad terms, the processing steps of every autonomous system include sensing, perception, decision-making, and action execution, i.e., control functions. Figure 1 illustrates how these functions interrelate.


Figure 1. Autonomous systems' processing steps involve the interrelationship of various functions (courtesy MathWorks).

Autonomous systems rely on sensor suites that provide data about the surrounding environment to feed the perception system. These suites include radar and vision cameras, which provide detections of objects in their field of view, LiDAR, which provide point clouds of returns from obstacles in the environment, and (in some cases) ultrasound and sonar sensors. Signal processing that occurs in the sensor system includes detection, segmentation, labeling, and classification, often along with basic tracking to reduce false alarms.

MathWorks has identified and is addressing an immediate and growing need for a toolset dedicated to sensor fusion and tracking that is reusable, shareable, flexible and configurable. Sensor fusion and tracking are at the heart of the perception block in Figure 1. By providing these tools, much of the effort that is currently spent on "reinventing the wheel" with every new autonomous system development project can be greatly reduced, if not eliminated. MathWorks' goal is to enable researchers, engineers, and enthusiasts to develop their own autonomous systems with significantly reduced time, effort, and funding. In addition, the tools we're developing also enable sharing of best practices and best results, both within an organization and across organizations and disciplines.

The toolset integrates sensor fusion and tracking algorithms into complete trackers, and also provides a comprehensive environment that allows for simulating sensor data, swapping trackers, testing various sensor fusion architectures, and evaluating the overall tracking results as compared to the simulated truth. These tools help researchers and developers that want to select the optimal perception solution to meet the requirements and sensor suites of their autonomous systems.

Figure 2 shows the toolset architecture. MATLAB and Simulink provide a framework to explore centralized and decentralized sensor fusion architectures. This framework also supports various models that extend to two- and three-dimensional environments.


Figure 2. A toolset architecture for sensor fusion system modeling and software development supports both centralized and decentralized architectures (courtesy MathWorks).

Sensor fusion is also required to perform localization. The fusion of GPS and inertial sensors enable self-awareness for an autonomous system. For navigation and tracking, the self-awareness information must be tightly coupled with the perception algorithms. Specifically, localization is used to determine pose, orientation, and position. This coupling is enabled with sensor models and fusion algorithms for IMU and GPS positions as shown in the workflow in Figure 3. Positional information is required by multi-object trackers because the autonomous system needs to know where it is at all times in order to keep track of objects in its occupancy grid.


Figure 3. Self-awareness for autonomous systems requires tight coupling of various sensors and their associated algorithms (courtesy MathWorks).

In the case of GPS-denied environments such as an urban "canyon", the inertial measurements from an IMU can alternatively be fused with visual odometry in order to provide improved positional accuracy.


Figure 4. Visual odometry is a feasible sensor fusion alternative in environments where GPS capabilities are compromised (courtesy MathWorks).

Scenario definition tools can find use in defining the "ground truth" (the environmental information provided by direct observation i.e., empirical evidence, as opposed to being inferred by implication). Platforms, which are simulated representations of ground truth objects, can be represented as either point targets or extended objects. Their positions in space over time, including their poses and orientations, are defined by attaching a trajectory to the target. Each platform can be assigned multiple signatures. Each signature defines how that object interacts with the sensor simulation.

Simulated sensors can also be assigned to the platform, allowing them to sense other platforms in the environment. Because sensors in autonomous systems can have very high resolution, system models must account for extended objects with multiple detections. Figure 5 shows how it's possible to track these detections as a group, versus clustering them to only individual detections.


Figure 5. System models must comprehend sensors with very high resolutions (courtesy MathWorks).

With scene generation, it's possible to focus on corner cases that may be difficult to capture with recorded data. Complex scenes can be set up, and detections can be synthesized directly, to test algorithms and configurations prior to live testing.

It's also important to maintain an open interface with the trackers. In such a way, detections obtained from actual sensors can be provided to a range of trackers with a common API, which captures the necessary information passed to the tracker: time of detection, measurement, uncertainty, classification, and information about the sensor that made the detection and its coordinate system.

Finally, we realize that we can't envision the needs of all users. MathWorks makes tracker components that are building blocks, which developers can reuse in constructing their own trackers. Some examples of these building blocks include track confirmation and deletion logic based on history and score, and track management.

After a system is modeled, C code can then be generated from the system's algorithms. This C code can be used to accelerate simulation times, to integrate into a larger simulation framework, or to deploy to a general-purpose processor and/or one or multiple specialized processors.

Avinash Nehemiah
Product Manager, Deep Learning, Computer Vision and Automated Driving
MathWorks

Case Study: Autonomous Vehicles

ADAS (advanced driver assistance systems)-equipped and fully self-driving vehicles aren't the only autonomous devices on the market and under development, but they may be the most famous examples of the product category. In the following section, Synopsys describes the opportunity for sensor fusion in this particular application, along with implementation details.

ADAS use multiple technologies to obtain the context of the external environment and implement object detection. Radar, LiDAR, imaging and ultrasonic sensors around the car each offer different capabilities in terms of distance, resolution and other characteristics, and together enable robust 360-degree capture. Figure 6 shows the different sensor technologies and their uses in an autonomous car design. Imaging technologies, for example, offer the ability to view images at short and medium distances and can be used for surround view and parking assist, as well as traffic signal and sign detection.


Figure 6. Numerous sensor technologies with varying capabilities can find use in autonomous vehicle designs (courtesy Synopsys).

Traditional automotive systems implement distinct sensor and computation paths up to the point where objects are detected. Cloud points or processed images are then created and passed to a sensor-fusion processing core, where other locational information is also applied. Figure 7 shows a traditional system design; from multiple technology sensors through signal processing, detected points, sensor fusion and then the applications processor.


Figure 7. A traditional system design implements distinct computation paths, up to a point, for each sensor technology employed (courtesy Synopsys).

In sub-optimal everyday-use conditions, such as in poor weather or at night, imaging may be less able to provide useful, reliable data points for object detection and tracking than in situations with more amenable environmental characteristics. LiDAR is alternatively well suited to night conditions but is negatively impacted by poor weather. Radar is robust across varying environmental conditions, but is less precise than other sensor technologies with respect to object definition.

Front-focused sensors’ critical functions involve the detection of objects such as pedestrians and other vehicles. Such objects can rapidly change directions and path, thereby leading to the need for an autonomous vehicle to quickly apply path correction and collision avoidance techniques. Front-focused sensor arrays often employ radar for object detection, and can also use LiDAR for full environment mapping. Compounding weather and environmental conditions require data from imaging, radar and LiDAR systems to cross-reference and efficiently identify object data. The fusion and cross computation of data between these various systems provides the highest reliability view of the surrounding environment.

The trend toward cross-computation is evolving the optimal location of computation fusion within the system. Radar and LiDAR processing and perception stages, for example, can be fused into a single processing unit that provides both the LiDAR cloud point data and the radar-detected object point. Cross-referencing and cross-computing these various data points will result in more accurate, higher reliability object detection and identification. In addition to radar- and LiDAR-specific algorithms' computation, the cross-computation support provides for sensor fusion capabilities. Such algorithms tend to focus heavily on linear algebra from a mathematical computation standpoint.

Case Study: Industrial Automation

Robots are finding increasing adoption in environments ranging from consumer to military systems, along with numerous interim application points. One of the most common usage scenarios today is in the manufacturing world, from initial piece parts acquisition through assembly line production and all the way through to finished goods warehouse inventory management. Synopsys shares its perspectives on the industrial automation opportunity for sensor fusion, in the following section.

In addition to the earlier mentioned automotive applications, industrial automation applications can also benefit from the fusion of radar, LiDAR and imaging. With the emerging wave of industrial Internet of Things implementations, individual industrial machines can operate autonomously as well as communicate with each other and their human monitors. These industrial robotic vehicles operate in a more controlled environment than do automotive systems, and are likely to perform more repetitive and limited motions. With that said, the robot still requires visibility of its surroundings as well as of other robots and human workers. The safety constraints and considerations are still evolving and depend upon the use and range of the mobile robot devices.

Industrial automation environmental conditions are also simpler and more controlled than those in automotive in illumination and visibility; after all, it usually does not rain, snow or get foggy or pitch black in factory buildings! The image sensors, often combined with LiDAR, can therefore be of lower resolution and cost. As in automotive applications, the combination of image and LiDAR processing provides for a superior representation of the environment as well as detection of objects. Robotic vehicles are often battery-powered, so reducing the total sensor count and combined power consumption are critical objectives.

The ability to fuse the computation and processing of all sensor data, as well as other sensor fusion inputs such as Wi-Fi-determined location, into one single processing unit provides for a very low-cost and low-power solution for industrial automation. Such a capability also provides the ability to offer a common foundation product capable of implementing a range of small to large autonomous driving robots in the industrial automation space (Figure 8).


Figure 8. The ability to merge the computation of all sensor data within a single processing unit allows for straightforward design of a scalable range of autonomous products (courtesy Synopsys).

One possible ISA (instruction set architecture) that covers all of these computation requirements is a unified-core VLIW (very long instruction word) architecture for high parallel operation execution, with wide SIMD (single instruction, multiple data) computation of vector data. Such a scheme provides the fundamental architecture needed to operate on data at the required performance throughput. The ISA and data type support, as well as register banks and load/store definitions, need to be more complex than with legacy DSP cores, as support for integer, fixed point and complex data types is necessary (Table 1). Floating-point computation capabilities are also critical for preserving accuracy and dynamic range.

 

LiDAR / Imaging

Linear Algebra

Radar

Algorithms / computation

Image computation

Kalman filters, Cholesky, Vector operations, mathematical algebra

Complex FFT filters, Clustering, Filtering. Kalman filters

Architecture

Wide SIMD and VLIW

Wide SIMD and VLIW

Wide SIMD and VLIW

Data Types

Integer (8b), Fixed Point (16b). Possible Floating Point (half and single precision)

Complex (16b + 16b). Floating Point (half and single precision)

Fixed Point (16b). Complex (16b + 16b). Possible Floating Point (single precision)

Compiler

C-Compiler and OpenCL Compiler

C-Compiler

C-Compiler

Table 1. Algorithm, data type, programming language and other requirements of various sensor fusion subsystems (courtesy Synopsys).

One of the major differences between the various usages of the common single processing core involves the programming and compiler support. Imaging algorithms have traditionally been programmed in OpenCL. Fixed-point algorithms used in radar, LiDAR and controller algorithms, however, are commonly programmed in C, hence necessitating a C compiler. A combined processor/DSP core therefore needs to offer a unified programming model across both OpenCL and C. The unified compiler has to be able to support the syntax of both programming languages, as well as to efficiently map these different syntaxes to the common architecture and ISA of the processor.

Graham Wilson
Senior Product Marketing Manager, ARC Processors
Synopsys

Case Study: Drones

Drones (i.e., quadrotors) are a popular and increasingly widespread product used by consumers as well as in a diversity of industrial, military and other applications. Historically fully under the control of human operators on the ground, they're becoming increasingly autonomous as the cameras built into them find use not only for capturing footage of the world around them but also in understanding and responding to their surroundings. The following section from FRAMOS explains how combining imaging with other sensing modalities can further bolster the robustness of this autonomy.

When a drone flies, it needs to know where it is in three-dimensional space at all times, across all six degrees of freedom for translation and rotation. Such pose estimation is crucial for flying without crashes or other errors. Drone developers are heavily challenged when attempting to use a single IMU or vision sensor to measure both orientation and translation in space. A hybrid approach combining IMU and vision data, conversely, improves the precision of pose estimation for drones based on the paired strengths of both measuring methods.

The IMU sensor measures acceleration, with information about the orientation derived from its raw output data. In theory, such acceleration measurements could also be used to derive translation. However, to calculate such results, developers need to integrate twice, a process that results in increased errors. Therefore, the IMU sensor alone is not an accurate source of precise location information.

In contrast, the vision sensor is quite good at measuring location; it's sub-optimal at determining orientation, however. Particularly with wide-view angles and long-distance observation, it’s quite complicated for the vision system alone to measure orientation with adequate precision. A hybrid system of paired IMU and vision data can provide a more precise measurement for the full six degrees of pose in space, providing better results than using either the IMU or the vision sensor individually.

The most challenging issues in such a sensor fusion configuration are to determine a common coordinate frame of reference for both orientation and translation data, as well as to minimize the noise produced by the sensors. A common approach to create the reference frame leverages linear Kalman filters, which have the capability to merge both IMU and vision data for hybrid pose estimation purposes. For a vision system mounted on or embedded in the drone, SLAM (simultaneous localization and mapping) provides spatial awareness for the drone by mapping their environment to ensure that it does not collide with trees, buildings, other drones or other objects.

Factors to Consider When Building a Hybrid Sensor-based Drone System

Several key factors influence the measurement quality. First off, the quality results of an IMU's measurements highly depend on the quality of the IMU selected for the design. Inexpensive IMU tend to generate high noise levels, which can lead to various errors and other deviations. More generally, proper calibration is necessary to comprehend the filter’s characteristics, e.g. the sensor's noise model. Individual sensors, even from the same model and manufacturer, will have slightly different noise patterns that require consideration.

On the vision side, the implementation specifics fundamentally depend on whether a global or rolling shutter image sensor is being used. With a global shutter image sensor, every pixel is illuminated at the same time, with no consequent read-out distortion caused by object motion. With a more economical rolling shutter image sensor, conversely, distortion can occur due to read-out time differences between pixels. IMU information can correct for rolling shutter artifacts, historically by the use of various filter methods. Nowadays, the noise reduction of the IMU sensor can also be corrected by deep learning-based processing.

Combining IMU and Vision Data

One challenge with hybrid systems is that the captured vision data is often very slow from a frame rate standpoint, usually well below 100 Hz, while the IMU data comes across at high frequency, sometimes well over 1 KHz. The root of the resultant implementation problem lies in finding a way to obtain information from both systems at the exact same time. SLAM techniques such as Continuous Trajectory Estimation can approximate the drone's movement by assuming that the drone’s speed is continuous.

Developers can integrate both the IMU and vision data into an environment with a common reference frame, allowing them to assign measurements to a specific part of the continuous trajectory. In-between any two image acquisitions, multiple IMU measurements provide additional reference points regarding this trajectory. When in the air, the drone will then constantly be time-synchronized and updated with IMU data. And every time a vision image is received, it then corrects the IMU information.

Hardware Requirements, Implementation and Testing

Considering the drones’ limited integration space and more general resource-limited embedded nature, implementing a robust setup for synchronization of the IMU and vision data is not straightforward. Light and lean components with powerful processing units are necessary, in a limited memory footprint while consuming little power. Stabilized sensors and software filtering are also essential for SoC developers as a result of the extreme jitter due to propeller movement, which influences the vision components of the design.

Both the vision sensor and IMU have individual local reference systems to measure their own pose, which need to be calibrated to ensure a hybrid SLAM has a common reference frame. Each system must be calibrated individually first, and then co-calibrated with respect to each other, in order to receive the position in the common reference frame. Multiple data sets are available to test aerial vehicle pose estimation, also in hybrid fashion, with the developed pipelines. The most commonly used dataset is EuRoC, which provides raw vision and IMU data to test algorithms and compare against other methods.

Ute Häußler
Corporate Editor, Content and PR
FRAMOS

Conclusion

While visible light image sensors are often included in autonomous system designs, they're not necessarily the sole required sensor technology in all implementation situations. By combining them with other sensors, however, the resultant "fusion" configuration can deliver a robust implementation in applications such as semi- and fully-autonomous vehicles, industrial robots, drones, and other autonomous devices.

Additional Developer Assistance

The Embedded Vision Alliance® is a global partnership that brings together technology providers with end product and systems developers who are enabling innovative, practical applications of computer vision and visual AI. FRAMOS, MathWorks and Synopsys, the co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance’s annual conference and trade show, the Embedded Vision Summit®, is coming up May 20-23, 2019 in Santa Clara, California. Intended for product creators interested in incorporating visual intelligence into electronic systems and software, the Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings. More information, along with online registration, is now available.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other machine learning frameworks. Access is free to all through a simple registration process. The Embedded Vision Alliance and its member companies also periodically deliver webinars on a variety of technical topics, including various machine learning subjects. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Alliance website.

What's the Best Way to Compare Modern CMOS Cameras?

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.

For nearly every sensor model, there is a considerable number of cameras from different manufacturers in which it is used.