Embedded Vision Alliance: Technical Articles

Software Frameworks and Toolsets for Deep Learning-based Vision Processing

Bookmark and Share

Software Frameworks and Toolsets for Deep Learning-based Vision Processing

This article provides both background and implementation-detailed information on software frameworks and toolsets for deep learning-based vision processing, an increasingly popular and robust alternative to classical computer vision algorithms. It covers the leading available software framework options, the root reasons for their abundance, and guidelines for selecting an optimal approach among the candidates for a particular implementation. It also covers "middleware" utilities that optimize a generic framework for use in a particular embedded implementation, comprehending factors such as applicable data types and bit widths, as well as available heterogeneous computing resources.

For developers in specific markets and applications, toolsets that incorporate deep learning techniques can provide an attractive alternative to an intermediary software framework-based development approach. And the article also introduces an industry alliance available to help product creators optimally implement deep learning-based vision processing in their hardware and software designs.

Traditionally, computer vision applications have relied on special-purpose algorithms that are painstakingly designed to recognize specific types of objects. Recently, however, CNNs (convolutional neural networks) and other deep learning approaches have been shown to be superior to traditional algorithms on a variety of image understanding tasks. In contrast to traditional algorithms, deep learning approaches are generalized learning algorithms trained through examples to recognize specific classes of objects, for example, or to estimate optical flow. Since deep learning is a comparatively new approach, however, the usage expertise for it in the developer community is comparatively immature versus with traditional algorithms such as those included in the OpenCV open-source computer vision library.

General-purpose deep learning software frameworks can significantly assist both in getting developers up to speed and in getting deep learning-based designs completed in a timely and robust manner, as can deep learning-based toolsets focused on specific applications. However, when using them, it's important to keep in mind that the abundance of resources that may be assumed in a framework originally intended for PC-based software development, for example, aren't likely also available in an embedded implementation. Embedded designs are also increasingly heterogeneous in nature, containing multiple computing nodes (not only a CPU but also GPU, FPGA, DSP and/or specialized co-processors); the ability to efficiently harness these parallel processing resources is beneficial from cost, performance and power consumption standpoints.

Deep Learning Framework Alternatives and Selection Criteria

The term "software framework" can mean different things to different people. In a big-picture sense, you can think of it as a software package that includes all elements necessary for the development of a particular application. Whereas an alternative software library implements specific core functionality, such as a set of algorithms, a framework provides additional infrastructure (drivers, a scheduler, user interfaces, a configuration parser, etc.) to make practical use of this core functionality. Beyond this high-level overview definition, any more specific classification of the term "software framework", while potentially more concrete, intuitive and meaningful to some users, would potentially also exclude other subsets of developers' characterizations and/or applications' uses.

When applied to deep learning-based vision processing, software frameworks contain different sets of elements, depending on their particular application intentions. Frameworks for designing and training DNNs (deep neural networks) provide core algorithm implementations such as convolutional layers, max pooling, loss layers, etc. In this initial respect, they're essentially a library. However, they also provide all of the necessary infrastructure to implement functions such as reading a network description file, linking core functions into a network, reading data from training and validation databases, running the network forward to generate output, computing loss, running the network backward to adapt the weights, and repeating this process as many times as is necessary to adequately train the network.

It’s possible to also use training-tailored frameworks for inference, in conjunction with a pre-trained network, since such "running the network forward" operations are part of the training process (Figure 1). As such, it may be reasonable to use training-intended frameworks to also deploy the trained network. However, although tools for efficiently deploying DNNs in applications are often also thought of as frameworks, they only support the forward pass. For example, OpenVX with its neural network extension supports efficient deployment of DNNs but does not support training. Such frameworks provide only the forward pass components of core algorithm implementations (convolution layers, max pooling, etc.). They also provide the necessary infrastructure to link together these layers and run them in the forward direction in order to infer meaning from input images, based on previous training.


Figure 1. In deep learning inference, also known as deployment (right), a neural network analyzes new data it’s presented with, based on its previous training (left) (courtesy Synopsys).

Current examples of frameworks intended for training DNNs include Caffe, Facebook's Caffe2, Microsoft's Cognitive Toolkit, Darknet, MXNet, Google's TensorFlow, Theano, and Torch (Intel's Deep Learning Training Tool and NVIDIA's DIGITS are a special case, as they both run Caffe "under the hood"). Inference-specific frameworks include the OpenCV DNN module, Khronos' OpenVX Neural Network Extension, and various silicon vendor-specific tools. Additionally, several chip suppliers provide proprietary tools for quantizing and otherwise optimizing networks for resource-constrained embedded applications, which will be further discussed in subsequent sections of this article. Such tools are sometimes integrated into a standalone framework; other times they require (or alternatively include a custom version of) another existing framework.

Why do so many framework options exist? Focusing first on those intended for training, reasons for this diversity include the following:

  • Various alternatives were designed more or less simultaneously by different developers, ahead of the potential emergence of a single coherent and comprehensive solution.
  • Different offerings reflect different developer preferences and perspectives regarding DNNs. Caffe, for example, is in some sense closest to an application, in that a text file is commonly used to describe a network, with the framework subsequently invoked via the command line for training, testing and deployment. TensorFlow, in contrast, is closer to a language, specifically akin to Matlab with a dataflow paradigm. Meanwhile, Theano and Torch are reminiscent of a Python library.
  • Differences in capabilities also exist, such as different layer types supported by default, as well as support (or not) for integer and half-float numerical formats.

Regarding frameworks intended for efficient DNN deployment, differences between frameworks commonly reflect different design goals. OpenVX, for example, is primarily designed for portability while retaining reasonable levels of performance and power consumption. The OpenCV DNN module, in contrast, is designed first and foremost for ease of use. And of course, the various available vendor-specific tools are designed to solely support particular hardware platforms.

Finally, how can a developer select among the available software framework candidates to identify one that's optimum for a particular situation? In terms of training, for example, the decision often comes down to familiarity and personal preference. Substantial differences also exist in capabilities between the offerings, however, and these differences evolve over time; at some point, the advancements in an alternative framework may override legacy history with an otherwise preferred one.

Unfortunately there’s no simple answer to the "which one's best" question. What does the developer care about most? Is it speed of training, efficiency of inference, the need to use a pre-trained network, ease of implementing custom capabilities in the framework, etc? For each of these criteria, differences among frameworks exist both in capabilities offered and in personal preference (choice of language, etc). With that all said, a good rule of thumb is to travel on well-worn paths. Find out what frameworks other people are already using in applications as similar as possible to yours. Then, when you inevitably run into problems, your odds of finding a documented solution are much better.

Addressing Additional Challenges

After evaluating the tradeoffs of various training frameworks and selecting one for your project, several other key design decisions must also be made in order to implement an efficient embedded deep learning solution. These include:

  • Finding or developing an appropriate training dataset
  • Selecting a suitable vision processor (or heterogeneous multi-processor) for the system
  • Designing an effective network model topology appropriate for the available compute resources
  • Implementing a run-time that optimizes any available hardware acceleration in the SoC

Of course, engineering teams must also overcome these development challenges within the constraints of time, cost, and available skill sets.

One common starting point for new developers involves the use of an example project associated with one of the training frameworks. In these tutorials, developers are typically guided through a series of DIY exercises to train a preconfigured CNN for one of the common image classification problems such as MNIST, CIFAR-10 or ImageNet. The result is a well-behaved neural net that operates predictably on a computer. Unfortunately, at this point the usefulness of the tutorials usually begins to diminish, since it’s then left as an "exercise for the reader" to figure out how to adapt and optimize these example datasets, network topologies and PC-class inference models to solve other vision challenges and ultimately deploy a working solution on an embedded device.

The deep learning aspect of such a project will typically comprise six distinct stages (Figure 2). The first four take place on a computer (for development), with the latter two located on the target (for deployment):

  1. Dataset creation, curation and augmentation
  2. Network design
  3. Network training
  4. Model validation and optimization
  5. Runtime inference accuracy and performance tuning
  6. Provisioning for manufacturing and field updates


Figure 2. The first five (of six total) stages of a typical deep learning project are frequently iterated multiple times in striving for an optimal implementation (courtesy Au-Zone Technologies).

Development teams may find themselves iterating steps 1-5 many times in searching for an optimal balance between network size, model accuracy and runtime inference performance on the processor(s) of choice. For developers considering deployment of Deep Learning vision solutions on standard SoC’s, development tools such as Au-Zone Technologies' DeepView ML Toolkit and Run-Time Inference Engine are helpful in addressing the various challenges faced at each of these developmental stages (see sidebar "Leveraging GPU Acceleration for Deep Learning Development and Deployment") (Figure 3).


Figure 3. The DeepView Machine Learning Toolkit provides various facilities useful in addressing challenges faced in both deep learning development and deployment (courtesy Au-Zone Technologies).

Framework Optimizations for DSP Acceleration

In comparison to the abundant compute and memory resources available in a PC, an embedded vision system must offer performance sufficient for target applications, but at greatly reduced power consumption and die area. Embedded vision applications therefore greatly benefit from the availability of highly optimized heterogeneous SoCs containing multiple parallel processing units, each optimized for specific tasks. Synopsys' DesignWare EV6x family, for example, integrates a scalar unit for control, a vector unit for pixel processing, and an optional dedicated CNN engine for executing deep learning networks (Figure 4).


Figure 4. Modern SoCs, as well as the cores within them, contain multiple heterogeneous processing elements suitable for accelerating various aspects of deep learning algorithms (courtesy Synopsys).

Embedded vision system designers have much to consider when leveraging a software framework for training a CNN graph. They must pay attention to the bit resolution of the CNN calculations, consider all possible hardware optimizations during training, and evaluate how best to take advantage of available coefficient and feature map pruning and compression techniques. If silicon area (translating to SoC cost) isn’t a concern, an embedded vision processor might directly use the native 32-bit floating-point outputs of PC-tailored software frameworks. However, such complex data types demand large MACs (multiply-accumulator units), sizeable memory for storage, and high transfer bandwidth. All of these factors adversely affect the SoC and system power consumption and area budgets. The ideal goal, therefore, is to use the smallest possible bit resolution without adversely degrading the accuracy of the original trained CNN graph.

Based on careful analysis of popular graphs, Synopsys has determined that CNN calculations on common classification graphs currently deliver acceptable accuracy down to 10-bit integer precision in many cases (Figure 5). The EV6x vision processor's CNN engine therefore supports highly optimized 12-bit multiplication operations. Caffe framework-sourced graphs utilizing 32-bit floating-point outputs can, by using vendor-supplied conversion utilities, be mapped to the EV6x 12-bit CNN architecture without need for retraining and with little to no loss in accuracy. Such mapping tools convert the coefficients and graphs output by the software framework during initial training into formats recognized by the embedded vision system for deployment purposes. Automated capabilities like these are important when already-trained graphs are available and retraining is undesirable.


Figure 5. An analysis of CNNs on common classification graphs suggests that they retain high accuracy down to at least 10-bit calculation precision (courtesy Synopsys).

Encouragingly, software framework developers are beginning to pay closer attention to the needs of not only PCs but also embedded systems. In the future, therefore, it will likely be possible to directly train (and retrain) graphs for specific integer bit resolutions; 8-bit and even lower-resolution multiplications will further save cost, power consumption and bandwidth.

Framework Optimizations for FPGA Acceleration

Heterogeneous SoCs that combine high performance processors and programmable logic are also finding increasing use in embedded vision systems (Figure 6). Such devices leverage programmable logic's highly parallel architecture in implementing high-performance image processing pipelines, with the processor subsystem managing high-level functions such as system monitoring, user interfaces and communications. The CPU-plus-FPGA combination delivers a flexible and responsive system solution.


Figure 6. GPUs (left) and FPGA fabric (right) are two common methods of accelerating portions of deep learning functions otherwise handled by a CPU (courtesy Xilinx).

To gain maximum benefit from such a heterogeneous SoC, the user needs to be able to leverage industry standard frame works such as Caffe for machine learning, as well as OpenVX and OpenCV for image processing. Effective development therefore requires a tool chain that not only supports these industry standards but also enables straightforward allocation (and dynamic reallocation) of functionality between the programmable logic and the processor subsystems. Such a system-optimizing compiler uses high-level synthesis (HLS) to create the logic implementation, along with a connectivity framework to integrate it with the processor. The compiler also supports development with high-level languages such as C, C++ and OpenCL.

Initial development involves implementing the algorithm solely targeting the processor. Once algorithm functionality is deemed acceptable, the next stage in the process is to identify performance bottlenecks via application profiling. Leveraging the system-optimizing compiler to migrate functions into the programmable logic is a means of relieving these bottlenecks, an approach which can also reduce power consumption.

In order to effectively accomplish this migration, the system-optimizing compiler requires the availability of predefined implementation libraries suitable for HLS, image processing, and machine learning. Some toolchains refer to such libraries as middleware. In the case of machine learning within embedded vision applications, both predefined implementations supporting machine learning inference and the ability to accelerate OpenCV functions are required. Xilinx's reVISION stack, for example, provides developers with both Caffe integration capabilities and a range of acceleration-capable OpenCV functions (including the OpenVX core functions).

reVISION’s integration with Caffe for implementing machine learning inference engines is as straightforward as providing a prototxt file and the trained weights; Xilinx's toolset handles the rest of the process (Figure 7). This prototxt file is finds use in configuring the C/C++ scheduler running on the SoC's processor subsystem, in combination with hardware-optimized libraries within the programmable logic that accelerate the neural network inference. Specifically the programmable logic implements functions such as Conv, ReLu and Pooling. reVISION's integration with industry-standard embedded vision and machine learning frameworks and libraries provides development teams with programmable logic's benefits without the need to delve deep into the logic design.


Figure 7. Accelerating Caffe-based neural network inference in programmable logic is a straightforward process, thanks to reVISION stack toolset capabilities (courtesy Xilinx).

Deep Learning-based Application Software

Industry 4.0, an "umbrella" term for the diversity of universally connected and high-level automated processes that determine everyday routines in modern production enterprises, is one example of a mainstream computer vision application that has attracted widespread developer attention. And deep learning-based technologies, resulting in autonomous and self-adaptive production systems, are becoming increasingly more influential in Industry 4.0. While it's certainly possible to develop applications for Industry 4.0 using foundation software frameworks, as discussed elsewhere in this article, a mature, high volume market such as this one is also served by deep learning-based application software whose "out of box" attributes reduce complexity.

MVTec's machine vision software products such as HALCON are one example. The company's software solutions run both on standard PC-based hardware platforms and on ARM-based processor platforms, such as Android and iOS smartphones and tablets and industry-standard smart cameras. In general, they do not require customer-specific modifications or complex customization. Customers can therefore take advantage of deep learning without having any specialized expertise in underlying technologies, and the entire rapid-prototyping development, testing and evaluation process runs in the company's interactive HDevelop programming environment.

Optical character recognition (OCR) is one specific application of deep learning. In a basic office environment, OCR is used to recognize text in scanned paper documents, extracting and digitally reusing the content. However, industrial use scenarios impose much stricter demands on OCR applications. Such systems must be able to read letter and/or number combinations printed or stamped onto objects, for example. The corresponding piece parts and end products can then be reliably identified, classified and tracked. HALCON employs advanced functions and classification techniques that enable a wide range of characters to be accurately recognized even in challenging conditions, thus addressed the stringent requirements that need to be met by a robust solution in industrial environments.

In environments such as these, text not only needs to be identified without errors under varied lighting conditions and across a wide range of fonts, it must also be accurately recognized even when distorted due to tilting and when smudged due to print defects. Furthermore, text to be recognized can be in a blurry condition and printed onto or etched into reflective surfaces or highly textured color backgrounds. With the help of deep learning technologies, OCR accuracy can be improved significantly. By utilizing a standard software solution like MVTec's HALCON, users are unburdened from the complex and expensive training process. After all, huge amounts of data are generated during training, and hundreds of thousands of images are required for each class, all of which have to be labeled.

Conclusion

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products, and can provide significant new markets for hardware, software and semiconductor suppliers (see sidebar "Additional Developer Assistance"). Deep learning-based vision processing is an increasingly popular and robust alternative to classical computer vision algorithms; conversion, partitioning, evaluation and optimization toolsets enable efficient retargeting of originally PC-tailored deep learning software frameworks for embedded vision implementations. These frameworks will steadily become more inherently embedded-friendly in the future, and applications that incorporate deep learning techniques will continue to be an attractive alternative approach for vision developers in specific markets.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Brad Scott
President, Au-Zone Technologies

Amit Shoham
Distinguished Engineer, BDTI

Johannes Hiltner
Product Manager for HALCON Embedded, MVTec Software GmbH

Gordon Cooper
Product Marketing Manager for Embedded Vision Processors, Synopsys

Giles Peckham
Regional Marketing Director, Xilinx

Sidebar: Leveraging GPU Acceleration for Deep Learning Development and Deployment

The design example that follows leverages the GPU core in a SoC to accelerate portions of deep learning algorithms which otherwise run on the CPU core. The approach discussed is an increasingly common one for heterogeneous computing in embedded vision, given the prevalence of robust graphics subsystems in modern application processors. These same techniques and methods also apply to the custom hardware acceleration blocks available in many modern SoCs.

Dataset Creation, Curation and Augmentation

A fundamental requirement for deep learning implementations is to source or generate two independent datasets: one suitable for network training, and the other to evaluate the effectiveness of the training. To ensure that the trained model is accurate, efficient and robust, the training dataset must be of significant size; it's often on the order of hundreds of thousands of labeled and grouped samples. One widely known public dataset, ImageNet, encompasses 1.5 million images across 1,000 discrete categories or classes, for example.

Creating large datasets is a time-consuming and error-prone exercise. Avoid these common pitfalls in order to ensure efficient and accurate training:

  1. Avoid incorrect annotation labels. This goal is much harder to achieve than might seem to be the case at first glance, due to inevitable human interaction with the high volume of data. It's unfortunately quite common to find errors in public datasets. Using advanced visualization and automated inspection tools greatly helps in improving dataset quality (Figure A).
  2. Make sure that the dataset represents the true diversity of expected inputs. For example, imagine training a neural network to classify images of electronic components on a circuit board. If you’ve trained it only with images of components on green circuit boards, it may fail when presented with an image of a component on a brown circuit board. Similarly, if all images of diodes in the training set happen to also have a capacitor partly visible at the edge of the image, the network may inadvertently learn to associate the capacitor with diodes, and fail to classify a diode when a capacitor is not also visible.
  3. In many cases, it makes sense to generate image samples from video as a means of quickly populating datasets. However, in doing so you must take great care to avoid reusing annotations from a common video sequence for both the training and testing databases. Such a mistake could lead to high training scores that can't be replicated by real-life implementations.
  4. Dataset creation should be an iterative process. You can greatly improve the trained model if you inspect the error distribution and optimize the training dataset if you find that certain classes are unrepresented or misclassified. Keeping dataset creation in the development loop allows for a better overall solution.


Figure A. DeepView's Dataset Curator Workspace enables visual inspection to ensure robustness without redundancy (courtesy Au-Zone Technologies).

For image classification implementations, in addition to supplying a sufficient number of samples, you should ensure that the dataset accurately represents information as captured by the end hardware platform in the field. As such, you need to comprehend the inevitable noise and other sources of variability and error that will be introduced into the data stream when devices are deployed into the real world. Randomly introducing image augmentation into the training sample set is one possible technique for increasing training data volume while improving the network's robustness, i.e. ensuring that the network is trained effectively and efficiently (Figure B).


Figure B. Random image augmentation can enhance not only the training sample set size but also its effectiveness (courtesy Au-Zone Technologies).

The types of augmentation techniques employed, along with the range of parameters used, both require adaptation for each application. Operations that make sense for solving some problems may degrade results in others. One simple example of this divergence involves horizontally flipping images; doing so might improve training for vehicle classification, but it wouldn’t make sense for traffic sign classification where numbers would then be incorrectly reversed in training.

Datasets are often created with images that tend to be very uniformly cropped, and with the objects of interest neatly centered. Images in real-world applications, on the other hand, may not be captured in such an ideal way, resulting in much greater variance in the position of objects. Adding randomized cropping augmentation can help the neural network generalize to the varied real-world conditions that it will encounter in the deployed application.

Network Design

Decades of research in the field of artificial neural networks have resulted in many different network classes, each with a variety of implementations (and variations of each) optimized for a diverse range of applications and performance objectives. Most of these neural networks have been developed with the particular objective of improving inference accuracy, and they have typically made the assumption that run time inference will be performed on a server- or desktop-class computer. Depending on the classification and/or other problem(s) you need to solve for your embedded vision project, exploring and selecting among potential network topologies (or alternatively designing your own) can therefore be a time consuming and otherwise challenging exercise.

Understanding which of these networks provides the "best" result within the constraints of the compute, dataflow and memory footprint resources available on your target adds a whole new dimension to the complexity of the problem. Determining how to "zero in" on an appropriate class of network, followed by a specific topology within that class, can rapidly become time-consuming endeavor, especially so if you're attempting to do anything other than solve a "conventional" deep learning image classification problem. Even when using a standard network for conventional image classification, many system considerations bear attention:

  • The image resolution of the input layer can significantly impact the network design
  • Will your input be single-channel (monochromatic) or multi-channel (RGB, YUV)? This is a particularly important consideration if you’re going to attempt transfer learning (to be further discussed shortly), since you’ll start with a network that was either pre-trained with color or monochrome data, and there’s no simple way to convert that pre-trained network from one format to another. On the other hand, if you’re going to train entirely from scratch, it’s relatively easy to modify a network topology to use a different number of input channels, so you can just take your network of choice and apply it to your application’s image format.
  • Ensure that the dataset format matches what you’ll be using on your target
  • Are pre-trained models compatible with your dataset, and is transfer learning an option?

When developing application-specific CNNs intended for deployment on embedded hardware platforms, it’s often once again very challenging to know where to begin. Leveraging popular network topologies such as ResNet and Inception will often lead to very accurate results in training and validation, but will often also require the compute resources of a small server to obtain reasonable inference times. As with any design optimization problem, knowing roughly where to begin, obtaining direct feedback on key performance indicators during the design process, and profiling on target hardware to enable rapid design iterations are all key factors to quickly converging on a deployable solution.

When designing a network to suit your specific product requirements, some of the key network design parameters that you will need to evaluate include:

  • Overall accuracy when validated both with test data and live data
  • Model size: number of layers, weights, bits/weight, MACs/image, total memory footprint/image, etc.
  • The distribution of inference compute time across network layers (scene graph nodes)
  • On-target inference time
  • Various other optimization opportunities

The Network Designer in the DeepView ML Toolkit allows users to select and adapt from preexisting templates for common network topologies, as well as to quickly create and explore new network design topologies (Figure C). With more than 15 node types supported, the tool enables quick and easy creation and configuration of scene graph representations of deep neural networks for training.


Figure C. The DeepView Network Design Workspace supports both customization of predefined network topologies and the creation of new scene graphs (courtesy Au-Zone Technologies).

Network Training

Training a network can be a tedious and repetitive process, with iteration necessary each time the network architecture is modified or the dataset is altered. The time required to train a model is directly impacted by the complexity of the network and the dataset size, and typically ranges from a few minutes to multiple days. Monitoring the loss value and a graph of the accuracy at each epoch helps developers to visualize the training session's efficiency.

Obtaining a training trend early in the process allows developers to save time by aborting training sessions that are not training properly (Figure D). Archiving training graphs for different training sessions is also a great way of analyzing the impact of: dataset alterations, network modifications and training parameter adjustments.


Figure D. Visually monitoring accuracy helps developers easily assess any particular training session's effectiveness (courtesy Au-Zone Technologies).

Transfer learning is a powerful method for optimizing network training. It's conceptually similar to the problem a developer would normally have with a dataset that's too small to properly train a rich set of parameters. By using transfer learning, you're leveraging an existing network trained on a similar problem to solve a new problem. For example, you can leverage a network trained on the very general (and very large) ImageNet dataset to specifically classify types of furniture with much less training time and effort than would otherwise be needed.

By importing a model already trained on a large dataset and freezing its earlier layers, a developer can then re-train the later network layers against a significantly smaller dataset, targeting the specific problem to be solved. Note, however, that such early-layer freezing isn't always the optimum approach; in some applications you might obtain better results by allowing the earlier network to learn the features of the new application.

And while dataset size reduction is one key advantage of transfer learning, another critical advantage is the potential reduction in training time. When training a network "from scratch," it can take a very long time to converge on a set of weights that delivers high accuracy. Via transfer learning, in summary, you can (depending on the application) use a smaller dataset, train for fewer iterations, and/or reduce training time by training only the last few layers of the network,.

Model Validation and Optimization

Obtaining sufficiently high accuracy on the testing dataset is a leading indicator of the network performance for a given application. However, limiting the result analysis to global score monitoring isn’t sufficient. In-depth analysis of the results is essential to understand how the network currently behaves and how to make it perform better.

Building a validation matrix is a solid starting point to visualize the error distribution among classes (Figure E). Filtering validation results is also an effective way to investigate the dataset entries that perform poorly, as well as to understand error validity and identify pathways to resolution.


Figure E. Graphically analyzing the error distribution among classes, along with as filtering validation results, enables evaluation of dataset entries that perform poorly (courtesy Au-Zone Technologies).

Many applications can also benefit from hierarchically ordering the classification labels used for analyzing the groups' accuracy. A distracted driving application containing 1 safe class and 9 unsafe classes, for example, could have mediocre overall classification accuracy but still be considered sufficient if the "safe" versus "unsafe" differentiation performs well.

Runtime Inference Accuracy and Performance Tuning

As the design and training activities begin to converge to acceptable levels in the development environment, target runtime inference optimization next requires consideration. The deep learning training frameworks discussed in the main article provide a key aspect of the overall solution, but leave the problem of implementing and optimizing the runtime to the developer. While general-purpose runtime implementations exist, they frequently do a subpar job of addressing several important aspects of deployment:

  1. Independence between the network model and runtime inference engine
    Separation of these two items enables independent optimization of the engine for each supported processor architecture option. Compute elements within each unique SoC, such as the GPU, vision processor, memory interfaces and other proprietary IP, can be fully exploited without concern for the model that will be deployed on them.
  2. The ability to accommodate NNEF-based models
    Such a capability allows for models created with frameworks not directly supported by tools such as DeepView to be alternatively imported using an industry-standard exchange format.
  3. Support for multiple, preloaded instantiations
    Enabling multiple networks on single device via fast context switching is desirable when a single device is required to perform a plurality of deep learning tasks but does not have the capacity to perform them concurrently.
  4. Portability between devices
    Support for any OpenCL 1.2-capable device enables the DeepView Inference Engine (for example) to be highly portable, easily integrated into both existing and new runtime environments with minimal effort. Such flexibility enables straightforward device benchmarking and comparison during the hardware-vetting process.
  5. Development tool integration
    The ability to quickly and easily profile runtime performance, validate accuracy, visualize results and return to network design for refinement becomes extremely helpful when iterating on final design details.

In applications where speed and/or power consumption are critical, optimization considerations for these parameters should be comprehended in the network design and training early in the process. Once you have a dataset and an initial network design that trains with reasonable accuracy, you can then explore tradeoffs in accuracy vs. # of MACs, weights, types of activation layers used, etc., tuning these parameters for the target architecture.

Provisioning For Manufacturing and Field Updates

When deploying working firmware to the field, numerous steps require consideration in order to ensure integrity and security at the end devices. Neural network model updates present additional challenges to both the developer and system OEM. Depending on the topology of the network required for your application, for example, trained models range from hundreds of thousands to many millions of parameters. When represented as half-floats, these models typically range from tens to hundreds of MBytes in size. And if the device needs to support multiple networks for different use cases or modes, the required model footprint further expands.

For all but the most trivial network examples, therefore, managing over-the-air updates quickly becomes unwieldy, time-consuming and costly, especially in the absence of a compression strategy. Standard techniques for managing embedded system firmware and binary image updates also don’t work well with network models, for three primary reasons:

  1. When models are updated, it’s "all or nothing". No package update equivalent currently exists to enable replacement of only a single layer or node in the network.
  2. All but the most trivial model re-training results in an incremental differential file that is equivalent in size to the original file.
  3. Lossless compression provides very little benefit for typical neural network models, given the highly random nature of the source data.

Fortunately, neural networks are relatively tolerant of noise, so lossy compression techniques can provide significant advantages. Figure F demonstrates the impact that lossy compression has on inference accuracy for 4 different CNN’s implemented using DeepView. Compression ratios greater than 80% are easily achievable for most models, with minimal degradation in accuracy. And with further adjustment to the network topology and parameter representations, compression ratios exceeding 90% are realistically achievable for practical, real-world network models.

 


Figure F. Deep learning models can retain high accuracy even at high degrees of lossy compression (courtesy Au-Zone Technologies).

A trained network model requires a significant investment in engineering time, encompassing the effort invested in assembling a dataset, designing the neural network, and training and validating it. When developing a model for a commercial product, protecting the model on the target and ensuring its authenticity are critical requirements. DeepView, for example, has addressed these concerns by providing a fully integrated certificate management system. The toolkit provides both graphical and command line interface options, along with both C- and Python-based APIs, for integration with 3rd-party infrastructure. Such a system ensures model authenticity as well as security from IP theft attempts.

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. Au-Zone Technologies, BDTI, MVTec, Synopsys and Xilinx, the co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance and its member companies periodically deliver webinars on a variety of technical topics. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Alliance website. Also, the Embedded Vision Alliance has begun offering "Deep Learning for Computer Vision with TensorFlow," a full-day technical training class planned for a variety of both U.S. and international locations. See the Alliance website for additional information and online registration.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, is intended for product creators interested in incorporating visual intelligence into electronic systems and software. The Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings.

The most recent Embedded Vision Summit took place in Santa Clara, California on May 1-3, 2017; a slide set along with both demonstration and presentation videos from the event are now in the process of being published on the Alliance website. The next Embedded Vision Summit is scheduled for May 22-24, 2018, again in Santa Clara, California; mark your calendars and plan to attend.

Cloud-versus-Edge and Centralized-versus-Distributed: Evaluating Vision Processing Alternatives

Bookmark and Share

Cloud-versus-Edge and Centralized-versus-Distributed: Evaluating Vision Processing Alternatives

Although incorporating visual intelligence in your next product is an increasingly beneficial (not to mention practically feasible) decision, how to best implement this intelligence is less obvious. Image processing can optionally take place completely within the edge device, in a network-connected cloud server, or subdivided among these locations. And at the edge, centralized and distributed processing architecture alternatives require additional consideration. This article reviews these different architectural approaches and evaluate their strengths, weaknesses, and requirements. It also introduces an industry alliance available to help product creators optimally implement vision processing in their hardware and software designs.

The overall benefits of incorporating vision processing in products are increasingly understood and adopted by the technology industry, with the plethora of vision-inclusive announcements at this year's Consumer Electronics Show and Embedded Vision Summit only the latest tangible examples of the trend. However, where the vision processing should best take place is less clear-cut. In some cases, for responsiveness or other reasons, it's optimally tackled at the device, a location that's the only available option for a standalone system. In others, where rapidity of results is less important than richness or where device cost minimization is paramount, such processing might best occur at a network-connected server.

In yet other design scenarios, a hybrid scheme, with portions of the overall processing load subdivided between client and cloud (along with, potentially, intermediate locations between these two end points) makes most sense. And in situations where vision processing takes place either in part or in full at the device, additional system architecture analysis is required; should a centralized, unified intelligence "nexus" be the favored approach, or is a distributed multi-node processing method preferable (see sidebar "Centralized vs Distributed Vision Processing")? Fortunately, an industry alliance, offering a variety of technical resources, is available to help product creators optimally implement vision processing in hardware and software designs.

Vision Processing at the Device

Many embedded vision systems find use in applications that require semi- to fully autonomous operation. Such applications demand a real-time, low-latency, always-available decision loop and are therefore unable to rely on cloud connections in order to perform the required processing. Instead, such an embedded vision system must be capable of fully implementing its vision (and broader mission) processing within the device. Such an approach is also often referred to as processing at the edge. Typical applications that require processing at the edge include ADAS (advanced driver assistance systems)-equipped cars and drones, both of which must be capable of navigation and collision avoidance in the process of performing a mission (Figure 1).


Figure 1. Autonomous drones intended for agriculture and other applications are an example of vision processing done completely at the device, i.e. the edge (courtesy Xilinx).

Although vision processing takes place fully at the edge in such applications, this doesn't necessary mean that the device is completely independent from the device. Periodic communication with a server may still occur in order to provide the client with location and mission updates, for instance, or to transfer completed-mission data from the client to the cloud for subsequent analysis, distribution and archive. One example of such a scenario would be with an agricultural drone, which uses hyperspectral imaging to map and classify agricultural land. In this particular application, while the drone can autonomously perform the survey, it will still periodically transmit collected data back to the cloud.

More generally, many of these applications employ algorithms that leverage machine learning and neural networks, which are used to create (i.e., train) the classifiers used by the algorithms. Since real-time generation of these classifiers is not required, and since such training demands significant processing capabilities, it commonly takes place in advance using cloud-based hardware. Subsequent real-time inference, which involves leveraging these previously trained parameters to classify, recognize and process unknown inputs, takes place fully in the client, at the edge.

Processing at the edge requires not only the necessary horsepower to implement the required algorithms but also frequently demands a low power, highly integrated solution such as that provided by a SoC (system on chip). SoC-based solutions may also provide the improved security to protect information both contained within the system and transmitted and received. And the highly integrated nature of processing at the edge often requires that the image processing system be capable of communicating with (and controlling, in some cases) other subsystems. In drones, for example, SWaP (size, weight and power consumption) constraints often result in combining the image processing system with other functions such as motor control in order to create an optimal solution. And going forward, the ever-increasing opportunities for (and deployments of) autonomous products will continue to influence SoC product roadmaps, increasing processing capabilities, expanding integration and reducing power consumption.

Vision Processing in the Cloud

Not all embedded vision applications require the previously discussed real-time processing and decision-making capabilities, however. Other applications involve the implementation of extremely processing-intensive and otherwise complex algorithms, where accuracy is more important than real-time results and where high-bandwidth, reliable network connectivity is a given. Common examples here include medical and scientific imaging; in such cases, vision processing may be optimally implemented on a server, in the cloud (Figure 2). Cloud-based processing is also advantageous in situations where multiple consumers of the analyzed data exist, again such as with medical and scientific imaging.


Figure 2. In applications such as medical imaging, responsiveness of image analysis results is less important than accuracy of results, thereby favoring a cloud-based vision processing approach (courtesy Xilinx).

The network-connected client can act either as a producer of vision data to be analyzed at the server, a consumer of the server's analysis results, or (as is frequently the case) both a producer and a consumer. In some applications, basic pre-processing in the client is also desirable prior to passing data on to the server. Examples include client-side metadata tagging in order to ensure traceability, for applications where images need to be catalogued and recalled for later use. Similarly, clients that act as processed data consumers may subsequently perform more complex operations on the information received; doing 3D rendering and rotation in medical imaging applications, for example. In both of these scenarios, care must be taken in the system design to ensure that data does not become stale (i.e. out of date) either in the client or at the cloud.

Connectivity is key to successful cloud based applications; both the cloud and client need to be able to gracefully handle unexpected and unknown-duration network disconnections that leave them unable to communicate with each other, along with reduced-bandwidth time periods. Should such an event occur, clients that act as producers should buffer data in order to seamlessly compensate for short-duration connectivity degradation. Redundant communication links, such as the combination of Wi-Fi, wired Ethernet, 4G/5G cellular data and/or ad hoc networking, can also find use in surmounting disconnections of any of them.

Vision processing within the cloud often leverages deep machine learning or other artificial intelligence techniques to implement the required algorithms. One example might involve facial recognition performed on images uploaded to social media accounts, in order to identify and tag individuals within those images. Such cloud-based processing involves not only the use of deep learning frameworks (such as Caffe and TensorFlow) and APIs and libraries (like OpenCV and OpenVX) but also the ability to process SQL queries and otherwise interact with databases. And server-based processing performance acceleration modules must also support cache coherent interconnect standards such as CCIX.

The processing capability required to implement these and other algorithms defines the performance requirements of the server architecture and infrastructure. Going forward, the cloud will experience ever-increasing demand for acceleration in expanding beyond the historical microprocessor-centric approach. One compelling option for delivering this acceleration involves leveraging the highly parallel processing attributes of FPGAs, along with their dynamic reprogrammability that enables them to singlehandedly handle multiple vision processing algorithms (and steps within each of them). Companion high level synthesis (HLS) development tools support commonly used frameworks and libraries such as those previously mentioned. And FPGAs' performance-per-watt metrics, translated into infrastructure costs, are also attractive in comparison to alternative processing and acceleration approaches.

Hybrid Vision Processing

For some computer vision applications, a hybrid processing topology can be used to maximize the benefits of both cloud and edge alternatives, while minimizing the drawbacks of each. The facial recognition feature found in Tend Insights' new Secure Lynx™ camera provides an example of this approach (Figure 3).


Figure 3. Leading-edge consumer surveillance cameras employ a hybrid vision processing topology, combining initial video analysis at the client and more in-depth follow-on image examination at the server (courtesy Tend Insights).

State-of-the-art face recognition can produce very accurate results, but this accuracy comes at a high price. These algorithms often rely on deep learning for feature extraction, which requires significant processing capacity. For example, the popular OpenFace open source face detection and recognition system takes more than 800ms to recognize a single face when the algorithm is run on a desktop PC running an 8-core 3.7GHz CPU. For a product such as the Tend Secure Lynx camera, adding enough edge-only processing power to achieve a reasonable face recognition frame rate would easily triple the BOM costs, pushing the final purchase price well above the threshold of target market consumers (not to mention also negatively impacting device size, weight, power consumption and other metrics).

A cloud-only processing solution presents even bigger problems. The bandwidth required to upload high-resolution video to the cloud for processing would have a detrimental impact on the consumer’s broadband Internet performance. Processing video in the cloud also creates a long-term ongoing cost for the provider, which must somehow be passed along to the consumer. While a hybrid approach will not work for all vision problems, face detection and recognition is a compelling partitioning candidate. Leveraging a hybrid approach, however, requires segmenting the total vision problem to be solved into multiple phases (Table 1).

 

Phase 1

Phase 2

Description

Motion detection
Face detection
Face tracking

Face recognition

Compute requirements

Low

High

Filtering ability

Very high

Low

Redundancy reduction

Medium

None

"Solved" status

Reasonably solved

Rapid development

Best location

Edge

Cloud

Table 1. Segmenting the vision processing "problem" using a hybrid approach.

The first phase requires orders of magnitude less processing capacity than the second phase, allowing it to run on a low-power, low-cost edge device. The first phase also acts as a very powerful selective filter, reducing the amount of data that is passed on to the second phase.

With the Tend Secure Lynx camera, for example, the processing requirements for acceptable-quality face detection are much less than those for recognition. Even a low-cost SoC can achieve processing speeds of up to 5 FPS with detection accuracy good enough for a consumer-class monitoring camera. And after the camera detects and tracks a face, it also pushes crops of individual video frames to a more powerful network-connected server for recognition purposes.

By filtering both temporally and spatially in the first phase, the Tend Secure Lynx Indoor camera significantly lowers network usage. The first phase can also reduce data redundancy. Once the Tend Secure Lynx camera detects a face, it then tracks that face across subsequent video frames while using minimal computational power. Presuming that all instances of a spatially tracked face belong to the same person results in a notable reduction in the redundancy of recognition computation.

Delegating face recognition to the cloud also allows for other notable benefits. Although the problem of detecting faces is reasonably solved at the present time (meaning that well-known, low-cost techniques produce sufficiently accurate results), the same cannot be said about face recognition. The recent MegaFace Challenge, for example, exposed the limitations of even the best modern algorithms.

Academics and commercial entities are therefore both continuously working on developing new and improved facial recognition methods in order to improve accuracy. By implementing such rapidly advancing vision approaches in the cloud, the implementer can quickly stay up-to-date with the latest breakthroughs, balancing accuracy and cost. In the cloud, the entire implementation can be replaced seamlessly, including providing the ability to rapidly transition between custom-built implementations and third-party solutions.

However, some cloud drawbacks remain present in a hybrid approach. For example, although ongoing server and bandwidth costs are greatly reduced compared to a cloud-only solution, they remain non-zero and must be factored into the total cost of the product. Also, because the edge device has no ability to process the second phase, any connectivity breakdown means that the problem is only half-solved until the link to the cloud is restored. The camera will simply accumulate face detection data until the network connection is restored, at which time it will once again delegate all recognition tasks to the server.

In the case of the Tend Secure Lynx camera, however, this latter limitation is not a significant drawback in actual practice, because a breakdown in connectivity also prevents the camera from notifying the user as well as preventing the user from viewing live video. As such, even successful recognition would not be useful in the absence of a network connection.

While the hybrid architecture may be the superior approach at the present time, this won't necessarily always be the case. As embedded hardware continues to advance and prices continue to drop in coming years, it is likely that we will shortly begin to see low-cost, low-power SoCs with sufficient compute capacity to singlehandedly run accurate facial recognition algorithms. Around that same time, it is likely that accuracy improvement in new algorithms will begin to plateau, opening the door to move the entire vision problem into a low-cost edge device.

Conclusion

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products, and can provide significant new markets for hardware, software and semiconductor suppliers (see sidebar "Additional Developer Assistance"). Edge, cloud and hybrid vision processing approaches each offer both strengths and shortcomings; evaluating the capabilities of each candidate will enable selection of an optimum approach for any particular design situation. And as both the vision algorithms and the processors that run them continue to evolve, periodic revisits of assumptions will be beneficial in reconsidering alternative approaches in the future.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Nathan Kopp
Principal Architect, Tend Insights

Darnell Moore
Perception and Analytics Lab Manager, Texas Instruments

Mark Jensen
Director of Strategic Marketing for Embedded Vision, Xilinx

Sidebar: Centralized versus Distributed Vision Processing

Even if vision processing operates fully at the device, such as with the semi- and fully autonomous products mentioned earlier, additional processing architecture options and their associated tradeoffs beg for further analysis. Specifically, distributed and centralized processing approaches both offer strengths and shortcomings that require in-depth consideration prior to selection and implementation. Previously, drones were the case study chosen to exemplify various vision processing concepts. Here, another compelling example will be showcased: today's ADAS for semi-autonomous vehicles, rapidly transitioning to the fully autonomous vehicles of the near future.

The quest to eliminate roadway accidents, injuries and deaths, as well as to shuttle the masses with "Jetsonian" convenience, has triggered an avalanche of activity in the embedded vision community around ADAS and vehicle automation. Vehicle manufacturers and researchers are rapidly developing and deploying capabilities that replace driver functions with automated systems in stages (Figure A). Computer vision is the essential enabler for sensing, modeling, and navigating the dynamic world around the vehicle, in order to fulfill the National Highway Traffic Safety Administration’s minimum behavioral competencies for highly automated vehicles (HAVs).


Figure A. The capabilities commonly found on vehicles equipped with ADAS and automated systems span parking, driving, and safety functions (courtesy Texas Instruments).

To faithfully execute driver tasks, highly automated vehicles rely on redundant, complimentary perception systems to measure and model the environment. Emerging vehicles achieving SAE Level 3 and beyond will commonly be equipped with multiple vision cameras as well as RADAR, LIDAR and ultrasonic sensors to support safety, driving, and parking functions. When such multi-modal data is available, sensor fusion processing is employed for redundancy and robustness. Two approaches to sensor fusion generally exist. With object-level sensor fusion, each modality completes signal chain processing to determine an independent result, i.e., the raw input is processed through all algorithm steps, including detection and classification, in order to identify obstacles. The detection likelihoods for each sensing mode are then jointly considered by applying Bayesian voting to determine the final result. An alternative method, using either raw or partially processed results from the signal chain, i.e., feature-level or even raw sensor fusion, can find use for more robust detection.

Most automotive SoCs feature architectures with dedicated hardware acceleration for low- and mid-level image processing, vision, and machine learning functions. A redundant compute domain with the processing capability needed to deploy safe-stop measures is also required to achieve the appropriate safety integrity level for HAVs as recommended by ISO 26262, the governing international standard for functional road safety. While no mandated sensor suite exists, a consensus has formed supporting two predominant vision processing topologies: centralized and distributed.

A centralized compute topology offers several notable inherent advantages, namely access to all the sensor data, which promotes low- latency processing and leads to a straightforward system architecture. Universal sensor access also facilitates flexible fusion options. Automated systems are expected to respond quickly to dynamic, critical scenarios, typically within 100 msec. A low-latency framework that processes all sensor data within 33 msec, i.e., 30 fps, can make three observations within this response window.

To support all Level 3+ applications on a single SoC, however, requires a substantial compute budget along with high bandwidth throughput, both of which generally come with high power consumption and cost liabilities. An example multi-modal sensor configuration will transfer 6.8 Gbps of raw input data to the central ECU (electronic control unit), mandating dense and expensive high-speed memory both at the system level and on-chip (Figure B). Moreover, scalability – especially with cost – can be difficult to achieve. A delicate balance is required with centralized compute arrangements; using two large SoCs, for example, relaxes the software requirements needed for supporting redundancy at the expensive of doubling cost and power, while using a less capable fail-safe ECU may demand a larger software investment.


Figure B. An example sensor configuration leverages a centralized compute ECU for a highly automated vehicle. The sensor suite encompasses long-, medium-, and short-range RADAR, ultrasonic sonar, cameras with various optics including ultra-wide field-of-view, and LIDAR (courtesy Texas Instruments).

The alternative distributed compute topology inherits an installed base of driver assistance and vehicle automation applications that have already been cost-optimized, augmenting these systems with functional modules to extend capabilities. This particular framework offers more design, cost, and power scalability than the alternative centralized approach, at the expense of increased system complexity and latency. These challenges can be mitigated to a degree with careful hardware and software architecting. Moreover, at the tradeoff of additional software complexity, a distributed strategy can more robustly negotiate various fail-safe scenarios by leveraging the inherent redundancy of distributed sensing islands.

Example heterogeneous vision SoC families find use in surround view, radar, front camera, and fusion ECUs, with a particular family member chosen depending on the required level of automation and the specific algorithm specifications (Figure C). While dual (for redundancy) SoCs used in alternative centralized processor configurations can approach a cumulative 250 watts of power consumption, thereby requiring exotic cooling options, multiple distributed processors can support similar capabilities at an order of magnitude less power consumption.


Figure C. The same sensor suite can alternatively be processed using a distributed compute topology, consisting of multiple ECUs, each tailored to handle specific functions (courtesy Texas Instruments).

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. Tend Insights, Texas Instruments and Xilinx, the co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance and its member companies periodically deliver webinars on a variety of technical topics. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Alliance website. Also, beginning on July 13, 2017 in Santa Clara, California and expanding to other U.S. and international locations in subsequent months, the Embedded Vision Alliance will offer "Deep Learning for Computer Vision with TensorFlow," a full-day technical training class. See the Alliance website for additional information and online registration.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, is intended for product creators interested in incorporating visual intelligence into electronic systems and software. The Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings.

The most recent Embedded Vision Summit took place in Santa Clara, California on May 1-3, 2017; a slide set along with both demonstration and presentation videos from the event are now in the process of being published on the Alliance website. The next Embedded Vision Summit is scheduled for May 22-24, 2018, again in Santa Clara, California; mark your calendars and plan to attend.

Are Neural Networks the Future of Machine Vision?

This technical article was originally published at Basler's website. It is reprinted here with the permission of Basler.

The Internet of Things That See: Opportunities, Techniques and Challenges

Bookmark and Share

The Internet of Things That See: Opportunities, Techniques and Challenges

This article was originally published at the 2017 Embedded World Conference.

With the emergence of increasingly capable processors, image sensors, and algorithms, it's becoming practical to incorporate computer vision capabilities into a wide range of systems, enabling them to analyze their environments via video inputs. This article explores the opportunity for embedded vision, compares various processor and algorithm options for implementing embedded vision, and introduces an industry alliance created to help engineers incorporate vision capabilities into their designs.

Introduction

Vision technology is now enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Such image perception, understanding, and decision-making processes have historically been achievable only using large, expensive, and power-hungry computers and cameras. Thus, computer vision has long been restricted to academic research and low-volume applications.

However, thanks to the emergence of increasingly capable and cost-effective processors, image sensors, memories and other semiconductor devices, along with robust algorithms, it's now practical to incorporate computer vision into a wide range of systems. The Embedded Vision Alliance uses the term "embedded vision" to refer to this growing use of practical computer vision technology in embedded systems, mobile devices, PCs, and the cloud.

Similar to the way that wireless communication has now become pervasive, embedded vision technology is poised to be widely deployed in the coming years. Advances in digital integrated circuits were critical in enabling high-speed wireless technology to evolve from exotic to mainstream. When chips got fast enough, inexpensive enough, and energy efficient enough, high-speed wireless became a mass-market technology. Today one can buy a broadband wireless modem or a router for under $50.

Similarly, advances in digital chips are now paving the way for the proliferation of embedded vision into high-volume applications. Like wireless communication, embedded vision requires lots of processing power—particularly as applications increasingly adopt high-resolution cameras and make use of multiple cameras. Providing that processing power at a cost low enough to enable mass adoption is a big challenge.

This challenge is multiplied by the fact that embedded vision applications require a high degree of programmability. In contrast to wireless applications where standards mean that, for example, baseband algorithms don’t vary dramatically from one handset to another, in embedded vision applications there are great opportunities to get better results—and enable valuable features—through unique algorithms.

With embedded vision, the industry is entering a "virtuous circle" of the sort that has characterized many other digital signal processing application domains (Figure 1). Although there are few chips dedicated to embedded vision applications today, these applications are increasingly adopting high-performance, cost-effective processing chips developed for other applications. As these chips continue to deliver more programmable performance per dollar and per watt, they will enable the creation of more high-volume embedded vision end products. Those high-volume applications, in turn, will attract more investment from silicon providers, who will deliver even better performance, efficiency, and programmability – for example, by creating chips tailored for vision applications.


Figure 1. Embedded vision benefits from a "virtuous circle" positive feedback loop of investments both in the underlying technology and on applications.

Processing Options

As previously mentioned, vision algorithms typically require high compute performance. And, of course, embedded systems of all kinds are usually required to fit into tight cost and power consumption envelopes. In other application domains, such as digital wireless communications, chip designers achieve this challenging combination of high performance, low cost, and low power by using specialized accelerators to implement the most demanding processing tasks in the application. These coprocessors and accelerators are typically not programmable by the chip user, however.

This tradeoff is often acceptable in wireless applications, where standards mean that there is strong commonality among algorithms used by different equipment designers. In vision applications, however, there are no standards constraining the choice of algorithms. On the contrary, there are often many approaches to choose from to solve a particular vision problem. Therefore, vision algorithms are very diverse, and tend to change rapidly over time. As a result, the use of non-programmable accelerators and coprocessors is less attractive for vision applications compared to applications like digital wireless and compression-centric consumer video equipment.

Achieving the combination of high performance, low cost, low power, and programmability is challenging. Special-purpose hardware typically achieves high performance at low cost, but with limited programmability. General-purpose CPUs provide programmability, but with weak performance, poor cost-effectiveness, and/or low energy-efficiency. Demanding embedded vision applications most often use a combination of processing elements, which might include, for example:

  • A general-purpose CPU for heuristics, complex decision-making, network access, user interface, storage management, and overall control
  • A specialized, programmable co-processor for real-time, moderate-rate processing with moderately complex algorithms
  • One or more fixed-function engines for pixel-rate processing with simple algorithms

Convolutional neural networks (CNNs) and other deep learning approaches for computer vision, which the next section of this article will discuss, tend to be very computationally demanding. As a result, they have not historically been deployed in cost- and power-sensitive applications. However, it's increasingly common today to implement CNNs, for example, using graphics processor cores and discrete GPU chips. And several suppliers have also recently introduced processors targeting computer vision applications, with an emphasis on CNNs.

Deep Learning Techniques

Traditionally, computer vision applications have relied on special-purpose algorithms that are painstakingly designed to recognize specific types of objects. Recently, however, CNNs and other deep learning approaches have been shown to be superior to traditional algorithms on a variety of image understanding tasks. In contrast to traditional algorithms, deep learning approaches are generalized learning algorithms that are trained through examples to recognize specific classes of objects.

Object recognition, for example, is typically implemented in traditional computer vision approaches using a feature extractor module and a classifier unit. The feature extractor is a hand-designed module, such as a Histogram of Gradients (HoG) or a Scale- Invariant Feature Transform (SIFT) detector, which is adapted to a specific application. The main task of the feature extractor is to generate a feature vector—a mathematical description of local characteristics in the input image. The task of the classifier is to project this multi-dimensional feature vector onto a plane and make a prediction regarding whether a given object type is present in the scene.

In contrast, with neural networks, the idea is to make the end-to-end object recognition system adaptive, with no distinction between the feature extractor and the classifier. Training the feature extractor, rather than hand-designing it, gives the system the ability to learn and recognize more- complex and non-linear features in objects which would otherwise be hard to model in a program. The complete network is trained from the input pixel stage all the way to the output classifier layer that generates class labels. All the parameters in the network are learned using a large set of training data. As learning progresses, the parameters are trained to extract relevant features of the objects the system is tasked to recognize. By adding more layers in the network, complex features are learned hierarchically from simple ones.

More generally, many real-world systems are difficult to model mathematically or programmatically. Complex pattern recognition, 3-D object recognition in a cluttered scene, detection of fraudulent activities on a credit card, speech recognition and prediction of weather and stock prices are examples of non-linear systems that involve solving for thousands of variables, or following a large number of weak rules to get to a solution, or "chasing a moving target" for a system that changes its rules over time.

Often, such as in recognition problems, we don’t have robust conceptual frameworks to guide our solutions because we don’t know how the brain does the job! The motivation for exploring machine learning comes from our desire to imitate the brain (Figure 2). Simply put, we want to collect a large number of examples that give the correct output for a given input, and then instead of writing a program, give these examples to a learning algorithm, and let the machine produce a program that does the job. If trained properly, the machine will subsequently operate correctly on previously unseen examples, a process known as "inference."


Figure 2. Inspired by biology, artificial neural networks attempt to model the operation of neuron cells in the human brain.

Open Standards

In the early stages of new technology availability and implementation, development tools tend to be company-proprietary. Examples include NVIDIA's CUDA toolset for developing computer vision and other applications that leverage the GPU as a heterogeneous coprocessor, and the company's associated CuDNN software library for accelerated deep learning training and inference operations (AMD's more recent equivalents for its own GPUs are ROCm and MIOpen).

As a technology such as embedded vision matures, however, additional development tool sets tend to emerge, which are more open and generic and support multiple suppliers and products. Although these successors may not be as thoroughly optimized for any particular architecture as are the proprietary tools, they offer several advantages; for example, they allow developers to create software that runs efficiently on processors from different suppliers.

One significant example of an open standard for computer vision is OpenCV, the Open Source Computer Vision Library. This collection of more than 2500 software components, representing both classic and emerging machine learning-based computer vision functions, was initially developed in proprietary fashion by Intel Research in the mid-1990s. Intel released OpenCV to the open source community in 2000, and ongoing development and distribution is now overseen by the OpenCV Foundation.

Another example of a key enabling resource for the practical deployment of computer vision technology is OpenCL, managed by the Khronos Group. An industry standard alternative to the proprietary and GPU-centric CUDA and ROCm mentioned previously, it is a maturing set of heterogenous programming languages and APIs that enable software developers to efficiently harness the profusion of diverse processing resources in modern SoCs, in an abundance of applications including embedded vision. It's joined by the HSA Foundation's various specifications, which encompass the standardization of memory coherency and other attributes requiring the implementation of specific hardware features in each heterogeneous computing node.

Then there's OpenVX, a recently introduced open standard managed by the Khronos Group. It was developed for the cross-platform acceleration of computer vision applications, prompted by the need for high performance and low power with diverse processors. OpenVX is specifically targeted at low-power, real-time applications, particularly those running on mobile and embedded platforms. The specification provides a way for developers to tap into the performance of processor-specific optimized code, while still providing code portability via the API's standardized interface and higher-level abstractions.

Numerous open standards are also appearing for emerging deep learning-based applications. Open-source frameworks include the popular Caffe, maintained by the U.C. Berkeley Vision and Learning Center, along with Theano and Torch. More recently, they've been joined by frameworks initially launched (and still maintained) by a single company but now open-sourced, such as Google's TensorFlow and Microsoft's Cognitive Toolkit (formerly known as CNTK). And for deep learning model training and testing, large databases, such as the ImageNet Project, containing more than ten million images, are available.

Industry Alliance Assistance

The Embedded Vision Alliance, a worldwide organization of technology suppliers, is working to empower product creators to transform the potential of embedded vision into reality. The Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Alliance maintains a website providing tutorial articles, videos, and a discussion forum staffed by technology experts. Registered website users can also receive the Alliance’s newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools. Access is free to all through a simple registration process.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, will be held May 1-3, 2017 at the Santa Clara, California Convention Center. Designed for product creators interested in incorporating visual intelligence into electronic systems and software, the Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Alliance member companies.

The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings. Online registration and additional information on the 2017 Embedded Vision Summit are now available.

Conclusion

With embedded vision, the industry is entering a "virtuous circle" of the sort that has characterized many other digital processing application domains. Embedded vision applications are adopting high-performance, cost-effective processor chips originally developed for other applications; ICs and cores tailored for embedded vision applications are also now becoming available. Deep learning approaches have been shown to be superior to traditional vision processing algorithms on a variety of image understanding tasks, expanding the range of applications for embedded vision. Open standard algorithm libraries, APIs, data sets and other toolset elements are simplifying the development of efficient computer vision software. The Embedded Vision Alliance believes that in the coming years, embedded vision will become ubiquitous, as a powerful and practical way to bring intelligence and autonomy to many types of devices.

By Jeff Bier
Founder, Embedded Vision Alliance

ARM Guide to OpenCL Optimizing Pyramid: The Test Method

Bookmark and Share

ARM Guide to OpenCL Optimizing Pyramid: The Test Method

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes how to use the SDK and test the performance of optimizations.

How to test the optimization performance

The following performance test method produces the results that are shown in this guide:

  1. Use the difference in the CL timer values, called by CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END, to measure the time that the kernel on the GPU takes.
  2. Measure the execution-time ratio between the optimized implementation and the implementation without the optimizations to evaluate the performance increase each optimization achieves. This enables you to see the benefits of each optimization as they are added.
  3. Run the kernel across various image resolutions to see how different optimizations affect different resolutions. Depending on the use case, the performance of one resolution might be more important than the others. For example, a real-time web-cam feed requires different performance compared to taking a high-resolution photo with a camera.

The resolutions that have been tested are:

  • 640 x 480.
  • 1024 x 576.
  • 2048 x 1536.
  • 4096 x 2304.

To obtain the results that this guide uses, the results from ten runs are averaged. This reduces the effects that individual runs have on the results from the SDK.

After measuring the performance of the code with a new optimization, you can then add more optimizations. With each new optimization added, repeat the test steps and compare the results with the results from the code before the implementation of the new optimization.

Mali Offline Compiler

The Mali™ Offline Compiler is a command-line tool that translates compute shaders that are written in OpenCL into binary for execution on the Mali GPUs.

You can use the Offline Compiler to produce a static analysis output, that shows:

  • The number of work and uniform registers that the code uses when it runs.
  • The number of instruction words that are emitted for each pipeline.
  • The number of cycles for the shortest path for each pipeline.
  • The number of cycles for the longest path for each pipeline.
  • The source of the current bottleneck.

To start the Offline Compiler and produce the static analysis output, execute the command, mali_clcc -v on the kernel.

To get the Offline Compiler see, http://malideveloper.arm.com.

ARM Guide to OpenCL Optimizing Pyramid: The Test Environment

Bookmark and Share

ARM Guide to OpenCL Optimizing Pyramid: The Test Environment

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes the requirements to run the SDK and the example test platform which generates the results that this guide shows.

The SDK platform requirements

To run the SDK sample on your platform, the platform must meet the following requirements:

  • The platform must contain an ARM®Mali™ Midgard GPU running a Linux environment.
  • You must have an OpenCL driver for your GPU. See http://malideveloper.arm.com for available drivers.
  • You must have an internet connection to download and install the tools that enable you to build the samples.

Note: A graphics environment is not required, a serial console is enough.

Example pyramid test platform

The pyramid test platform that produces the results in this guide is built from the following components:

  • Platform
    Arndale 5250 board (Dual ARM®Cortex®‑A15 processor, with ARM®Mali™‑T604 GPU).
  • File system
    Linaro Ubuntu 14.04 Hard Float.
  • Kernel
    Linaro 3.11.0-arndale.
  • DDK
    ARM®Mali™ Midgard r4p0 DDK.

Note: This is an example of the hardware that can be used. Any hardware that meets the platform requirements can be used.

ARM Guide to OpenCL Optimizing Pyramid: Conclusion

Bookmark and Share

ARM Guide to OpenCL Optimizing Pyramid: Conclusion

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes the conclusions from the example optimization process.

Conclusion

This example shows one way to implement and optimize the creation of a Gaussian image pyramid using OpenCL and OpenCL buffers.

Small changes in the OpenCL code can produce significant performance improvements. For example, processing the RGB color planes separately and then recombining them after reduces the number of loads, but also simplifies the handling of pixel boarders and enables some kernels to be merged.

The following techniques are useful for optimizing pyramid image generation:

  • Separate the convolution stage.
  • Use padding to avoid expensive boundary checks.
  • Split the image into its individual color planes.
  • Change the storage method to improve the vectorization of the loads.
  • Merge kernels to reduce the time spent enqueuing kernels and reduce the execution time of the most expensive kernel.

Image Quality Analysis, Enhancement and Optimization Techniques for Computer Vision

Bookmark and Share

Image Quality Analysis, Enhancement and Optimization Techniques for Computer Vision

This article explains the differences between images intended for human viewing and for computer analysis, and how these differences factor into the hardware and software design of a camera intended for computer vision applications versus traditional still and video image capture. It discusses various methods, both industry standard and proprietary, for assessing and optimizing computer vision-intended image quality with a given camera subsystem design approach. And it introduces an industry alliance available to help product creators incorporate robust image capture and analysis capabilities into their computer vision designs.

The bulk of historical industry attention on still and video image capture and processing has focused on hardware and software techniques that improve images' perceived quality as evaluated by the human visual system (eyes, optic nerves, and brain), thereby also mimicking as closely as possible the human visual system. However, while some of the methods implemented in the conventional image capture-and-processing arena may be equally beneficial for computer vision processing purposes, others may be ineffective, thereby wasting available processing resources, power consumption, etc.

Some conventional enhancement methods may, in fact, be counterproductive for computer vision. Consider, for example, an edge-smoothing or facial blemish-suppressing algorithm: while the human eye might prefer the result, it may hamper the efforts of a vision processor that's searching for objects in a scene or doing various facial analysis tasks. In contrast, various image optimization techniques for computer vision might generate outputs that the human eye and brain would judge as "ugly" but a vision processor would conversely perceive as "improved."

Historically, the computer vision market was niche, thereby forcing product developers to either employ non-optimal hardware and software originally intended for traditional image capture and processing (such as a smartphone camera module, tuned for human perception needs) or to create application-optimized products that by virtue of their low volumes had consequent high costs and prices. Now, however, the high performance, cost effectiveness, low power consumption, and compact form factor of various vision processing technologies are making it possible to incorporate practical computer vision capabilities into a diversity of products that aren't hampered by the shortcomings of historical image capture and processing optimizations.

Detailing the Differences

To paraphrase a well-known proverb, IQ (image quality) is in the eye of the beholder…whether that beholder is a person or a computer vision system. For human perception purposes, the fundamental objective of improving image quality is to process the captured image in such a way that it is most pleasing to the viewer. Sharpness, noise, dynamic range, color accuracy, contrast, distortion, and artifacts are just some of the characteristics that need to blend in a balanced way for an image to be deemed high quality (Figure 1).


Figure 1. Poor color tuning, shown in the ColorChecker chart example on the left, results in halos within colored squares along with color bleeding into black borders (see inset), both artifact types absent from the high-quality tuning example on the right (Courtesy Algolux, ColorChecker provided by Imatest).

Achieving the best possible IQ for human perception requires an optimal lens and sensor, and an ISP (image signal processor) painstakingly tuned by an expert. Complicating the system design are the realities of component cost and quality, image sensor and ISP capabilities, tuning time and effort, and available expertise. There are even regional biases that factor into perceived quality, such as a preference for cooler images versus warmer images between different cultures. Despite these subjective factors, testing and calibration environments, along with specific metrics, still make the human perception tuning process more objective (at least for products targeting a particular region of the world, for example).

For both traditional and neural network-based CV (computer vision) systems, on the other hand, maximizing image quality is primarily about preserving as much data as possible in the source image to maximize the accuracy of downstream processing algorithms. This objective is application-dependent in its details. For example, sharp edges allow the algorithm to achieve better feature extraction for doing segmentation or object detection, but they result in images that look harsh to a human viewer. Color accuracy is also critical in applications such as sign recognition and object classification, but may not be particularly important for face recognition or feature detection functions.

As the industry continues to explore deep learning computer vision approaches and begins to integrate them into shipping products, training these neural network models with ever-larger tagged image and video datasets to improve accuracy becomes increasingly important. Unfortunately, many of these training images, such as the ImageNet dataset for classification, are captured using typical consumer handheld and drone-based cameras, stored in lossy-compressed formats such as JPEG, and tuned for human vision purposes. Their resultant significantly less-than-ideal included information hampers the very accuracy improvements that computer vision algorithm developers are striving for. Significant industry effort is now therefore being applied to computer vision classification, in order to improve accuracy (e.g. the precision and recall) of identifying a scene as well as specific objects within an image.

Application Examples

To further explore the types of images needed for computer vision versus human perception, first look at a "classical" machine vision application that leverages a vision system for quality control purposes, specifically to ensure that a container is within allowable dimensions. Such an application gives absolutely no consideration to images that are "pleasing to the eye." Instead, system components such as lighting, lens, camera and software are selected solely for their ability to maximize defect detections, no matter how unattractive the images they create may look to a human observer.

On the other hand, ADAS (Advanced Driver Assistance Systems) is an example of an application that drives tradeoffs between images generated for processing and for viewing (Figure 2). In some cases, these systems are focused entirely on computer processing, since the autonomous reactions need to occur more rapidly than a human can respond, as with a collision avoidance system. Other ADAS implementations combine human viewing with computer vision, such as found in a back-up assistance system. In this case, the display outputs a reasonably high quality image, typically also including some navigational guidance.



Figure 2. ADAS and some other embedded vision applications may require generating images intended for both human viewing (top) and computer analysis (bottom) purposes; parallel processing paths are one means of accomplishing this objective (courtesy study.marearts.com).

More advanced back-up systems also include passive alerts (or, in some cases, active collision avoidance) for pedestrians or objects behind the vehicle. In this application, the camera-captured images are parallel-processed, with one path going to the in-car display where the image is optimized for driver and passenger viewing. The other processing path involves the collision warning-or-avoidance subsystem, which may analyze a monochrome image enhanced to improve the accuracy of object detection algorithms such as Canny edge and optical flow.

Camera Design Tradeoffs

As background to a discussion of computer vision component selection, consider three types of vision systems: the human visual perception system, mechanical and/or electrical systems designed to mimic human vision, and computer vision systems (Figure 3). Fundamental elements that are common to all three of these systems include:


Figure 3. Cameras, whether conventional or computer vision-tailored, contain the same fundamental elements as found in the human visual system (courtesy Algolux).

For human vision, millions of years' worth of evolution has determined the system's current (and presumably optimal) structure, design, and elements. For designers of imaging systems, whether for human perception or computer vision, innovation and evolution have conversely so far produced an overwhelming number of options for each of these system elements. So how does a designer sort through the staggering number of lighting, lens, sensor, and processor alternatives to end up with the optimal system combination? And just as importantly, how does the designer model the system once the constituent elements are chosen, and how is the system characterized?

Another design challenge involves determining how the system's hardware components can impact the software selection, and visa versa. Similarly, how can the upstream system components complement any required downstream image processing? And how can software compensate for less-than-optimal hardware...and is the converse also true?

When selecting vision components, the designer typically begins with a review of the overall system boundary conditions: power consumption, cost, size, weight, performance, and schedule, for example. These big-picture parameters drive the overall design and, therefore, the selection of various components in the design. Traditional machine vision systems have historically been relatively unconcerned about size, weight and power consumption, and even (to a lesser degree) cost. Performance was dictated by available PC-based processors and mature vision libraries, for which many adequate solutions existed. For this reason, many traditional machine vision systems, although performance-optimized, historically utilized standard hardware platforms.

Conversely, the embedded vision designer is typically far more parameter-constrained, and thus is required to more fully optimize the entire system. Every gram, mm, mW, dollar and Gflop can influence the success of the product. Consider, for example, the components that constitute the image acquisition portion of the system design; lighting, the lens, sensor, pixel processor and interconnect. The designer may, for example, consider a low-cost package, such as a smartphone-tailored or other compact camera module (CCM). These modules can deliver remarkable images, at least for human viewing purposes, and are low SWaP (size, weight and power) and cost.

Downsides exist to such a highly integrated approach, however. One is the absence of "lifetime availability": these modules tend to have a lifespan of less than two years. Depending on the application, as previously discussed, the on-board processing may deliver an image suitable for viewing versus for additional vision processing. Also, these modules, along with the necessary support for them, may only be available to very high volume customers.

Component Selection

If the designer decides to choose and combine individual components, rather than using an integrated CCM, several selection factors vie for consideration. The first component to be considered is lighting (i.e. illumination), which is optimized to allow the camera to capture and generate the most favorable image outputs. Many embedded systems rely on ambient light, which can vary from bright sunlight to nearly dark conditions (in ADAS and drone applications, for example). Some scenes will also contain a combination of both bright and dark areas, thereby creating further image-processing challenges. Other applications, such as medical instruments, involve a more constrained-illumination environment and typically also implement specialized lighting to perform analyses.

Illumination options for the designer to consider include source type, geometry, wavelength, and pattern. Light source options include LED, mercury, laser and fluorescent; additional variables include the ability to vary intensity and to "strobe" the illumination pattern. The system can include single or multiple sources to, for example, increase contrast or eliminate shadows. Many applications also use lasers or patterned light to illuminate the target; depth-sensing applications such as laser profiling and structured light are examples. Whether relying on ambient light or creating a controlled illumination environment, designers must also consider the light's wavelength range, since image sensors operate in specific light bands.

Another key element in the vision application is the lens. As with lighting, many options exist, each with cost and performance tradeoffs. Some basic parameters include fixed and auto-focus, zoom range, field of view, depth-of-field and format. The optics industry is marked by continuous innovation; recent breakthroughs such as liquid lenses also bear consideration. The final component in the image capture chain, prior to processing, is the sensor. Many manufacturers, materials, and formats exist in the market. Non-visible light spectrum alternatives, such as UV and infrared, are even available.

The overall trend, at least with mainstream visible-light image sensors, is towards ever-smaller pixels (both to reduce cost at a given pixel count and to expand the cost-effective resolution range) that are compatible with standard CMOS manufacturing processes. The performance of CMOS sensors is approaching that of traditional CCD sensors, and CMOS alternatives also tend to be lower in both cost and power consumption. Many CMOS sensors, especially those designed for consumer applications, also now embed image-processing capabilities such as HDR (high dynamic range) and color conversion. Smaller pixels, however, requires a trade-off between spatial resolution and light sensitivity. Additional considerations include color fidelity and the varying artifacts that can be induced by sensors' global versus rolling shutter modes.

In almost all cases today, the pixel data that comes out of the sensor is processed by an ISP and then compressed. The purpose of the ISP is to deliver the best image quality possible for the application; this objective is accomplished through a pipeline of algorithmic processing stages that typically begins with the de-mosaic (also known as de-Bayer) of the sensor’s raw output to reconstruct a full-color image from the sensor's CFA (color filter array) red, green, and blue pattern. A monochromatic sensor does not need this de-mosaic step, since each pixel is dedicated to capturing light intensity, providing higher native resolution and sensitivity at the expense of loss of color detail.

Other key ISP stages include defective pixel correction, lens-shading correction, color correction, "3A" (auto focus, auto exposure, and auto white balance), HDR, sharpening, and noise reduction. ISPs can optionally be integrated as a core within a SoC, operate as standalone chips, or be implemented as custom hardware on an FPGA. Each ISP provider takes a unique approach to developing the stages' algorithms, as well as their order, along with leveraging hundreds or thousands of tuning parameters with the goal of delivering optimum perceived image quality.

Assessing Component Quality

When selecting vision system components, designers are faced with a staggering number of technical parameters to comprehend, along with the challenge of determining how various components are interdependent and/or complementary. For instance, how does the image sensor resolution affect the choices in optics or lighting? One way that designers can begin to quantify the quality of various components, and the images generated by them, is through available standards. For example, the European Machine Vision Association has developed EMVA1288, a standard now commonly used in the machine vision industry. It defines a consistent measurement methodology to test key performance indicators of machine vision cameras.

Standards such as EMVA1288 assist in making the technical data provided by various camera and sensor vendors comparable. EMVA1288 is reliable in that it documents the measurement procedures as well as the subsequent data determination processes. The specification covers numerous parameters, such as spectral sensitivity, defect pixels, color, and various SNR (signal/noise ratio) specifications, such as maximal SNR, dynamic range, and dark noise.

These parameters are important because they define the system performance baseline. For example, is the color conversion feature integrated in the sensor adequate for the application? If not, it may be better to instead rely on the raw pixel data from the sensor, recreating a custom color reproduction downstream. Another example is defect pixels. Some imaging applications are not concerned about individual (or even small clusters of) pixels that are "outliers", because the information to be extracted from the image is either "global" or pattern-based, in both cases where minor defects are not a concern. In cases like these, then, a lower-cost sensor may be adequate for the application.

Considering Image Quality

Moving beyond the performance of individual system components, the system designer must also consider the quality of the images generated by them, with respect to required processing operations. The definition and evolution of image quality metrics has long been an ongoing area of research and development, and volumes' worth of publications are available on the topic. Simply put, the goal of trying to objectively measure what is essentially a subjective domain arose from the need to assess and benchmark imaging and video systems, in order to streamline evaluation effort and complexity.

The first real metric available to measure image quality was quite subjective in nature. The MOS (mean opinion score) rates peoples' perceptions of a particular image or video sequence across a range from 1 (unacceptable) to 5 (excellent). This approach requires a large-enough population sample size and is expensive, time-consuming, and error-prone. Other more objective metrics exist, such as PSNR (peak signal-to-noise ratio) and MSE (mean squared error), but they have flaws of their own, such as an imperfect ability to model subjective perceptions.

More sophisticated metrics measure parameters such as image sharpness (e.g. MTF, or mean transfer function), acutance, noise, color response, dynamic range, tone, distortion, flare, chromatic aberration, vignetting, sensor non-uniformity, and color moiré. These are the attributes the ISP tries to make as ideal as possible. If it cannot improve the image enough, the system designer might need to select a higher quality lens or sensor. Alternative approaches might be to modify the ISP itself (not possible when it's integrated within a commercial SoC) or do additional post-processing on the application processor.

Several analysis tools and methodologies exist that harness these metrics, both individually and in combination, to evaluate the quality of a camera system's output images against idealized test charts. Specialized test environments with test charts, controlled lighting, and analysis software are required here to effectively evaluate or calibrate a complete camera. DxO, Imatest, Image Engineering, and X-rite are some of the well-known companies that provide these tools, in some cases also offering testing services.

While such metrics can be analyzed, what scores correlate to high-quality image results? Industry standards provide the answer to this question. DxO's DxOMark, for example, is a well-known commercial rating service for camera sensors and lenses, which has been around for a long time and aggregates numerous individual metrics into one summary score. Microsoft has also published image quality guidelines for both its Windows Hello face recognition sign-in feature and for Skype, the latter certifying video calls at both Premium and Standard levels.

The IEEE also supports two image quality standards efforts. The first, now ratified, is IEEE 1858-2016 CPIQ (camera phone image quality), intended to provide objective assessment of the quality of the camera(s) integrated within a smartphone. The second, recently initiated by the organization, is IEEE-P2020, a standard for automotive system image quality. The latter effort is focused not only on image quality for human perception with automobile cameras but also for various computer vision functions. As more camera-based systems are being integrated into cars for increasingly sophisticated ADAS and autonomous driving capabilities, establishing a consistent image quality target that enables the computer vision ecosystem to achieve highest possible accuracy will accelerate the development and deployment of such systems.

Tuning and Optimization

With the incredible advancements in optics, sensors, and processing performance in recent years, one might think an embedded camera would "out of box" deliver amazing images for either human consumption or computer vision purposes. Incredibly complex lens designs exist, for example, for both unique high-end applications and cost- and space-constrained mainstream mobile phones, in both cases paired with state-of-the-art sensors and sophisticated image-processing pipelines. As such, at least in certain conditions, the latest camera phones from Apple, Samsung, and other suppliers are capable of delivering image quality that closely approximates that of premium DSLRs (digital single-lens reflex cameras) and high-end video cameras.

In reality, however, hundreds to thousands of parameters, often preset to default values that don't comprehend specific lens and sensor configurations, guide the image processing done by ISPs. Image quality experts at large companies use analysis tools and environments to evaluate a camera’s ability to reproduce test charts accurately, and then further fine-tune the ISP to deliver optimum image quality. These expert engineering teams at leading core, chip and system providers spend many months iterating different parameter settings, lighting conditions, and test charts to deliver an optimum set of parameters; a massive amount of time and expense! And in some cases, they will invest in the development of internal tools that both supplement industry analysis offerings and that automate testing iterations and/or various data analysis tasks.

In the best-case scenario, the tuning process starts with a software model, typically in C or MATLAB, of the lens, sensor, and ISP. Initial test charts fed through the model produce output images subsequently analyzed by tools such those from Imatest. The development team iteratively sets parameters, sends images through the model, analyzes the results via both automated tools and visual inspection, and repeats this process until an acceptable result is achieved (Figure 4). The team then moves to the prototype stage, incorporating the lens, sensor, ISP (either on a SoC or FPGA), leveraging a test lab with lighting and test chart setups targeted for evaluating real-world image quality.


Figure 4. The image tuning process typically begins with iteration using a software model of the lens, sensor and ISP combination (Courtesy Algolux and Imatest).

This stage in the tuning process begins with the initial software model settings as a starting point; image quality experts iterating through additional parameter combinations until they achieve the best-case result. Furthermore, the process must be repeated each time the camera system is cost-optimized or otherwise component-altered (with a less expensive lens or sensor, for example, or an algorithm optimized for lower power consumption), as well as when the product goes through subsequent initial-production and calibration stages.

The performance of the camera is a critical value proposition, which easily rationalizes the investment by larger companies. Internal expertise improves with each product release, incrementally enhancing the team’s tuning efficiency. Teams at smaller companies that don’t have access to internal expertise may instead outsource the task to third-party analysis and optimization service providers; still a very costly and time consuming process. Outsourced services companies that perform tuning build their expertise by tuning a wide variety of camera systems from different clients.

The smallest (but most numerous) companies, which don’t even have the resources to support outsourcing, are often forced to resort to using out-of-box parameter settings. Keep in mind, too, that leading SoC providers provide documentation, tools, and support hand-holding for sensor integration, ISP tuning, etc. only to top customers. Even Raspberry Pi, an open source project, doesn’t provide access to its SoCs' ISP parameter registers for tuning purposes. Scenarios like this represent a significant challenge for any a camera-based system provider. Fortunately, innovative work is now being done to apply machine learning and other advanced techniques in automating IQ tuning for both human perception and computer vision accuracy. These approaches "solve" for the optimum parameter combinations against IQ metric goals, thereby striving to reduce tuning effort and expense for large and small companies alike.

Conclusion

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products, and can provide significant new markets for hardware, software and semiconductor suppliers (see sidebar "Additional Developer Assistance"). Delivering optimum image quality in a product otherwise constrained by boundary conditions such as power consumption, cost, size, weight, performance, and schedule is a critical attribute, regardless of whether the images will subsequently be viewed by humans and/or analyzed by computers. As various methods for assessing and optimizing image quality continue to evolve and mature, they'll bring the "holy grail" of still and video picture perfection ever closer to becoming a reality.

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. Algolux and Allied Vision, the co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, will be held May 1-3, 2017 at the Santa Clara, California Convention Center.  Designed for product creators interested in incorporating visual intelligence into electronic systems and software, the Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings. Online registration and additional information on the 2017 Embedded Vision Summit are now available.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Dave Tokic
VP Marketing & Strategic Partnerships, Algolux

Michael Melle
Sales Development Manager, Allied Vision

ARM Guide to OpenCL Optimizing Pyramid: Optimization Process

Bookmark and Share

ARM Guide to OpenCL Optimizing Pyramid: Optimization Process

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes an example optimization process for creating image pyramids.

Convolution matrix separability

Convolution with an nxn convolution matrix requires n2 multiplications per pixel and n2-1 additions per pixel. However, a property of linear algebra called separability enables this task to be completed more efficiently. When this property is used, you can achieve significant optimization improvements because it requires n multiplications and n-1 additions.

If a matrix is separable, then it can be represented as the outer product of two vectors with dimension n.

The following figure shows the separated parts of the Gaussian 5x5 matrix.


Figure 4-1 Separating the Gaussian matrix

A matrix is separable if its rank is one. The rank of a matrix is the maximum number of linearly independent columns of the matrix or the maximum number of linearly independent rows of the matrix.

A row or column is linearly independent if it cannot be expressed as a multiple of another row or column, and then added to an offset. Therefore the following equation is false for linearly independent rows or columns, c0 = c1 x alpha + beta, where c0 and c1 are two different rows or columns.

The Gaussian 5x5 matrix has a rank of one, therefore it is separable.

To use a separable convolution matrix efficiently, perform the total convolution by sequentially applying two convolutions using the separate parts. Apply one of the convolutions along the x direction of the image, and store the intermediate results in a temporary buffer. Then apply the second convolution along the y direction of the temporary buffer. The result from this method is identical to the result that the full unseparated matrix provides but requires fewer operations.

The following code shows how this task can be achieved.

// Pseudo code
// Convolution 1D along Y direction
for(int y = 0; y < height; y++)
{
     for(int x = 0; x < width; x++)
     {
          sum = 0.0;

          for(int i = -2; i <= 2; i++)
          {
               // Get value from SOURCE image
               pixel = get_pixel(src, x, y + i);
               sum = sum + coeffs[i + 2]*pixel;...

ARM Guide to OpenCL Optimizing Pyramid: Initial Implementation

Bookmark and Share

ARM Guide to OpenCL Optimizing Pyramid: Initial Implementation

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes an initial, unoptimized implementation of pyramid.

Initial code

The initial, unoptimized code uses the basic pyramid process where an unaltered Gaussian convolution is applied, then the result is subsampled. This process is then repeated on the result to create the pyramid image for further levels.

The initial Gaussian 5x5 filter implementation

The code for the initial unoptimized Gaussian pyramid uses the same convolution code as the ARM® Guide to OpenCL Optimizing Convolution example.

The following code shows the Gaussian 5x5 filter code from the ARM® Guide to OpenCL Optimizing Convolution.

// Variables declaration

...

#define PARTIAL_CONVOLUTION( i, m0, m1, m2, m3, m4 )\
     do{\
          temp = vload16(0, src + addr + i * strideByte);\
          temp2 = vload4(0, src + addr + i * strideByte + 16);\
          l2Row = temp.s01234567;\
          l1Row = temp.s3456789A;\
          mRow = temp.s6789ABCD;\
          r1Row = (uchar8)(temp.s9ABC, temp.sDEF, temp2.s0);\
          r2Row = (uchar8)(temp.sCDEF, temp2.s0123);\
          l2Data = convert_ushort8(l2Row);\
          l1Data = convert_ushort8(l1Row);\
          mData = convert_ushort8(mRow);\
          r1Data = convert_ushort8(r1Row);\
          r2Data = convert_ushort8(r2Row);\
          pixels += l2Data * (ushort8)m0;\
          pixels += l1Data * (ushort8)m1;\
          pixels += mData * (ushort8)m2;\
          pixels += r1Data * (ushort8)m3;\
          pixels += r2Data * (ushort8)m4;\
     }while(0)

     PARTIAL_CONVOLUTION( -2, MAT0, MAT1, MAT2, MAT3, MAT4);
     PARTIAL_CONVOLUTION( -1, MAT5, MAT6, MAT7, MAT8, MAT9);
     PARTIAL_CONVOLUTION( 0, MAT10, MAT11,...