Embedded Vision Alliance: Technical Articles

Implementing Vision with Deep Learning in Resource-constrained Designs

Bookmark and Share

Implementing Vision with Deep Learning in Resource-constrained Designs

DNNs (deep neural networks) have transformed the field of computer vision, delivering superior results on functions such as recognizing objects, localizing objects within a frame, and determining which pixels belong to which object. Even problems like optical flow and stereo correspondence, which had been solved quite well with conventional techniques, are now finding even better solutions using deep learning techniques. But deep learning is also resource-intensive, as measured by its compute requirements, memory and storage demands, network latency and bandwidth needs, and other metrics. These resource requirements are particularly challenging in embedded vision designs, which often have stringent size, weight, cost, power consumption and other constraints. In this article, we review deep learning implementation options, including heterogeneous processing, network quantization, and software optimization. Sidebar articles present case studies on deep learning for ADAS applications, and for object recognition.

Traditionally, computer vision applications have relied on special-purpose algorithms that are painstakingly designed to recognize specific types of objects. Recently, however, CNNs (convolutional neural networks) and other deep learning approaches have been shown to be superior to traditional algorithms on a variety of image understanding tasks. In contrast to traditional algorithms, deep learning approaches are generalized learning algorithms trained through examples to recognize specific classes of objects.

Since deep learning is a comparatively new approach, however, the usage expertise for it in the developer community is comparatively immature versus with traditional alternatives. And much of this existing expertise is focused on resource-rich PCs versus comparatively resource-deficient embedded and other designs, as measured by factors such as:

  • Image capture (along with, potentially, depth discernment) subsystem capabilities
  • CPU and other processors' compute capabilities
  • Local cache and chip/system memory capacities, latencies and bandwidths
  • Local mass storage capacity, latency and bandwidth, and
  • Network connectivity reliability, latency and bandwidth

This article provides information on techniques for developing robust deep learning-based vision processing SoCs, systems and software for resource-constrained applications. It showcases, for example, the opportunities for and benefits of leveraging available heterogeneous computing resources beyond the CPU, such as a GPU, DSP and/or specialized processor. It also discusses the tradeoffs of various cache and main memory technologies and architectures, implemented at both the chip and system levels. It highlights hardware and software design toolsets and methodologies that assist in the optimization process. And it also introduces readers to an industry alliance created to help product creators incorporate vision capabilities into their hardware and software, along with outlining the technical resources that this alliance provides (see sidebar "Additional Developer Assistance").

Defining the Problem

Why is developing a deep learning-based embedded vision design, along with adapting a deep learning model initially targeting a PC implementation, so challenging? In answering these questions, the following introductory essay from Au-Zone Technologies sets the stage for the sections that follow it.

At the core of the constrained-resources predicament is the reality that technology innovations typically originate in academic research. Both here and often with the initial profitable commercial opportunities that follow, development and implementation both typically employ a desktop- or server-class computer platform. Such systems contain a relative abundance of processing, memory, storage and connectivity resources, and they're also comparatively unhindered by the size, weight, cost, power consumption and other constraints that are conversely common in embedded design implementations. Whereas a PC developer might not think twice about standardizing on high precision floating point data and calculations, for example, an embedded developer would need to rely on low-precision fixed-point alternatives in order to balance the more challenging resource "budget."

Some specific examples of constraints and other development challenges in embedded implementations include:

  • The tradeoffs between (more expensive but faster and more power efficient) SRAM and (cheaper but slower and higher power) DRAM, both at a given level in the system memory hierarchy and as subdivided (both in terms of type and capacity) among various levels.
  • The increased commonality of a unified system memory approach shared by various heterogeneous processors (versus, say, a dedicated frame buffer for a GPU in a PC), and the increased resultant likelihood of contention between multiple processors for the same memory bus and end storage location resources.
  • Vision processing pipelines, along with low-level drivers, that may be "tuned" for human perception, and therefore conversely may be sub-optimal for computer vision purposes.
  • Software APIs, frameworks, libraries and other standards that are immature and therefore in rapid evolution.
  • The oft-necessity for purpose-built development tools, custom inference engines, and the like, along with the subsequent inability to easily migrate them to other implementations.

And how do these and other factors adversely affect embedded vision development? Whether defined by available compute resources, memory and mass storage capacity and bandwidth, network connectivity characteristics, battery capacities, the thermal envelope, size, weight, bill-of-materials costs, or the combination of multiple or all of these and other factors, these constraints fundamentally define what capabilities your product is able to support and how robustly it is able to deliver those capabilities. And equally if not more importantly, they affect how long your product will spend in development, not to mention the required project budget and manpower headcount, in order to hit necessary capability targets.

Brad Scott
President, Au-Zone Technologies

Heterogeneous Processing

Computer vision and machine learning present low-power mobile and embedded systems with significant challenges. It’s important, therefore, to leverage every bit of the computing potential present in your SoC and/or system. Designing from the outset with processing efficiency in mind is key. Dual-core CPUs first came to mainstream PCs in the mid-2000s; in recent years, multi-core processors have also become common in smartphones and tablets, and even in various embedded SoCs. GPUs, whether integrated in the same die/package as the rest of the application processor or in discrete form, are also increasingly common.

Modern embedded CPUs have steadily improved in their ability to tackle some parallel tasks, as vector-based SIMD (single instruction multiple data) architecture extensions, for example, become available and are leveraged by developers. Modern embedded GPUs have conversely approached the same problem from the opposite direction, becoming more adept at some serial tasks. Together, the CPU and GPU can handle much if not all of the machine learning workload.

A key message when it comes to processor choice, in considering the spectrum of possible vision and machine learning workloads, is that a one-size solution can't possibly fit all possible use cases. Thinking heterogeneously when designing your SoC or system, leveraging a selection of processors with strengths in different areas, can reap dividends when it comes to overall efficiency.  For example, many modern SoCs already come equipped with both "big" and "little" CPUs (and clusters of them), along with a GPU. The “big” CPU cores are present for moments (often short bursts) when high performance is needed, while the “little” cores are intended for sustained processing efficiency, and the GPU delivers as-required massively parallel computation ability. When used intelligently, via use of the OpenCL API and other programming techniques, you end up with a great deal of efficient processing power available to you.

The challenge, of course, is efficiently spreading your machine learning and computer vision pipeline workloads across all available computing resources in such a heterogeneous fashion. A common technique for these kinds of pipelines, when implemented on desktop or server systems, is to make many copies of large data buffers. On mobile and embedded systems, conversely, such memory copies are highly inefficient, both in terms of the time taken to perform the copies, and (crucially) in the amount of energy these sorts of operation consume. Fortunately, however, embedded and mobile SoCs typically implement a unified, global memory array, making the “zero-copy” ideal at least somewhat feasible in reality.

A remaining problem, at least historically: the CPU and GPU have separate caches, which means that transferring workloads between processors has typically required expensive (in terms of latency and/or power consumption) synchronization operations. Any potential performance gains made when moving workloads between processors can be negated in the process. These days, fortunately, designing your SoC using a modern, cache-coherent interconnect can make all the difference. Caches are kept up-to-date automatically, which means that you can more freely swap workloads between various processor types.

The inclusion of computer vision and deep learning accelerators alongside CPUs and GPUs in modern SoCs is becoming increasingly commonplace. The following essay from Synopsys discusses their potential to boost the performance, along with lowering the energy consumption, of deep learning-based vision algorithms, as well as providing various implementation suggestions.

Along with facing the usual embedded challenges of power and area consumption, a designer architecting an embedded SoC for computer vision and deep learning must also tackle some unique challenges and constraints, such as steadily increasing system complexity and the rapid pace of technology advancement. Power and area constraints therefore need to be balanced against performance and accuracy targets; the latter is very important for object detection and classification tasks, which are increasingly implemented using deep learning techniques. And memory bandwidth can also become a limiting factor that requires insight and understanding.

An embedded vision SoC, such as one based on Synopsys’ DesignWare EV6x processor family, combines both traditional computer vision processing units and newer deep learning engines (Figure 1). The EV6x includes a vision processor that combines both scalar and vector DSP resources, along with a programmable CNN engine. The scalar unit is programmed via a C/C++ compiler, while the vector unit is programmed using an OpenCL C compiler. A CNN graph, such as AlexNet, ResNet, or VGG16, is trained using Caffe, TensorFlow or another software framework, and is then mapped into the CNN engine’s hardware. The CNN engine is programmable in the sense that any graph can be mapped to its hardware.


Figure 1. The various components of an embedded vision SoC can together implement a robust ADAS or other processing solution (courtesy Synopsys).

These three heterogeneous processing units deliver the best performance for a given power and area, because they are each optimized for their specific tasks. Each dedicated processing unit must be programmed, so designers need a robust set of software tools that enable mapping of embedded vision solutions across the processing units. DesignWare EV6x processors, for example, are fully programmable and supported by the MetaWare EV development toolkit, which includes software development tools based on the OpenVX, OpenCV and Open CL C embedded vision standards.

Performance, power and area are some of the constraints that need to be balanced against each other. Different projects will have different prioritization orders for these and other constraints. An automotive ADAS (advance driver assistance system) design based on a high-resolution front camera might prefer to prioritize performance, but limitations in clock speed defined by process technology (and associated cost), power consumption (and associated heat dissipation) and other parameters will constrain the SoC's maximum performance capabilities (see sidebar "Optimization at the System Level, An ADAS Case Study"). Selecting a processing solution that allows for scaling of the CNN engine can be an effective means of trading off various constraints.

Performance

Evaluating the performance of different embedded vision systems is not a straightforward process, since no benchmark standard currently exists. Comparing the number of multiply-accumulators (MACs) in different CNN engines can provide a first-order assessment, since CNN implementations require a large number of MACs. TeraMAC/s has therefore become a popular metric for specifying CNN engines. The EV6x, for example, can scale from 880 to 3,520 MACs, which at a 1,280 MHz clock frequency (under typical operating conditions and when fabricated on a 16 nm process node), delivers a performance range of 1.1 to 4.5 TMAC/s. Obscured by this metric, unfortunately, is the MACs' precision. Two different CNN engines, for example, may look similar from a TMAC/s standpoint:

880 12b MAC x 1,280 MHz = 1.1TMAC/s

1,024 8b MAC x 1,000 MHZ = 1TMAC/s

What is missing from these high-level TMAC/s performance numbers, however, is the bit resolution of the MACs in each CNN engine.

Accuracy

Bit resolution can significantly impact system accuracy, a critical metric for an application such as a front camera in an automobile. Most embedded vision CNNs exclusively use fixed-point calculations, versus floating-point calculations, since silicon cost (therefore the amount of silicon area consumed) is a key design parameter. The EV6x, for example, uses optimized 12-bit integer MACs that support 12- or eight-bit calculations. Most CNN graphs can be executed with eight-bit precision and no loss of accuracy; however, some graphs benefit from the additional resolution that a 12-bit MAC provides.

A deep graph, such as ResNet-152, for example, which has a large path distance from start node to end node, generally does not perform well in an eight-bit system. Graphs that have a large amount of pixel processing operations, such as Denoiser (PDF), also tend to fare poorly when using eight-bit CNNs (Figure 2). Conversely, using even higher bit resolutions, such as 16 bits or 32 bits, doesn’t normally produce significantly improved accuracy but does adversely impact the silicon area required to implement the solution.


Figure 2. A 12-bit fixed-point implementation of the Denoiser algorithm delivers results practically indistinguishable from those of a floating-point implementation, while eight-bit fixed-point results exhibit more obvious lasting noise artifacts (courtesy Synopsys).

Area

Some SoCs prioritize silicon area ahead of either performance or power consumption. A designer of an application processor intended for consumer surveillance applications, for example, might have limited area available on the SoC, therefore striving to obtain highest-achievable performance at a given die "footprint". Selecting the most advanced, i.e., smallest, process node available is often the best way to minimize area. Moving from 28nm to 16nm, or from 16nm to 12nm or lower, can both increase clock frequency capabilities, reduce power consumption at a given clock speed, and decrease die size. However, process node selection is often dictated by company initiatives that are beyond the SoC designer's influence.

Selecting an optimized embedded vision processor is the next best way to minimize area, since such a processor will maximize the return on the silicon area investment. Another way to minimize area is to reduce the required memory sizes. For an EV6x core containing both a vision processor and a CNN "engine", for example, multiple memory arrays are present: for the scalar and vector DSPs as well as the CNN accelerator, along with a closely coupled partition for sharing data between the processing units. While memory size recommendations exist, the final choice is left to the SoC designer. Keep in mind, however, that reducing memory capacities could negatively impact system performance, among other ways by increasing bandwidth on the AXI bus. Another way to address memory size on the EV6x is to choose a single- versus dual-MAC configuration. Again, smaller area has to be balanced against higher performance.

Power Consumption

Consumer and mobile designers often list power consumption as the most critical constraint. A designer of an SoC for a mobile device, for example, might strive for the best performance within a 200 mW power budget. Fortunately, embedded vision processors are designed with low power in mind. Fine-tuning clock frequencies is the most straightforward way to lower the power consumption budget. A system that could perform at clock speeds as high as 1280 MHz might instead be clocked at 500 MHz or lower, at least in some operating modes, to optimize battery life and reduce heat dissipation. Doing so, of course, can degrade performance. Power consumption is affected by both area (e.g., transistor count) and frequency. Generally, the smaller the area and lower the transistor toggle rate, the less power your design will consume.

Bandwidth

Bandwidth on the AXI or equivalent interconnect bus is often a top concern, related to power consumption. Reducing the number of external bus transactions will save power. Increasing the amount of memory (at the tradeoff of higher required silicon area, as previously discussed) to decrease the number of required external bus accesses is one approach to actualizing this aspiration. However, deep learning research is also coming up with various pruning and compression techniques that reduce the number and type of computations, along with the amount of memory, needed to implement a given CNN graph.

Take a VGG16 graph, for example. Reducing the number of coefficients, through a pruning process that deletes coefficients close to zero and retrains the system to retain accuracy, can significantly reduce the number of coefficients that have to be stored in memory, therefore cutting down bandwidth. However, this process doesn’t necessarily lower the calculation load, therefore power consumption, unless (as with the EV6x) hardware support also exists to discard MAC operations with zero inputs. Using eight-bit coefficients when 12-bit support isn’t needed for accuracy will also lower both bus bandwidth and external memory requirements. The EV6x, for example, supports eight-bit as well as 12-bit coefficients and feature map values.

Gordon Cooper
Product Marketing Manager for Embedded Vision Processors, Synopsys

Software Optimization

While it's highly beneficial to fully leverage all available processing resources in a SoC and/or system, as discussed in the previous section of this article, it's equally important to optimize the deep learning-based vision software that's running on those processing resources. The following essay from BDTI explores this topic in detail, including implementation suggestions.

Optimization of a compute-intensive workload can be described in terms of one or multiple of the following levels of abstraction:

  • Algorithm optimization: modify the algorithm to do less computation, and/or better fit the target hardware.
  • Software architecture optimization: come up with a software architecture that maximizes system throughput, e.g., avoiding needless data copies, making efficient use of caches, enabling efficient use of parallel resources, etc.
  • Hot-spot optimization: individually optimize the most time-consuming functions.

The greatest gains are often found at the highest levels of abstraction, and therefore highest-level optimizations should be undertaken first. A fully customized implementation of an application, one that has been thoroughly optimized at each of these levels and for the specific use case, will likely yield the best power/performance result. But such a time consuming and expensive development path may not be practical. For DNN deployment in particular, a full-custom implementation of a CNN is a risky undertaking: network topologies are evolving rapidly, and if you spend several man-months optimizing for one topology, by the time you’re finished you might find out that you should be using some other topology instead. Because of this risk, it's rare to see DNNs optimized to the last clock cycle or the last mW.

What's typically seen in practice, instead, are frameworks for deployment that don’t assume a specific topology. In terms of software architecture, these frameworks are designed to handle well-known published topologies such as ResNet50, GoogLeNet, and SqueezeNet. The underlying individual functions are also highly optimized, using these same well-known networks as benchmarks. Because these frameworks need to have a lot of flexibility, they can’t necessarily match the peak performance of a fully customized implementation.

Keep in mind, too, that that if you're striving to deploy a network that’s vastly different from the “benchmark” networks for which the framework was designed, you might end up with sub-optimal performance. A character recognition network that operates on 40x40 pixel images, for example, probably has a much smaller memory footprint than that found with commonly-used image classification networks, making the character recognition network amenable to optimizations that conversely wouldn't be effective in a framework designed to deploy larger networks (see sidebar "Optimization at the System Level, An Object Recognition Case Study").

Some examples of hardware vendor-provided frameworks for efficient deployment of DNNs include (not intended to be a comprehensive list):

These particular DNN deployment frameworks are based on underlying libraries that are thoroughly optimized by the hardware vendor. In leveraging them, developers often have two implementation options:

  1. Use the deployment framework at a high level (i.e., simply feed it the network description and weights, and let automated tools do all the work), or
  2. In some cases it’s possible to make direct calls to the framework's underlying library API, thereby obtaining better results via some manual optimization of the implementation architecture.

Because the underlying library is so well optimized by the hardware vendor, both of these options can end up delivering better performance than what a developer could practically achieve otherwise, given limited budget and schedule, even though, as previously noted, a full-custom implementation would theoretically yield even better results.

One other note: since the deployment frameworks provided by hardware vendors need to be flexible enough to support various topologies, they often leave the first level of optimization listed at the beginning of this essay, algorithm optimization, to the developer. If you can design and train a smaller topology (measured by some combination of fewer weights, fewer activations, and/or fewer operations), you’ll likely end up with better performance. This may be the case even if your tailored implementation is seemingly less optimal because your resultant network now doesn't look like one of the well-known topologies for which the framework was initially designed.

Therefore, if you’ve selected a well-known topology that your research suggested would give you the smallest/fastest/etc. outcomes for your application, trained it, and ended up with acceptable results, ask yourself the following question: is the accuracy of your trained network limited by the topology design, or by the training methods (choice of optimizer, meta parameters), or by limitations of the training data, or by fundamental limitations of the application? It’s possible that you could shrink feature maps, remove layers, or tweak the network topology in various other ways to make the network smaller without losing much (if any) accuracy.

Conclusion

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products, and can provide significant new markets for hardware, software and semiconductor suppliers. Deep learning-based vision processing is an increasingly popular and robust alternative to classical computer vision algorithms, but it tends to be comparatively resource-intensive, which is particularly problematic for resource-constrained embedded system designs. However, by making effective leverage of all available heterogeneous computing nodes, efficiently utilizing memory and interconnect bandwidth (both between various processors and their local and shared memory), and harnessing leading-edge software tools and techniques, it's possible to develop a deep learning-based embedded vision design that achieves necessary accuracy and inference speed levels while still meeting size, weight, cost, power consumption and other requirements.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance
Senior Analyst, BDTI

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. Au-Zone Technologies, BDTI and Synopsys, the co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, is coming up May 22-24, 2018 in Santa Clara, California. Intended for product creators interested in incorporating visual intelligence into electronic systems and software, the Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings. More information, along with online registration, is now available.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance and its member companies periodically deliver webinars on a variety of technical topics, including various deep learning subjects. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Alliance website. Also, the Embedded Vision Alliance is offering offering "Deep Learning for Computer Vision with TensorFlow," a series of both one- and three-day technical training class planned for a variety of both U.S. and international locations. See the Alliance website for additional information and online registration.

Sidebar: Optimization at the System Level, An ADAS Case Study

The design project discussed in the following case study from Au-Zone Technologies, developed in partnership with fellow Alliance member company NXP Semiconductors, showcases real-life implementations of the heterogeneous computing and software optimization concepts covered in the main article, along with other system-level fine-tuning optimization techniques.

As with any computer vision design exercise, system-level optimization must be taken into consideration when creating deep learning-based vision processing solutions. Considering the raw volume of data created by a real-time video stream, the compute horsepower required for real-time CNN processing of the stream, and the limitations imposed by embedded SoCs, optimization at every opportunity during development and deployment is essential to creating commercially practical solutions.

To make these system-level design optimization concepts more tangible, this case study describes the development of a practical TSR (traffic sign recognition) solution for deployment on NXP Semiconductor’s i.MX8 processors. In this example, we focus on the principal areas of system optimization within the constraints imposed by a typical embedded system:

  1. Deploying a deep learning-based vision processing pipeline on a heterogeneous platform
  2. Neural network design choices and tradeoffs
  3. Inference engine performance optimizations

Although this case study focuses on a specific image classification problem and processor architecture, the design methodology and optimization principles can be generalized to solve many different embedded vision processing and classification problems, on divergent hardware.

System Design Objectives and Optimization Approaches

Using as little compute horsepower as possible, the overall objective of this project was to design a practical object detection and classification solution characterized by the following boundary conditions:

  • Robust TSR detection and classification using only standard optics, sensors and processor(s)
  • Real-time processing of the video stream coming from a camera
  • Classification accuracy greater than 98% on test data, and greater than 90% on live video
  • Low-light performance suitable for automotive market applications
  • A cabling solution able to provide 2-3m separation between the camera head and compute engine
  • The use of automotive-grade silicon (sensor, processor and supporting components)
  • The use of best-in-class computer vision techniques to implement the vision pipeline
  • An overall system cost that is appropriate for the market application

Focusing on an overall system optimization forces the developer to consider many different, often conflicting, design and performance aspects to find a balance that is appropriate for a given application. The following diagram highlights some of most common parameters that need to be considered in any development program (Figure A). Depending on the application, market and performance objectives, the relative priority of these parameters will vary.


Figure A. The relative priority of various common system design optimization parameters will vary depending on the particular application and other variables (courtesy Au-Zone Technologies).

Taking a holistic approach to maximize accuracy while minimizing compute cost (time and energy), production hardware cost (engineering bill of materials, e.g., eBOM) and engineering effort often becomes a challenging balancing act for designers. System design for a deep learning-based vision solution is an iterative process; you will need both a starting point and a general framework on how to approach the problem in order to deliver a commercially viable solution within a resource constrained design.

Although there is no single recipe to follow, the following outlined sequence provides a generalized framework for your consideration. Assuming the objective for your project will be to implement a real-time classification system with optimum accuracy on lowest-cost hardware, the developer should consider optimization in the following order of diminishing returns:

  1. Minimize data volume and processing requirements
  2. Distributed processing, pipelines, and heterogeneous computing
  3. Optimization of individual vision pipeline stages
    1. Image Processing
    2. Object Detection
    3. Object Classification

Minimize Data Volume and Processing Requirements

It may seem obvious, but an easy trap to fall into when designing an embedded imaging system involves using commonly available cameras and image sensors, which are designed to provide extremely high image quality. The development team architecting the hardware may identify sensors that are convenient from a hardware design perspective, for example, without having visibility into how this decision may make the downstream processing problem overly complex or even impossible to solve on a given platform. Often, such devices produce data streams with image quality and resolution far exceeding what is actually required to solve common recognition or classification problems, making all subsequent processing steps much more complicated than necessary.

Specific image sensor (and camera) parameters affecting the data rate include:

  • Color vs monochromatic capture
  • Pixel depth
  • Frame rate
  • Image resolution (i.e., what is the minimum viable resolution necessary to solve the problem?)
  • Scaling (nearest-neighbor or interpolation)
  • Binning (2x2, 4x4 grouping)
  • Data type/format/color space (e.g., RAW, Bayer, YUV, RGB, etc. (Figure B))

Pursuing all opportunities to minimize data input and reduce complexity at the front end will provide significant benefits downstream.


Figure B. The output format and other characteristics of the image sensor, such as in this Bayer filter pattern approach, can greatly affect the payload size of the data stream to be subsequently processed (courtesy Au-Zone Technologies).

Distributed Processing, Pipelines, and Heterogeneous Computing

The Wikipedia definition for heterogeneous computing refers to "systems that use more than one kind of processor or cores. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks."

Given the overall computational load required by a deep learning-based vision solution running at a robust frame rate, the first significant design optimization opportunity is to map the vision pipeline to a heterogeneous compute platform with processing elements that are well suited to each stage in the pipeline. The following diagrams show the simplified mapping between our TSR pipeline and the corresponding hardware compute elements in the i.MX8 based system (Figure C).



Figure C. A simplified visualization of the vision processing pipeline used to implement this project (top), can be mapped onto various heterogeneous processing elements in the target SoC (bottom) (courtesy Au-Zone Technologies).

Optimization of Individual Vision Pipeline Stages

One consequence of this pipelined computing architecture is a derived system design optimization constraint for the entire pipeline: to ensure that the data type and format of the processed data from one stage maps well to the next. Unnecessary data movement or format conversions between any stages will result in needless compute time and energy consumption.

It’s worth noting that implementing the ‘fastest’ solution at each stage in the pipeline, and/or the fewest total number of stages to construct the pipeline, will still not always result in a fully optimized solution if translations are still required. For this reason, it becomes very important to evaluate overall end-to-end performance of the system before ‘committing’ to any specific element in the vision pipeline or supporting hardware. In this case study, the two highest-cost processing functions in the vision pipeline are detection (since the entire scene must be processed) and classification (due to the high compute requirements for the CNN). Focusing on first minimizing these two stages results, at least in this case, in the best end-to-end efficiency.

The image sensor selected for the case study is one concrete example of this kind of design optimization. In order to provide good low light performance for automotive applications, we selected a high dynamic range image sensor which also includes a basic on-chip ISP (image signal processor) block, whose image quality optimization capabilities (de-Bayering, white balance, auto exposure and aperture/gamma corrections) offloads later pipeline stages from needing to accomplish these tasks. However, this image sensor only has two output color format options, YUV and RAW, which imposes a color space constraint on the following stages.

In other cases, therefore, the designer may decide to use an image sensor with support for other color space output options, alternatively perform ISP functions downstream, or design a system with no ISP in order to optimize for bill-of-materials cost. Designing a deep learning system with no ISP suggests the need for a custom training dataset that has generated using a similar imaging pipeline, thereby shifting cost from hardware to the front end development effort. This tradeoff may be favorable if a custom dataset is required, anyway.

Color Space Conversion

Since detection and classification are the two most expensive system functions in the pipeline, finding the most efficient method to implement each of them, along with providing each of them with an optimum input format, is crucial to developing an efficient solution. The principle objective of the color space conversion stage is to transform the input frames coming from the image sensor into formats well suited for each of these stages. The following diagram summarizes this process (Figure D). The ordering of the conversion sequence may seem odd at first glance, with YUV→RGB conversion first, followed later by the classification stage. However, this ordering is beneficial: the overall YUV→RGB→GRY (greyscale) conversion implementation cost ends up being lower than would be the case using other approaches.


Figure D. The particular color space conversion sequence used in this project was selected in order to minimize the total implementation cost (courtesy Au-Zone Technologies).

Since the GTSRB (German Traffic Sign Recognition Benchmark) training images, for example, are RGB in format, as are the images operated on by most deep learning CNNs, this format is required for the classification stage. The lowest-cost method to convert from YUV to RGB makes use of a well-known technique described formulaically as:

R = 1.164(Y - 16) + 1.596(V - 128)
G = 1.164(Y - 16) - 0.813(V - 128) - 0.391(U - 128)
B = 1.164(Y - 16) + 2.018(U - 128)

The pipeline sequence buffers this image format for subsequent processing in the classification stage, after performing detection and identifying regions of interest.

Object Detection

Since the detection stage must process all pixels in the field of view in order to identify high-probability candidate opportunities to subsequently classify, the detection stage must be very efficient. Although implementing a CNN-based detector is possible, traditional computer vision-based detection techniques are often much more computationally efficient. These computer vision operations can also often be performed in greyscale format without loss in accuracy, thereby providing a 3:1 reduction in pixel processing at the cost of upfront conversion. The format conversion from RGB to greyscale is described formulaically as:

GRY = (R * 0.2126 + G * 0.7152 + B * 0.0722)

Due to the highly parallel nature of both of these color space conversions, they can be accomplished very efficiently, and at full resolution and frame rate, by shader kernels running on the GPU. The overall cost of these conversions is insignificant in terms of the overall end-to-end processing time, and they make the two critical stages very effective.

The candidate detection algorithm scans every frame to look for high probability regions of interest where traffic signs are likely to be present. It is a first-order search, only performing detection without any attempt to classify. Using traditional computer vision techniques, the algorithm detects the basic shapes of traffic signs (such as circles, triangles, and rectangles) in the overall scene and generates a corresponding ROI (region of interest). This is done by first extracting the edges in the image with a Sobel filter then processing those edges for contours.

Once regions of interest are determined by the detection stage, the corresponding ROI parameters are then passed to the CNN stage of the pipeline for classification. The overriding goal of the detection stage is high recall, i.e., false positives are acceptable as long as false negatives (missed detections) are minimized. This tradeoff ensures that all reasonable candidates will pass to the CNN stage for classification; the CNN will handle rejections as part of its subsequent processing.

Object Classification

With one or more high probability candidate regions identified in a scene, the specific pixel data from the RGB image for each ROI is passed to the CNN for classification. Any given scene may contain multiple candidates that require classification, which are therefore processed sequentially (Figure E). Since the distance to the object is variable, and since the input side of the neural network is fixed in resolution, the neural network candidate ROIs are first scaled to match the neural network input.


Figure E. Any given scene may contain multiple traffic sign recognition candidates requiring sequential classification (courtesy Au-Zone Technologies).

Optimizing inference involves the consideration of several factors. First and foremost is the tradeoff between execution time and accuracy; larger, more complex networks can sometimes provide higher accuracy, but only within limits. And when optimizing execution time, you should focus on two general areas: network design optimizations and runtime inference optimizations.

Network Design Optimizations

The most important aspect to consider when designing a neural network to solve a classification problem is to make sure that it is of an appropriate size and depth. Example networks can be plotted on a graph with respect to their accuracy versus compute time at various model sizes, and then grouped into three general categories based on the complexity of the image classification problem they are appropriate for solving (Figure F). One network type or size does not solve all problems, and perhaps even more importantly, larger networks are not always better if real-time performance is important for your application.


Figure F. A comparison of various network model options organizes them with respect to both their accuracy versus compute time at various model sizes, and the complexity of the image classification problems they're appropriate to solve (courtesy Au-Zone Technologies).

In examining the example networks in the medium complexity category in more detail, for example, you'll discover the multiple tradeoffs to be made within this subgroup (Table A). The relative priority of accuracy, inference time and memory usage in your design situation will be reflected in a particular ranking for the various examples.

Network Topology

Input Resolution

Layers / Weights

Training Accuracy †

Runtime (MB)

Inference (msec)

TSR Net

24 x 24

8 / 500K

97%

2.0

1.4

TSR Net

56 x 56

8 / 2.5M

97%

10.0

7.2

Fully Convolutional (FCN)

24 x 24

5 / 275K

95%

0.5

1.0

Multi-Layer Perceptron (MLP)

24 x 24

2 / 1M

78%

1.5

0.8

ResNet *

24 x 24

18 / 260K

97%

1.2

8.5

SqueezeNet *

24 x 24

12+ / 5M+

98%

5

5.0

Table A. A comparison of medium-complexity neural networks implemented on a desktop computer (notes: †=inference accuracy after 100 epochs of training, *=custom network designs with key features derived from named network) (courtesy Au-Zone Technologies)

Key takeaways from this data include the fact that the use of higher resolution inputs for training, testing and/or live video does not demonstrably improve accuracy, but will have a significant negative impact on runtime size and inference time. To be specific, further testing based on desktop computer evaluation showed that the 24 x 24 input resolution setting was optimal. Also, slightly reducing the accuracy expectation (-2%) compared to a fully convolutional network enabled reducing inference time by more than 6 ms.

With the TSRnet topology, for example, calculating the distribution of compute time across the network layers enables a developer to quickly focus on layers that are consuming the most compute time, targeting them for further optimization (Figure G). If bottlenecks become apparent with a given network model, the developer can modify the network to simplify or eliminate layers, subsequently retraining and retesting quickly.




Figure G. With the TSRnet topology (top), calculating the distribution of compute time across the network layers rapidly identifies layers consuming the most compute time (middle), which are targets for further optimization (bottom) (courtesy Au-Zone Technologies)

Runtime Inference Optimizations

With a suitable neural network design capable of solving the classification problem evaluated on a desktop computer to use as a starting point, the next obstacles to overcome are to implement that network on the resource-constrained embedded target, and then optimize the embedded implementation for efficiency. When implementing the runtime on the embedded target, the developer has several options, each with tradeoffs, to consider:

  1. Code the model directly using existing libraries such as ACL, BLAS, Eigen and processor-specific methods. This is a particularly good option if few network design iterations or processor-specific optimizations are required to reach an appropriate solution.
  2. Modify existing open-source training frameworks to perform inference on your embedded platform.
  3. Leverage TensorFlow Lite to retrain an existing model on your data. This is a particularly good option if the problem you’re working on fits well with existing network design and supported core operations. Also, this option straightforwardly targets Android and iOS platforms.
  4. Implement a dynamic inference engine capable of loading and evaluating models. This is the "long game" option, requiring the most upfront investment to achieve a fully optimized solution capable of loading and evaluating different network topologies. The benefit of this investment is an engine fully optimized for any particular processor architecture and fully independent of the network design.

For this case study, we pursued option 4, using the DeepViewRT Run Time Inference Engine to target execution of the neural network on the processor. Table B shows inference times evaluated on the i.MX8 processor, and using the DeepViewRT dynamic inference engine for the same networks previously described in Table A.

Network Topology

Input Resolution

Layers / Weights

Training Accuracy †

Runtime (MB)

Inference (msec)

TSR Net

24 x 24

8 / 500K

97%

2.0

2.8

TSR Net

56 x 56

8 / 2.5M

97%

10.0

14.0

Fully Convolutional (FCN)

24 x 24

5 / 275K

95%

0.5

9.5

Multi-Layer Perceptron (MLP)

24 x 24

2 / 1M

78%

1.5

2.5

ResNet *

24 x 24

18 / 260K

95%

1.2

3.9

SqueezeNet *

24 x 24

12+ / 5M+

<80%

5

>1000

Table B. A comparison of medium-complexity neural networks implemented on the i.MX8 embedded target (notes: †=inference accuracy after 100 epochs of training, *=custom network designs with key features derived from named network) (courtesy Au-Zone Technologies)

Regardless of the particular path you choose to implement run-time inference, the overall objective remains the same: to minimize the system compute load and time required to perform inference for a given network on a particular target processor. Optimizations for resource-constrained embedded processors cluster into two general categories: compute techniques, and processor architecture exploits.

Compute Techniques

Compute techniques are generally portable from one hardware platform to another, although the benefits seen on one processor are not always duplicated exactly on another. Some specific areas of optimization to consider are outlined in the following lists.

Neural network and linear algebra libraries:

Computational transforms:

Separable convolutions:

  • Computational data format ordering: NHWC, NCHW, NCHW, etc…
    • N refers to the number of images in a batch.
    • H refers to the number of pixels in the vertical (height) dimension.
    • W refers to the number of pixels in the horizontal (width) dimension.
    • C refers to the channels. For example, 1 for black and white or grayscale and 3 for RGB.

Data I/O optimizations:

  • Data reuse
  • Dimensional ordering
  • Image data reuse
  • Filter data reuse
  • Tiling: blocking vs linear
  • Kernel and layer fusing

Processor Architecture Exploits

Hardware specific optimizations are typically processor architecture specific and are not assumed to be portable. Some specific examples are outlined in the following lists.

Processor core type and instruction set

  • CPU
  • GPU
  • Vision DSP
  • FPGA
  • Neural network processor

Memory use optimization

  • Registers
  • L1 and L2 cache
  • SRAM vs DRAM, and standard vs DDR (double data rate) memory interfaces
  • Cache line/non-strided accesses
  • Float, half, fixed data formats
  • Bit width tradeoffs
  • Non-linear quantization

Brad Scott
President, Au-Zone Technologies

Sidebar: Optimization at the System Level, An Object Recognition Case Study

In developing an optimized deep learning-based embedded vision design, putting realistic constraints on the characteristics of images (and kinds of objects in those images) that the classifier will be tasked with handling can be quite impactful in terms of reducing the required implementation resources. However, it's equally important to expand those constraints as needed to comprehend all possible image data scenarios. The following case study from BDTI, based on a real-life research project for a client, covers both of these points.

The goal of this project was to train a classifier that achieved very high accuracy (above 99%) on 25 object categories, while also being practical to implement on an embedded platform. The project was very exploratory in nature; the target hardware had not yet been defined, so we had no specific constraints on compute load, storage, or bandwidth. However, we could not assume that "cloud" connectivity was available; inference had to happen entirely in the embedded system.

We knew that we could make some reasonable assumptions about the data: the object of interest was always in a known region of the image, for example. The distance from the camera to the object of interest would also be measured separately from the image classifier, so that the classifier could assume that images were appropriately scaled and cropped using this information. These assumptions meant that the inputs to the classifier would be fairly uniform in position and scale, and we would therefore be able to design a “lightweight” classifier for the job.

In designing a classifier CNN based on these assumptions, we selected kernel sizes, feature map sizes, etc. to limit the compute load and memory footprint (including the number of weights and the size of activation matrices). In considering the effort to implement the CNN on an embedded CPU/DSP, we therefore also deliberately avoided normalization layers and other features that would have increased coding effort.

We were initially given both a training dataset and a validation dataset. At the end of the project, we would also receive additional test sets, which would include input conditions not covered in the training and validation datasets. By measuring accuracy on these test sets, we would be able to determine how well the initial training had generalized to the additional situations in the test sets. However, we were not allowed to see the test sets in advance, and we did not know what kinds of conditions would be present in the test sets. In general, this is non-ideal practice: a good rule of thumb is that artificial neural networks only learn what they are shown, and they can’t be expected to generalize to conditions not covered in training. Unfortunately in this case we didn’t even obtain any up-front information about the range of input conditions in the test sets.

The training and validation datasets consisted of images captured under uniform and otherwise optimum conditions: good visibility of the objects of interest, good lighting, etc. We were therefore able to rapidly design a lightweight CNN classifier and train it to achieve over 99.9% accuracy on the validation dataset. But we knew that the training and validation sets' conditions were too close to ideal, and it was therefore very unlikely that this initial training would generalize well to more diverse conditions. So we next used simple image processing techniques to simulate more challenging conditions such as occlusions, shadows, and other types of image noise.

We created a new validation set with these simulated challenges. On it, the network we had originally trained initially scored only 66.7% accuracy; not at all surprising, because it was not trained to deal with the simulated challenges. We then added these same simulated challenge conditions to the training set. We also created a custom layer in Caffe to on-the-fly simulate random occlusions during training. We retrained the network, which then achieved 100% accuracy on the original validation set, and 99.2% accuracy on the simulated-challenge validation set. And on the three test sets that we subsequently received, our network accuracy scores were as high as 99.9%.

We estimate that the CNN we designed could execute inference in a few hundred msec on a fully loaded 1 GHz NEON-supportive ARM core, and with no need for a GPU as a co-processor; a very lightweight and otherwise embedded-friendly design. Note that since this project focused only on initial network topology design and training, subsequent pruning or quantizing of the network was not explored, and memory "footprint" estimates are therefore not available. However, we believe that pruning and quantization (with retraining) would be very effective, with a negligible resultant reduction in accuracy.

Implementing High-performance Deep Learning Without Breaking Your Power Budget

This article was originally published at Synopsys' website. It is reprinted here with the permission of Synopsys.

Computer Vision in Surround View Applications

Bookmark and Share

Computer Vision in Surround View Applications

The ability to "stitch" together (offline or in real-time) multiple images taken simultaneously by multiple cameras and/or sequentially by a single camera, in both cases capturing varying viewpoints of a scene, is becoming an increasingly appealing (if not necessary) capability in an expanding variety of applications. High quality of results is a critical requirement, one that's a particular challenge in price-sensitive consumer and similar applications due to their cost-driven quality shortcomings in optics, image sensors, and other components. And quality and cost aren't the sole factors that bear consideration in a design; power consumption, size and weight, latency and other performance metrics, and other attributes are also critical.

Seamlessly combining multiple images capturing varying perspectives of a scene, whether taken simultaneously from multiple cameras or sequentially from a single camera, is a feature which first gained prominence with the so-called "panorama" mode supported in image sensor-equipped smartphones and tablets. Newer smartphones offer supplemental camera accessories capable of capturing a 360-degree view of a scene in a single exposure. The feature has also spread to a diversity of applications: semi- and fully autonomous vehicles, drones, standalone consumer cameras and professional multi-camera capture rigs, etc. And it's now being used to not only deliver "surround" still images but also high frame rate, high resolution and otherwise "rich" video. The ramping popularity of various AR (augmented reality) and VR (virtual reality) platforms for content playback has further accelerated consumer awareness and demand.

Early, rudimentary "stitching" techniques produced sub-par quality results, thereby compelling developers to adopt more advanced computational photography and other computer vision algorithms. Computer vision functions that will be showcased in the following sections implement seamless "stitching" of multiple images together, including aligning features between images and balancing exposure, color balance and other characteristics of each image. Dewarping to eliminate perspective and lens distortions is critical to a high quality result, as is calibration to adjust for misalignment between cameras (as well as to correct for alignment shifts over time and use). Highlighted functions include those for ADAS (advanced driver assistance systems) and autonomous vehicles, as well as for both professional and consumer video capture setups; the concepts discussed will also be more broadly applicable to other surround view product opportunities. And the article also introduces readers to an industry alliance created to help product creators incorporate vision capabilities into their hardware and software, along with outlining the technical resources that this alliance provides (see sidebar "Additional Developer Assistance").

Surround View for ADAS and Autonomous Vehicles

The following essay was written by Alliance member company videantis and a development partner, ADASENS. It showcases a key application opportunity for surround view functions: leveraging the video outputs of multiple cameras to deliver a distortion-free and comprehensive perspective around a car to human and, increasingly, autonomous drivers.

In automotive applications, surround view systems are often also called 360-degree video systems. These systems increase driver visibility, which is a valuable capability when undertaking low-speed parking maneuvers, for example. They present a top-down (i.e. "bird’s-eye") view of the vehicle, as if the driver was positioned above the car. Images from multiple cameras combine into a single perspective, presented on a dashboard-mounted display. Such systems typically use 4-6 wide-angle cameras, mounted on the rear, front and sides of the vehicle, to capture a full view of the surroundings. And the computer vision-based driver safety features they support, implementing various image analysis techniques, can warn the driver or even partially-to-completely autonomously operate the vehicle.

Surround view system architectures implement two primary, distinct functions:

  • Camera calibration: in order to combine the multiple camera views into a single image, knowledge of each camera’s precise intrinsic and extrinsic parameters is necessary.
  • Synthesis of the multiple video streams into a single view: this merging-and-rendering task combines the images from the different cameras into a single natural-looking image, and re-projects that resulting image on the display.

Calibration

In order to successfully combine the images captured from the different cameras into a single view, it's necessary to know the extrinsic parameters that represent the location and orientation of the camera in the 3D space, as well as the intrinsic parameters that represent the optical center and focal length of the camera. These parameters may vary on a per-camera basis due to manufacturing tolerances in the factory; they can also change after the vehicle has been manufactured due to the effects of accidents, temperature variations, etc. Each camera's extrinsic parameters can even be affected by factors such as vehicle load and tire pressure. Therefore, camera calibration must be repeated at various points in time: during camera manufacturing, during vehicle assembly, at each vehicle start, and periodically while driving the car (Figure 1).


Figure 1. Multiple calibration steps, at various points in time throughout an automotive system's life, are necessary in order to accurately align images (courtesy ADASENS and videantis).

One possibility, and a growing trend, is to integrate this calibration capability within the camera itself, in essence making the cameras in a surround view system self-aware. Calibration while the car is driving, for example, is known as marker-less or target-less calibration. In addition, the camera should also be able to diagnose itself and signal the operator if the lens becomes dirty or blocked, both situations that would prevent the surround view system from operating error-free. Signaling the driver (or car) in cases when the camera can’t function properly is particularly important in situations involving driver assistance or fully automated driving. Two fundamental algorithms, briefly discussed here, address the desire to make cameras self-aware: target-less calibration based on optical flow, and view-block detection based on machine learning techniques.

Calibration can be performed using the vanishing point theory, which estimates the position and orientation of the camera in 3D space (Figure 2). Consecutive frames from a monocular camera, in combination with CAN (Controller Area Network) bus-sourced data such as wheel speeds and the steering wheel angle, are inputs to the algorithm. The vanishing point is a virtual point in 2D image coordinates that corresponds to a point in the 3D space where 2D projections such as a set of parallel line converge. Using this method, the roll, pitch and yaw angles of a camera can be estimated with accuracies of up to 0.2°, and without need for markers. Such continuously running calibration is also necessary in order to adapt the camera to short and long-term changes, and it can be used as an input for other embedded vision algorithms such as a crossing-traffic alert, which calculates time-to-collision that is then used to trigger a warning to the driver.


Figure 2. The vanishing point technique, which estimates the position and orientation of a camera in 3D space, is useful in periodically calibrating it (courtesy ADASENS and videantis).

Soil or blockage detection is based on the extraction of image quality metrics such as sharpness and saturation (Figure 3). This data can then trigger a cleaning system, for example, as well as provide a confidence level to other system functions, such as an autonomous emergency braking system that may not function correctly if one or multiple cameras are obscured. Soil and blockage detection extracts prominent image quality metrics, which are combined into a feature vector and "learned" using a support vector machine, which performs discriminative feature identification. Temporal filtering, along with a hysteresis function, are also incorporated in the algorithm in order to prevent "false positives" due to short-term changes and soil-based flickering.


Figure 3. A camera lens that becomes dirty or blocked could prevent the surround view system from operating error-free (courtesy ADASENS and videantis).

Merging and Rendering

The next step in the process involves combining the captured images into a unified surround view image that can then be displayed. It involves re-projecting the camera-sourced images, originally taken from different viewpoints, and more generally merging the distinct video streams. In addition the live images themselves, it utilizes the various virtual cameras' viewpoint parameters, as well as the characteristics of the surface that the rendered image will be re-projected onto.

One common technique is to simply project the camera images onto a plane that represents the ground. However, this approach results in various distortions; for example, objects rising above the ground plane, such as pedestrians, trees and street lights, will be unnaturally stretched out (Figure 4). The resulting unnatural images making it harder for the driver to accurately gauge distances to various objects. A common improved technique is to render the images onto a bowl-shaped surface instead. This approach results in a less-distorted final image, but it still contains some artifacts. Ideally, therefore, the algorithm would re-project the cameras' images onto the actual 3D structure of the vehicle’s surroundings.



Figure 4. Projection onto the ground plane tends to flatten and stretch objects (top); projection onto a "bowl" surface instead results in more natural rendering (bottom) (courtesy ADASENS and videantis).

System Architecture Alternatives

One typical system implementation encompasses the various cameras along with a separate ECU (engine control unit) "box" that converts the multiple camera streams into a single surround view image, which is then forwarded to the head unit for display on the dashboard (Figure 5). Various processing architectures for the calibration, computer vision, and rendering tasks are available. Some designs leverage multi-core CPUs or GPUs, while other approaches employ low-cost and lower-power vision processors. Videantis' v-MP4000HDX and v-MP6000UDX vision processor families, for example, efficiently support all required visual computing tasks, including calibration, computer vision, and rendering. Camera interface options include LVDS and automotive Ethernet; in the latter case, videantis' processors can also handle the requisite H.264 video compression and decompression, thereby unifying all necessary visual processing at a common location.



Figure 5. One common system architecture option locates surround view and other vision processing solely in the ECU, which then sends rendered images to the head unit (top). Another approach subdivides the vision processing between the ECU and the cameras themselves (bottom) (courtesy ADASENS and videantis).

Another prevalent system architecture incorporates self-aware cameras, thereby reducing the complexity in their common surround view ECU. This approach provides an evolutionary path toward putting even more intelligence into the cameras, resulting in a scalable system with multiple options for the car manufacturer to easily provide additional vision-based features. Enabling added functionality involves upgrading to more powerful and intelligent cameras; the base cost of the simplest setup remains low. Such an approach matches up well with the car manufacturer's overall business objectives: providing multiple options for the consumer to select from while purchasing the vehicle.

Marco Jacobs
Vice President of Marketing, videantis

Florian Baumann
Technical Director, ADASENS

Surround View in Professional Video Capture Systems

Surround video and VR (virtual reality) are commonly (albeit mistakenly) interchanged terms; as Wikipedia notes, "VR typically refers to interactive experiences wherein the viewer's motions can be tracked to allow real-time interactions within a virtual environment, with orientation and position tracking. In 360-degree video, the locations of viewers are fixed, viewers are limited to the angles captured by the cameras, and [viewers] cannot interact with the environment." With that said, VR headsets (whether standalone or smartphone-based) are also ideal platforms for live- or offline-viewing both 180- and 360-degree video content, which in some cases offers only an expanded horizontal perspective but in other cases delivers a full spherical display controlled by the viewer's head location, position and motion. The following essay from AMD describes the implementation of Radeon Loom, a surround video capture setup intended for professional use, thereby supporting ultra-high image resolutions, high frame rates and other high-end attributes.

One of the key goals of the Radeon Loom project was to enable real-time preview of 360-degree video in a headset such as an Oculus Rift or HTC Vive, while simultaneously filming it with a high-quality cinematic camera setup (see sidebar "Radeon Loom: A Historical Perspective"). Existing solutions comprise either low-end cameras, which don't deliver sufficient quality levels for Hollywood expectations, or very expensive high-end cameras that take lengthy periods of time to produce well-stitched results. After several design iterations, AMD came up with several implementation options (Figure 6).



Figure 6. An example real-time stitching system block diagram (top) transformed into reality with AMD's Radeon Loom (bottom) (courtesy AMD).

Important details of the design include the fact that it uses a high-performance workstation graphics card, such as a FirePro W9100 or one of the newer Radeon Pro WX series. These higher-end cards support more simultaneously operating cameras, as well as higher per-camera resolutions. Specifically, Radeon Loom is using Black Magic cameras with HDMI outputs, converting them to SDI (Serial Digital Interface) via per-camera signal converters (SDI is common in equipment used in the broadcast and film industries). The Black Magic cameras support gen-lock (generator locking), which synchronizes the simultaneous start of multi-camera capture to an external sync-generator output signal. Other similar-featured (i.e. HDMI output and gen-lock input) cameras would work just as well.

Once the data is in the GPU's memory, a complex set of algorithms (to be discussed shortly) tackles stitching together all the images into a 360-degree spherical video. Once stitching is complete, the result is sent out over SDI to one or more PCs equipped with HMDs for immediate viewing and/or streaming to the Internet.

Practical issues on the placement of equipment for a real-time setup also require consideration. Each situation is unique; possible scenarios include filming a Hollywood production with a single equipment rig or broadcasting a live concert with multiple distributed cameras. With 360-degree camera arrays, for example, you don’t generally have an operator behind the camera, since he or she would then be visible in the captured video. In such a case, you would also probably want to locate the stitching and/or viewing PCs far away, or behind a wall or green screen, for example.

Why Stitching is Difficult

Before explaining how stitching works, let's begin with a brief explanation of why it's such a challenging problem to solve. If you've seen any high-quality 360-degree videos, you might have concluded that spherical stitching is a solved problem. It isn’t. With that said, however, algorithm pioneers deserve abundant credit for incrementally solved many issues with panoramic stitching and 360 VR stitching over the past few decades. Credit also goes to the companies that have produced commercial stitching products and helped bring VR authoring to the masses (or at least the early adopters).

Fundamental problems still exist, however: parallax, camera count versus seam count, and the exposure differences between sensors are only a few examples (Figure 7). Let's cover parallax first. Simply stated, two cameras in two different locations and positions will see the same object from two different perspectives, just as the same finger held close to your nose appears to have different backgrounds when sequentially viewed from each of your eyes (opened one at a time). Ironically, this disparity is what the human brain uses to determine depth when combining the images. But it also causes problems when trying to merge two separate images together and fool your eyes and brain into thinking they are one image.



Figure 7. Parallax (top) and lens distortion effects (bottom) are several of the fundamental problems that need to be solved in order to deliver high-quality stitching (courtesy AMD).

The second issue: more cameras are generally better, because you end up with a higher effective resolution and improved optical quality (due to less distortion from more narrower-view lenses, versus fewer fisheye lenses). However, more cameras also means more seams between per-camera captured images, a scenario that creates more opportunities for artifacts. As people and other objects move across the seams, the parallax problem repeatedly reveals itself, with small angular differences. It is also more difficult to align all of the images when multiple cameras exist; misalignment leads to "ghosting." And more seams also means more processing time.

Each camera's sensor is also dealing with different lighting conditions. For example, if you're capturing a 360-degree video containing a sunset, you'll have a west-facing camera looking at the sun, while an east-facing camera is capturing a much darker region. Although clever algorithms exist to adjust and blend the exposure variations across images, this blending comes at the cost of lighting and color accuracy, as well as overall dynamic range. The problem is amplified in low-light conditions, potentially limiting artistic expression.

Other problems also exist, also with solutions, but at higher cost tradeoffs. For example, most digital cameras use a "rolling shutter" as opposed to the more costly "global shutter." Global shutter-based cameras capture every pixel at the same time. Conversely, rolling shutter cameras sequentially capture horizontal rows of pixels at different points in time. When stitching together images shot using rolling shutter-based cameras, some of the pixels in overlapping image areas will have been captured at different times, potentially resulting in erroneous disparities.

With those qualifiers stated, it's now time for an explanation of 360-degree video stitching and how AMD optimized its code to run in real time. To begin, let's look at the overall software hierarchy and the processing pipeline (Figure 8).


Figure 8. Radeon Loom's software hierarchy has OpenVX at its nexus (courtesy AMD).

An OpenVX™ Foundation

AMD built the Loom stitching framework on top of OpenVX, a foundation that is important for several reasons. OpenVX is an open standard supported by the Khronos Group, an organization that also developed and maintains OpenGL™, OpenCL™, Vulkan™ and many other industry standards. OpenVX is also well suited to this and similar software tasks, because it allows the underlining hardware architecture to optimally execute the compute graph (pipeline), while details of how the hardware obtains its efficiency don't need be exposed to upper software levels. The AMD implementation of OpenVX, which is completely open-sourced on Github, includes a Graph Optimizer that conceptually acts like a compiler for the whole pipeline.

Additionally, and by design, the OpenVX specification allows each implementation to decide how to process each workload. For example, processing could be done out of order, in tiles, in local memory, or handled in part or in its entirety by dedicated hardware. This flexibility means that as both the hardware and software drivers improve, the stitching code can automatically take advantage of these enhancements, similar to how 3D games automatically achieve higher frame rates, higher resolutions and other improvements with new hardware and new driver versions.

The Loom Stitching Pipeline

Most of the steps in the stitching pipeline are the same, regardless of whether you are stitching in real-time or offline (i.e. in batch mode) (Figure 9). The process begins with a camera rig capturing a group of videos, likely either to SD flash memory cards, a hard drive or digital video tape, depending on the camera model. After shooting, copy all of the files to a PC and launch the stitching application.


Figure 9. The offline stitching pipeline has numerous critical stages, and is similar to the real-time processing pipeline alternative (courtesy AMD).

Before continuing with the implementation details, let's step back for a second and set the stage. The goal is to obtain a spherical output image to view in a VR headset (Figure 10). However, it's first necessary to create a flat projection of a sphere. The 360-degree video player application will then warp the flat into a sphere. This, the most common method, is called an equirectangular projection. With that said, other projection approaches are also possible.



Figure 10. The goal, a spherical image to view in a headset (top), first involves the rendering of a flat projection, which is then warped (bottom) (courtesy AMD).

The first step in the pipeline is to decode each video stream, which the camera has previously encoded into a standard format, such as H.264. Next is to perform a color space conversion to RGB. Video codecs such as H.264 store the data in a YUV format, typically YUV 4:2:0 to obtain better compression. The Loom pipeline supports color depths from 8-bit to 16-bit. Even with 8-bit inputs and outputs, some internal steps are performed with 16-bit precision to maximize quality.

Next comes the lens correction step. The specifics are to some degree dependent on the characteristics of the exact lenses on your cameras. Essentially what is happening, however, is that the distortion artifacts introduced by each camera's lens are corrected to make straight lines actually appear straight, both horizontally and vertically (Figure 11). Fisheye lenses and circular fisheye lenses have even more (natural) distortion that needs to be corrected.



Figure 11. Lens distortion correction (top) is particularly challenging with fisheye lenses (bottom) (courtesy AMD).

Once this correction is accomplished, the next step warps each image into an intermediate buffer representing an equirectangular projection (Figure 12). At this point, if you simply merge all the layers together, you'll end up with a stitched image. This overview won't discuss in detail how to deal with exposure differences between images and across overlap areas. Note, however, that Loom contains seam-finding, exposure compensation and multi-band blending modules, all of which are required in order to obtain a good quality stitch balanced across the camera images and minimizing the seams' visibility of the seams. For multi-band blending, for example, the algorithms expand the internal data to 16 bits, even if the input source is 8 bits, and provide proper padding so everything can run at high speed on the GPU.




Figure 12. Warping the corrected images into equirectangular-sized intermediate buffers (top), then merging the layers together (middle), isn't alone sufficient to deliver high-quality results (bottom) (courtesy AMD).

Finding Seams

AMD's exploration and evaluation of possible seam-finding algorithms was guided by a number of desirable characteristics, beginning with "high speed" and "parallelizable." AMD chose this prioritization in order to be able to support real-time stitching with many lens types, as well as to be scalable across the lineup of GPUs. Temporal stability is also required, so that a seam would not flicker due to noise or minor motion in the scene at an overlap area. While many of the algorithms in academic literature work well for still images such as panoramas, they aren't as robust with video.

The algorithm picks a path in each overlapping region, and then stays with this same seam for a variable number of frames, periodically re-checking for a possible better seam in conjunction with decreasing the advantage for the original seam over time. Because each 360-degree view may contain many possible seams, not re-computing every seam on every frame significantly reduces the processing load. The complete solution also includes both seam finding and transition blending functions. The more definitive the seam, the less need there is for wide-area blending. Off-line (batch mode) processing supports adjustment of various stitching parameters in order to do more or less processing per seam on each frame.

While this fairly simple algorithm might work fairly well for a relatively static scene, motion across the seam could still be problematic. So AMD's approach also does a "lightweight" check of the pixels in each overlap region, in order to quickly identify significant activity in the overlap regions and flag these particular seams such that they can be promptly re-checked. All seams are grouped into priority levels; the highest priority candidates are re-checked first and queued for re-computation in order to minimize the impact on the system's real-time capabilities. For setups with a small number of cameras and/or in off-line processing scenarios, a user can default all seam candidates to the highest priority.

How to find the optimal path for each seam? Many research papers have been published, promoting algorithms such as max-flow min-cut (or graph cut), segmentation, watershed and the like. Graph cut starts by computing a cost for making a cut at each pixel and then finding a path through the region that has the minimum total cost. If you have the perfect "cost function," you’ll get great results, of course. But the point is to find a seam that is the least objectionable and remains so over time. In real-time stitching you can’t easily account for the future; conversely, in off-line mode a best seam over time can be found (in time).

Before you can choose a cost function, you have to inherently understand what it is that you are trying to minimize and maximize (Figure 13). Good starting points are to cut along an edge and to not cross edges. The stronger the edge the better when following it; conversely, crossing over an edge is worst of all. And cutting right on an edge is better than cutting parallel to but some distance away from it, although some cases you may not have a nearby edge to cut on.

Figure 13. Sometimes, the best seam follows an edge (top left). In other situations, however, the stronger edge ends up with a lower "cost" score (top right). In this case, the theoretically best seam has the worst score, since it's not always on an edge (bottom left). Here, the best score follows an edge (bottom right) (courtesy AMD).

Computing a cost function leverages a horizontal and vertical gradient using a 3x3 Sobel function, for both phase and magnitude (Figure 14).


Figure 14. Seam candidate cost calculations leverage 3x3 Sobel functions (courtesy AMD).

Graph Cuts

In classical graph cut theory, an “s-t graph” is a mesh of nodes (pixels in this particular case) linked together (Figure 15). S is the starting point (source) and T is the ending point (sink). Each vertical and horizontal link has an associated cost for breaking that connection. However, the academic description may be confusing in this particular implementation, because when considering a vertical seam and a left and right image, S and T are on the left and right images, not the top and bottom of the seam.


Figure 15. An “s-t graph” is a mesh of nodes linked together, with S the starting (source) point and T the ending (sink) point (courtesy AMD).

The graph cut method measures total cost, not average cost. Thus a shorter cut through some high-cost edges may get preference over longer cuts through areas of average cost. Some possible methodology improvements include segmenting the image and finding seams between segments, avoiding areas of high salience, computing on a pyramid of images, preferring cuts in high frequency regions, minimizing average error, etc.

Seam Cut

The algorithm begins by computing a cost, the Sobel phase and magnitude, at each pixel location. It then sums the accumulated cost along a path that generally follows the direction of the seam. It chooses a vertical, horizontal or diagonal seam direction based on the dimensions of the given seam. It looks at a pixel and 3 possible directions of the next pixel. For example in a vertical seam, moving from top to bottom a pixel has 3 pixels below it that can be considered – down-left, down or down-right. The lower cost of the three options is taken. Optionally, the algorithm potentially provides "bonus points" if a cut is perpendicular and right next to an edge.

This above process executes in parallel for every pixel row (or column) in the overlap area (minus some boundary pixels). After the algorithm reaches the bottom (in this case) of the seam, it compares all possible paths and picks the one with the lowest overall cost. The final step is to trace back up the path and set the weights for the left and right images (Figure 16).



Figure 16. The two overlap source images were taken from different angles (upper left and right). An elementary stitch of them produces sub-par results (middle). Stitching by means of a generated seam generates a superior output (courtesy AMD).

Mike Schmit
Director of Software Engineering, Radeon Technologies Group, AMD

Surround View in Consumer Video Capture Systems

AMD's Radeon Loom, as previously noted, focuses its attention on ultra-high-resolution (4K and 8K), high-frame rate and otherwise high-quality professional applications, and is PC-based. Many of the concepts explained in AMD's essay, however, are equally applicable to more deeply embedded system designs, potentially at standard HD video resolutions, with more modest frame rates and more mainstream quality expectations. See, for example, the following presentation, "Designing a Consumer Panoramic Camcorder Using Embedded Vision," delivered by CENTR (subsequently acquired by Amazon):

Lucid VR, a member of the Embedded Vision Alliance and an award winner at the 2017 Embedded Vision Summit's Vision Tank competition, has developed a consumer-targeted stereoscopic camera which captures the world the way human eyes see it - with true depth and 180 degree field of vision. When viewed within a virtual reality headset like Oculus Rift or Google Daydream with a mobile phone, the image surrounds the user, creating complete immersion. Here's a recent demonstration from the company:

Conclusion

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products, and can provide significant new markets for hardware, software and semiconductor suppliers. High-quality surround view "stitching" of both still images and video streams via computer vision processing can not only create compelling immersive content directly viewable on VR headsets and other platform, but can also generate valuable visual information used by downstream computer vision algorithms for autonomous vehicles and other applications. By carefully selecting and optimizing both the "stitching" algorithms and the processing architecture(s) that run them, surround view functionality can be cost-effectively and efficiently incorporated in a diversity of products. And an industry association, the Embedded Vision Alliance, is also available to help product creators optimally implement surround view capabilities in their resource-constrained hardware and software designs.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. AMD and videantis, co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, is coming up May 22-24, 2018 in Santa Clara, California.  Intended for product creators interested in incorporating visual intelligence into electronic systems and software, the Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings.  More information, along with online registration, is now available.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance and its member companies periodically deliver webinars on a variety of technical topics, including various deep learning subjects. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Alliance website. Also, the Embedded Vision Alliance is offering offering "Deep Learning for Computer Vision with TensorFlow," a series of both one- and three-day technical training class planned for a variety of both U.S. and international locations. See the Alliance website for additional information and online registration.

Sidebar: Radeon Loom: A Historical Perspective

The following essay from AMD explains how the company came up with its "Radeon Loom" project name, in the process providing a history lesson on human beings' longstanding desires to engross themselves with immersive media.

People have seemingly an innate need to immerse themselves in 360-degree images, stories and experiences, and such desires are not unique to current generations. In fact, it's possible to trace this yearning for recording history, and for educating and entertaining ourselves through it, from modern-day IMAX, VR and AR (augmented reality) experiences all the way back to ancient cave paintings. To put today's technology in perspective, here's a sampling of immersive content and devices from the last 200+ years:

Several interesting connections exist between the historical loom, an apparatus for making fabric, and modern computers (and AMD’s software running on them). The first and most obvious linkage is the fact that looms are multi-threaded machines, capable of being fed by thousands of threads to create beautiful fabrics and images on them. Radeon GPUs also run thousands of threads (of code instructions this time), and also produce stunning images.

Of particular interest is the Jacquard Loom, invented in France in 1801. Joseph Jacquard didn’t invent the original loom; he actually worked as a child in his parent’s factory as a draw-boy, as did many children of the time. Draw-boys, directed by the master weaver, manipulated the warp threads one by one; this was a job much more easily tackled by children's small hands. Unfortunately, it also required them to be up high on the loom, in a dangerous position.

Jacquard's experience later motivated him as an adult to develop an automated punch card mechanism for the loom, thereby eliminating his childhood job. The series of punched cards controlled the intricate patterns being woven. And a few decades later, Charles Babbage intended to use the same conceptual punch card system on his (never-built) Analytical Engine, which was an ancestor of modern day computing hardware and software.

When Napoleon observed the Jacquard Loom in action, he granted a patent for it to the city of Lyon, essentially open-sourcing the design. This grant was an effort to help expand the French textile industry, especially for highly desirable fine silk fabrics. And as industry productivity consequently increased, from a few square inches per day to a square yard or two per day, what did the master weavers do with their newfound time? They now could devote more focus on creative endeavors, producing designs with new patterns and colors every year, creations which they then needed to market and convince people to try. Today this is called the "fashion" industry, somewhat removed from its elementary fabric-weaving origins.

Solving Intelligence, Vision and Connectivity Challenges at the Edge with ECP5 FPGAs

This article was originally published at Lattice Semiconductor's website. It is reprinted here with the permission of Lattice Semiconductor.

Seeing Clearer – Driving Toward Better Cameras for Safer Vehicles

This article was originally published by Dave Tokic of Alliance member company Algolux. It is reprinted here with Tokic's permission.

Fundamentals of Image Processing Systems

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.

What do image processing systems have to do with keeping foodstuffs in good shape?

Machine Learning’s Fragmentation Problem — and the Solution from Khronos

This blog post was originally published at Alliance partner organization Khronos' website. It is reprinted here with the permission of the Khronos Group.

There is a wide range of open-source deep learning training networks available today offering researchers and designers plenty of choice when they are setting up their project. Caffe, Tensorflow, Chainer, Theano, Caffe2, the list goes on and is getting longer all the time.

Visual Ventures with Chris Rowen

This article was originally published as a two-part blog series at Cadence's website. It is reprinted here with the permission of Cadence.


"Even if he gives the same presentation two weeks apart, it will be different.”
—Neil Robinson, fellow attendee, on Chris Rowen

Smart Watch, Smart Home, Smart City – How the Internet of Things Helps Shape the Future

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.

BOSCH Visiontec Brings Innovative Automotive IP to Market Fast Using High-Level Synthesis

This article was originally published at Mentor's website. It is reprinted here with the permission of Mentor.