Embedded Vision Alliance: Technical Articles

Embedded Low-power Deep Learning with TIDL

This article was originally published at Texas Instruments' website (PDF). It is reprinted here with the permission of Texas Instruments.

Data Sets for Machine Learning Model Training

Bookmark and Share

Data Sets for Machine Learning Model Training

Deep learning and other machine learning techniques have rapidly become a transformative force in computer vision. Compared to conventional computer vision techniques, machine learning algorithms deliver superior results on functions such as recognizing objects, localizing objects within a frame, and determining which pixels belong to which object. Even problems like optical flow and stereo correspondence, which had been solved quite well with conventional techniques, are now finding even better solutions using machine learning techniques. The pace of ongoing machine learning development remains rapid. And in comparison to traditional computer vision algorithms, it's easier to create effective solutions for new problems without requiring a huge team of specialists. But machine learning is also resource-intensive, as measured by its compute and memory requirements. And for machine learning to deliver its potential, it requires a sufficient amount of high quality training data, plus developer knowledge of how to properly use it.

Imagine teaching someone how to bake a cake. You can do it by writing down instructions, or you can show them how it’s done and have them learn by example. This, in a nutshell, is the difference between traditional coding and machine learning: coding is like writing a recipe, whereas machine learning is teaching by example. If you teach cake baking by example, you'll probably have to demonstrate the process many times before your student remembers all the details and can bake a cake on their own from memory. The process of baking a cake is conversely easy to describe with a recipe, but many other problems are very difficult to solve this same "recipe way."

For example, how would you teach someone to discern between an image of a cat and an image of a dog? It would be practically impossible to write a “recipe” for cat-vs-dog classification; alternatively, though, with enough example images you could teach this skill in straightforward fashion. Many problems in computer vision and artificial intelligence that are impractical to solve with traditional coding and algorithms become solvable with deep learning or some other form of machine learning.

You can think of DNNs (deep neural networks) as universal approximators: they can be structured to map any input to any output (e.g., the input can be a picture, and the output can be the probabilities that the picture depicts a dog and a cat, respectively). When trained properly, they can provide a very good approximate solution to just about any problem you throw at them, under the important assumption that the network topology is a sufficiently effective match for the problem at hand. DNNs typically have a huge number of internal parameters—often millions of them. Training is the process by which you initially set all of these parameters in order that the network can then reliably solve a particular problem, like discerning cats vs. dogs.

Much industry attention is focused on machine learning inference, where a neural network analyzes new data it is presented with based on its previous training. But what about that prior, all-important training step? Obviously, training effectiveness is a critical factor in accuracy, responsiveness and other subsequent inference metrics. Without sufficient training, accuracy will inevitably suffer. But excessive training, using a data set that exceeds the scope of the images that the model will be tasked with interpreting in real-life usage, is also undesirable. Since training is highly compute-intensive, for example, an insufficiently bounded (i.e., limited, or constrained) data set will require more training time and cost than is necessary. The resultant model may also be more complex than required, translating into excessive memory budget and inference computation requirements.

This article discusses in detail the challenges of, as well as the means of, assembling a robust (but not excessive) training data set for machine learning-based vision processing. It answers fundamental questions such as why an optimized training set is important, what are the characteristics of an optimized training set, and how one assembles an optimized training set. And it also introduces readers to an industry alliance created to help product creators incorporate machine learning-based vision capabilities into their hardware and software, along with outlining the technical resources that this alliance provides (see sidebar "Additional Developer Assistance").

Training Fundamentals

Much of the information in the next several sections of this article, covering foundational machine learning concepts, comes from BDTI.

Today, training of deep neural networks primarily occurs via a process called SGD (stochastic gradient descent). To begin, the network parameters are initialized to random values. The network is then provided with batches of example training inputs (e.g., pictures of cats and dogs). For each input, the model computes a corresponding output based on its current parameters. The computed output is compared to the expected correct output, with the error differential fed backward through the network. In this step, called back-propagation, an adjustment is computed for each of the network's parameters in order to decrease the error.

The training process is repeated many times with hundreds, thousands, or even millions of example inputs. The intent is for the training process to gradually converge on a set of model parameters that provide an accurate solution to the problem at hand. Note that SGD as described here requires that you have a desired correct output (usually called a "labeled output”, e.g., “cat” or “dog”) associated with each example input. Machine learning performed using a combination of example inputs and their corresponding labels is called supervised learning. One of the critical challenges of machine learning, therefore, is finding or creating (or both) an effective dataset that contains correct examples and their corresponding output labels.

Several variations of the basic SGD optimization process are available that, in some cases, can assist the training in converging more rapidly and/or in reaching an end set of parameters that yields more accurate results. These SGD derivatives are called optimizers; examples include Adam, Adagrad, and RMSprop. And sometimes, rather than directly (and solely) training a DNN on the desired dataset, a combination of techniques can be employed.

In transfer learning, for example, a network that has already been trained with one dataset (e.g., ImageNet) is then re-trained with another dataset for a new application. However, in such cases, the parameters from the original training are used to initialize the new training, versus random data as previously discussed. If the training datasets are similar enough, the original parameters can provide a good starting point for retaining; it may even be possible to retain most of the parameters and retrain only the last few network model layers, Regardless, by using transfer learning when a network pre-trained on another similar dataset is available, the total training time can potentially be greatly reduced.

In unsupervised learning, techniques such generative adversarial networks and other autoencoders are used to train neural networks using data that lacks the annotation of a desired network output for each network input. Most often unsupervised learning (also referred to as training using unlabeled data) is combined with some supervised learning (training using labeled data), so some amount of annotated data is still needed. But the amount can sometimes be reduced by the use of unsupervised learning techniques, at least in theory.

At the moment, a significant amount of labeled data usually remains necessary in order to deploy a reliable DNN in a typical computer vision application. The primary drawback of labeled data, unfortunately, is that manual annotation of a large dataset can be expensive and time consuming. It’s therefore conceptually appealing to use unsupervised learning to reduce the amount of labeled data that’s necessary. Most current practical implementations rely mostly on annotated data, along with transfer learning. However, although unsupervised-centric training is comparatively uncommon today, research is ongoing on various techniques for improving its effectiveness.

Dataset Sources

Many annotated datasets are publicly available, although the licensing terms for most of them indicate that they are to be used only for academic or research purposes. Many of these datasets are application-specific, e.g. containing only faces, pedestrians, cars, etc. The quality of publically available datasets also varies greatly, and many of them are fairly small, i.e., on the order of 200 images, therefore perhaps not sufficient standalone to achieve reasonable accuracy for an application. While it’s conceivable to combine multiple public datasets into a larger set, this must be done with care; inconsistent annotation practices and other issues can result in an aggregated resulting dataset that isn’t very good.

Noteworthy resources of public datasets include:

  • ImageNet: one of the best-known image datasets and a de-facto benchmark for assessing the quality of a DNN topology for image classification. For many well-known CNN (convolutional neural network, a common DNN class) topologies, models pre-trained with ImageNet are also publically available.
  • OpenImages: a GitHub-served, Google-maintained database of ~9 million URLs that link to Creative Commons Attribution-licensed images. The database contains annotations, including labels and bounding boxes spanning thousands of categories. You can often create an application-specific dataset by selecting an appropriate subset of the database and then downloading the images.

Various websites also provide listings of open-source datasets. See, for example, the comprehensive CV Datasets on the Web page.

Some computer vision applications aren’t, however, adequately addressed by existing public datasets, whose various limitations may end up requiring the creation of a custom dataset instead. Sometimes, you can leverage Google’s OpenImages to create such a dataset, but other situations require the creation of a dataset "from scratch". In some cases, this custom dataset development process involves sourcing new images. Examples might include:

  • Industrial or medical applications, where the desired images aren't available in the public domain
  • Applications with unusual or tightly specified camera angles, lighting, optics and/or captured using custom sensors, that aren’t well represented in publicly available images

In other cases, existing images may be usable, but custom annotation is needed, e.g., when:

  • Labels or bounding boxes are needed for categories not found in OpenImages
  • Other types of annotation are needed, e.g., segmentation

Various companies offer dataset creation and/or annotation services (see sidebar "Benefits of Custom Datasets, and a Case Study"). Such suppliers may search the Internet for publically available images specific to the application, undertake image creation themselves, and/or accept images provided by the customer and annotate them. Some of these companies have spent substantial effort in creating software tools to make human annotation more efficient, as well as employing annotators in locations around the world where labor is inexpensive. Note, however, that creating a high-quality dataset requires careful planning, supervision, and quality control.

Shortcomings that you may encounter when creating and annotating a dataset (by yourself and/or in partnership with a vendor) include:

  • Insufficient or incomplete data: Sometimes it’s difficult or impossible to capture enough data to cover the diversity of inputs that a real-life application will be expected to handle.
  • Unbalanced data: A neural network will learn the statistical properties of the training data. If a network is trained to classify dog vs. cat on a dataset with ten times more dogs than cats, for example, it will learn that guessing “dog” every time nets it a 90% accuracy score. Such imbalances will ultimately limit the accuracy that the network can achieve.
  • Misleading correlations: For example, if the previously mentioned dog vs. cat network is trained with images that always show dogs outside and cats indoors, that network will very likely produce incorrect results when instead shown a cat outdoors or a dog indoors.  Similarly, an example from Google involved a CNN that had learned to associate dumbells with arms, because all the images of dumbells that the network had been trained with also included arms (from people doing curls).
  • Inconsistent annotations: Should an antique bureau be classified as a “desk,” a “dresser,” or a “bookshelf?” Should a tuk-tuk on a street in Bangkok be classified as a “car” or “motorcycle?” For a single human annotator, it’s often easy to end up annotating very similar objects in very different ways. And when the annotation effort for a large dataset is split among people from different cultural backgrounds, who might speak different languages or dialects, this problem is compounded—sometimes severely. Inconsistent annotations mean that the neural network is then fed conflicting data during training, and is therefore unable to learn to accurately predict the correct answers.

To some extent these problems can be managed with good specifications, guidelines, and education of the people doing the data collection and annotation. However, it’s generally impossible to predict all of the corner cases, as well as the subtle issues that can arise in the course of collecting and annotating data. Very often it’s necessary to iterate a multi-step process multiple times:

  1. Collect and annotate an initial dataset
  2. Train a network
  3. Evaluate the errors that the network makes in the application, and determine what problems exist in the dataset that contribute to those errors
  4. Adjust the dataset (collect more data, correct inconsistent annotations, etc.) and return to step 2.
  5. Repeat until the desired accuracy is achieved.

Dataset Augmentation

Dataset augmentation is an "umbrella" term for an important set of techniques that can reduce the need for annotated data. It creates multiple variations of the same source image, via methods such as:

  • Random cropping, rotation, and/or other random warps
  • The addition of random color gradients
  • Random blurring or non-linear transfer functions

Often these techniques are supported directly in machine learning frameworks, and can therefore be easily applied to every image automatically.

Sometimes it’s also useful to use augmentation techniques off-line to help correct dataset problems. For example, with an unbalanced dataset where a certain class of object occurs far less frequently than others, randomized perspective warps can be used to create additional images from the existing images in this particular class, resulting in a more balanced dataset. The following essay from Luxoft further explores the topic of dataset augmentation, including real-life examples from past design services projects.

Often, for reasons previously discussed in this article, we need more data to train our learnable models than we already have on hand, in order to improve the generalization capability of our models. The size of the training dataset has to be consistent with the model's complexity, which is a function of both model architecture and the number of model parameters. DNN training, for example, is a form of MLE (maximum likelihood estimation), from a statistical point of view. This method is proven to work only in cases where the number of training data samples is orders of magnitude more than the number of model parameters.

Modern state-of-art DNNs typically comprise millions of weights, while training datasets such as ImageNet contain only about the same number of samples, or even fewer. Such situations represent a serious violation of the applicability of MLE methods, and lead to the reality that the majority of known DNN models are prone to overfitting (a phenomenon in which the model has been trained to work so well on training data that it begins working more poorly on data it hasn't seen before). Recent research by Zhang et al. suggests that such DNNs, by simply memorizing the entire training dataset, are therefore capable of freely fitting (statistically approximating, i.e. modeling, a target function) even random labels.

Various regularization approaches are applicable in dealing with overfitting in the presence of a small trained dataset. Some of them are implicit, such as:

  • A change in model architecture
  • A change of the optimization algorithm used
  • The use of early stopping during the training process (Yao et al.).

Techniques such as these, while helping to reduce overfitting, may not be sufficient. In such cases, explicit regularization approaches are often considered next:

Explicit approaches can be applied separately or jointly, and in various combinations. Note, however, that their use doesn’t necessary lead to an improvement in model generalization capabilities. Applying them blindly, in fact, might even increase a model's generalization error.

Data augmentation is a collection of methods used to automatically generate new data samples via the combination of existing samples and prior domain knowledge. Consider it as a relatively inexpensive means of drastically increasing the training dataset size, with the intent to decrease generalization error. Each separate augmentation method is usually designed to support invariance of (i.e. unchanging) model performance on corresponding cases of possible inputs. And you can divide them into supervised and unsupervised methods, as introduced previously in this article.

Unsupervised Data Augmentation

The family of unsupervised methods includes simple techniques such as (Figure 1):

  • Geometric transformations: flipping, rotation, scaling, cropping, and distortions
  • Photometric transformations: color jittering, edge-enhancement, and fancy PCA

Such techniques are found in many frameworks and third-party libraries.


Figure 1. Geometric and photometric augmentations applied to a mushroom image were implemented with help of the imgaug library (courtesy Luxoft).

Some of these augmentation methods can be considered as extensions of the concept of dropout, applied to the input data. Examples of such methods are cutout, from DeVries and Taylor, and random erasing, by Zhong et al. Common to both of these techniques is the application of a random rectangular mask to the training images; the cutout method uses zero-masking to normalized images, while the random erasing approach fills the mask with random noise (Figure 2). The masking of continuous regions of training data makes the model more robust to occlusions and less prone to overfitting. The latter provides a distinct decrease in test errors for state-of-the-art DNN architectures.



Figure 2. Common to both the cutout (top) and random erasing (bottom) unsupervised augmentation techniques is the application of a random rectangular mask to the training images (courtesy Luxoft).

The selection of an optimum data augmentation alternative depends not only on the learnable model and the training data but also on how the trained model is scored. Some augmentation methods are preferable from a recall perspective, for example, while others deliver better precision. A Luxoft vehicle detection test case compares precision-recall curves for common photometric (i.e. perceived light intensity-based) augmentation methods (Figure 3). Note that augmentation with too broad a range of parameter values (such as the violet curve in Figure 3) results in redundancy that degrades both precision and recall.


Figure 3. An example comparison of precision-recall curves corresponds to different photometric augmentations (courtesy Luxoft).

If, on the other hand, you focus on the IoU (intersection over union) evaluation metric, your comparison results may be completely different (Figure 4):


Figure 4. An example comparison of average IoU (intersection over union) corresponds to different photometric augmentations (courtesy Luxoft).

The excessive use of data augmentation has many negative side effects, beginning with increased required training time and extending to over-augmentation, when the differences between classes become indistinguishable.

In our experiments with pharmaceutical data, for example, we learned that data augmentation can lead to quality performance drops if not accompanied by proper shuffling of training data. Such shuffling finds use in avoiding correlation between samples in batches during the training process. The application of an effective shuffling algorithm, instead of simple random permutation, can result in both faster convergence of learning and higher accuracy on final validation (Figure 5).


Figure 5. Proper shuffling algorithms can result in both faster convergence and higher accuracy for various augmented dataset types (courtesy Luxoft).

Augmentation techniques are especially useful when, for example, the dataset diverges from test data due to image quality or other a priori-known parameters such as various camera factors (ISO sensitivity and other image noise sources, white balance, lens distortion, etc.). In such cases, elementary image processing methods like those mentioned earlier in this essay can find use not only in increasing the amount of data available for training but also in making training data more consistent with test data, as well as reducing the cost of collecting and labeling new data.

However, use cases still exist where simple image processing is not sufficient to result in a training dataset that's a close approximation of real data (Figure 6). In such complex cases, you may want to consider supervised augmentation techniques.



Figure 6. In some cases, image samples from training datasets (top) are insufficiently close approximations of real-life input counterparts (bottom) (courtesy Luxoft).

Supervised Data Augmentation

Supervised augmentation methods encompass both learnable and un-learnable approaches. One common example of the latter involves graphical simulation of data samples, such as the 3D simulation of a train car coupling from a past railway project (Figure 7).




Figure 7. In a case study example of un-learnable supervised data augmentation, 3D graphics simulations of a train car coupling (top and middle) found use in supplementing real-Iife images (bottom) (courtesy Luxoft).

Keep in mind that training the model on simulated samples, if not done carefully, can result in a model that overfits on details presented in simulated data, and is therefore not applicable to real-life situations. Note, too, that is not always necessary to simulate photorealistic data with synthetic data extensions (Figure 8).




Figure 8. In this case study example used for fish behavior classification, the synthetic data (top) derived from motion estimation algorithms (middle) was successfully used to train the DNN model, even though it was not at all realistic-looking to the human eye (bottom). (courtesy Luxoft).

Learnable augmentation methods leverage an auxiliary generative model that is "learned" in order to produce the new data samples. The best-known examples of this approach are based on the GAN (generative adversarial nets) concept, proposed initially by Goodfellow et al. The primary characteristic of these methods is the simultaneous training of two neural networks. One neural network strives to generate images that are similar to natural ones, while the other network learns to detect imitation images. The training process completes when generated images are indistinguishable from real image samples.

Shrivastava et al. and Liu et al. applied the GAN approach in order to generate highly realistic images from simulated ones, thereby striving to eliminate a primary drawback of using simulation in data augmentation. Also, Szegedy et al. and Goodfellow et al. observed that the generation of realistic images is not the sole way of improving generalization. They proposed the use of adversarial samples, which had been initially designed to trick learning models. Extending a training dataset by means of such samples, intermixed with real ones, can decrease overfitting.

Finally, Lemley et al. have introduced a novel neural approach to augmentation, which connects the concepts of GAN and adversarial samples in order to produce new image samples. This approach uses a separate augmentation network that is trained jointly with the target classification network. The process of training is as follows:

  • At least three images of same class are randomly selected from the original dataset.
  • All selected images, except for one, are treated as input to the augmentation network, which produces a new image with the same size.
  • The output image is compared with the remaining one from the previous step to generate a similarity (or loss) metric for the augmentation network. The authors suggest using MSE (mean squared error) as the metric.
  • Both compared images are then fed into the target classification network, with the loss computed as categorical cross-entropy.
  • Both losses (augmentation and classification) are merged together by means of various weighting parameters. Merged loss may also depend on the epoch number.
  • The total loss back-propagates from the classification network to the augmentation network. As a result, the last network finds use in generating optimum augmented images for the first network.

The presented approach doesn’t attempt to produce realistic data. Instead, the joint information between merged data samples is employed to improve the generalization capability of the target learnable model.

Potemkin Andrey
Deep Learning Engineer, Luxoft

Sergey Fedorov
Program Manager, Luxoft

Valiev Ildar
Deep Learning Engineer, Luxoft

Synthetic Datasets

In some cases it may be practical to synthetically (artificially) generate some or all of a dataset. For example, in an industrial inspection application, 3D models of objects to be analyzed may already exist. These models can be used to render additional artificial images with different lighting conditions, camera angles, imperfections, etc. Because the data is generated from a model, all annotations can be automatically generated too; the annotation effort is replaced by programming effort. One interesting example of this technique is the “flying chairs” dataset, which uses 3D models of chairs to create an artificial dataset via optical flow annotation. The following essay from AImotive further explores this topic, including its use in developing and refining the company's autonomous vehicle control software.

In 2015, AImotive set out to create an AI-based camera-first self-driving solution. We knew that merging computer vision with artificial intelligence would offer unique capabilities. Simply put, any sensor setup would benefit from AI. But using it with computer vision would allow aiDrive to precisely predict the behavior of human drivers, pedestrians, animals and other object around a vehicle.

However, as previously discussed in this article, training neural networks requires a massive amount of data. Furthermore, each network has to be trained on data sets that are curated and otherwise customized to the specific purpose of the network. These basic difficulties are only emphasized by the safety-critical nature of self-driving. And to further complicate matters, the data sets must also be variable. Only when trained on a variety of images from different environments will a network achieve the necessary accuracy to support full autonomy.

To address these challenges, AImotive developed a semi-automated annotation tool to create data sets. How images are processed in the tool is itself specific to the goal of the data set. aiDrive, our autonomous driving software suite, currently uses AI-based systems for lane detection, object and free space segmentation, bounding boxing, and depth estimation. Images for lane detection training are relatively easy to generate. Lanes are marked on key frames and then propagated through the video stream. Segmentation is aided by non-real-time, high-precision segmentation networks that pre-analyze complete image sequences.

Our annotation team then scours the data set to correct any mistakes. To assist with bounding boxing, each instance of an object is marked separately. Based on this pixel-precise segmentation, bounding boxing is fully automated. Raw ground truth data from LIDAR and stereo cameras is used to train depth estimation algorithms. While the depth training data is relatively easy to collect, the benefits of having distance estimation algorithm that uses only a single camera are huge. In case one camera’s vision is degraded in the stereo setup, for example, the depth information of the surroundings will help in stopping the car safely.

AImotive follows a test-oriented development plan. In this, all systems undergo rigorous testing in aiSim, our photorealistic simulator. Only after they are deemed safe do we test them on public roads. The simulator also gives our team immediate feedback on our neural networks and the quality of our data sets. Tracking the effectiveness of our training through this internal feedback loop gives us the flexibility to improve our data sets in a relatively short timeframe, if needed.

Bence Varga
Business Development Manager, AImotive

Conclusion

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products, and can provide significant new markets for hardware, software and semiconductor suppliers. Machine learning-based vision processing is an increasingly popular and robust alternative to classical computer vision algorithms, but it tends to be comparatively resource-intensive, which is particularly problematic for resource-constrained embedded system designs. However, by optimizing the various characteristics of the dataset used to initially train the network, it's possible to develop a machine learning-based embedded vision design that achieves necessary accuracy and inference speed levels while still meeting size, weight, cost, power consumption and other requirements.

Brian Dipert
Editor-in-Chief, Embedded Vision Alliance
Senior Analyst, BDTI

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. AImotive, BDTI, iMerit and Luxoft, the co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, is coming up May 22-24, 2018 in Santa Clara, California. Intended for product creators interested in incorporating visual intelligence into electronic systems and software, the Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings. More information, along with online registration, is now available.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other machine learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance and its member companies periodically deliver webinars on a variety of technical topics, including various machine learning subjects. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Alliance website. Also, the Embedded Vision Alliance is offering offering "Deep Learning for Computer Vision with TensorFlow," a series of both one- and three-day technical training class planned for a variety of both U.S. and international locations. See the Alliance website for additional information and online registration.

Sidebar: The Benefits of Custom Training Datasets, and a Case Study

The following essay from iMerit, a provider of custom dataset and annotation services, further explores themes introduced in the main article regarding the importance of a relevant and comprehensive training dataset, as well as detailing the company's experiences with one of its clients.

More high quality training data leads to higher inference accuracy in model outputs, as the main article's discussion introduces. While numerous sources of large-scale "open" data now exist, producing a market-ready product may still require the use of a robust, high-quality, high-volume custom training dataset, particularly when licensing restrictions preclude the commercial use of open-source images.

The race to artificial intelligence-based problem-solving is very real, with major industry investments joining with academic breakthroughs to lead to dramatic advances in machine learning scalability and open-source modeling approaches. These advancements are accelerating the speed at which new technologies and products derived from them are created and deployed, but the models they're based on aren’t necessarily market-ready. Attempts at production deployments based on these models may then fail, due to accuracy issues. Such failures are attributable to the lack of case-specific training data at scale, among other factors.

Leveraging open source datasets or synthetic data in the early phases of development is certainly cost-effective and, depending on the project, may also serve your production needs. In other cases, however, custom training datasets are required to achieve market-ready accuracy levels.

Training Data Factors That Affect Inference Output Accuracy

To keep diseases from being misdiagnosed, say, or to prevent self-driving cars from mistaking a person for an object, computer vision models need to be trained with massive amounts of high-quality, large scale, relevant, comprehensive and, potentially, custom data. Open-source datasets may fall short with lack of variety, quantity, and real-life examples. For example, if you are looking to identify objects, pre-made training sets may include a perfectly framed picture of the object, while in the real world, this won’t always be the case.

The training set may also include disproportionally high amounts of certain image categories, neglecting others. If you were looking to identify dogs, an open source dataset may have an abundance of images of Labrador Retrievers, but little or no pictures of Jack Russell Terriers. And it may not have adequate representation of "edge cases" that are critical in your particular application (Figure A).

Figure A. Publicly available datasets may contain typical images of various subjects (top left and top right) but omit all-important image "edge cases" (bottom left and bottom right) (courtesy iMerit and Pexels).

Quality and Quantity

Depending on your project, quality may be straightforward to define. Many times, however, the definition of what a quality dataset looks like is a challenge in and of its own. This initial definition is typically based on project needs. But due to the nature of variable data sources, type of labeling task, gaps in requirement perception, and contextual relevance, the definition of quality can evolve. Once you have achieved the definition of quality data for your project to ensure accuracy and consistency, you can then scale labeling efforts to build your training datasets.

While most academic reports and blog writeups about machine learning focus on improvements to algorithms and features, the use of larger training datasets may in fact be more impactful on results. See, for example, the research paper “The Unreasonable Effectiveness of Data” (PDF) and a follow-up blog post published by Google Research. The more images or video you use to train your algorithm, the more robust and accurate it likely will be (Figure B). But if open-source datasets don’t provide the amount of data you require in order to achieve ideal solution quality, the collection and labeling of a custom dataset may be needed.


Figure B. The more images or video you use to train your algorithm, the more robust and accurate it likely will be (courtesy iMerit).

Relevance and Comprehensiveness

While both data quality and quantity are crucial to output accuracy, so too is the relevance of the dataset. To produce a high-quality training dataset, you'll likely require raw image data that directly applies to the problem that you are trying to solve. If you're trying to train a computer to identify cats via an open-source dataset of animal images, for example, only having photos of dogs and rabbits on hand will not produce the accuracy results you need (Figure C).



Figure C. Relevance (top) and comprehensiveness (bottom) are both important training dataset characteristics (courtesy iMerit).

An accurate algorithm also requires training with a diverse and extensive dataset. If your objective is to identify cats, for example, the algorithm probably needs to be able to identify all 50+ different breeds of cats. It also needs to be able to identify household cats versus wild cats, identify sitting cats versus standing cats, etc. To accomplish this objective, a training dataset must include a spectrum of cat images. Depending on the complexity of the project and the desired outcome, this might mean 5,000 labeled images, or it might mean 5 million labeled images.

How KinaTrax Leverages Custom Datasets to Optimize Pitcher Performance

As you can see, there are many factors involved in achieving market-ready output quality levels. Ensuring that you have a large amount of high quality, relevant and comprehensive data is the first step. Next is ensuring accurate image labeling and annotation. A case study from KinaTrax, one of our partners, will exemplify this concept.

KinaTrax develops 3D kinematic models that are used by MLB (Major League Baseball) teams to monitor and improve pitchers’ performance, as well as prevent injuries. Because of the need for constant coverage from multiple angles, the company is unable to leverage the vast amount of already available video of MLB pitchers captured by broadcast networks, etc. Instead, KinaTrax’s Markerless Motion Capture System comprises a suite of imagers mounted throughout a baseball stadium to capture pitchers' detailed movements. Currently, these systems are installed at the home stadiums of the Chicago Cubs and the Tampa Bay Rays, along with another undisclosed team's home stadium.

KinaTrax uses computer vision and machine learning algorithms to capture a pitcher's biomechanics at more than 300 frames per second. The video is then recovered in 3D and reconstructed frame by frame, producing an image for every motion within the pitch sequence. These images are then annotated at 20 distinct joint centers by iMerit’s team of computer vision data experts (Figure D). iMerit provides on-demand and scalable end-to-end annotation resources through KinaTrax’s data analysis workflow. The 3D kinematic models can then find use in generating comprehensive and customizable biomechanic reports for evaluating mechanics over time, and for both performance enhancement and injury prevention outcomes.


Figure D. KinaTrax’s Markerless Motion Capture System converts video frames to a series of 3D images, which are then annotated at 20 distinct joint centers by iMerit's team of data experts (courtesy iMerit).

The idea of leveraging in-game data to enhance pitcher performance “will make a profound difference in major league baseball; it will change the game,” according to Steven Cadavid, KinaTrax's president. "Billy Beane revolutionized baseball with his analytical, evidence-based approach to selecting players dubbed ‘Moneyball’. KinaTrax’s offering is similarly revolutionary, supplying team management with the data they need to make the best decisions about a pitcher’s health. From an injury prevention standpoint, for example, the datasets we’re collecting are really unprecedented.”

KinaTrax is taking a fresh approach to motion capture technology by leveraging video annotation to obtain the data required to build 3D kinematic models. This means that the subjects, in this case the MLB pitchers, don’t need to wear markers to capture the data. This key enhancement in KinaTrax’s technology enables the company to capture not only training data but also in-game information. And in combination with iMerit’s on-demand dataset service offering, technology-improved human beings are revolutionizing the game of baseball.

Jai Natarajan
Vice President, Marketing and Technology, iMerit

Implementing Vision with Deep Learning in Resource-constrained Designs

Bookmark and Share

Implementing Vision with Deep Learning in Resource-constrained Designs

DNNs (deep neural networks) have transformed the field of computer vision, delivering superior results on functions such as recognizing objects, localizing objects within a frame, and determining which pixels belong to which object. Even problems like optical flow and stereo correspondence, which had been solved quite well with conventional techniques, are now finding even better solutions using deep learning techniques. But deep learning is also resource-intensive, as measured by its compute requirements, memory and storage demands, network latency and bandwidth needs, and other metrics. These resource requirements are particularly challenging in embedded vision designs, which often have stringent size, weight, cost, power consumption and other constraints. In this article, we review deep learning implementation options, including heterogeneous processing, network quantization, and software optimization. Sidebar articles present case studies on deep learning for ADAS applications, and for object recognition.

Traditionally, computer vision applications have relied on special-purpose algorithms that are painstakingly designed to recognize specific types of objects. Recently, however, CNNs (convolutional neural networks) and other deep learning approaches have been shown to be superior to traditional algorithms on a variety of image understanding tasks. In contrast to traditional algorithms, deep learning approaches are generalized learning algorithms trained through examples to recognize specific classes of objects.

Since deep learning is a comparatively new approach, however, the usage expertise for it in the developer community is comparatively immature versus with traditional alternatives. And much of this existing expertise is focused on resource-rich PCs versus comparatively resource-deficient embedded and other designs, as measured by factors such as:

  • Image capture (along with, potentially, depth discernment) subsystem capabilities
  • CPU and other processors' compute capabilities
  • Local cache and chip/system memory capacities, latencies and bandwidths
  • Local mass storage capacity, latency and bandwidth, and
  • Network connectivity reliability, latency and bandwidth

This article provides information on techniques for developing robust deep learning-based vision processing SoCs, systems and software for resource-constrained applications. It showcases, for example, the opportunities for and benefits of leveraging available heterogeneous computing resources beyond the CPU, such as a GPU, DSP and/or specialized processor. It also discusses the tradeoffs of various cache and main memory technologies and architectures, implemented at both the chip and system levels. It highlights hardware and software design toolsets and methodologies that assist in the optimization process. And it also introduces readers to an industry alliance created to help product creators incorporate vision capabilities into their hardware and software, along with outlining the technical resources that this alliance provides (see sidebar "Additional Developer Assistance").

Defining the Problem

Why is developing a deep learning-based embedded vision design, along with adapting a deep learning model initially targeting a PC implementation, so challenging? In answering these questions, the following introductory essay from Au-Zone Technologies sets the stage for the sections that follow it.

At the core of the constrained-resources predicament is the reality that technology innovations typically originate in academic research. Both here and often with the initial profitable commercial opportunities that follow, development and implementation both typically employ a desktop- or server-class computer platform. Such systems contain a relative abundance of processing, memory, storage and connectivity resources, and they're also comparatively unhindered by the size, weight, cost, power consumption and other constraints that are conversely common in embedded design implementations. Whereas a PC developer might not think twice about standardizing on high precision floating point data and calculations, for example, an embedded developer would need to rely on low-precision fixed-point alternatives in order to balance the more challenging resource "budget."

Some specific examples of constraints and other development challenges in embedded implementations include:

  • The tradeoffs between (more expensive but faster and more power efficient) SRAM and (cheaper but slower and higher power) DRAM, both at a given level in the system memory hierarchy and as subdivided (both in terms of type and capacity) among various levels.
  • The increased commonality of a unified system memory approach shared by various heterogeneous processors (versus, say, a dedicated frame buffer for a GPU in a PC), and the increased resultant likelihood of contention between multiple processors for the same memory bus and end storage location resources.
  • Vision processing pipelines, along with low-level drivers, that may be "tuned" for human perception, and therefore conversely may be sub-optimal for computer vision purposes.
  • Software APIs, frameworks, libraries and other standards that are immature and therefore in rapid evolution.
  • The oft-necessity for purpose-built development tools, custom inference engines, and the like, along with the subsequent inability to easily migrate them to other implementations.

And how do these and other factors adversely affect embedded vision development? Whether defined by available compute resources, memory and mass storage capacity and bandwidth, network connectivity characteristics, battery capacities, the thermal envelope, size, weight, bill-of-materials costs, or the combination of multiple or all of these and other factors, these constraints fundamentally define what capabilities your product is able to support and how robustly it is able to deliver those capabilities. And equally if not more importantly, they affect how long your product will spend in development, not to mention the required project budget and manpower headcount, in order to hit necessary capability targets.

Brad Scott
President, Au-Zone Technologies

Heterogeneous Processing

Computer vision and machine learning present low-power mobile and embedded systems with significant challenges. It’s important, therefore, to leverage every bit of the computing potential present in your SoC and/or system. Designing from the outset with processing efficiency in mind is key. Dual-core CPUs first came to mainstream PCs in the mid-2000s; in recent years, multi-core processors have also become common in smartphones and tablets, and even in various embedded SoCs. GPUs, whether integrated in the same die/package as the rest of the application processor or in discrete form, are also increasingly common.

Modern embedded CPUs have steadily improved in their ability to tackle some parallel tasks, as vector-based SIMD (single instruction multiple data) architecture extensions, for example, become available and are leveraged by developers. Modern embedded GPUs have conversely approached the same problem from the opposite direction, becoming more adept at some serial tasks. Together, the CPU and GPU can handle much if not all of the machine learning workload.

A key message when it comes to processor choice, in considering the spectrum of possible vision and machine learning workloads, is that a one-size solution can't possibly fit all possible use cases. Thinking heterogeneously when designing your SoC or system, leveraging a selection of processors with strengths in different areas, can reap dividends when it comes to overall efficiency.  For example, many modern SoCs already come equipped with both "big" and "little" CPUs (and clusters of them), along with a GPU. The “big” CPU cores are present for moments (often short bursts) when high performance is needed, while the “little” cores are intended for sustained processing efficiency, and the GPU delivers as-required massively parallel computation ability. When used intelligently, via use of the OpenCL API and other programming techniques, you end up with a great deal of efficient processing power available to you.

The challenge, of course, is efficiently spreading your machine learning and computer vision pipeline workloads across all available computing resources in such a heterogeneous fashion. A common technique for these kinds of pipelines, when implemented on desktop or server systems, is to make many copies of large data buffers. On mobile and embedded systems, conversely, such memory copies are highly inefficient, both in terms of the time taken to perform the copies, and (crucially) in the amount of energy these sorts of operation consume. Fortunately, however, embedded and mobile SoCs typically implement a unified, global memory array, making the “zero-copy” ideal at least somewhat feasible in reality.

A remaining problem, at least historically: the CPU and GPU have separate caches, which means that transferring workloads between processors has typically required expensive (in terms of latency and/or power consumption) synchronization operations. Any potential performance gains made when moving workloads between processors can be negated in the process. These days, fortunately, designing your SoC using a modern, cache-coherent interconnect can make all the difference. Caches are kept up-to-date automatically, which means that you can more freely swap workloads between various processor types.

The inclusion of computer vision and deep learning accelerators alongside CPUs and GPUs in modern SoCs is becoming increasingly commonplace. The following essay from Synopsys discusses their potential to boost the performance, along with lowering the energy consumption, of deep learning-based vision algorithms, as well as providing various implementation suggestions.

Along with facing the usual embedded challenges of power and area consumption, a designer architecting an embedded SoC for computer vision and deep learning must also tackle some unique challenges and constraints, such as steadily increasing system complexity and the rapid pace of technology advancement. Power and area constraints therefore need to be balanced against performance and accuracy targets; the latter is very important for object detection and classification tasks, which are increasingly implemented using deep learning techniques. And memory bandwidth can also become a limiting factor that requires insight and understanding.

An embedded vision SoC, such as one based on Synopsys’ DesignWare EV6x processor family, combines both traditional computer vision processing units and newer deep learning engines (Figure 1). The EV6x includes a vision processor that combines both scalar and vector DSP resources, along with a programmable CNN engine. The scalar unit is programmed via a C/C++ compiler, while the vector unit is programmed using an OpenCL C compiler. A CNN graph, such as AlexNet, ResNet, or VGG16, is trained using Caffe, TensorFlow or another software framework, and is then mapped into the CNN engine’s hardware. The CNN engine is programmable in the sense that any graph can be mapped to its hardware.


Figure 1. The various components of an embedded vision SoC can together implement a robust ADAS or other processing solution (courtesy Synopsys).

These three heterogeneous processing units deliver the best performance for a given power and area, because they are each optimized for their specific tasks. Each dedicated processing unit must be programmed, so designers need a robust set of software tools that enable mapping of embedded vision solutions across the processing units. DesignWare EV6x processors, for example, are fully programmable and supported by the MetaWare EV development toolkit, which includes software development tools based on the OpenVX, OpenCV and Open CL C embedded vision standards.

Performance, power and area are some of the constraints that need to be balanced against each other. Different projects will have different prioritization orders for these and other constraints. An automotive ADAS (advance driver assistance system) design based on a high-resolution front camera might prefer to prioritize performance, but limitations in clock speed defined by process technology (and associated cost), power consumption (and associated heat dissipation) and other parameters will constrain the SoC's maximum performance capabilities (see sidebar "Optimization at the System Level, An ADAS Case Study"). Selecting a processing solution that allows for scaling of the CNN engine can be an effective means of trading off various constraints.

Performance

Evaluating the performance of different embedded vision systems is not a straightforward process, since no benchmark standard currently exists. Comparing the number of multiply-accumulators (MACs) in different CNN engines can provide a first-order assessment, since CNN implementations require a large number of MACs. TeraMAC/s has therefore become a popular metric for specifying CNN engines. The EV6x, for example, can scale from 880 to 3,520 MACs, which at a 1,280 MHz clock frequency (under typical operating conditions and when fabricated on a 16 nm process node), delivers a performance range of 1.1 to 4.5 TMAC/s. Obscured by this metric, unfortunately, is the MACs' precision. Two different CNN engines, for example, may look similar from a TMAC/s standpoint:

880 12b MAC x 1,280 MHz = 1.1TMAC/s

1,024 8b MAC x 1,000 MHZ = 1TMAC/s

What is missing from these high-level TMAC/s performance numbers, however, is the bit resolution of the MACs in each CNN engine.

Accuracy

Bit resolution can significantly impact system accuracy, a critical metric for an application such as a front camera in an automobile. Most embedded vision CNNs exclusively use fixed-point calculations, versus floating-point calculations, since silicon cost (therefore the amount of silicon area consumed) is a key design parameter. The EV6x, for example, uses optimized 12-bit integer MACs that support 12- or eight-bit calculations. Most CNN graphs can be executed with eight-bit precision and no loss of accuracy; however, some graphs benefit from the additional resolution that a 12-bit MAC provides.

A deep graph, such as ResNet-152, for example, which has a large path distance from start node to end node, generally does not perform well in an eight-bit system. Graphs that have a large amount of pixel processing operations, such as Denoiser (PDF), also tend to fare poorly when using eight-bit CNNs (Figure 2). Conversely, using even higher bit resolutions, such as 16 bits or 32 bits, doesn’t normally produce significantly improved accuracy but does adversely impact the silicon area required to implement the solution.


Figure 2. A 12-bit fixed-point implementation of the Denoiser algorithm delivers results practically indistinguishable from those of a floating-point implementation, while eight-bit fixed-point results exhibit more obvious lasting noise artifacts (courtesy Synopsys).

Area

Some SoCs prioritize silicon area ahead of either performance or power consumption. A designer of an application processor intended for consumer surveillance applications, for example, might have limited area available on the SoC, therefore striving to obtain highest-achievable performance at a given die "footprint". Selecting the most advanced, i.e., smallest, process node available is often the best way to minimize area. Moving from 28nm to 16nm, or from 16nm to 12nm or lower, can both increase clock frequency capabilities, reduce power consumption at a given clock speed, and decrease die size. However, process node selection is often dictated by company initiatives that are beyond the SoC designer's influence.

Selecting an optimized embedded vision processor is the next best way to minimize area, since such a processor will maximize the return on the silicon area investment. Another way to minimize area is to reduce the required memory sizes. For an EV6x core containing both a vision processor and a CNN "engine", for example, multiple memory arrays are present: for the scalar and vector DSPs as well as the CNN accelerator, along with a closely coupled partition for sharing data between the processing units. While memory size recommendations exist, the final choice is left to the SoC designer. Keep in mind, however, that reducing memory capacities could negatively impact system performance, among other ways by increasing bandwidth on the AXI bus. Another way to address memory size on the EV6x is to choose a single- versus dual-MAC configuration. Again, smaller area has to be balanced against higher performance.

Power Consumption

Consumer and mobile designers often list power consumption as the most critical constraint. A designer of an SoC for a mobile device, for example, might strive for the best performance within a 200 mW power budget. Fortunately, embedded vision processors are designed with low power in mind. Fine-tuning clock frequencies is the most straightforward way to lower the power consumption budget. A system that could perform at clock speeds as high as 1280 MHz might instead be clocked at 500 MHz or lower, at least in some operating modes, to optimize battery life and reduce heat dissipation. Doing so, of course, can degrade performance. Power consumption is affected by both area (e.g., transistor count) and frequency. Generally, the smaller the area and lower the transistor toggle rate, the less power your design will consume.

Bandwidth

Bandwidth on the AXI or equivalent interconnect bus is often a top concern, related to power consumption. Reducing the number of external bus transactions will save power. Increasing the amount of memory (at the tradeoff of higher required silicon area, as previously discussed) to decrease the number of required external bus accesses is one approach to actualizing this aspiration. However, deep learning research is also coming up with various pruning and compression techniques that reduce the number and type of computations, along with the amount of memory, needed to implement a given CNN graph.

Take a VGG16 graph, for example. Reducing the number of coefficients, through a pruning process that deletes coefficients close to zero and retrains the system to retain accuracy, can significantly reduce the number of coefficients that have to be stored in memory, therefore cutting down bandwidth. However, this process doesn’t necessarily lower the calculation load, therefore power consumption, unless (as with the EV6x) hardware support also exists to discard MAC operations with zero inputs. Using eight-bit coefficients when 12-bit support isn’t needed for accuracy will also lower both bus bandwidth and external memory requirements. The EV6x, for example, supports eight-bit as well as 12-bit coefficients and feature map values.

Gordon Cooper
Product Marketing Manager for Embedded Vision Processors, Synopsys

Software Optimization

While it's highly beneficial to fully leverage all available processing resources in a SoC and/or system, as discussed in the previous section of this article, it's equally important to optimize the deep learning-based vision software that's running on those processing resources. The following essay from BDTI explores this topic in detail, including implementation suggestions.

Optimization of a compute-intensive workload can be described in terms of one or multiple of the following levels of abstraction:

  • Algorithm optimization: modify the algorithm to do less computation, and/or better fit the target hardware.
  • Software architecture optimization: come up with a software architecture that maximizes system throughput, e.g., avoiding needless data copies, making efficient use of caches, enabling efficient use of parallel resources, etc.
  • Hot-spot optimization: individually optimize the most time-consuming functions.

The greatest gains are often found at the highest levels of abstraction, and therefore highest-level optimizations should be undertaken first. A fully customized implementation of an application, one that has been thoroughly optimized at each of these levels and for the specific use case, will likely yield the best power/performance result. But such a time consuming and expensive development path may not be practical. For DNN deployment in particular, a full-custom implementation of a CNN is a risky undertaking: network topologies are evolving rapidly, and if you spend several man-months optimizing for one topology, by the time you’re finished you might find out that you should be using some other topology instead. Because of this risk, it's rare to see DNNs optimized to the last clock cycle or the last mW.

What's typically seen in practice, instead, are frameworks for deployment that don’t assume a specific topology. In terms of software architecture, these frameworks are designed to handle well-known published topologies such as ResNet50, GoogLeNet, and SqueezeNet. The underlying individual functions are also highly optimized, using these same well-known networks as benchmarks. Because these frameworks need to have a lot of flexibility, they can’t necessarily match the peak performance of a fully customized implementation.

Keep in mind, too, that that if you're striving to deploy a network that’s vastly different from the “benchmark” networks for which the framework was designed, you might end up with sub-optimal performance. A character recognition network that operates on 40x40 pixel images, for example, probably has a much smaller memory footprint than that found with commonly-used image classification networks, making the character recognition network amenable to optimizations that conversely wouldn't be effective in a framework designed to deploy larger networks (see sidebar "Optimization at the System Level, An Object Recognition Case Study").

Some examples of hardware vendor-provided frameworks for efficient deployment of DNNs include (not intended to be a comprehensive list):

These particular DNN deployment frameworks are based on underlying libraries that are thoroughly optimized by the hardware vendor. In leveraging them, developers often have two implementation options:

  1. Use the deployment framework at a high level (i.e., simply feed it the network description and weights, and let automated tools do all the work), or
  2. In some cases it’s possible to make direct calls to the framework's underlying library API, thereby obtaining better results via some manual optimization of the implementation architecture.

Because the underlying library is so well optimized by the hardware vendor, both of these options can end up delivering better performance than what a developer could practically achieve otherwise, given limited budget and schedule, even though, as previously noted, a full-custom implementation would theoretically yield even better results.

One other note: since the deployment frameworks provided by hardware vendors need to be flexible enough to support various topologies, they often leave the first level of optimization listed at the beginning of this essay, algorithm optimization, to the developer. If you can design and train a smaller topology (measured by some combination of fewer weights, fewer activations, and/or fewer operations), you’ll likely end up with better performance. This may be the case even if your tailored implementation is seemingly less optimal because your resultant network now doesn't look like one of the well-known topologies for which the framework was initially designed.

Therefore, if you’ve selected a well-known topology that your research suggested would give you the smallest/fastest/etc. outcomes for your application, trained it, and ended up with acceptable results, ask yourself the following question: is the accuracy of your trained network limited by the topology design, or by the training methods (choice of optimizer, meta parameters), or by limitations of the training data, or by fundamental limitations of the application? It’s possible that you could shrink feature maps, remove layers, or tweak the network topology in various other ways to make the network smaller without losing much (if any) accuracy.

Conclusion

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products, and can provide significant new markets for hardware, software and semiconductor suppliers. Deep learning-based vision processing is an increasingly popular and robust alternative to classical computer vision algorithms, but it tends to be comparatively resource-intensive, which is particularly problematic for resource-constrained embedded system designs. However, by making effective leverage of all available heterogeneous computing nodes, efficiently utilizing memory and interconnect bandwidth (both between various processors and their local and shared memory), and harnessing leading-edge software tools and techniques, it's possible to develop a deep learning-based embedded vision design that achieves necessary accuracy and inference speed levels while still meeting size, weight, cost, power consumption and other requirements.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance
Senior Analyst, BDTI

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. Au-Zone Technologies, BDTI and Synopsys, the co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, is coming up May 22-24, 2018 in Santa Clara, California. Intended for product creators interested in incorporating visual intelligence into electronic systems and software, the Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings. More information, along with online registration, is now available.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance and its member companies periodically deliver webinars on a variety of technical topics, including various deep learning subjects. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Alliance website. Also, the Embedded Vision Alliance is offering offering "Deep Learning for Computer Vision with TensorFlow," a series of both one- and three-day technical training class planned for a variety of both U.S. and international locations. See the Alliance website for additional information and online registration.

Sidebar: Optimization at the System Level, An ADAS Case Study

The design project discussed in the following case study from Au-Zone Technologies, developed in partnership with fellow Alliance member company NXP Semiconductors, showcases real-life implementations of the heterogeneous computing and software optimization concepts covered in the main article, along with other system-level fine-tuning optimization techniques.

As with any computer vision design exercise, system-level optimization must be taken into consideration when creating deep learning-based vision processing solutions. Considering the raw volume of data created by a real-time video stream, the compute horsepower required for real-time CNN processing of the stream, and the limitations imposed by embedded SoCs, optimization at every opportunity during development and deployment is essential to creating commercially practical solutions.

To make these system-level design optimization concepts more tangible, this case study describes the development of a practical TSR (traffic sign recognition) solution for deployment on NXP Semiconductor’s i.MX8 processors. In this example, we focus on the principal areas of system optimization within the constraints imposed by a typical embedded system:

  1. Deploying a deep learning-based vision processing pipeline on a heterogeneous platform
  2. Neural network design choices and tradeoffs
  3. Inference engine performance optimizations

Although this case study focuses on a specific image classification problem and processor architecture, the design methodology and optimization principles can be generalized to solve many different embedded vision processing and classification problems, on divergent hardware.

System Design Objectives and Optimization Approaches

Using as little compute horsepower as possible, the overall objective of this project was to design a practical object detection and classification solution characterized by the following boundary conditions:

  • Robust TSR detection and classification using only standard optics, sensors and processor(s)
  • Real-time processing of the video stream coming from a camera
  • Classification accuracy greater than 98% on test data, and greater than 90% on live video
  • Low-light performance suitable for automotive market applications
  • A cabling solution able to provide 2-3m separation between the camera head and compute engine
  • The use of automotive-grade silicon (sensor, processor and supporting components)
  • The use of best-in-class computer vision techniques to implement the vision pipeline
  • An overall system cost that is appropriate for the market application

Focusing on an overall system optimization forces the developer to consider many different, often conflicting, design and performance aspects to find a balance that is appropriate for a given application. The following diagram highlights some of most common parameters that need to be considered in any development program (Figure A). Depending on the application, market and performance objectives, the relative priority of these parameters will vary.


Figure A. The relative priority of various common system design optimization parameters will vary depending on the particular application and other variables (courtesy Au-Zone Technologies).

Taking a holistic approach to maximize accuracy while minimizing compute cost (time and energy), production hardware cost (engineering bill of materials, e.g., eBOM) and engineering effort often becomes a challenging balancing act for designers. System design for a deep learning-based vision solution is an iterative process; you will need both a starting point and a general framework on how to approach the problem in order to deliver a commercially viable solution within a resource constrained design.

Although there is no single recipe to follow, the following outlined sequence provides a generalized framework for your consideration. Assuming the objective for your project will be to implement a real-time classification system with optimum accuracy on lowest-cost hardware, the developer should consider optimization in the following order of diminishing returns:

  1. Minimize data volume and processing requirements
  2. Distributed processing, pipelines, and heterogeneous computing
  3. Optimization of individual vision pipeline stages
    1. Image Processing
    2. Object Detection
    3. Object Classification

Minimize Data Volume and Processing Requirements

It may seem obvious, but an easy trap to fall into when designing an embedded imaging system involves using commonly available cameras and image sensors, which are designed to provide extremely high image quality. The development team architecting the hardware may identify sensors that are convenient from a hardware design perspective, for example, without having visibility into how this decision may make the downstream processing problem overly complex or even impossible to solve on a given platform. Often, such devices produce data streams with image quality and resolution far exceeding what is actually required to solve common recognition or classification problems, making all subsequent processing steps much more complicated than necessary.

Specific image sensor (and camera) parameters affecting the data rate include:

  • Color vs monochromatic capture
  • Pixel depth
  • Frame rate
  • Image resolution (i.e., what is the minimum viable resolution necessary to solve the problem?)
  • Scaling (nearest-neighbor or interpolation)
  • Binning (2x2, 4x4 grouping)
  • Data type/format/color space (e.g., RAW, Bayer, YUV, RGB, etc. (Figure B))

Pursuing all opportunities to minimize data input and reduce complexity at the front end will provide significant benefits downstream.


Figure B. The output format and other characteristics of the image sensor, such as in this Bayer filter pattern approach, can greatly affect the payload size of the data stream to be subsequently processed (courtesy Au-Zone Technologies).

Distributed Processing, Pipelines, and Heterogeneous Computing

The Wikipedia definition for heterogeneous computing refers to "systems that use more than one kind of processor or cores. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks."

Given the overall computational load required by a deep learning-based vision solution running at a robust frame rate, the first significant design optimization opportunity is to map the vision pipeline to a heterogeneous compute platform with processing elements that are well suited to each stage in the pipeline. The following diagrams show the simplified mapping between our TSR pipeline and the corresponding hardware compute elements in the i.MX8 based system (Figure C).



Figure C. A simplified visualization of the vision processing pipeline used to implement this project (top), can be mapped onto various heterogeneous processing elements in the target SoC (bottom) (courtesy Au-Zone Technologies).

Optimization of Individual Vision Pipeline Stages

One consequence of this pipelined computing architecture is a derived system design optimization constraint for the entire pipeline: to ensure that the data type and format of the processed data from one stage maps well to the next. Unnecessary data movement or format conversions between any stages will result in needless compute time and energy consumption.

It’s worth noting that implementing the ‘fastest’ solution at each stage in the pipeline, and/or the fewest total number of stages to construct the pipeline, will still not always result in a fully optimized solution if translations are still required. For this reason, it becomes very important to evaluate overall end-to-end performance of the system before ‘committing’ to any specific element in the vision pipeline or supporting hardware. In this case study, the two highest-cost processing functions in the vision pipeline are detection (since the entire scene must be processed) and classification (due to the high compute requirements for the CNN). Focusing on first minimizing these two stages results, at least in this case, in the best end-to-end efficiency.

The image sensor selected for the case study is one concrete example of this kind of design optimization. In order to provide good low light performance for automotive applications, we selected a high dynamic range image sensor which also includes a basic on-chip ISP (image signal processor) block, whose image quality optimization capabilities (de-Bayering, white balance, auto exposure and aperture/gamma corrections) offloads later pipeline stages from needing to accomplish these tasks. However, this image sensor only has two output color format options, YUV and RAW, which imposes a color space constraint on the following stages.

In other cases, therefore, the designer may decide to use an image sensor with support for other color space output options, alternatively perform ISP functions downstream, or design a system with no ISP in order to optimize for bill-of-materials cost. Designing a deep learning system with no ISP suggests the need for a custom training dataset that has generated using a similar imaging pipeline, thereby shifting cost from hardware to the front end development effort. This tradeoff may be favorable if a custom dataset is required, anyway.

Color Space Conversion

Since detection and classification are the two most expensive system functions in the pipeline, finding the most efficient method to implement each of them, along with providing each of them with an optimum input format, is crucial to developing an efficient solution. The principle objective of the color space conversion stage is to transform the input frames coming from the image sensor into formats well suited for each of these stages. The following diagram summarizes this process (Figure D). The ordering of the conversion sequence may seem odd at first glance, with YUV→RGB conversion first, followed later by the classification stage. However, this ordering is beneficial: the overall YUV→RGB→GRY (greyscale) conversion implementation cost ends up being lower than would be the case using other approaches.


Figure D. The particular color space conversion sequence used in this project was selected in order to minimize the total implementation cost (courtesy Au-Zone Technologies).

Since the GTSRB (German Traffic Sign Recognition Benchmark) training images, for example, are RGB in format, as are the images operated on by most deep learning CNNs, this format is required for the classification stage. The lowest-cost method to convert from YUV to RGB makes use of a well-known technique described formulaically as:

R = 1.164(Y - 16) + 1.596(V - 128)
G = 1.164(Y - 16) - 0.813(V - 128) - 0.391(U - 128)
B = 1.164(Y - 16) + 2.018(U - 128)

The pipeline sequence buffers this image format for subsequent processing in the classification stage, after performing detection and identifying regions of interest.

Object Detection

Since the detection stage must process all pixels in the field of view in order to identify high-probability candidate opportunities to subsequently classify, the detection stage must be very efficient. Although implementing a CNN-based detector is possible, traditional computer vision-based detection techniques are often much more computationally efficient. These computer vision operations can also often be performed in greyscale format without loss in accuracy, thereby providing a 3:1 reduction in pixel processing at the cost of upfront conversion. The format conversion from RGB to greyscale is described formulaically as:

GRY = (R * 0.2126 + G * 0.7152 + B * 0.0722)

Due to the highly parallel nature of both of these color space conversions, they can be accomplished very efficiently, and at full resolution and frame rate, by shader kernels running on the GPU. The overall cost of these conversions is insignificant in terms of the overall end-to-end processing time, and they make the two critical stages very effective.

The candidate detection algorithm scans every frame to look for high probability regions of interest where traffic signs are likely to be present. It is a first-order search, only performing detection without any attempt to classify. Using traditional computer vision techniques, the algorithm detects the basic shapes of traffic signs (such as circles, triangles, and rectangles) in the overall scene and generates a corresponding ROI (region of interest). This is done by first extracting the edges in the image with a Sobel filter then processing those edges for contours.

Once regions of interest are determined by the detection stage, the corresponding ROI parameters are then passed to the CNN stage of the pipeline for classification. The overriding goal of the detection stage is high recall, i.e., false positives are acceptable as long as false negatives (missed detections) are minimized. This tradeoff ensures that all reasonable candidates will pass to the CNN stage for classification; the CNN will handle rejections as part of its subsequent processing.

Object Classification

With one or more high probability candidate regions identified in a scene, the specific pixel data from the RGB image for each ROI is passed to the CNN for classification. Any given scene may contain multiple candidates that require classification, which are therefore processed sequentially (Figure E). Since the distance to the object is variable, and since the input side of the neural network is fixed in resolution, the neural network candidate ROIs are first scaled to match the neural network input.


Figure E. Any given scene may contain multiple traffic sign recognition candidates requiring sequential classification (courtesy Au-Zone Technologies).

Optimizing inference involves the consideration of several factors. First and foremost is the tradeoff between execution time and accuracy; larger, more complex networks can sometimes provide higher accuracy, but only within limits. And when optimizing execution time, you should focus on two general areas: network design optimizations and runtime inference optimizations.

Network Design Optimizations

The most important aspect to consider when designing a neural network to solve a classification problem is to make sure that it is of an appropriate size and depth. Example networks can be plotted on a graph with respect to their accuracy versus compute time at various model sizes, and then grouped into three general categories based on the complexity of the image classification problem they are appropriate for solving (Figure F). One network type or size does not solve all problems, and perhaps even more importantly, larger networks are not always better if real-time performance is important for your application.


Figure F. A comparison of various network model options organizes them with respect to both their accuracy versus compute time at various model sizes, and the complexity of the image classification problems they're appropriate to solve (courtesy Au-Zone Technologies).

In examining the example networks in the medium complexity category in more detail, for example, you'll discover the multiple tradeoffs to be made within this subgroup (Table A). The relative priority of accuracy, inference time and memory usage in your design situation will be reflected in a particular ranking for the various examples.

Network Topology

Input Resolution

Layers / Weights

Training Accuracy †

Runtime (MB)

Inference (msec)

TSR Net

24 x 24

8 / 500K

97%

2.0

1.4

TSR Net

56 x 56

8 / 2.5M

97%

10.0

7.2

Fully Convolutional (FCN)

24 x 24

5 / 275K

95%

0.5

1.0

Multi-Layer Perceptron (MLP)

24 x 24

2 / 1M

78%

1.5

0.8

ResNet *

24 x 24

18 / 260K

97%

1.2

8.5

SqueezeNet *

24 x 24

12+ / 5M+

98%

5

5.0

Table A. A comparison of medium-complexity neural networks implemented on a desktop computer (notes: †=inference accuracy after 100 epochs of training, *=custom network designs with key features derived from named network) (courtesy Au-Zone Technologies)

Key takeaways from this data include the fact that the use of higher resolution inputs for training, testing and/or live video does not demonstrably improve accuracy, but will have a significant negative impact on runtime size and inference time. To be specific, further testing based on desktop computer evaluation showed that the 24 x 24 input resolution setting was optimal. Also, slightly reducing the accuracy expectation (-2%) compared to a fully convolutional network enabled reducing inference time by more than 6 ms.

With the TSRnet topology, for example, calculating the distribution of compute time across the network layers enables a developer to quickly focus on layers that are consuming the most compute time, targeting them for further optimization (Figure G). If bottlenecks become apparent with a given network model, the developer can modify the network to simplify or eliminate layers, subsequently retraining and retesting quickly.




Figure G. With the TSRnet topology (top), calculating the distribution of compute time across the network layers rapidly identifies layers consuming the most compute time (middle), which are targets for further optimization (bottom) (courtesy Au-Zone Technologies)

Runtime Inference Optimizations

With a suitable neural network design capable of solving the classification problem evaluated on a desktop computer to use as a starting point, the next obstacles to overcome are to implement that network on the resource-constrained embedded target, and then optimize the embedded implementation for efficiency. When implementing the runtime on the embedded target, the developer has several options, each with tradeoffs, to consider:

  1. Code the model directly using existing libraries such as ACL, BLAS, Eigen and processor-specific methods. This is a particularly good option if few network design iterations or processor-specific optimizations are required to reach an appropriate solution.
  2. Modify existing open-source training frameworks to perform inference on your embedded platform.
  3. Leverage TensorFlow Lite to retrain an existing model on your data. This is a particularly good option if the problem you’re working on fits well with existing network design and supported core operations. Also, this option straightforwardly targets Android and iOS platforms.
  4. Implement a dynamic inference engine capable of loading and evaluating models. This is the "long game" option, requiring the most upfront investment to achieve a fully optimized solution capable of loading and evaluating different network topologies. The benefit of this investment is an engine fully optimized for any particular processor architecture and fully independent of the network design.

For this case study, we pursued option 4, using the DeepViewRT Run Time Inference Engine to target execution of the neural network on the processor. Table B shows inference times evaluated on the i.MX8 processor, and using the DeepViewRT dynamic inference engine for the same networks previously described in Table A.

Network Topology

Input Resolution

Layers / Weights

Training Accuracy †

Runtime (MB)

Inference (msec)

TSR Net

24 x 24

8 / 500K

97%

2.0

2.8

TSR Net

56 x 56

8 / 2.5M

97%

10.0

14.0

Fully Convolutional (FCN)

24 x 24

5 / 275K

95%

0.5

9.5

Multi-Layer Perceptron (MLP)

24 x 24

2 / 1M

78%

1.5

2.5

ResNet *

24 x 24

18 / 260K

95%

1.2

3.9

SqueezeNet *

24 x 24

12+ / 5M+

<80%

5

>1000

Table B. A comparison of medium-complexity neural networks implemented on the i.MX8 embedded target (notes: †=inference accuracy after 100 epochs of training, *=custom network designs with key features derived from named network) (courtesy Au-Zone Technologies)

Regardless of the particular path you choose to implement run-time inference, the overall objective remains the same: to minimize the system compute load and time required to perform inference for a given network on a particular target processor. Optimizations for resource-constrained embedded processors cluster into two general categories: compute techniques, and processor architecture exploits.

Compute Techniques

Compute techniques are generally portable from one hardware platform to another, although the benefits seen on one processor are not always duplicated exactly on another. Some specific areas of optimization to consider are outlined in the following lists.

Neural network and linear algebra libraries:

Computational transforms:

Separable convolutions:

  • Computational data format ordering: NHWC, NCHW, NCHW, etc…
    • N refers to the number of images in a batch.
    • H refers to the number of pixels in the vertical (height) dimension.
    • W refers to the number of pixels in the horizontal (width) dimension.
    • C refers to the channels. For example, 1 for black and white or grayscale and 3 for RGB.

Data I/O optimizations:

  • Data reuse
  • Dimensional ordering
  • Image data reuse
  • Filter data reuse
  • Tiling: blocking vs linear
  • Kernel and layer fusing

Processor Architecture Exploits

Hardware specific optimizations are typically processor architecture specific and are not assumed to be portable. Some specific examples are outlined in the following lists.

Processor core type and instruction set

  • CPU
  • GPU
  • Vision DSP
  • FPGA
  • Neural network processor

Memory use optimization

  • Registers
  • L1 and L2 cache
  • SRAM vs DRAM, and standard vs DDR (double data rate) memory interfaces
  • Cache line/non-strided accesses
  • Float, half, fixed data formats
  • Bit width tradeoffs
  • Non-linear quantization

Brad Scott
President, Au-Zone Technologies

Sidebar: Optimization at the System Level, An Object Recognition Case Study

In developing an optimized deep learning-based embedded vision design, putting realistic constraints on the characteristics of images (and kinds of objects in those images) that the classifier will be tasked with handling can be quite impactful in terms of reducing the required implementation resources. However, it's equally important to expand those constraints as needed to comprehend all possible image data scenarios. The following case study from BDTI, based on a real-life research project for a client, covers both of these points.

The goal of this project was to train a classifier that achieved very high accuracy (above 99%) on 25 object categories, while also being practical to implement on an embedded platform. The project was very exploratory in nature; the target hardware had not yet been defined, so we had no specific constraints on compute load, storage, or bandwidth. However, we could not assume that "cloud" connectivity was available; inference had to happen entirely in the embedded system.

We knew that we could make some reasonable assumptions about the data: the object of interest was always in a known region of the image, for example. The distance from the camera to the object of interest would also be measured separately from the image classifier, so that the classifier could assume that images were appropriately scaled and cropped using this information. These assumptions meant that the inputs to the classifier would be fairly uniform in position and scale, and we would therefore be able to design a “lightweight” classifier for the job.

In designing a classifier CNN based on these assumptions, we selected kernel sizes, feature map sizes, etc. to limit the compute load and memory footprint (including the number of weights and the size of activation matrices). In considering the effort to implement the CNN on an embedded CPU/DSP, we therefore also deliberately avoided normalization layers and other features that would have increased coding effort.

We were initially given both a training dataset and a validation dataset. At the end of the project, we would also receive additional test sets, which would include input conditions not covered in the training and validation datasets. By measuring accuracy on these test sets, we would be able to determine how well the initial training had generalized to the additional situations in the test sets. However, we were not allowed to see the test sets in advance, and we did not know what kinds of conditions would be present in the test sets. In general, this is non-ideal practice: a good rule of thumb is that artificial neural networks only learn what they are shown, and they can’t be expected to generalize to conditions not covered in training. Unfortunately in this case we didn’t even obtain any up-front information about the range of input conditions in the test sets.

The training and validation datasets consisted of images captured under uniform and otherwise optimum conditions: good visibility of the objects of interest, good lighting, etc. We were therefore able to rapidly design a lightweight CNN classifier and train it to achieve over 99.9% accuracy on the validation dataset. But we knew that the training and validation sets' conditions were too close to ideal, and it was therefore very unlikely that this initial training would generalize well to more diverse conditions. So we next used simple image processing techniques to simulate more challenging conditions such as occlusions, shadows, and other types of image noise.

We created a new validation set with these simulated challenges. On it, the network we had originally trained initially scored only 66.7% accuracy; not at all surprising, because it was not trained to deal with the simulated challenges. We then added these same simulated challenge conditions to the training set. We also created a custom layer in Caffe to on-the-fly simulate random occlusions during training. We retrained the network, which then achieved 100% accuracy on the original validation set, and 99.2% accuracy on the simulated-challenge validation set. And on the three test sets that we subsequently received, our network accuracy scores were as high as 99.9%.

We estimate that the CNN we designed could execute inference in a few hundred msec on a fully loaded 1 GHz NEON-supportive ARM core, and with no need for a GPU as a co-processor; a very lightweight and otherwise embedded-friendly design. Note that since this project focused only on initial network topology design and training, subsequent pruning or quantizing of the network was not explored, and memory "footprint" estimates are therefore not available. However, we believe that pruning and quantization (with retraining) would be very effective, with a negligible resultant reduction in accuracy.

Implementing High-performance Deep Learning Without Breaking Your Power Budget

This article was originally published at Synopsys' website. It is reprinted here with the permission of Synopsys.

Computer Vision in Surround View Applications

Bookmark and Share

Computer Vision in Surround View Applications

The ability to "stitch" together (offline or in real-time) multiple images taken simultaneously by multiple cameras and/or sequentially by a single camera, in both cases capturing varying viewpoints of a scene, is becoming an increasingly appealing (if not necessary) capability in an expanding variety of applications. High quality of results is a critical requirement, one that's a particular challenge in price-sensitive consumer and similar applications due to their cost-driven quality shortcomings in optics, image sensors, and other components. And quality and cost aren't the sole factors that bear consideration in a design; power consumption, size and weight, latency and other performance metrics, and other attributes are also critical.

Seamlessly combining multiple images capturing varying perspectives of a scene, whether taken simultaneously from multiple cameras or sequentially from a single camera, is a feature which first gained prominence with the so-called "panorama" mode supported in image sensor-equipped smartphones and tablets. Newer smartphones offer supplemental camera accessories capable of capturing a 360-degree view of a scene in a single exposure. The feature has also spread to a diversity of applications: semi- and fully autonomous vehicles, drones, standalone consumer cameras and professional multi-camera capture rigs, etc. And it's now being used to not only deliver "surround" still images but also high frame rate, high resolution and otherwise "rich" video. The ramping popularity of various AR (augmented reality) and VR (virtual reality) platforms for content playback has further accelerated consumer awareness and demand.

Early, rudimentary "stitching" techniques produced sub-par quality results, thereby compelling developers to adopt more advanced computational photography and other computer vision algorithms. Computer vision functions that will be showcased in the following sections implement seamless "stitching" of multiple images together, including aligning features between images and balancing exposure, color balance and other characteristics of each image. Dewarping to eliminate perspective and lens distortions is critical to a high quality result, as is calibration to adjust for misalignment between cameras (as well as to correct for alignment shifts over time and use). Highlighted functions include those for ADAS (advanced driver assistance systems) and autonomous vehicles, as well as for both professional and consumer video capture setups; the concepts discussed will also be more broadly applicable to other surround view product opportunities. And the article also introduces readers to an industry alliance created to help product creators incorporate vision capabilities into their hardware and software, along with outlining the technical resources that this alliance provides (see sidebar "Additional Developer Assistance").

Surround View for ADAS and Autonomous Vehicles

The following essay was written by Alliance member company videantis and a development partner, ADASENS. It showcases a key application opportunity for surround view functions: leveraging the video outputs of multiple cameras to deliver a distortion-free and comprehensive perspective around a car to human and, increasingly, autonomous drivers.

In automotive applications, surround view systems are often also called 360-degree video systems. These systems increase driver visibility, which is a valuable capability when undertaking low-speed parking maneuvers, for example. They present a top-down (i.e. "bird’s-eye") view of the vehicle, as if the driver was positioned above the car. Images from multiple cameras combine into a single perspective, presented on a dashboard-mounted display. Such systems typically use 4-6 wide-angle cameras, mounted on the rear, front and sides of the vehicle, to capture a full view of the surroundings. And the computer vision-based driver safety features they support, implementing various image analysis techniques, can warn the driver or even partially-to-completely autonomously operate the vehicle.

Surround view system architectures implement two primary, distinct functions:

  • Camera calibration: in order to combine the multiple camera views into a single image, knowledge of each camera’s precise intrinsic and extrinsic parameters is necessary.
  • Synthesis of the multiple video streams into a single view: this merging-and-rendering task combines the images from the different cameras into a single natural-looking image, and re-projects that resulting image on the display.

Calibration

In order to successfully combine the images captured from the different cameras into a single view, it's necessary to know the extrinsic parameters that represent the location and orientation of the camera in the 3D space, as well as the intrinsic parameters that represent the optical center and focal length of the camera. These parameters may vary on a per-camera basis due to manufacturing tolerances in the factory; they can also change after the vehicle has been manufactured due to the effects of accidents, temperature variations, etc. Each camera's extrinsic parameters can even be affected by factors such as vehicle load and tire pressure. Therefore, camera calibration must be repeated at various points in time: during camera manufacturing, during vehicle assembly, at each vehicle start, and periodically while driving the car (Figure 1).


Figure 1. Multiple calibration steps, at various points in time throughout an automotive system's life, are necessary in order to accurately align images (courtesy ADASENS and videantis).

One possibility, and a growing trend, is to integrate this calibration capability within the camera itself, in essence making the cameras in a surround view system self-aware. Calibration while the car is driving, for example, is known as marker-less or target-less calibration. In addition, the camera should also be able to diagnose itself and signal the operator if the lens becomes dirty or blocked, both situations that would prevent the surround view system from operating error-free. Signaling the driver (or car) in cases when the camera can’t function properly is particularly important in situations involving driver assistance or fully automated driving. Two fundamental algorithms, briefly discussed here, address the desire to make cameras self-aware: target-less calibration based on optical flow, and view-block detection based on machine learning techniques.

Calibration can be performed using the vanishing point theory, which estimates the position and orientation of the camera in 3D space (Figure 2). Consecutive frames from a monocular camera, in combination with CAN (Controller Area Network) bus-sourced data such as wheel speeds and the steering wheel angle, are inputs to the algorithm. The vanishing point is a virtual point in 2D image coordinates that corresponds to a point in the 3D space where 2D projections such as a set of parallel line converge. Using this method, the roll, pitch and yaw angles of a camera can be estimated with accuracies of up to 0.2°, and without need for markers. Such continuously running calibration is also necessary in order to adapt the camera to short and long-term changes, and it can be used as an input for other embedded vision algorithms such as a crossing-traffic alert, which calculates time-to-collision that is then used to trigger a warning to the driver.


Figure 2. The vanishing point technique, which estimates the position and orientation of a camera in 3D space, is useful in periodically calibrating it (courtesy ADASENS and videantis).

Soil or blockage detection is based on the extraction of image quality metrics such as sharpness and saturation (Figure 3). This data can then trigger a cleaning system, for example, as well as provide a confidence level to other system functions, such as an autonomous emergency braking system that may not function correctly if one or multiple cameras are obscured. Soil and blockage detection extracts prominent image quality metrics, which are combined into a feature vector and "learned" using a support vector machine, which performs discriminative feature identification. Temporal filtering, along with a hysteresis function, are also incorporated in the algorithm in order to prevent "false positives" due to short-term changes and soil-based flickering.


Figure 3. A camera lens that becomes dirty or blocked could prevent the surround view system from operating error-free (courtesy ADASENS and videantis).

Merging and Rendering

The next step in the process involves combining the captured images into a unified surround view image that can then be displayed. It involves re-projecting the camera-sourced images, originally taken from different viewpoints, and more generally merging the distinct video streams. In addition the live images themselves, it utilizes the various virtual cameras' viewpoint parameters, as well as the characteristics of the surface that the rendered image will be re-projected onto.

One common technique is to simply project the camera images onto a plane that represents the ground. However, this approach results in various distortions; for example, objects rising above the ground plane, such as pedestrians, trees and street lights, will be unnaturally stretched out (Figure 4). The resulting unnatural images making it harder for the driver to accurately gauge distances to various objects. A common improved technique is to render the images onto a bowl-shaped surface instead. This approach results in a less-distorted final image, but it still contains some artifacts. Ideally, therefore, the algorithm would re-project the cameras' images onto the actual 3D structure of the vehicle’s surroundings.



Figure 4. Projection onto the ground plane tends to flatten and stretch objects (top); projection onto a "bowl" surface instead results in more natural rendering (bottom) (courtesy ADASENS and videantis).

System Architecture Alternatives

One typical system implementation encompasses the various cameras along with a separate ECU (engine control unit) "box" that converts the multiple camera streams into a single surround view image, which is then forwarded to the head unit for display on the dashboard (Figure 5). Various processing architectures for the calibration, computer vision, and rendering tasks are available. Some designs leverage multi-core CPUs or GPUs, while other approaches employ low-cost and lower-power vision processors. Videantis' v-MP4000HDX and v-MP6000UDX vision processor families, for example, efficiently support all required visual computing tasks, including calibration, computer vision, and rendering. Camera interface options include LVDS and automotive Ethernet; in the latter case, videantis' processors can also handle the requisite H.264 video compression and decompression, thereby unifying all necessary visual processing at a common location.



Figure 5. One common system architecture option locates surround view and other vision processing solely in the ECU, which then sends rendered images to the head unit (top). Another approach subdivides the vision processing between the ECU and the cameras themselves (bottom) (courtesy ADASENS and videantis).

Another prevalent system architecture incorporates self-aware cameras, thereby reducing the complexity in their common surround view ECU. This approach provides an evolutionary path toward putting even more intelligence into the cameras, resulting in a scalable system with multiple options for the car manufacturer to easily provide additional vision-based features. Enabling added functionality involves upgrading to more powerful and intelligent cameras; the base cost of the simplest setup remains low. Such an approach matches up well with the car manufacturer's overall business objectives: providing multiple options for the consumer to select from while purchasing the vehicle.

Marco Jacobs
Vice President of Marketing, videantis

Florian Baumann
Technical Director, ADASENS

Surround View in Professional Video Capture Systems

Surround video and VR (virtual reality) are commonly (albeit mistakenly) interchanged terms; as Wikipedia notes, "VR typically refers to interactive experiences wherein the viewer's motions can be tracked to allow real-time interactions within a virtual environment, with orientation and position tracking. In 360-degree video, the locations of viewers are fixed, viewers are limited to the angles captured by the cameras, and [viewers] cannot interact with the environment." With that said, VR headsets (whether standalone or smartphone-based) are also ideal platforms for live- or offline-viewing both 180- and 360-degree video content, which in some cases offers only an expanded horizontal perspective but in other cases delivers a full spherical display controlled by the viewer's head location, position and motion. The following essay from AMD describes the implementation of Radeon Loom, a surround video capture setup intended for professional use, thereby supporting ultra-high image resolutions, high frame rates and other high-end attributes.

One of the key goals of the Radeon Loom project was to enable real-time preview of 360-degree video in a headset such as an Oculus Rift or HTC Vive, while simultaneously filming it with a high-quality cinematic camera setup (see sidebar "Radeon Loom: A Historical Perspective"). Existing solutions comprise either low-end cameras, which don't deliver sufficient quality levels for Hollywood expectations, or very expensive high-end cameras that take lengthy periods of time to produce well-stitched results. After several design iterations, AMD came up with several implementation options (Figure 6).



Figure 6. An example real-time stitching system block diagram (top) transformed into reality with AMD's Radeon Loom (bottom) (courtesy AMD).

Important details of the design include the fact that it uses a high-performance workstation graphics card, such as a FirePro W9100 or one of the newer Radeon Pro WX series. These higher-end cards support more simultaneously operating cameras, as well as higher per-camera resolutions. Specifically, Radeon Loom is using Black Magic cameras with HDMI outputs, converting them to SDI (Serial Digital Interface) via per-camera signal converters (SDI is common in equipment used in the broadcast and film industries). The Black Magic cameras support gen-lock (generator locking), which synchronizes the simultaneous start of multi-camera capture to an external sync-generator output signal. Other similar-featured (i.e. HDMI output and gen-lock input) cameras would work just as well.

Once the data is in the GPU's memory, a complex set of algorithms (to be discussed shortly) tackles stitching together all the images into a 360-degree spherical video. Once stitching is complete, the result is sent out over SDI to one or more PCs equipped with HMDs for immediate viewing and/or streaming to the Internet.

Practical issues on the placement of equipment for a real-time setup also require consideration. Each situation is unique; possible scenarios include filming a Hollywood production with a single equipment rig or broadcasting a live concert with multiple distributed cameras. With 360-degree camera arrays, for example, you don’t generally have an operator behind the camera, since he or she would then be visible in the captured video. In such a case, you would also probably want to locate the stitching and/or viewing PCs far away, or behind a wall or green screen, for example.

Why Stitching is Difficult

Before explaining how stitching works, let's begin with a brief explanation of why it's such a challenging problem to solve. If you've seen any high-quality 360-degree videos, you might have concluded that spherical stitching is a solved problem. It isn’t. With that said, however, algorithm pioneers deserve abundant credit for incrementally solved many issues with panoramic stitching and 360 VR stitching over the past few decades. Credit also goes to the companies that have produced commercial stitching products and helped bring VR authoring to the masses (or at least the early adopters).

Fundamental problems still exist, however: parallax, camera count versus seam count, and the exposure differences between sensors are only a few examples (Figure 7). Let's cover parallax first. Simply stated, two cameras in two different locations and positions will see the same object from two different perspectives, just as the same finger held close to your nose appears to have different backgrounds when sequentially viewed from each of your eyes (opened one at a time). Ironically, this disparity is what the human brain uses to determine depth when combining the images. But it also causes problems when trying to merge two separate images together and fool your eyes and brain into thinking they are one image.



Figure 7. Parallax (top) and lens distortion effects (bottom) are several of the fundamental problems that need to be solved in order to deliver high-quality stitching (courtesy AMD).

The second issue: more cameras are generally better, because you end up with a higher effective resolution and improved optical quality (due to less distortion from more narrower-view lenses, versus fewer fisheye lenses). However, more cameras also means more seams between per-camera captured images, a scenario that creates more opportunities for artifacts. As people and other objects move across the seams, the parallax problem repeatedly reveals itself, with small angular differences. It is also more difficult to align all of the images when multiple cameras exist; misalignment leads to "ghosting." And more seams also means more processing time.

Each camera's sensor is also dealing with different lighting conditions. For example, if you're capturing a 360-degree video containing a sunset, you'll have a west-facing camera looking at the sun, while an east-facing camera is capturing a much darker region. Although clever algorithms exist to adjust and blend the exposure variations across images, this blending comes at the cost of lighting and color accuracy, as well as overall dynamic range. The problem is amplified in low-light conditions, potentially limiting artistic expression.

Other problems also exist, also with solutions, but at higher cost tradeoffs. For example, most digital cameras use a "rolling shutter" as opposed to the more costly "global shutter." Global shutter-based cameras capture every pixel at the same time. Conversely, rolling shutter cameras sequentially capture horizontal rows of pixels at different points in time. When stitching together images shot using rolling shutter-based cameras, some of the pixels in overlapping image areas will have been captured at different times, potentially resulting in erroneous disparities.

With those qualifiers stated, it's now time for an explanation of 360-degree video stitching and how AMD optimized its code to run in real time. To begin, let's look at the overall software hierarchy and the processing pipeline (Figure 8).


Figure 8. Radeon Loom's software hierarchy has OpenVX at its nexus (courtesy AMD).

An OpenVX™ Foundation

AMD built the Loom stitching framework on top of OpenVX, a foundation that is important for several reasons. OpenVX is an open standard supported by the Khronos Group, an organization that also developed and maintains OpenGL™, OpenCL™, Vulkan™ and many other industry standards. OpenVX is also well suited to this and similar software tasks, because it allows the underlining hardware architecture to optimally execute the compute graph (pipeline), while details of how the hardware obtains its efficiency don't need be exposed to upper software levels. The AMD implementation of OpenVX, which is completely open-sourced on Github, includes a Graph Optimizer that conceptually acts like a compiler for the whole pipeline.

Additionally, and by design, the OpenVX specification allows each implementation to decide how to process each workload. For example, processing could be done out of order, in tiles, in local memory, or handled in part or in its entirety by dedicated hardware. This flexibility means that as both the hardware and software drivers improve, the stitching code can automatically take advantage of these enhancements, similar to how 3D games automatically achieve higher frame rates, higher resolutions and other improvements with new hardware and new driver versions.

The Loom Stitching Pipeline

Most of the steps in the stitching pipeline are the same, regardless of whether you are stitching in real-time or offline (i.e. in batch mode) (Figure 9). The process begins with a camera rig capturing a group of videos, likely either to SD flash memory cards, a hard drive or digital video tape, depending on the camera model. After shooting, copy all of the files to a PC and launch the stitching application.


Figure 9. The offline stitching pipeline has numerous critical stages, and is similar to the real-time processing pipeline alternative (courtesy AMD).

Before continuing with the implementation details, let's step back for a second and set the stage. The goal is to obtain a spherical output image to view in a VR headset (Figure 10). However, it's first necessary to create a flat projection of a sphere. The 360-degree video player application will then warp the flat into a sphere. This, the most common method, is called an equirectangular projection. With that said, other projection approaches are also possible.



Figure 10. The goal, a spherical image to view in a headset (top), first involves the rendering of a flat projection, which is then warped (bottom) (courtesy AMD).

The first step in the pipeline is to decode each video stream, which the camera has previously encoded into a standard format, such as H.264. Next is to perform a color space conversion to RGB. Video codecs such as H.264 store the data in a YUV format, typically YUV 4:2:0 to obtain better compression. The Loom pipeline supports color depths from 8-bit to 16-bit. Even with 8-bit inputs and outputs, some internal steps are performed with 16-bit precision to maximize quality.

Next comes the lens correction step. The specifics are to some degree dependent on the characteristics of the exact lenses on your cameras. Essentially what is happening, however, is that the distortion artifacts introduced by each camera's lens are corrected to make straight lines actually appear straight, both horizontally and vertically (Figure 11). Fisheye lenses and circular fisheye lenses have even more (natural) distortion that needs to be corrected.



Figure 11. Lens distortion correction (top) is particularly challenging with fisheye lenses (bottom) (courtesy AMD).

Once this correction is accomplished, the next step warps each image into an intermediate buffer representing an equirectangular projection (Figure 12). At this point, if you simply merge all the layers together, you'll end up with a stitched image. This overview won't discuss in detail how to deal with exposure differences between images and across overlap areas. Note, however, that Loom contains seam-finding, exposure compensation and multi-band blending modules, all of which are required in order to obtain a good quality stitch balanced across the camera images and minimizing the seams' visibility of the seams. For multi-band blending, for example, the algorithms expand the internal data to 16 bits, even if the input source is 8 bits, and provide proper padding so everything can run at high speed on the GPU.




Figure 12. Warping the corrected images into equirectangular-sized intermediate buffers (top), then merging the layers together (middle), isn't alone sufficient to deliver high-quality results (bottom) (courtesy AMD).

Finding Seams

AMD's exploration and evaluation of possible seam-finding algorithms was guided by a number of desirable characteristics, beginning with "high speed" and "parallelizable." AMD chose this prioritization in order to be able to support real-time stitching with many lens types, as well as to be scalable across the lineup of GPUs. Temporal stability is also required, so that a seam would not flicker due to noise or minor motion in the scene at an overlap area. While many of the algorithms in academic literature work well for still images such as panoramas, they aren't as robust with video.

The algorithm picks a path in each overlapping region, and then stays with this same seam for a variable number of frames, periodically re-checking for a possible better seam in conjunction with decreasing the advantage for the original seam over time. Because each 360-degree view may contain many possible seams, not re-computing every seam on every frame significantly reduces the processing load. The complete solution also includes both seam finding and transition blending functions. The more definitive the seam, the less need there is for wide-area blending. Off-line (batch mode) processing supports adjustment of various stitching parameters in order to do more or less processing per seam on each frame.

While this fairly simple algorithm might work fairly well for a relatively static scene, motion across the seam could still be problematic. So AMD's approach also does a "lightweight" check of the pixels in each overlap region, in order to quickly identify significant activity in the overlap regions and flag these particular seams such that they can be promptly re-checked. All seams are grouped into priority levels; the highest priority candidates are re-checked first and queued for re-computation in order to minimize the impact on the system's real-time capabilities. For setups with a small number of cameras and/or in off-line processing scenarios, a user can default all seam candidates to the highest priority.

How to find the optimal path for each seam? Many research papers have been published, promoting algorithms such as max-flow min-cut (or graph cut), segmentation, watershed and the like. Graph cut starts by computing a cost for making a cut at each pixel and then finding a path through the region that has the minimum total cost. If you have the perfect "cost function," you’ll get great results, of course. But the point is to find a seam that is the least objectionable and remains so over time. In real-time stitching you can’t easily account for the future; conversely, in off-line mode a best seam over time can be found (in time).

Before you can choose a cost function, you have to inherently understand what it is that you are trying to minimize and maximize (Figure 13). Good starting points are to cut along an edge and to not cross edges. The stronger the edge the better when following it; conversely, crossing over an edge is worst of all. And cutting right on an edge is better than cutting parallel to but some distance away from it, although some cases you may not have a nearby edge to cut on.

Figure 13. Sometimes, the best seam follows an edge (top left). In other situations, however, the stronger edge ends up with a lower "cost" score (top right). In this case, the theoretically best seam has the worst score, since it's not always on an edge (bottom left). Here, the best score follows an edge (bottom right) (courtesy AMD).

Computing a cost function leverages a horizontal and vertical gradient using a 3x3 Sobel function, for both phase and magnitude (Figure 14).


Figure 14. Seam candidate cost calculations leverage 3x3 Sobel functions (courtesy AMD).

Graph Cuts

In classical graph cut theory, an “s-t graph” is a mesh of nodes (pixels in this particular case) linked together (Figure 15). S is the starting point (source) and T is the ending point (sink). Each vertical and horizontal link has an associated cost for breaking that connection. However, the academic description may be confusing in this particular implementation, because when considering a vertical seam and a left and right image, S and T are on the left and right images, not the top and bottom of the seam.


Figure 15. An “s-t graph” is a mesh of nodes linked together, with S the starting (source) point and T the ending (sink) point (courtesy AMD).

The graph cut method measures total cost, not average cost. Thus a shorter cut through some high-cost edges may get preference over longer cuts through areas of average cost. Some possible methodology improvements include segmenting the image and finding seams between segments, avoiding areas of high salience, computing on a pyramid of images, preferring cuts in high frequency regions, minimizing average error, etc.

Seam Cut

The algorithm begins by computing a cost, the Sobel phase and magnitude, at each pixel location. It then sums the accumulated cost along a path that generally follows the direction of the seam. It chooses a vertical, horizontal or diagonal seam direction based on the dimensions of the given seam. It looks at a pixel and 3 possible directions of the next pixel. For example in a vertical seam, moving from top to bottom a pixel has 3 pixels below it that can be considered – down-left, down or down-right. The lower cost of the three options is taken. Optionally, the algorithm potentially provides "bonus points" if a cut is perpendicular and right next to an edge.

This above process executes in parallel for every pixel row (or column) in the overlap area (minus some boundary pixels). After the algorithm reaches the bottom (in this case) of the seam, it compares all possible paths and picks the one with the lowest overall cost. The final step is to trace back up the path and set the weights for the left and right images (Figure 16).



Figure 16. The two overlap source images were taken from different angles (upper left and right). An elementary stitch of them produces sub-par results (middle). Stitching by means of a generated seam generates a superior output (courtesy AMD).

Mike Schmit
Director of Software Engineering, Radeon Technologies Group, AMD

Surround View in Consumer Video Capture Systems

AMD's Radeon Loom, as previously noted, focuses its attention on ultra-high-resolution (4K and 8K), high-frame rate and otherwise high-quality professional applications, and is PC-based. Many of the concepts explained in AMD's essay, however, are equally applicable to more deeply embedded system designs, potentially at standard HD video resolutions, with more modest frame rates and more mainstream quality expectations. See, for example, the following presentation, "Designing a Consumer Panoramic Camcorder Using Embedded Vision," delivered by CENTR (subsequently acquired by Amazon):

Lucid VR, a member of the Embedded Vision Alliance and an award winner at the 2017 Embedded Vision Summit's Vision Tank competition, has developed a consumer-targeted stereoscopic camera which captures the world the way human eyes see it - with true depth and 180 degree field of vision. When viewed within a virtual reality headset like Oculus Rift or Google Daydream with a mobile phone, the image surrounds the user, creating complete immersion. Here's a recent demonstration from the company:

Conclusion

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products, and can provide significant new markets for hardware, software and semiconductor suppliers. High-quality surround view "stitching" of both still images and video streams via computer vision processing can not only create compelling immersive content directly viewable on VR headsets and other platform, but can also generate valuable visual information used by downstream computer vision algorithms for autonomous vehicles and other applications. By carefully selecting and optimizing both the "stitching" algorithms and the processing architecture(s) that run them, surround view functionality can be cost-effectively and efficiently incorporated in a diversity of products. And an industry association, the Embedded Vision Alliance, is also available to help product creators optimally implement surround view capabilities in their resource-constrained hardware and software designs.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. AMD and videantis, co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, is coming up May 22-24, 2018 in Santa Clara, California.  Intended for product creators interested in incorporating visual intelligence into electronic systems and software, the Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings.  More information, along with online registration, is now available.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance and its member companies periodically deliver webinars on a variety of technical topics, including various deep learning subjects. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Alliance website. Also, the Embedded Vision Alliance is offering offering "Deep Learning for Computer Vision with TensorFlow," a series of both one- and three-day technical training class planned for a variety of both U.S. and international locations. See the Alliance website for additional information and online registration.

Sidebar: Radeon Loom: A Historical Perspective

The following essay from AMD explains how the company came up with its "Radeon Loom" project name, in the process providing a history lesson on human beings' longstanding desires to engross themselves with immersive media.

People have seemingly an innate need to immerse themselves in 360-degree images, stories and experiences, and such desires are not unique to current generations. In fact, it's possible to trace this yearning for recording history, and for educating and entertaining ourselves through it, from modern-day IMAX, VR and AR (augmented reality) experiences all the way back to ancient cave paintings. To put today's technology in perspective, here's a sampling of immersive content and devices from the last 200+ years:

Several interesting connections exist between the historical loom, an apparatus for making fabric, and modern computers (and AMD’s software running on them). The first and most obvious linkage is the fact that looms are multi-threaded machines, capable of being fed by thousands of threads to create beautiful fabrics and images on them. Radeon GPUs also run thousands of threads (of code instructions this time), and also produce stunning images.

Of particular interest is the Jacquard Loom, invented in France in 1801. Joseph Jacquard didn’t invent the original loom; he actually worked as a child in his parent’s factory as a draw-boy, as did many children of the time. Draw-boys, directed by the master weaver, manipulated the warp threads one by one; this was a job much more easily tackled by children's small hands. Unfortunately, it also required them to be up high on the loom, in a dangerous position.

Jacquard's experience later motivated him as an adult to develop an automated punch card mechanism for the loom, thereby eliminating his childhood job. The series of punched cards controlled the intricate patterns being woven. And a few decades later, Charles Babbage intended to use the same conceptual punch card system on his (never-built) Analytical Engine, which was an ancestor of modern day computing hardware and software.

When Napoleon observed the Jacquard Loom in action, he granted a patent for it to the city of Lyon, essentially open-sourcing the design. This grant was an effort to help expand the French textile industry, especially for highly desirable fine silk fabrics. And as industry productivity consequently increased, from a few square inches per day to a square yard or two per day, what did the master weavers do with their newfound time? They now could devote more focus on creative endeavors, producing designs with new patterns and colors every year, creations which they then needed to market and convince people to try. Today this is called the "fashion" industry, somewhat removed from its elementary fabric-weaving origins.

Solving Intelligence, Vision and Connectivity Challenges at the Edge with ECP5 FPGAs

This article was originally published at Lattice Semiconductor's website. It is reprinted here with the permission of Lattice Semiconductor.

Seeing Clearer – Driving Toward Better Cameras for Safer Vehicles

This article was originally published by Dave Tokic of Alliance member company Algolux. It is reprinted here with Tokic's permission.

Fundamentals of Image Processing Systems

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.

What do image processing systems have to do with keeping foodstuffs in good shape?

Machine Learning’s Fragmentation Problem — and the Solution from Khronos

This blog post was originally published at Alliance partner organization Khronos' website. It is reprinted here with the permission of the Khronos Group.

There is a wide range of open-source deep learning training networks available today offering researchers and designers plenty of choice when they are setting up their project. Caffe, Tensorflow, Chainer, Theano, Caffe2, the list goes on and is getting longer all the time.

Visual Ventures with Chris Rowen

This article was originally published as a two-part blog series at Cadence's website. It is reprinted here with the permission of Cadence.


"Even if he gives the same presentation two weeks apart, it will be different.”
—Neil Robinson, fellow attendee, on Chris Rowen