Embedded Vision Alliance: Technical Articles

Event-based Sensing Enables a New Generation of Machine Vision Solutions

This article excerpt is published in full form at Prophesee's website. It is reprinted here with the permission of Prophesee.

Event-based sensing is a new paradigm in imaging technology inspired by human biology. It promises to enable a smarter and safer world by improving the ability of machines to sense their environments and make intelligent decisions about what they see.

Camera Selection – How Can I Find the Right Camera for My Image Processing System?

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.

Lost in the Jungle of Options?

Faced with the challenge of designing an image processing system, you may find yourself in a veritable jungle of options, amidst a dizzying range of camera models, relevant properties, helpful features and potential applications.

Speeding Up Semantic Segmentation Using MATLAB Container from NVIDIA NGC

Bookmark and Share

Speeding Up Semantic Segmentation Using MATLAB Container from NVIDIA NGC

This article was originally published at NVIDIA's website. It is reprinted here with the permission of NVIDIA.

Gone are the days of using a single GPU to train a deep learning model.  With computationally intensive algorithms such as semantic segmentation, a single GPU can take days to optimize a model. But multi-GPU hardware is expensive, you say. Not any longer;  NVIDIA multi-GPU hardware on cloud instances like the AWS P3 allow you to pay for only what you use. Cloud instances allow you to take advantage of the latest generation of hardware with support for Tensor Cores, enabling significant performance boots with modest investments. You may have heard that setting up a cloud instance is difficult, but NVIDIA NGC makes life much easier. NGC is the hub of GPU-optimized software for deep learning, machine learning, and HPC. NGC takes care of all the plumbing so developers and data scientists can focus on generating actionable insights. 

This post walks through the easiest path to speeding up semantic segmentation by using NVIDIA GPUs on a cloud instance with the MATLAB container for deep learning available from NGC. First, we will explain semantic segmentation. Next we will show performance results for a semantic segmentation model trained in MATLAB on two different P3 instances using the MATLAB R2018b container available from NGC . Finally, we’ll cover a few tricks in MATLAB that make it easy to perform deep learning and help manage memory use.

What is Semantic Segmentation?

The semantic segmentation algorithm for deep learning assigns a label or category to every pixel in an image. This dense approach to recognition provides critical capabilities compared to traditional bounding-box approaches in some applications. In automated driving, it’s the difference between a generalized area labeled “road” and an exact, pixel-level determination of the drivable surface of the road. In medical imaging, it means the difference between labeling a rectangular region as a “cancer cell” and knowing the exact shape and size of the cell.

Figure 1. Example of an image with semantic labels for every pixel

We tested semantic segmentation using MATLAB to train a SegNet model, which has an encoder-decoder architecture with four encoder layers and four decoder layers. The dataset associated with this model is the CamVid dataset, a driving dataset with each pixel labeled with a semantic class (e.g. sky, road, vehicle, etc.). Unlike the original paper, we used stochastic gradient descent for training and pre-initialized the layers and weights from a pretrained VGG-16 model.

Performance Testing Using MATLAB on P3 Instances with NVIDIA GPUs

While semantic segmentation can be effective, it comes at a significant computational and memory cost. We ran our tests using AWS P3 instances with the MATLAB container available from NGC . Use of the container requires an AWS account and a valid MATLAB license. You can obtain a free trial MATLAB license for cloud use. Mathworks makes available directions on how to set up the MATLAB container on AWS.

The original SegNet implementation in 2015 took about a week to run on the single Tesla K40 used by the authors, as mentioned in the original paper. Below is a plot of the semantic segmentation network training process in MATLAB using a single V100 NVIDIA GPU on a p3.2xlarge instance. Figure 2 shows it took about 121 minutes, which is much faster than in the original paper.

Figure 2. Training Progress for SegNet in MATLAB on a single V100 NVIDIA GPU

Next, we performed the same test using the eight V100 NVIDIA GPUs available on a p3.16xlarge instance. The only change required in the MATLAB code: setting the training option parameter ExecutionEnvironment to multi-gpu.  Figure 3 illustrates a training plot showing that the process now took 37 minutes, 3.25x faster than using the p3.2xlarge instance.

Figure 3. Training Progress for SegNet in MATLAB on eight V100 NVIDIA GPUs

This 3.25x improvement in performance shows the power of the latest NVIDIA multi-GPU hardware with Tensor Cores, bringing what originally took “about a week” down to 37 minutes. I bet the SegNet authors wish they had this hardware when they were developing their algorithm!

Making Deep Learning Easier with MATLAB

Now let’s dive into why you should use MATLAB for developing deep learning algorithms such as semantic segmentation. MATLAB includes many useful tools and commands to make it easier to perform deep learning. One of the most useful MATLAB commands is imageDatastore, which allows you to efficiently manage a large collection of images. The command creates a database that allows working with the entire dataset as a single object. The MiniBatchSize parameter is particularly critical for semantic segmentation, determining how many images are used in each iteration. The default value of 256 consumes too much memory for semantic segmentation, so we set the value to 4.

Data augmentation, as presented in the function imageDataAugmenter, represents another powerful deep learning capability in MATLAB.  Data augmentation extends datasets by providing more examples to the network using translations, rotations, reflections, scaling, cropping, and more. This helps improve model accuracy. This uses data augmentation of random left/right reflections and X/Y translations of +/- 10 pixels. This data augmentation is combined into a pixelLabelDatastore, so that the operations occur at the time of each iteration and avoids unnecessary copies of the dataset.

How We Performed Semantic Segmentation in MATLAB

This section covers key parts of the code we used for the test above. The complete MATLAB code used in this test is available here. The single line at the end is where the training occurs.

The following code downloads the dataset and unzips it on your local machine.

imageURL = 'http://web4.cs.ucl.ac.uk/staff/g.brostow/MotionSegRecData/files/701_StillsRaw_full.zip';
labelURL = 'http://web4.cs.ucl.ac.uk/staff/g.brostow/MotionSegRecData/data/LabeledApproved_full.zip';

outputFolder = fullfile(tempdir,'CamVid');

if ~exist(outputFolder, 'dir')
     labelsZip = fullfile(outputFolder,'labels.zip');
     imagesZip = fullfile(outputFolder,'images.zip');

     disp('Downloading 16 MB CamVid dataset labels...');
     websave(labelsZip, labelURL);
     unzip(labelsZip, fullfile(outputFolder,'labels'));

     disp('Downloading 557 MB CamVid dataset images...');
     websave(imagesZip, imageURL);
     unzip(imagesZip, fullfile(outputFolder,'images'));

This code makes a temporary folder to unzip the files on your instance. When using the container, these files will be lost once the container shuts down. If you want to maintain a consistent location for your data, you should change the code to use an S3 bucket or some other permanent location.

The code below shows 11 classes used from the CamVid dataset to train the semantic segmentation network.  In the original dataset, there are 32 classes.

imgDir = fullfile(outputFolder,'images','701_StillsRaw_full');
imds = imageDatastore(imgDir);
classes = [

labelIDs = camvidPixelLabelIDs();
labelDir = fullfile(outputFolder,'labels');
pxds = pixelLabelDatastore(labelDir,classes,labelIDs);

In this next section, we resize the CamVid data to the resolution of the SegNet and partition the dataset into training and testing sets.

imageFolder = fullfile(outputFolder,'imagesResized',filesep);
imds = resizeCamVidImages(imds,imageFolder);
labelFolder = fullfile(outputFolder,'labelsResized',filesep);
pxds = resizeCamVidPixelLabels(pxds,labelFolder);

[imdsTrain,imdsTest,pxdsTrain,pxdsTest] = partitionCamVidData(imds,pxds);
numTrainingImages = numel(imdsTrain.Files)
numTestingImages = numel(imdsTest.Files)

Now let’s create a SegNet network. Start with VGG-16 weights and adjust them to balance the class weights.

imageSize = [360 480 3];
numClasses = numel(classes);
lgraph = segnetLayers(imageSize,numClasses,'vgg16');

imageFreq = tbl.PixelCount ./ tbl.ImagePixelCount;
classWeights = median(imageFreq) ./ imageFreq;
pxLayer = pixelClassificationLayer('Name','labels','ClassNames',tbl.Name,'ClassWeights',classWeights);
lgraph = removeLayers(lgraph,'pixelLabels');
lgraph = addLayers(lgraph, pxLayer);
lgraph = connectLayers(lgraph,'softmax','labels');

Next, select the training options. The MiniBatchSize parameter is particularly critical for semantic segmentation, determining how many images are used in each iteration. The default value of 256 requires too much memory for semantic segmentation, so we set the value to 4. The ExecutionEnvironment option is set to >multi-gpu to use multiple V100 NVIDIA GPUs as found on the p3.16xlarge instance. Check out the documentation for more details on the training options.

options = trainingOptions( 'sgdm', ...
     'Momentum',0.9, ...
     'InitialLearnRate',1e-3, ...
     'L2Regularization',0.0005, ...
     'MaxEpochs',100, ...
     'MiniBatchSize',4 * gpuDeviceCount, ...
     'Shuffle','every-epoch', ...

Another powerful capability in MATLAB for deep learning is imageDataAugmenter, which provides more examples to the network and helps improve accuracy. This example uses data augmentation of random left/right reflections and X/Y translations of +/- 10 pixels. This is combined into a pixelLabelDatastore, so that the operations occur at the time of each iteration and avoids unnecessary copies of the dataset.

augmenter = imageDataAugmenter('RandXReflection',true,...
     'RandXTranslation',[-10 10],'RandYTranslation',[-10 10]);
pximds = pixelLabelImageDatastore(imdsTrain,pxdsTrain,...

Now we can start training. This next line of code for training takes about 37 minutes to run on the p3.16xlarge instance. We measure the time spent training in the plot window for keeping track of training progress. Refer back to figures 2 and 3 to see the measured time taken to run this function on p3.2xlarge and p3.16xlarge instances.

[net, info] = trainNetwork(pximds,lgraph,options);


MATLAB makes it easy for engineers to train deep-learning models that can take advantage of NVIDIA GPUs for accelerating the training process. With MATLAB, switching from training on a single GPU machine to a multi-GPU machine takes just a single line of code, shown in the final code snippet above. We showed how you can speed up deep learning applications by training neural networks in the MATLAB Deep Learning Container on the NGC, which is designed to take full advantage of high-performance NVIDIA® GPUs.  As a next step, download the code and try it yourself in MATLAB on an AWS P3 instance.

Bruce Tannenbaum
Manager of Technical Marketing for Vision, AI, and IoT Applications, MathWorks

Arvind Jayaraman
Senior Pilot Engineer, MathWorks.

Bringing it All into Focus: Finding the Right Lens for Your Camera

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.

Intel’s Recommendations for the U.S. National Strategy on Artificial Intelligence

This article was originally published at Intel's website. It is reprinted here with the permission of Intel.

Improving TensorFlow Inference Performance on Intel Xeon Processors

This article was originally published at Intel's website. It is reprinted here with the permission of Intel.

The Four Key Trends Driving the Proliferation of Visual Perception

A version of this article was previously published in the February/March edition of Imaging and Machine Vision Europe. It is reprinted here with the permission of Imaging and Machine Vision Europe.

High-Sensitivity Image Processing Cameras

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.

High-sensitivity image processing cameras are essential for achieving a quality video image with low image noise even in poor lighting conditions.

Harnessing the Power of AI: An Easy Start with Lattice’s sensAI

Bookmark and Share

Harnessing the Power of AI: An Easy Start with Lattice’s sensAI

This article was originally published at Lattice Semiconductor's website. It is reprinted here with the permission of Lattice Semiconductor.

Artificial intelligence, or AI, is everywhere. It’s a revolutionary technology that is slowly pervading more industries than you can imagine. It seems that every company, no matter what their business, needs to have some kind of AI story. In particular, you see AI seriously pursued for applications like self-driving automobiles, the Internet of Things (IoT), network security, and medicine. Company visionaries are expected to have a good understanding of how AI can be applied to their businesses, and success by early adopters will force holdouts into the fray.

Not all AI is the same, however, and different application categories require different AI approaches. The application class that appears to have gotten the most traction so far is embedded vision. AI for this category makes use of so-called convolutional neural networks, or CNNs, which attempt to mimic the way that the biological eye is believed to operate. We will focus on vision in this AI whitepaper, even though many of the concepts will apply to other applications as well.

AI Edge Requirements

AI involves the creation of a trained model of how something works. That model is then used to make inferences about the real world when deployed in an application. This gives an AI application two major life phases: training and inference.

Training is done during development, typically in the cloud. Inference, on the other hand, is required by deployed devices as an ongoing activity. Because inference can also be a computationally difficult problem, much of it is currently done in the cloud. But there is often little time to make decisions. Sending data to the cloud and then waiting until a decision arrives back can take time – and by then, it may be too late. Making that decision locally can save precious seconds.

This need for real-time control applies to many application areas where decisions are needed quickly. Many such examples detect human presence:

Other always-on applications include:

Because of this need for quick decisions, there is a strong move underway to take inference out of the cloud and implement it at the “edge” – that is, in the devices that gather data and then take action based on the AI decisions. This takes the delays inherent in the cloud out of the picture.

There are two other benefits to local inference. The first is privacy. Data enroute to and from the cloud, and data stored up in the cloud, is subject to hacking and theft. If the data never leaves the equipment, then there is far less opportunity for mischief.

The other benefit relates to the bandwidth available in the internet. Sending video up to the cloud for real-time interpretation chews up an enormous amount of bandwidth. Making the decisions locally frees that bandwidth up for all of the other demanding uses.

In addition:

  • Many such devices are powered by a battery – or, if they are mains-supplied, have heat constraints that limit how much power is sustainable. In the cloud, it’s the facility’s responsibility to manage power and cooling.
  • AI models are evolving rapidly. Between the beginning and end of training, the size of the model may change dramatically, and the size of the required computing platform may not be well understood until well into the development process. In addition, small changes to the training can have a significant impact on the model, adding yet more variability. All of this makes it a challenge to size the hardware in the edge device appropriately.
  • There will always be tradeoffs during the process of optimizing the models for your specific device. That means that a model might operate differently in different pieces of equipment.
  • Finally, edge devices are often very small. This limits the size of any devices used for AI inference.

All of this leads to the following important requirements for interference at the edge. Engines for making AI inference at the edge must:

  • Consume very little power
  • Be very flexible
  • Be very scalable
  • Have a small physical footprint

Lattice’s sensAI offering lets you develop engines with precisely these four characteristics. It does so by including a hardware platform, soft IP, a neural-net compiler, development modules, and resources that will help get the design right quickly.

Inference Engine Options

There are two aspects to building an inference engine into an edge device: developing the hardware platform that will host the execution of the model, and developing the model itself.

Execution of a model can, in theory, take place on many different architectures. But execution at the edge, taking into account the power, exibility, and scalability requirements above, limits the choices – particularly for always-on applications.


The most common way of handling AI models is by using a processor. That may be a GPU or a DSP, or it may be a microcontroller. But the processors in edge devices may not be up to the challenge of executing even simple models; such a device may have only a low-end microcontroller (MCU) available. Using a larger processor may violate the power and cost requirements of the device, so it might seem like AI would be out of reach for such devices.

This is where low-power FPGAs can play an important role. Rather than bee ng up a processor to handle the algorithms, a Lattice ECP5 or UltraPlus FPGA can act as a co-processor to the MCU, providing the heavy lifting that the MCU can’t handle while keeping power within the required range. Because these Lattice FPGAs can implement DSPs, they provide computing power not available in a low-end MCU.

Figure 1. FPGA as a Co-Processor to MCU


For AI models that are more mature and will sell in high volumes, ASICs or application-specific standard products (ASSPs) may be appropriate. But, because of their activity load,they will consume too much power for an always-on application.

Here Lattice FPGAs can act as activity gates, handling wake-up activities involving wake words or recognition of some broad class of video image (like identifying something that looks like it might be a person) before waking up the ASIC or ASSP to complete the task of identifying more speech or confirming with high confidence that an artifact in a video is indeed a person (or even a specific person).

The FPGA handles the always-on part, where power is most critical. While not all FPGAs can handle this role, since many of them still consume too much power, Lattice’s ECP5 and UltraPlus FPGAs have the power characteristics necessary for this role.

Figure 2. FPGA as activity gate to ASIC/ASSP

Stand-Alone FPGA AI Engines

Finally, low-power FPGAs can act as stand-alone, integrated AI engines. The DSPs available in the FPGAs take the starring role here. Even if an edge device has no other computing resources, AI capabilities can be added without breaking the power, cost, or board- area budgets. And they have the exibility and scalability necessary for rapidly evolving algorithms.

Figure 3. Stand-alone, Integrated FPGA Solution

Building an Inference Engine in a Lattice FPGA

Designing hardware that will execute an AI inference model is an exercise in balancing the number of resources needed against performance and power requirements. Lattice’s ECP5 and UltraPlus familes provide this balance.

The ECP5 family has three members of differing sizes that can host from one to eight inference engines. They contain anywhere from 1 Mb to 3.7 Mb of local memory. They run up to 1 W of power, and they have a 100 mm2 footprint.

The UltraPlus family, by contrast, has power levels as low as one thousandth the power of the ECP5 family, at 1 mW. Consuming a mere 5.5 mm2 of board area, it contains up to eight multipliers and up to 1 Mb of local memory.

Lattice also provides CNN IP designed to operate efficiently on these devices. For the ECP5 family, Lattice has a CNN Accelerator.

Figure 4. CNN Accelerator for the ECP5 family

For the UltraPlus family, Lattice provides a CNN Compact Accelerator.

Figure 5. Compact CNN Accelerator for the UltraPlus family

We won’t dive into the details here; the main point is that you don’t have to design your own engine from scratch. Much more information is available from Lattice regarding these pieces of IP.

Finally, you can run examples like this and test them out on development modules, with one for each device family. The Himax HM01B0 UPduino shield uses an UltraPlus device, requiring 22 x 50 mm2 of space. The Embedded Vision Development Kit uses an ECP5 device, claiming 80 x 80 mm2 of space.

Figure 6. Development modules for evaluation of AI application

Given an FPGA, soft IP, and all of the other hardware details needed to move data around, the platform can be compiled using Lattice’s Diamond design tools in order to generate the bitstream that will configure the FPGAs at each power-up in the targeted equipment.

Building the Inference Model in a Lattice FPGA

Creating an inference model is very different from creating the underlying execution platform. It’s more abstract and mathematical, involving no RTL design. There are two main steps: creating the abstract model and then optimizing the model implementation for your chosen platform.

Model training takes place on any of several frameworks designed specifically for this process. The two best-known frameworks are Caffe and TensorFlow, but there are others as well.

A CNN consists of a series of layers – convolution layers, along with possible pooling and fully connected layers – each of which has nodes that are fed by the result of the prior layer. Each of those results is weighted at each node, and it is the training process that decides what the weights should be.

The weights output by the training frameworks are typically floating-point numbers. This is the most precise embodiment of the weights – and yet most edge devices aren’t equipped with floating-point capabilities. This is where we need to take this abstract model and optimize it for a specific platform – a job handled by Lattice’s Neural Network Compiler.

The Compiler allows you to load and review the original model as downloaded from one of the CNN frameworks. You can run performance analysis, which is important for what is likely the most critical aspect of model optimization: quantization.

Because we can’t deal with floating-point numbers, we have to convert them to integers. That means that we will lose some accuracy simply by virtue of rounding off floating-point numbers. The question is, what integer precision is needed to achieve the accuracy you want? 16 bits is usually the highest precision used, but weights – and inputs – may be expressed as smaller integers. Lattice currently supports 16-, 8-, and 1-bit implementations. 1-bit designs are actually trained in the single-bit integer domain to maintain accuracy.Clearly, smaller data units mean higher performance, smaller hardware, and, critically, lower power. But, make the precision too low, and you won’t have the accuracy required to faithfully infer the objects in a field of view.

Figure 7. A single model can be optimized differently for different equipment

So the neural-network compiler lets you create an instruction stream that represents the model, and those instructions can then be simulated or outright tested to judge whether the right balance has been struck between performance, power, and accuracy. This is usually measured by the percentage of images that were correctly processed out of a set of test images (different from the training images).

Improved operation can often be obtained by optimizing a model, including pruning of some nodes to reduce resource consumption, and then retraining the model in the abstract again. This is a design loop that allows you to fine-tune the accuracy while operating within constrained resources.

Two Detection Examples

We can see how the tradeoffs play out with two different vision examples. The first is a face-detection application; the second is a human-presence-detection application. We can view how the differences in the resources available in the different FPGAs affects the performance and power of the resulting implementations.

Both of these examples take their inputs from a camera, and they both execute on the same underlying engine architecture. For the UltraPlus implementation, the camera image is downsized and then processed through eight multipliers, leveraging internal storage and using LEDs as indicators.

Figure 8. UltraPlus platform for face-detection and human-presence applications

The ECP5 family has more resources, and so it provides a platform with more computing power. Here the camera image is pre-processed in an image signal processor (ISP) before being sent into the CNN. The results are combined with the original image in an overlay engine that allows text or annotations to be overlaid on the original image.

Figure 9. ECP5 platform for face-detection and human-presence applications

We can use a series of charts to measure the performance, power, and area of each implementation of the applications. We also do two implementations of each application: one with fewer inputs and one with more inputs.

For the face-detection application, we can see the results in Figure 10. Here the two implementations use 32 x 32 inputs for the simple version and 90x90 inputs for the more complex one.

Figure 10. Performance, power, and area results for simple and complex implementations of the face-recognition application in UltraPlus and ECP5 FPGAs

The left-hand axis shows the number of cycles required to process an image and how those cycles are spent. The right-hand axis shows the resulting frames-per-second (fps) performance for each implementation (the green line). Finally, each implementation shows the power and area.

The orange bars in the 32 x 32 example on the left represent the cycles spent on convolution. The UltraPlus has the fewest multipliers of the four examples; the other three are ECP5 devices with successively more multipliers. As the number of multipliers increases, the number of cycles required for convolution decreases.

The 90 x 90 example is on the right, and the results are quite different. There is a significant new blue contribution to the cycles on the bottom of each stack. This is the result of the more complex design using more memory than is available internally in the devices. As a result, they have to go out to DRAM, which hurts performance. Note also that this version cannot be implemented in the smaller UltraPlus device.

A similar situation holds for the human-presence application. Here the simple version uses 64 x 64 inputs, while the complex version works with 128 x 128 inputs.

Figure 11. Performance, power, and area results for simple and complex implementations of the human-presence application in UltraPlus and ECP5 FPGAs

Again, more multipliers reduce the convolution burden, and relying on DRAM hurts performance.

The performance for all versions is summarized in Table 1. This includes a measure of the smallest identifiable object or feature in an image, expressed as a percent of the full field of view. Using more inputs helps here, providing additional resolution for smaller objects.

Table 1. Performance summary of the two example applications in four different FPGAs


In summary, then, edge-inference AI designs that demand low power, exibility, and scalability can be readily implemented in Lattice FPGAs using the resources provided by the Lattice sensAI offering. It makes available the critical elements needed for successful deployment of AI algorithms:

  • Neural network compiler
  • Neural engine soft IP
  • Diamond design tools
  • Development boards
  • Reference designs

Much more information is available from Lattice; go to www.latticesemi.com to start using the power of AI in your designs.

What Is CoaXPress?

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.