Embedded Vision Alliance: Technical Articles

Intel Software Development Tools Optimize Deep Learning Performance for Healthcare Imaging

This article was originally published at Intel's website. It is reprinted here with the permission of Intel.

Using Calibration to Translate Video Data to the Real World

Bookmark and Share

Using Calibration to Translate Video Data to the Real World

This article was originally published at NVIDIA's website. It is reprinted here with the permission of NVIDIA.

DeepStream SDK 3.0 is about seeing beyond pixels. DeepStream exists to make it easier for you to go from raw video data to metadata that can be analyzed for actionable insights. Calibration is a key step in this process, in which the location of objects present in a video stream is translated into real-world geo-coordinates. This post walks through the details of calibration using DeepStream SDK 3.0.

Design

The DeepStream SDK is often used to develop large-scale systems such as intelligent traffic monitoring and smart buildings. This approach to calibration is meant for complex, scalable environments like these, and does not require a physical presence at the site.

Background

One of the big issues with extracting usable data from video streams is taking an object detected by the camera and translating it into a geo-location. Take a traffic camera as an example. When the camera sees a car, the raw image of the car isn’t useful to a smart cities system on its own. The car would ideally be placed in an information grid that also projects a live bird’s eye view of the activities in the city for the operator’s use.

Doing this means translating that camera image into latitude and longitude coordinates corresponding to the car’s location on that intersection. Technically, this is a transformation from the image plane of the camera (image of the car) to a global geo-location (latitude/longitude coordinate). Transformations like this are critical to a variety of use cases beyond simple visualization. Solutions that require multi-camera object tracking, movement summarization, geo-fencing, and other geo-locating for business intelligence and safety can leverage the same technique. We call this calibration.

Let’s take a closer look at how to approach calibration for applications built using DeepStream 3.0.

Approaches to Calibration

Multiple approaches exist for calibrating cameras to yield global coordinates. Several popular methods use a process based on inferring the intrinsic and extrinsic camera parameters. Global coordinates are then inferred with a simple geometric transformation from camera world to the real world.

One way to do this is to use a “checkerboard” pattern to infer the camera parameters. From there, a homomorphic transformation (translation from image plane to real-world) can be used to infer global coordinates.

While the checkerboard approach is a high-fidelity method for calibration, it’s both labor and resource intensive. This makes it impractical for smart cities applications like parking garages and traffic intersections that regularly employ hundreds of cameras in concert. Specifically, the checkerboard approach is:

  • Not generalizable. The technique requires creation of custom checkerboards for each application and the checkerboards must be placed at various angles in the camera view.
  • Invasive in high traffic areas. Placement of checkerboards requires populated and frequently active areas to be cleared for calibration work, impractical on public roads and in other crowded spaces.
  • Not automatable. No uniform or simple way to automate the process exists. Equal time and manpower must be spent on each camera, which can be excessive in some cases.

The approach outlined in this post is suitable for camera systems where cameras are observing a fixed field-of-view (FoV). That is, the cameras are fixed and are all watching the same geo-region. This approach is not suitable for cameras mounted on moving objects (e.g., cars) or Pan-Tilt-Zoom cameras.

Additionally, image size and scaling factors must be the same across all cameras and we must be able to access still images from each camera. We also need access to a global map of the area being watched.

Note that the calibration phase usually involves estimating the camera’s intrinsic and extrinsic parameters, which transforms each pixel into a global location using geometry operations. However, if the camera is a 360-degree camera in several use cases similar to ours, simple transformations may not be possible. The original image from the 360-degree camera has objects appearing in shapes that look distorted. Before we can actually perform the calibration steps, we need to supply pixels from a corrected image.

Consider image (A) in figure 1 below from a 360-degree camera. DeepStream first de-warps the original fisheye image to look like image (B), removing object distortion. This dewarped image is then used for car detection. This undistorted image of the car supplies the pixels for translation into a global coordinate.


Figure 1. Image A (left): Snapshot from a 360-degree camera. It is hard for object-detection algorithms to detect objects because of the warped nature of the image. B (right): A sample dewarped image of a region where objects can be easily detected by object-detection images.

System Overview

At a high level, this approach constructs corresponding polygons in the camera image and global maps. A transformation matrix maps camera space to global space, as presented in the flow diagram in figure 2.


Figure 2: Overview of the calibration process

The steps below outline the details on how an application to process this works:

  1. Draw a polygon on one of the camera images. From this we get four points on a camera plane (eg, points Ac, Bc, Cc and Dc). Use an image annotation tool for this step.
  2. Draw a corresponding image on the global polygon, resulting in four corresponding points (eg, points Ag, Bg, Cg, Dg). Use a GIS tool for this step.
  3. Create a CSV file which contains the information required for calibration. For each camera, insert one row which has the information <CameraId, Ac, Bc, Cc, Dc, Ag, Bg, Cg, Dg>”
  4. Load the CSV file into DeepStream. DeepStream computes a transformation matrix (per-camera) that translates every pixel in the camera plane into global coordinates.

The transformation matrix computes global coordinates for each object detected by the camera.

The Calibration Process

Calibration is a multi-step process, annotating maps, annotating images, and polygon drawing. Let’s walk through the steps:

Annotating Maps

The process involves mapping coordinates in images and global maps. You can use an open source geographical information system tool like QGIS here. QGIS helps you draw polygons and lines with respect to a real map and export the resulting coordinates as a CSV file. You can use this to geo-reference a city block or a parking level image.

Annotating maps requires QGIS. The sources below should help you learn more about installing and using QGIS

Annotating Images

There are many image annotation tools available; Ratsnake is a useful, freely available tool. Let’s walk through the steps for annotating images.

Step 1. Capturing image snapshots from cameras

The first step in calibration is obtaining snapshot images from all cameras. Snapshots should show clear, salient feature points of the region of interest. These salient feature points will be mapped to the features seen on a global map. For example, snapshots from cameras installed inside a parking garage should clearly show pillars, parking spot lines painted on the ground, and other features of the building itself. Take the snapshots when the area is empty, or near-empty, to ensure few vehicles, pedestrians, and other large objects block the building features.

We’ll store these snapshots in an directory, and label it for easy reference. For example, make a directory called CAM_IMG_DIR=/mnt/camdata/images/ and save the images there. Individual snapshots may be named with the IP address of the camera they were taken with. For a camera with IP 10.10.10.10, save the snapshot image as ${CAM_IMG_DIR}/10_10_10_10.png.

Step 2. Blueprint/CAD image

Download a global map (blueprint or CAD image) of the location being observed — a parking area in our example, as shown in figure 3. Save the map to a directory GIS_DIR (e.g., GIS_DIR==/mnt/camdata/gis/). For example, we save the png file of the parking area as ${GIS_DIR}/parking.png.


Figure 3. Example parking area image (${GIS_DIR}/parking.png)

Step 3. Georeferencing

Georeferencing maps every point of the region being monitored into a global coordinate system, e.g., latitude and longitude. In other words, it maps every point in the garage to its latitude and longitude.

Depending on the region you are monitoring, you may be able to use existing maps — particularly for outdoor regions. Say you’re using traffic cameras to monitor an intersection. There may well already be a Google or QGIS map you can use to get the coordinates of the intersection and/or traffic light itself.

However, in many use-cases there are no pre-existing georeferenced maps suitable for use in calibration. This is especially true in indoor scenarios, like our parking garage example. That said, you can often find CAD images or blueprints of buildings, and other indoor map files (usually in pdf or picture format) that provide coordinates for at least some key points in the region of interest.

Once you have your CAD image, blueprint, or other indoor map, it’s ready for georeferencing. You do this by placing the blueprint accurately on the global map using QGIS.

Georeferencing works in our methodology if the area in question has at least a few key feature points observable in both the blueprint and the map. Examples might include pillars or corners of staircases.

The process of georeferencing is described below:

  1. Using a GPS receiver (such as a smartphone), log the latitude and longitude coordinates of various feature points.
  2. Open the QGIS application. Launch the Georeferencer plugin.
  3. Open the blueprint (jpg/png image) in the Georeferencer plugin, and follow the guide for georeferencing. Going back to our example from steps 1 and 2 above, if we want to map the parking area, then we use the corresponding png file ${GIS_DIR}/parking.png. Map each of the feature points on both the QGIS map and the blueprint image.
  4. The resultant output is a georeferenced TIFF file that provides accurate geo-coordinates to any point on the map. Georeferencing yields one image for each blueprint. Save them as ${GIS_DIR}/parking.tif.

Polygon Drawing

Let’s walk through the detailed steps for calibrating one camera (say, camera A with IP 10.10.10.10). Assume that the snap-shots for each camera are stored in ${CAM_IMG _DIR}. We’ll need to repeat these steps for each camera in our set up.

Figure 4 shows the global map and the camera image for camera A.


Figure 4. The global map and the camera image for camera A.

  1. Open QGIS and load the global map. In this example, we will load the geo-referenced image of the region covered by camera A. Since this camera covers the above-mentioned parking area, we load the file ${GIS_DIR}/parking.tif.
  2. Narrow down on the region covered by camera A on the global map. In the above picture, we show the entire global map, and a zoomed in global map on the left-hand side.
  3. Open the image snapshot ${CAM_IMG_DIR}/10_10_10_10.png using Ratsnake.
  4. Identify the salient feature points that can be seen on both the global map and the snapshot. In this example, we see pillars and few parking spot lines in the camera image.
  5. Draw an identifying quadrilateral on the camera image using Ratsnake. Mark the points Ac, Bc, Cc and Dc (see snapshot image on right hand side).
  6. Draw the same exact quadrilateral on the global map. Call them points Ag, Bg, Cg and Dg. (For details on drawing polygons in QGIS and Ratsnake, see the next section below).
  7. Note that each point on the global map has to map back to its corresponding point on the snapshot image, i.e. Ag has to map to Ac, and so on. To do this in QGIS, the quadrilateral has to be drawn in the same direction, and starting with the same corresponding point, on both the camera image and in QGIS.

Drawing Polygons in QGIS and Ratsnake

First, let’s go over how to draw a polygon in QGIS. Note the global coordinate for each point on the polygon (e.g. Pt Ag). The global coordinate can take a quadrilateral drawn on the map consisting of four (x,y) points. Each global coordinate (x,y), can be the number of meters from the origin (in x and y direction). The origin may be the center of the building. In addition, the (longitude, latitude) for each point (x,y) is also given. To get the (x,y) from the QGIS tool, follow this procedure:

Drawing the polygons

  1. Create a new “Vector Layer” in QGIS tool for drawing polygons.
  2. Add a feature called CameraId (string) that corresponds to the id of the camera.
  3. Draw the quadrilateral for each camera. Make sure that there are exactly 4 points.
  4. Update the quadrilateral’s CameraId to the camera ID (e.g., “C_10_10_10_10”).
  5. Also, note down the longitude and latitude of the origin point (center of the building).
  6. Figure 5 shows example polygons that are drawn for an example parking area for various cameras. The background image is the map (${GIS_DIR}/parking.tif), and the gray boxes are the polygons that have been drawn.


Figure 5. Examples of polygons drawn using QGIS

Get the (longitude, latitude) for each of the polygon points

  1. Export the vector layer created in the previous step as a CSV. Make sure you have the following columns:
    CameraId, longitude0, latitude0, longitude1, latitude1, longitude2, latitude2, longitude3, latitude3
  2. Read the shapefile and get the attribute of the shape. You may use Python’s pyshp package. The documentation shows how to read the latitude and longitude points of the shape, and its attributes (in our case CameraId) (https://pypi.org/project/pyshp/). The documentation shows how to read the latitude and longitude points of the shape, and its attributes (in our case “CameraId”)
  3. Given an origin point (longitudeOrigin, latitudeOrigin), convert the latitude and longitude of each shape point to a respective (x, y) based on the distance and angle of the point from the origin.
  4. We will now have four global coordinate points: (gx0,gy0), (gx1,gy1), (gx2,gy2), and (gx3,gy3)

Draw the polygon on Ratsnake

  1. Note down the camera coordinates for Ac, Bc, Cc and Dc for each camera. Let us call them as (cx0,cy0), (cx1,cy1), (cx2,cy2), (cx3,cy3). Export these points for each camera.
  2. Create the Calibration Table (say, as a CSV file nvaisle_2M.csv) similar to the example in table 1 below. This helps DeepStream transform from camera coordinates to global coordinates
Column Example Comments
cameraId C10_10_10_10  
ipaddress 10.10.10.10  
level P1  
gx0 -105.8660603 Global coordinates
gy0 -12.57717718 Global coordinates
gx1 -105.9378082 Global coordinates
gy1 -4.760517508 Global coordinates
gx2 -96.0054864 Global coordinates
gy2 -4.86179862 Global coordinates
gx3 -95.99345216 Global coordinates
gy3 -11.80735727 Global coordinates
cx0 510 Camera coordinates
cy0 186 Camera coordinates
cx1 1050 Camera coordinates
cy1 126 Camera coordinates
cx2 1443 Camera coordinates
cy2 351 Camera coordinates
cx3 21 Camera coordinates
cy3 531 Camera coordinates

Table 1. Example calibration table

Transferring CSV to DeepStream Server

The CSV create above (nvaisle_2M.csv) file will be added to DeepStream configuration directory, enabling DeepStream to infer the geo-location of detected cars.

The calibration techniques you’ve learned in this post, combined with DeepStream SDK 3.0, enable you to easily create scalable applications with a rich UI to deliver complete situational awareness. Download the DeepStream SDK 3.0 today to get started.

Vinay Kolar
Senior Data Scientist, NVIDIA

Using MATLAB and TensorRT on NVIDIA GPUs

Bookmark and Share

Using MATLAB and TensorRT on NVIDIA GPUs

This article was originally published at NVIDIA's website. It is reprinted here with the permission of NVIDIA.

As we design deep learning networks, how can we quickly prototype the complete algorithm—including pre- and postprocessing logic around deep neural networks (DNNs) —to get a sense of timing and performance on standalone GPUs? This question comes up frequently from the scientists and engineers I work with. Traditionally, they would hand translate the complete algorithm into CUDA and compile it with the NVIDIA toolchain. However, they want to know if there’s a more automated way of short-circuiting the standard process.

Depending on the tools you’re using, compilers exist which can help automate the process of converting designs to CUDA. Engineers and scientists using MATLAB have access to tools to label ground truth and accelerate the design and training of deep learning networks that were covered in a previous post. MATLAB can also import and export using the ONNX format to interface with other frameworks. Finally, to quickly prototype designs on GPUs, MATLAB users can compile the complete algorithm to run on any modern NVIDIA GPUs, from NVIDIA Tesla to DRIVE to Jetson AGX Xavier platforms.

In this post, you’ll learn how you can use MATLAB’s new capabilities to compile MATLAB applications, including deep learning networks and any pre- or postprocessing logic, into CUDA and run it on modern NVIDIA GPUs.

Let’s use a traffic sign detection recognition (TSDR) example to show the steps in the workflow:

  • Run and test algorithm in MATLAB
  • Compile algorithm to CUDA and run on desktop GPU
  • Compile algorithm to CUDA and integrate with external applications

Traffic Sign Detection and Recognition Algorithm

The goal of the algorithm is to detect and recognize traffic signs using cameras mounted on vehicles. We feed in input images or video to the algorithm and it returns with a listing of traffic signs detected in the input. Traffic signs are also identified by a box in the output image. Figure 1 shows a test image and successful detection of a stop sign.


Figure 1. (Left) Input image and (right) output image with identified traffic sign

Figure 2 shows the traffic sign detection and recognition happens in three steps: detection, Non-Maximal Suppression (NMS), and recognition. First, the detection network (a variant of the You Only Look Once (YOLO) network) detects traffic signs from input images. Overlapping detections from the preceding stage are then suppressed using the NMS algorithm. Finally, the recognition network classifies the detected traffic signs.


Figure 2. Traffic Sign Detection and Recognition algorithm

Detection and Recognition Networks

The detection network is trained in the Darknet framework and imported into MATLAB for inference. All traffic signs are considered as a single class for training the detection network since the size of the traffic sign is small relative to that of the image and the number of training samples per class are less in the training data.

The detection network divides the input image into a 7 x 7 grid and each grid cell detects a traffic sign if the center of the traffic sign falls within the grid cell. Each cell predicts two bounding boxes and confidence scores for these bounding boxes. Confidence scores tell us whether the box contains an object or not. Each cell also predicts the probability for finding the traffic sign in the grid cell. The final score is product of the above two. We apply a threshold of 0.2 on this final score to select the detections.

The detection network contains 58 layers, including convolution, leaky ReLU, and fully connected layers. Table 1 shows a snippet of the layers displayed in MATLAB.

58×1 Layer array with layers:
1 ‘input’ Image Input 448x448x3 images
2 ‘conv1’ Convolution 64 7x7x3 convolutions with stride [2 2] and padding [3 3 3 3]
3 ‘relu1’ Leaky ReLU Leaky ReLU with scale 0.1
4 ‘pool1’ Max Pooling 2×2 max pooling with stride [2 2] and padding [0 0 0 0]
5 ‘conv2’ Convolution 192 3x3x64 convolutions with stride [1 1] and padding [1 1 1 1]
6 ‘relu2’ Leaky ReLU Leaky ReLU with scale 0.1
7 ‘pool2’ Max Pooling 2×2 max pooling with stride [2 2] and padding [0 0 0 0]
8 ‘conv3’ Convolution 128 1x1x192 convolutions with stride [1 1] and padding [0 0 0 0]
9 ‘relu3’ Leaky ReLU Leaky ReLU with scale 0.1
10 ‘conv4’ Convolution 256 3x3x128 convolutions with stride [1 1] and padding [1 1 1 1]
11 ‘relu4’ Leaky ReLU Leaky ReLU with scale 0.1
12 ‘conv5’ Convolution 256 1x1x256 convolutions with stride [1 1] and padding [0 0 0 0]
13 ‘relu5’ Leaky ReLU Leaky ReLU with scale 0.1
14 ‘conv6’ Convolution 512 3x3x256 convolutions with stride [1 1] and padding [1 1 1 1]
15 ‘relu6’ Leaky ReLU Leaky ReLU with scale 0.1
16 ‘pool6’ Max Pooling 2×2 max pooling with stride [2 2] and padding [0 0 0 0]

Table 1. A snippet of the 58 layers in the detection network

The recognition network is trained on the same images using MATLAB and contains 14 layers, including convolution, fully connected, and classification output layers. Table 2 shows details of the layers displayed in MATLAB.

14×1 Layer array with layers:
1 ‘imageinput’ Image Input 48x48x3 images with ‘zerocenter’ normalization and ‘randfliplr’ augmentations
2 ‘conv_1’ Convolution 100 7x7x3 convolutions with stride [1 1] and padding [0 0 0 0]
3 ‘relu_1’ ReLU ReLU
4 ‘maxpool_1’ Max Pooling 2×2 max pooling with stride [2 2] and padding [0 0 0 0]
5 ‘conv_2’ Convolution 150 4x4x100 convolutions with stride [1 1] and padding [0 0 0 0]
6 ‘relu_2’ ReLU ReLU
7 ‘maxpool_2’ Max Pooling 2×2 max pooling with stride [2 2] and padding [0 0 0 0]
8 ‘conv_3’ Convolution 250 4x4x150 convolutions with stride [1 1] and padding [0 0 0 0]
9 ‘maxpool_3’ Max Pooling 2×2 max pooling with stride [2 2] and padding [0 0 0 0]
10 ‘fc_1’ Fully Connected 300 fully connected layer
11 ‘dropout’ Dropout 90% dropout
12 ‘fc_2’ Fully Connected 35 fully connected layer
13 ‘softmax’ Softmax Softmax
14 ‘classoutput’ Classification Output crossentropyex with ‘0’ and 34 other classes

Table 2. The 14 layers of the recognition network

Run and Test Algorithm in MATLAB

The TSDR algorithm is defined in the tsdr_predict.m function. The function starts by converting the input image into BGR format before sending it to the detection network, which is specified in yolo_tsr.mat. The function loads network objects from yolo_tsr.mat into a persistent variable detectionnet so persistent objects are reused on subsequent calls to the function.


function [selectedBbox,idx] = tsdr_predict(img)
coder.gpu.kernelfun;
img_rz = imresize(img,[448,448]); % Resize the image
img_rz = img_rz(:,:,3:-1:1); % Converting into BGR format
img_rz = im2single(img_rz);

%% Traffic sign detection
persistent detectionnet;
if isempty(detectionnet)
   detectionnet = coder.loadDeepLearningNetwork('yolo_tsr.mat','Detection');
end
predictions = detectionnet.activations(img_rz,56,'OutputAs','channels');


The function then takes the output from the detection network to find bounding box coordinates in the input image before suppressing overlapping detections using selectStrongestBbox function.


coder.varsize('selectedBbox',[98, 4],[1 0]);
[selectedBbox,~] = selectStrongestBbox(round(boxes),probs);


Finally, the function recognizes traffic signs using the recognition network. As before with detectionnet, the function loads the network objects from recognitionNet.mat into a persistent variable recognitionnet so persistent objects are reused on subsequent calls.


persistent recognitionnet;
if isempty(recognitionnet)
   recognitionnet = coder.loadDeepLearningNetwork('RecognitionNet.mat','Recognition');
end

idx = zeros(size(selectedBbox,1),1);
inpImg = coder.nullcopy(zeros(48,48,3,size(selectedBbox,1)));
for i = 1:size(selectedBbox,1)

   ymin = selectedBbox(i,2);
   ymax = ymin+selectedBbox(i,4);
   xmin = selectedBbox(i,1);
   xmax = xmin+selectedBbox(i,3);

   % Resize Image
   inpImg(:,:,:,i) = imresize(img(ymin:ymax,xmin:xmax,:),[48,48]);

end

for i = 1:size(selectedBbox,1)
   output = recognitionnet.predict(inpImg(:,:,:,i));
   [~,idx(i)]=max(output);
end


To test tsdr_predict.m running in MATLAB using the CPU, we can write a test script that feeds a test image to tsdr_predict, then map class numbers to the class dictionary to get the type of traffic sign detected. We then draw a bounding box around the detected traffic sign and label it on the output image. The result from running the test script below is the same output image shown in Figure 1.


im = imread('stop.jpg');
im = imresize(im, [480,704]);
[bboxes,classes] = tsdr_predict_mex(im);

% Map the class numbers to traffic sign names in the class dictionary.
classNames = {'addedLane','slow','dip','speedLimit25','speedLimit35','speedLimit40','speedLimit45',...
   'speedLimit50','speedLimit55','speedLimit65','speedLimitUrdbl','doNotPass','intersection',...
   'keepRight','laneEnds','merge','noLeftTurn','noRightTurn','stop','pedestrianCrossing',...
   'stopAhead','rampSpeedAdvisory20','rampSpeedAdvisory45','truckSpeedLimit55',...
   'rampSpeedAdvisory50','turnLeft','rampSpeedAdvisoryUrdbl','turnRight','rightLaneMustTurn',...
   'yield','yieldAhead','school','schoolSpeedLimit25','zoneAhead45','signalAhead'};

classRec = classNames(classes);
outputImage = insertShape(im,'Rectangle',bboxes,'LineWidth',3);

for i = 1:size(bboxes,1)
   outputImage = insertText(outputImage,[bboxes(i,1)+bboxes(i,3) bboxes(i,2)-20],classRec{i},...
      'FontSize',20,'TextColor','red');
end
figure;
imshow(outputImage);

Compile Algorithm to CUDA and Run on Desktop GPU

Having tested the algorithm successfully in MATLAB on the CPU, the next step is to improve performance by running the algorithm on GPUs. Let’s begin by using the newly released MATLAB GPU Coder to compile the complete algorithm into CUDA. We first create a GPU configuration object for MEX files, which is source code compiled for use in MATLAB. We can specify the configuration to use either cuDNN or TensorRT with INT8 datatypes:


cfg.DeepLearningConfig = coder.DeepLearningConfig('cudnn'); % Use cuDNN
cfg.DeepLearningConfig = coder.DeepLearningConfig('tensorrt'); % Use TensorRT


Let’s use TensorRT. We’ll run the codegen command to start the compilation and specify the input to be of size [480,704,3] and type uint8. This value corresponds to the input image size of tsdr_predict function. GPU Coder then creates a MEX file, tsdr_predict_mex.


cfg = coder.gpuConfig('mex');
cfg.TargetLang = 'C++';
cfg.DeepLearningConfig = coder.DeepLearningConfig('tensorrt');
codegen -config cfg tsdr_predict -args {ones(480,704,3,'uint8')} -report


To test the MEX file, we reuse the same test script shown in the preceding section. We make one change to use tsdr_predict_mex instead of tsdr_predict.


[bboxes,classes] = tsdr_predict_mex(im);


The result from running tsdr_predict_mex is the same as running tsdr_predict. The output image matches the one shown in Figure 1 with a bounding box around the labeled traffic sign.

We can further test the algorithm on suites of test images and videos; MATLAB provides various facilities for accessing data stored locally, on networks, and in the cloud. We can even bring in live images and video using cameras connected to our testing machines. MATLAB also provides a unit test framework to help set up and run tests in a systematic way.

Compare Performance Gain of TensorRT and cuDNN

Earlier, we mentioned we can compile tsdr_predict.m to use cuDNN or TensorRT. Let’s take a look at the performance gain of using TensorRT relative to that of using cuDNN. We will use the same machine fitted with a Titan V GPU and Intel Xeon processor to time the results.

First, let’s record the execution time of the current MEX file using TensorRT with the help of the MATLAB timeit function. Averaged over 10 executions, we see an execution time of 0.0107s, which is equivalent to about 93 images/sec.


f = @() tsdr_predict_mex(im);
measured_time=0;
for i = 1:10
   measured_time = measured_time + timeit(f);
end
measured_time = measured_time/10;


Next, let’s time the execution time of the MEX file that uses cuDNN. We will retrace our steps and configure GPU Coder to use cuDNN to create the MEX file.


cfg = coder.gpuConfig('mex');
cfg.TargetLang = 'C++';
cfg.DeepLearningConfig = coder.DeepLearningConfig('cudnn');
codegen -config cfg tsdr_predict -args {ones(480,704,3,'uint8')} -report


We then use the same timeit function to run the MEX file using cuDNN. When averaged over 10 executions, we see an execution time of 0.0131s, which is approximately 76 images/sec. Comparing these two results, we see that using TensorRT with INT8 resulted in an increase of 93/76 = 22% for single image inference using two moderately sized networks. Table 3 summarizes the execution time of running on the CPU and GPU (Titan V) with cuDNN and TensorRT.

  CPU GPU with cuDNN GPU with TensorRT (INT8)
Execution time (s) 0.0320s 0.0131s 0.0107s
Equivalent images/sec 31 76 93

Table 3. Timing results of running tsdr_predict on CPU (Intel® Xeon CPU @ 3.6 GHz) and GPU (Titan V) with cuDNN and TensorRT

As an aside, we benchmarked results of using GPU Coder with cuDNN and TensorRT on ResNet-50 using the same Titan V GPU. The results are shown in Figure 3. We found that TensorRT INT8 datatype mode increases inference performance, especially at higher batch sizes:


Figure 3. Performance benchmark of running ResNet-50 using GPU Coder with cuDNN and TensorRT on Titan V GPU

Compile Algorithm to CUDA and Integrate with External Applications

Once we made sure the algorithm ran correctly in MATLAB on our desktop GPU, we could compile the algorithm to source code or a library to integrate into larger applications. Let’s configure GPU Coder to compile the algorithm into a library.


cfg = coder.gpuConfig('lib');
cfg.TargetLang = 'C++';
cfg.DeepLearningConfig = coder.DeepLearningConfig('tensorrt');
codegen -config cfg tsdr_predict -args {ones(480,704,3,'uint8')} -report


GPU Coder then creates a static library, tsdr_predict.a. You can integrate this library with applications running on your host machine, in the cloud, or even run it on mobile and embedded systems like the Jetson Xavier.

GPU Coder provides an example main function to show how you can call the library from your application. You need to call the initialization once before calling the tsdr_predict function. Finally, you should call the terminate function to free up resources for other applications when done.


int32_T main(int32_T, const char * const [])
{
   // Initialize the application.
   // You do not need to do this more than one time.
   tsdr_predict_initialize();

   // Invoke the entry-point functions.
   // You can call entry-point functions multiple times.
   main_tsdr_predict();

   // Terminate the application.
   // You do not need to do this more than one time.
   tsdr_predict_terminate();
   return 0;
}


static void main_tsdr_predict()
{
   real32_T selectedBbox_data[392];
   int32_T selectedBbox_size[2];
   real_T idx_data[98];
   int32_T idx_size[1];
   static uint8_T b[1013760];

   // Initialize function 'tsdr_predict' input arguments.
   // Initialize function input argument 'img'.
   // Call the entry-point 'tsdr_predict'.
   argInit_480x704x3_uint8_T(b);
   tsdr_predict(b, selectedBbox_data, selectedBbox_size, idx_data, idx_size);
}


Examine the Source Code

For those inclined, we can take a deeper look at the source code, which is stored in the same folder as the library. GPU Coder creates a code generation report that provides an interface to examine the original MATLAB code and generated CUDA code. The report also provides a handy interactive code traceability tool to map between MATLAB code and CUDA. Figure 4 shows a screen capture of the tool in action.


Figure 4. Interactive code traceability report mapping MATLAB code to CUDA code and vice versa

Let’s examine parts of the compiled CUDA code. Starting with the header file tsdr_predict.h, we see there are two function declarations.


// Include Files
#include <stddef.h>
#include <stdlib.h>
#include "rtwtypes.h"
#include "tsdr_predict_types.h"

// Function Declarations
extern void tsdr_predict(const uint8_T img[1013760], real32_T selectedBbox_data[], int32_T selectedBbox_size[2], real_T idx_data[],
   int32_T idx_size[1]);
extern void tsdr_predict_init();
...


Looking inside the source file tsdr_predix.cu, we can find the tsdr_predict function. The code snippet below shows the beginning of the function.


void tsdr_predict(const uint8_T img[1013760], real32_T selectedBbox_data[],
   int32_T selectedBbox_size[2], real_T idx_data[], int32_T
   idx_size[1])
{
   int32_T auxLength;
   int32_T rowIdx;
   int32_T colIdx;
   int32_T l;
   real_T sumVal;
   real_T absx2;
   int32_T numOfBbox;
   real_T oldIdx;
   int32_T xoffset;

...


Memory allocation is taken care of through cudaMalloc calls, which the following code snippet shows.


boolean_T exitg1;
cudaMalloc(&gpu_inpImg, 55296UL);
cudaMalloc(&gpu_inpImg_data, 677376U * sizeof(uint8_T));
cudaMalloc(&b_gpu_partialResize_size, 12UL);
cudaMalloc(&d_gpu_ResizedImage, 6912UL);
cudaMalloc(&gpu_partialResize_size, 12UL);
cudaMalloc(&b_gpu_colWeightsTotal, 384UL);
cudaMalloc(&b_gpu_rowWeightsTotal, 384UL);
...


Looking further down, GPU Coder generated several kernels for resizing the image. Data is moved between CPU and GPU memory spaces through cudaMemcpy calls at the appropriate locations to minimize data copies. Code snippets for part of the operation is shown below.


cudaMemcpy(gpu_img, (void *)&img[0], 1013760UL, cudaMemcpyHostToDevice);
tsdr_predict_kernel9<<<dim3(1260U, 1U, 1U), dim3(512U, 1U, 1U)>>>
   (*gpu_colWeightsTotal, *gpu_colWeights, *gpu_img, *gpu_ipColIndices,*gpu_partialResize);
tsdr_predict_kernel10<<<dim3(1176U, 1U, 1U), dim3(512U, 1U, 1U)>>>
   (*gpu_rowWeightsTotal, *gpu_rowWeights, *gpu_partialResize, *gpu_ipRowIndices, *gpu_ResizedImage);
...


Using the code traceability tool, we find the recognition network is defined in the DeepLearningNetwork_predict function. Inside, cudaMalloc calls are used to move data to GPU memory before launching several CUDA kernels. Data is moved back to CPU memory following the CUDA kernels using cudaMalloc, followed by cudaFree calls to free up GPU memory.


void DeepLearningNetwork_predict(b_Recognition_0 *obj, const real_T inputdata [6912], real32_T outT[35])
{
   real32_T (*gpu_inputT)[6912];
   real32_T (*gpu_out)[35];
   real_T (*gpu_inputdata)[6912];
   real32_T (*b_gpu_inputdata)[6912];
   real32_T (*gpu_outT)[35];
   cudaMalloc(&gpu_outT, 140UL);
   cudaMalloc(&gpu_out, 140UL);
   cudaMalloc(&gpu_inputT, 27648UL);
   cudaMalloc(&b_gpu_inputdata, 27648UL);
   cudaMalloc(&gpu_inputdata, 55296UL);
   cudaMemcpy(gpu_inputdata, (void *)&inputdata[0], 55296UL, cudaMemcpyHostToDevice);
c_DeepLearningNetwork_predict_k<<<dim3(14U, 1U, 1U), dim3(512U, 1U, 1U)>>>
   (*gpu_inputdata, *b_gpu_inputdata);
d_DeepLearningNetwork_predict_k<<<dim3(14U, 1U, 1U), dim3(512U, 1U, 1U)>>>
   (*b_gpu_inputdata, *gpu_inputT);
cudaMemcpy(obj->inputData, *gpu_inputT, 6912UL * sizeof(real32_T),cudaMemcpyDeviceToDevice);
obj->predict();
cudaMemcpy(*gpu_out, obj->outputData, 35UL * sizeof(real32_T), cudaMemcpyDeviceToDevice);
e_DeepLearningNetwork_predict_k<<<dim3(1U, 1U, 1U), dim3(64U, 1U, 1U)>>>
   (*gpu_out, *gpu_outT);
...


GPU Coder also generates CUDA kernels for other parts of the TSDR function to accelerate the algorithm. In total, GPU Coder created 31 CUDA kernels. The code generation report provides a listing of the kernels, along with other pertinent information.

Conclusion

In this post, we’ve covered how to run and test algorithms in MATLAB before compiling them to CUDA and accelerating them on GPUs. The generated CUDA can also be exported from MATLAB as source code or libraries and integrated with external applications running on any modern NVIDIA GPUs, from NVIDIA Tesla to DRIVE to Jetson AGX Xavier platforms. We hope this has helped you appreciate how automated CUDA compilers like GPU Coder can help short-circuit the standard process of hand translating designs into CUDA, as well as the ease by which you can tap into the powerful performance gains provided by TensorRT.

To solve the problems described in this post, I used MATLAB R2018b along with Deep Learning Toolbox, Parallel Computing Toolbox, Computer Vision System Toolbox, GPU Coder, and, of course, the NVIDIA tools, including TensorRT. You can learn more about deep learning with MATLAB and download a free 30-day trial of MATLAB using this link.

Bill Chou
Product Manager for Code Generation Products, MathWorks

Lighting in Image Processing

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.

If I can't read the writing on a piece of paper, there's usually a simple solution: I hold the text under a lamp. If I want to see faint contours on the paper, I hold the sheet up against the light and they're more visible. In other words, the type and intensity of illumination make a difference.

Computer Vision for Augmented Reality in Embedded Designs

Bookmark and Share

Computer Vision for Augmented Reality in Embedded Designs

Augmented reality (AR) and related technologies and products are becoming increasingly popular and prevalent, led by their adoption in smartphones, tablets and other mobile computing and communications devices. While developers of more deeply embedded platforms are also motivated to incorporate AR capabilities in their products, the comparative scarcity of processing, memory, storage, and networking resources is challenging, as are cost, form factor, power consumption and other constraints. Fortunately, however, by making effective use of all available compute capabilities in the design, along with leveraging APIs, middleware and other software toolsets, these challenges are largely and increasingly surmountable.

Augmented reality (AR) and related technologies such as Microsoft's HoloLens and other "mixed reality" platforms are, along with virtual reality (VR), one of the hottest topics in technology today. Applications such as Pokémon Go have generated widespread awareness of AR in the general public, both Apple and Google have recently launched software development kits (ARKit and ARCore, respectively) to further cultivate developer activity in this area, and available middleware toolsets also promise to enable broad multi-platform support while simultaneously maximizing application efficiency on each target platform.

However, many of the existing implementations are based on smartphones and tablet computers, which are the primary topic focus of a previously published article in this series. While these platforms have cost, power consumption, and form factor challenges, they typically also offer an abundance of heterogeneous compute resources (multi-core CPUs, GPUs, DSPs, dedicated-function coprocessors, etc.), memory resources, and robust network connectivity. What about platforms that aren't resource-blessed: head-mounted displays (HMDs), smart glasses, automotive heads-up displays (HUDs), and the like?

This article discusses implementation options for delivering robust AR functionality in such deeply embedded designs, which are characterized by a scarcity of compute, memory and connectivity resources, along with having cost, power consumption (and heat dissipation), size and weight, and other constraints. It provides general concept recommendations for both hardware and software development, along with specific detailed explanations in the form of case study examples. And it also introduces readers to an industry alliance created to help product creators incorporate vision-enabled AR capabilities into their SoCs, systems and software applications, along with outlining the technical resources that this alliance provides (see sidebar "Additional Developer Assistance").

Platform and Algorithm Alternatives

AR, VR and mixed (AR-plus-VR) system designs have varying capabilities, translating into varying amounts and types of both processing resources and the software functions that run on them. The following section, authored by Synopsys, explores these topics.

AR and VR differ in fundamental ways. Using images and sounds, VR seeks to create an immersive environment for a headset user, one entirely different than the environment the user is actually in (Figure 1). A VR headset could play back a multimedia recording that the user had previously made of a café in Budapest, for example, or could use virtualized images and sound to insert the user into a video game. Such simulated environments require heavy-duty audio, graphics and video processing "engines" in order to construct the virtual worlds.


Figure 1. VR HMDs create a fully immersive and alternative environment experience (courtesy Synopsys).

AR, on the other hand, as its name implies merges simulated images, graphics and audio with those of the real world. A user could walk past that previously mentioned café in Budapest wearing AR goggles, for example, and see a graphical list of the daily specials overlaid on the café's front window. AR is only partially simulated. The real-world aspect of AR requires computer vision in order to discern and identify the surroundings, so that AR can add the virtual world to them.

Mixed reality falls in-between AR and VR and doesn't have a singular definition; instead it encompasses a range of implementations. While it might still be fully simulated, for example, it might also have the ability to classify and include real-world elements with accurate positions in the virtual world—recognizing your hands so your cartoon self can hold a wand in a wizard game, for example, or recognizing your furniture and replacing it with cartoon furniture or furniture-sized rocks from an alien landscape. Like AR, mixed reality requires the use of various computer vision techniques to detect, identify and locate real-world elements.

SLAM for Localization and Mapping

In order for hardware devices to see the real world around them and augment that reality with inserted images and/or graphics, they need to be able to both determine their position in 3D space and map the surrounding environment. In a controlled setting, markers—2D symbols such as QR codes—enable a camera to determine its position and orientation relative to a surface. However, applications such as automotive, where you can’t insert markers along every stretch of road, must work robustly in a marker-less uncontrolled environment.

In situations like these, simultaneous localization and mapping (SLAM) algorithms, which originated in robotics research, can alternatively provide geometric position for the AR system. SLAM is capable of building 3D maps of an environment, along with tracking the location and position of the camera in that environment. These algorithms estimate the position of the image sensor while simultaneously modeling the environment to create a map (Figure 2). Knowledge of the sensor’s position and pose, in combination with the generated 3D map of the environment, enables the device (therefore the user of the device) to accurately navigate the environment.


Figure 2. SLAM algorithms, used in marker-less i.e. uncontrolled environments, build a 3D map of the surroundings by identifying points and edges of objects and performing plane extraction from the data (courtesy Synopsys).

SLAM can be implemented in a variety of ways. Visual SLAM, for example, is a camera-centric approach that doesn’t require the inclusion of inertial measurement units (IMUs) or expensive laser sensors in the design. Monocular visual SLAM, an increasingly popular approach, relies on a single camera, as its name implies.

A typical implementation of monocular visual SLAM encompasses several key tasks:

  1. Feature extraction or the identification of distinct landmarks (such as the lines forming the edge of a table). Feature extraction is often done with algorithms such Features from Accelerated Segment Test (FAST), Binary Robust Independent Elementary Features (BRIEF), Oriented FAST and rotated BRIEF (ORB), Scale-invariant Feature Transform (SIFT), and Speeded-up Robust Features (SURF).
  2. Feature matching between frames to determine how the motion of the camera has changed.
  3. Camera motion estimation, including loop detection and loop closure (addressing the challenge of recognizing a previously visited location).

All of these tasks are compute-intensive and will have a significant influence on the hardware chosen for an AR system.

Deep Learning for Perception

While SLAM provides the ability to determine a camera’s location in the environment, along with a 3D model of that environment, accurately recognizing objects in the environment requires a different technique. Deep learning algorithms such as convolutional neural networks (CNNs) are the increasingly prevalent approach to solving this particular problem.

Assuming that the neural network’s coefficients have previously been comprehensively trained, CNNs can be very accurate when subsequently performing object identification (inference) tasks including localization (identifying the location of a particular object in an image) and classification (identifying the object e.g., dog versus cat, or Labrador Retriever versus German Shepherd). While SLAM can help a camera (and user of that camera) move through an environment without running into objects, CNN can identify that a particular object is a couch, refrigerator, or desk, along with highlighting where it is located in the field of view. Popular CNN graphs for real-time object identification are You Only Look Once (Yolo) v2, Faster Region CNN (R-CNN) and Single Shot MultiBox Detector (SSD).

CNN object detection graphs can be specialized to, for example, detect faces or hands. With CNN-based facial detection and recognition, AR systems are capable of adding a name and other social media-sourced information above a person’s face in the AR environment. Using CNN to detect the user’s hands allow game developers to place an object in the player’s virtual hand. Detecting a hand’s existence is easier than determining the hand positioning. Some CNN-based solutions require a depth camera output as well as RGB sensor output to train and execute a position-aware CNN graph.

CNNs can also be successfully applied to the semantic segmentation challenge. Unlike object detection, which only cares about the particular pixels in an image that might be an object of interest, semantic segmentation is concerned about every pixel. For example, in an automotive scene, a semantic segmentation CNN would label all of the pixels that represent sky, road, buildings, and individual cars as a group, a critical capability for self-driving car navigation Applied to AR, semantic segmentation can identify the ceiling, walls and floor, as well as furniture or other objects in the space. Semantic knowledge of a scene enables realistic interactions between real and virtual objects.

Hardware Implementations

SLAM and CNN algorithms both demand a significant amount of computation processing per camera-captured image (frame). Creating a seamless environment for the AR user—merging the real world and virtual world without significant latency – requires a video frame rate of 20-30 frames per second (fps). This requirements means that the AR system has 33-40ms to capture, process, render and display each frame's results to the user. The faster the system it can complete these tasks, the faster the frame rate and the more realistic the resultant AR seems.

When developing a SoC for a monocular SLAM system, for example, computational efficiency and memory optimization are both critical design parameters. If the camera captures 4K resolution images at 30 fps, these specifications translate into the need to store and process 8,294,400 pixels per frame, and 248,832,000 pixels per second. Commonly, embedded vision systems store each frame in an external DDR SDRAM buffer and then, as efficiently as possible, sequentially transfer portions of the frame for subsequent processing (Figure 3).


Figure 3. Vision data is stored in off-chip memory and transferred to the processor over the AXI bus (courtesy Synopsys).

Running the algorithms necessary for advanced AR systems solely on a CPU is likely an inefficient approach to solving the problem. Alternatively offloading portions of the algorithms to a GPU, which is likely already present in an AR system for rendering graphics, will notably speed up SLAM and CNN calculations compared to the CPU-only approach. However, relying on the GPU for both graphics and AR processing may sub-optimize the performance of both operations, and can also come at a significant power consumption penalty.

An alternative SoC architectural approach, which trades off incremental core-count complexity for per-core operational efficiency, involves allocating embedded vision algorithm processing to dedicated-function cores. Performance and power consumption optimization can be achieved, for example, by pairing a flexible CNN engine with a vector DSP. The vector DSP is designed to handle algorithms like SLAM, while the dedicated CNN engine supports common deep learning functions (convolutions, pooling, element-wise operations, etc.) and delivers silicon area and power consumption efficiency because it is custom-designed for these parameters.

Synopsys’ EV6x Embedded Vision Processor IP is one leading option for providing an optimized solution to address performance, power consumption and other SoC design concerns. The DesignWare EV61, EV62 and EV64 Embedded Vision Processors integrate a high-performance 32-bit scalar core with a 512-bit vector DSP, along with a optional CNN engine for accurate object detection, classification, and scene segmentation.

The vector DSPs are ideal for executing the SLAM algorithm and run independently of the CNN engine. The EV6x family delivers up to 4.5 TeraMACs/sec of CNN performance when implemented in 16-nm processes under typical conditions, and supports multiple camera inputs each with up-to-4K resolution. The processors are fully programmable and configurable, combining the flexibility of software solutions with the high performance and low power consumption of dedicated hardware.

Gordon Cooper
Product Marketing Manager, Synopsys

Software Optimizations

Creating an efficient implementation of an AR application for a single target hardware platform is challenging enough for a software developer; successfully doing so across a range of platforms is even more difficult. The following section, authored by PTC, offers suggestions for actualizing this aspiration, including proposing a middleware option.

The computational complexities to be solved in order to facilitate robust AR are significant and increasingly diverse, considering that the tasks involved include image processing for 3D reconstruction, high-speed motion estimation and prediction, and a variety of 2D and 3D object recognition algorithms, along with networking support for various cloud-based functions. This interplay between hardware and software becomes even more critical when you consider that AR applications often require additional processing resources, such as for intricate 3D renderings and elaborate user interfaces to create compelling and effective AR experiences. PTC's Vuforia Chalk application, for example, significantly burdens the camera-video codec pipeline and network stack in facilitating a rich collaborative AR experience between a local technician and a remote expert (Figure 4).


Figure 4. Vuforia Chalk supports marker-less tracking and annotation (courtesy PTC).

For AR app developers, device diversity is a high hurdle to surmount. Every model of SoC (sometimes even extending to individual devices within a given model) exhibits specific idiosyncrasies related to differing hardware configurations. And today’s operating systems and libraries don't sufficiently "hide" this differentiation for the precision and accuracy that AR demands, even within a particular vendor’s SoC offerings. A cross-platform AR application development kit, such as the Vuforia Engine, provides reliable performance and a consistent user experience, largely due to its extensive, underlying platform-specific implementations and calibrations.

As a leading AR middleware toolset for application developers, the Vuforia Engine has long provided a rich and technically advanced software abstraction layer for AR functionality, including target detection, marker-less tracking, and ground plane detection. The Vuforia Engine abstracts hardware and operating system software (Apple's iOS, Google's Android, Microsoft's Universal Windows Platform (UWP) and HoloLens, and products from ODG and Vuzix, to name just a few) but makes use of underlying hardware and software capabilities whenever possible, such as IMUs, GPUs, and the ARKit and ARCore APIs.

Optimizations aside, the Vuforia Engine still needs to be extremely resource-conscious.  As such, PTC collaborates with numerous partners to improve the aforementioned hardware-and-software interplay. Manufacturers of AR products, along with the sensors and ICs contained within them, will in the future see ever-increasing demand to implement optimized hardware capabilities as well as expose core sensor characteristics and measurements in order to facilitate ever-more-optimized middleware. These improvements encompass the camera and its calibration aspects, especially as optical image stabilization and multi-lens kits become increasingly commonplace.

To date, more than 60,000 commercial AR applications have been enabled by the Vuforia Engine, resulting in more than 600 million installs. PTC continually works closely with its community of more than 500,000 developers, along with shipping its own Vuforia platform applications for the industrial enterprise market. The following recommendations, derived from these experiences, will be useful to application developers:

  • Continually profile the application on all supported devices, including static and memory footprint analyses. Build it with profiling in mind, both for troubleshooting in the test lab and for for real-time feedback in the end application. Application tuning in response to O/S notifications on device state (memory usage, operating temperature, network load, etc.) is critical for a satisfying user experience.
  • AR capabilities are particularly valued by users in mobile and field applications. Expect widely and rapidly varying network coverage; the application should be bandwidth-conscious and adapt quickly as connectivity conditions change over time.
  • Utilize optimized platform software when available, but closely evaluate fit and performance before committing. For example, cutting-edge O/S capabilities might not provide the necessary quality in their initial versions.
  • Hardware sensor particularities have a significant impact on tracking accuracy. Consider careful calibration with respect to temporal and spatial aspects for every hardware sensor model, not just for the integrated device model.
  • Given the diversity in hardware capabilities among platforms, such as between smartphones and digital eyewear, tailored approaches to providing AR capabilities should be employed in various cases. For example, extensive processing is required on a typical smartphone for 3D reconstruction, whereas the Microsoft HoloLens depth sensors and API deliver these capabilities for "free."
  • Don't underestimate the importance of the user interface. An AR experience combines the physical world, a digital experience and the movement of a human being in order to map content to the environment. Guiding the human simply, yet successfully, is crucial to application success.
  • AR delivers most value if the experience references real objects in the environment and offers interaction with them. Vuforia’s Model Targets, for example, precisely register a 3D model on industrial machinery, tying the digital and physical worlds together and enabling interaction both ways, via IoT sensors embedded in the machinery and contextual, virtual information displayed accurately on physical components (Figure 5).


Figure 5. Vuforia's Model Targets implement object detection and tracking based on a 3D CAD model (courtesy PTC).

Mathias Kölsch
Senior Director for Software Engineering, Vuforia, a PTC Technology

Implementation Case Study: A Head-mounted Display

As the introductory section of this article notes, HMDs such as the Meta 2 or products from DAQRI are one key target platform for resource-intensive AR, but (like other embedded systems covered by this article) HMDs are comparatively resource-constrained versus, say, a smartphone or tablet computer. The following section, authored by videantis, details a conceptual design that's applicable to both AR and "mixed reality" products.

A high-end HMD architecture, focusing here on the imaging and video paths (therefore excluding the IMU, audio, and haptics systems) includes numerous cameras (Figure 6). Image sensors are used, for example, for gaze tracking in order to discern where the user’s eyes are looking. This feature enables, for example, gaze contingency, wherein an AR screen display dynamically alters its regional graphical resolution and other rendering characteristics depending on where within it the viewer is looking. Another use for gaze tracking is to employ the eyes as a cursor pointer. Multiple cameras are also included for head tracking purposes, since the inclusion of outward-looking and camera-based trackers can deliver much more accurate results versus only using IMUs.


Figure 6. A high-end HMD architecture block diagram includes multiple image sensors for a variety of purposes (courtesy videantis).

Additional outward-looking cameras are included to sense the user’s surroundings. Ideally, they will capture a full 3D and real-time reconstruction of the environment around them. This way, virtual objects can then be correctly aligned to and "locked into place" within the real environment. Capture of a video stream of what the user sees can potentially occur in order to present a view of the surroundings onto the displays in front of the user’s eyes (a pair of cameras may be needed to present a correct stereo view for the two eyes) and for forwarding to other (remote) users. And even more cameras can be included to capture the user’s body position and gestures (again, a video stream of the same could also be captured for forwarding to remote users).

Different HMD implementations may use fewer (or potentially even more) cameras than what’s shown here. VR headsets, a functional subset of the architecture shown, typically require only head, eye, and body/gesture tracking capabilities, for example. And although in an ideal system, you’d include a separate camera for each individual sensing and capturing task, a single camera can alternatively handle multiple functions, thereby reducing the total number of cameras in the system, lowering cost and power consumption, and making the headset smaller and lighter.

Keep in mind when attempting to use a single camera for multiple functions, however, that the tasks that the camera(s) are being used for fundamentally define their requirements. For example, the ideal location and position of each camera for each function in the headset may vary. Resolutions, frame rates, rolling versus global shutter mechanisms, monochrome versus color filters, and other attributes may also be dissimilar for different visual functions. A camera intended for sensing the user’s gestures, for example, is typically based on a monochrome depth sensor optimized for short distances, while a camera for head tracking needs higher resolution, a higher frame rate, and the ability to capture more distant scenes.

Keep in mind, too, that "behind" each camera is an embedded vision processing chain; the cumulative power consumption of all of these computational modules, along with associated memory and other components, can reach high levels quickly, not to mention the added power consumption of the cameras themselves. And even more power consumption (and related heat dissipation) comes from the significant graphics processing resources in the rendering path that generates high-quality, high-resolution 3D scenes in front of each eye. Finding off-the-shelf components that can provide the many required camera interfaces and associated compute resources and still fit in a reasonably small and lightweight headset is often challenging.

This reality has resulted in some HMD designers deciding to subdivide the total system into two portions: the headset itself, and a (wired, for now, potentially wireless in the future) connected computing module worn on the waist, with the latter allowing for a larger size and weight than what a user could wear (and bear) on the head (analogously, a non-portable AR system may employ a PC as the computing module). In such cases, the vision processing and rendering happens inside the computing module, thereby presenting a different sort of challenge. The multiple cameras' outputs need to be transmitted from the HMD to the computing module, and the resulting dual display streams from the computing module back to the HMD, over the tether between them. The data will likely need to be compressed prior to transmission, and decompressed after reception, in both directions and with ultra-low latency.

The videantis v-MP6000UDX visual computing architecture is a leading example of a processor that can fulfill multiple roles within an advanced AR, VR or "mixed reality" system. First, as previously mentioned, there’s the wealth of computer vision and deep learning processing tasks that take place "behind" each camera. Secondly, there’s the ultra-low-latency, high-resolution video compression and decompression tasks that are needed in a subdivided configuration with separate HMD and computing module. And lastly, there’s the display rendering path, which requires a programmable imaging pipeline since the eye-tracking-based display systems and their associated display algorithms are still rapidly evolving. Running all of these algorithms on a unified architecture, such as the one which the v-MP6000UDX provides, is beneficial in numerous regards when compared against the alternative of different dedicated processing architectures for each task, an approach which would result in higher system complexity and extra data movement between the different processing systems.

Marco Jacobs
Vice President of Marketing, videantis

Implementation Case Study: An AR-based Automotive Heads-up Display

While you might not automatically think of an automobile when you hear the term "embedded system", developing automotive electronics involves dealing with notable design challenges. Cost is always a concern, of course, but size and weight are also critical factors when optimizing for vehicle range. And heat dissipation is equally difficult to deal with, especially as the computing intelligence included in vehicles exponentially grows with time. The following section, authored by NXP Semiconductors, provides a pertinent example, an AR-based heads-up display (HUD).

The fundamental premise (and promise) of AR in automobiles is to reduce the risk of accidents, by visually presenting relevant information in the driver's field of view while still enabling the driver to keep his or her eyes on the road ahead. HUDs based on various display technologies have been used in cars since the 1990s, and you might think that increasing vehicle autonomy over time might make their continued inclusion unnecessary. In actuality, however, the rapid recent advancements in ADAS and autonomous vehicle technologies are driving the need for more and more natural ways of presenting information to the human driver, both to reassure him or her that the vehicle is autonomously operating in an appropriate manner and to effectively alert when he or she needs to regain manual control.

Many implementation challenges exist for HUDs used in vehicles.  A HUD needs to display content visible in full sunlight, for example, "hover" the graphics sufficiently ahead so the driver doesn’t need to change focus while driving in order to discern them, and must also have compact dimensions in order to fit in the dashboard. Additionally, the bill-of-materials cost must be low, the HUD must operate without overheating and without need for active cooling, and it must run problem-free for many years. Latest-generation HUDs are also challenged to support a wide field of view (12 degrees or more) along with delivering full-color, high frame rate and high-resolution images.

Today, two primary HUD technologies exist: windshield projection and optical combiners. Reflecting projected data off the windshield delivers optimum coverage, but the implementation is complex, requiring tight integration with each specific vehicle. The exact windshield curvature must be known, for example, in order to deliver optimum quality results; the projector must also output high light intensity in order for the data to be visible on a sunny day. The alternative optical combiner approach is more straightforward from an integration standpoint, since the projection takes place onto a known simpler screen surface. That surface is usually also darker to provide contrast and reduce the sun's light intensity, so the projector doesn't need to be as intense as with the alternative windshield projection approach.  But the surface area covered by an optical combiner is often quite small.

Also, until very recently, arguably none of the HUD systems available in vehicles were "true" AR; while they projected information on top of the real life scene ahead, they didn't attempt to integrate with it. However, with the evolution of ADAS technology, high resolution mapping, and 3D graphics, it's now possible to project data such as navigation information overlaid with the real visual content of the 3D scene—highlighting the lane to follow, for example, or color coding of a suggested speed—just as if that information were actually part of the real world. Additional information likely to be included in the near future includes data sourced from Vehicle-to-Everything (V2X) infrastructure deployments, such as when a traffic light will change, or variations in traffic flow ahead (including impending stoppages)—in other words, any and all information which can prevent accidents and more generally enable reaching destinations faster and easier.

The integration of rich data into the driver's field of vision requires a 3D understanding of the scene, which is typically achieved by computing SLAM (discussed earlier in this article) on the images coming from a camera mounted such that it looks ahead of the vehicle. The reconstruction of the scene allows for placement of graphics, such as a navigation path, precisely in 3D space. Such integration of graphics within the "real" environment is a much more intuitive (not to mention less distracting) approach for the driver than classical navigation based on audio-only feedback or "bird's-eye view" visual instructions.

The display of 3D graphics using AR technologies in vehicles is currently at an early deployment stages, exemplified by the European version of the Mercedes-Benz Class-A series. This system presents data on a dedicated display that also includes an augmented video feed of the view ahead of the vehicle; the data includes navigation directions, for example, along with buildings augmented by street address numbers. This approach, however, means that the driver needs to look at the display instead of the windshield road ahead. A promising alternative approach comes from a company called WayRay, whose holographic display technology and software aspire to deliver true HUD-based AR. WayRay's approach is capable of present 3D graphics directly in the driver's field of view and without need for a HMD, as demonstrated in Rinspeed AG’s Oasis concept car (Figure 7).


Figure 7. WayRay's holographic HUD technology delivers rich AR data to the driver (courtesy WayRay).

Stéphane François
Software Program Manager, Vision and AI R&D, NXP Semiconductors

Conclusion

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products, and can provide significant new markets for hardware, software and semiconductor suppliers. Computer vision-enabled AR is one notable innovation example, although its resource demands have historically been particularly challenging to implement in deeply embedded products. However, by making effective leverage of all available compute capabilities in the design, along with leveraging APIs, middleware and other software toolsets, these challenges are increasingly surmountable.

Brian Dipert
Editor-in-Chief, Embedded Vision Alliance
Senior Analyst, BDTI

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. NXP Semiconductors, PTC, Synopsys and videantis, the co-authors of this article, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, is intended for product creators interested in incorporating visual intelligence into electronic systems and software. The Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings. The next Embedded Vision Summit is scheduled for May 20-23, 2019 in Santa Clara, California. Mark your calendars and plan to attend; more information, including online registration, will be available on the Embedded Vision Alliance website in the coming months.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other machine learning frameworks. Access is free to all through a simple registration process. And the Embedded Vision Alliance and its member companies also periodically deliver webinars on a variety of technical topics. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Embedded Vision Alliance website.

OpenVX Implementations Deliver Robust Computer Vision Applications

Bookmark and Share

OpenVX Implementations Deliver Robust Computer Vision Applications

Key to the widespread adoption of embedded vision is the ease of developing software that runs efficiently on a diversity of hardware platforms, with high performance, low power consumption and cost-effective system resource needs. In the past, this combination of objectives has been a tall order, since it has historically required significant code optimization for particular device architectures, thereby hampering portability to other architectures. Fortunately, this situation is changing with the maturation of the OpenVX standard created and maintained by the Khronos Group. This article provides implementation details of several design examples that leverage various capabilities of the standard.

OpenVX, an API from the Khronos Group, is an open standard for developing computer vision applications that are portable to a wide variety of computing platforms. It uses the concept of a computation graph to abstract the compute operations and data movement required by an algorithm, so that a broad range of hardware options can be used to execute the algorithm. An OpenVX implementation targeting a particular hardware platform translates the graph created by the application programmer into the instructions needed to execute efficiently on that hardware. Such flexibility means that the programmer will not need to rewrite his or her code when re-targeting new hardware, or to write new code specific to that hardware, making OpenVX a cross-platform API.

A previously published article in this series covered the initial v1.0 OpenVX specification and provided an overview of the standard's objectives, along with an explanation of its capabilities, as they existed in early 2016. This follow-on article showcases several case study examples of OpenVX implementations in various applications, leveraging multiple hardware platforms along with both traditional and deep learning computer vision algorithms. And it introduces readers to an industry alliance created to help product creators incorporate practical computer vision capabilities into their hardware and software, along with outlining the technical resources that this alliance provides (see sidebar "Additional Developer Assistance").

A companion article focuses on more recent updates to the OpenVX API, up to and including latest v1.2 of the specification and associated conformance tests, along with the recently published set of extensions that OpenVX implementers can optionally provide. It also discusses the optimization opportunities available with SoCs' increasingly common heterogeneous computing architectures.

Implementation Case Study: Tiling and Vendor Custom Kernels

As the introductory section of this article notes, one key benefit of OpenVX is its ability to enable the development of computer vision applications that are portable to a wide variety of computing platforms. The following section, authored by Cadence, details an example design that leverages a vision DSP core in conjunction with a vendor-developed custom OpenVX extension for acceleration of traditional computer vision algorithms.

This section provides a detailed case study of how to leverage the graph-based OpenVX API in order to automatically implement tile-based processing, a particularly important optimization for DSP architectures. It describes a commercially available OpenVX v1.1-compliant implementation of automated tile-based processing for Cadence Tensilica Vision DSPs. In addition to automating tiling optimizations for the standard OpenVX kernels, the Cadence Tensilica implementation extends the OpenVX API to enable automated tiling of user-defined kernels that run on the Vision DSP. Tile management is handled entirely by Cadence’s OpenVX extension, so the developer can focus on the kernel computation, not the complexities of memory management and data movement.

Tiling

One of the most important considerations for a DSP, in order to deliver low power consumption and high performance, is to enable it to efficiently handle data movement. In most cases, the DSP includes relatively small, fast, tightly coupled memories (TCM), as well as a set of DMA channels to efficiently handle transfers between system memory (usually DDR SDRAM) and the tightly coupled local memories.

"Tiling" is a key optimization, particularly when processing images. Tiling consists of breaking up each image into a set of sub-images, called tiles. Instead of executing the OpenVX graph at full-image granularity, node after node, the tiling approach instead involves executing the graph tile by tile. This technique allows the DSP to keep intermediate tiles in its TCM; the degree of tiling efficiency therefore heavily depends on the data access pattern of the algorithm.

In general, the union of all tiles needs to cover the entire image surface. In certain situations, however, there will be overlap between tiles of the same image, which means that some pixels may be fetched or computed twice (or more) (Figure 1). Such overlap occurs when larger input tiles are needed to compute an output tile of a given size (the neighborhood kernel property), and this constraint propagates across the graph. A typical example is the NxM convolution kernel, for which generating an output tile of a certain size will require a larger input tile, also encompassing what is commonly called a neighborhood or halo of additional pixels.


Figure 1. The union of all tiles needs to cover the entire image surface; in certain situations, there will be overlap between tiles (courtesy Cadence).

Tiling and OpenVX

In OpenVX, many of the kernels that operate on an image are not simple pixel-wise functions. More generally, tiling is a complex task. Multiple parameters require consideration, such as the data access pattern of the algorithm implemented by the OpenVX node (including the neighborhood), the graph connectivity, the fact that some kernels may scale images, and the available local memory size. Some nodes may also not be tilable or are only partially tilable (i.e., the input image is tilable but the output image is not). Also, in computer vision, it is not all about images; other data structures such as arrays (containing keypoints, for example) also find use. In case a neighborhood is required for tiles, some specific actions, depending on the node border mode, must be performed when the tile is at the edge of the image. All of these factors require consideration in order to ensure proper graph execution.

Cadence's OpenVX implementation supports tiling for standard kernels. In OpenVX, the tiling optimization can be done efficiently, thanks to the graph construct that upfront provides the dataflow relationships of the vision algorithm. This "ahead of time" insight is a fundamental advantage for OpenVX in comparison to other libraries such as OpenCV. The tiling optimization process happens at OpenVX graph verification time. The recent "export and import" OpenVX extension will also optionally enable this optimization process to be performed offline on the workstation (depending on the particular use case, either an online or offline solution may be preferable).

Tiling and Extensibility

Cadence's Tensilica Vision DSPs are hardware- and instruction set-extensible, and customers may therefore be interested in extending their OpenVX implementations with custom vision kernels. This is the case even though OpenVX v1.2 has significantly expanded the number of standard kernels available; application developers may still want to add their own kernels, either to implement a unique algorithm they’ve developed or (in the case of Tensilica Vision DSPs) to take advantage of hardware they've designed and included in the DSP.

The OpenVX standard has from the beginning supported so-called "user-defined kernels." The API is primarily designed to enable the execution of user kernels on the host (i.e., the CPU). Also, OpenVX user kernels operate at full-image granularity, not at tile-level granularity. The OpenVX working group has published a provisional tiling extension specification, discussed earlier in this article, as an initial effort to standardize an approach that's more suitable for accelerators such as DSPs and GPUs. However, as currently defined the extension is limited in terms of the scope and types of vision kernel patterns it can model.

In order to support a broader set of computer vision patterns, Cadence has developed a proprietary tilable custom kernel extension, called the vendor custom kernel (VCK), which enables users to create their own kernels running on the DSP, seamlessly integrated with the Cadence OpenVX framework. With VCK, kernels may or may not be tilable, depending on user choice. For tilable kernels, the Cadence implementation of the OpenVX framework handles tile management when these kernels are instantiated as nodes in an OpenVX graph. Automated features include:

  • Tile sizing, which is dependent on the OpenVX graph structure and the memory usage by other parts of the system (runtime, other kernels, etc.)
  • Communication between the host and the DSP
  • DSP local memory allocations
  • DMA data copy between system and local memories
  • Handling the image border mode

In terms of the OpenVX API, VCK extends the concept of OpenVX user kernels. Standard OpenVX user nodes can still be used to create a kernel to be executed on the host processor. VCK expands the Cadence Tensilica Vision DSP's capabilities to execute user-defined kernels via the OpenVX API.

Vendor Custom Kernel Model

As with standard user kernels, VCK kernels are registered to the OpenVX framework by defining a set of properties and callbacks. The VCK registration API then follows the same approach as the standard user kernel API, and reuses most of its API. However, two major implementation differences exist:

  • VCK kernels have both a host side and accelerator/DSP side form; the kernel registration function takes place on both sides (Figure 2).
  • VCK supports the notion of tilable images


Figure 3. VCK kernel registration occurs on both the host and accelerator/DSP side forms (courtesy Cadence).

On the Vision DSP, VCK kernels are implemented with a set of callback functions that will be called at OpenVX graph execution time (Figure 3):

  • The main "processing" function: Its purpose is to process input tiles and generate output tiles.
  • An (optional) ‘start’ function: It's called at graph execution, before any of the tiles start to be processed. It allows for initialization.
  • An (optional) ‘end’ function: It's called at graph execution, after all tile processing has completed. It allows for post-processing, such as when the VCK kernel implements a reduction algorithm.


Figure 3. VCK kernels are implemented with a set of callback functions called at OpenVX graph execution time (courtesy Cadence).

In addition, the VCK developer must provide a kernel dispatch function that will be common to all VCK kernels. The role of this dispatch function is to provide the address of the callback corresponding to an identifier. This way, the Vision DSP OpenVX runtime knows how to call ‘start’, ‘process’ and ‘end’ callbacks when the OpenVX graph is executed.

The VCK kernel is registered on the host, similar to the OpenVX user kernel, before instantiating it as a node in an OpenVX graph. As part of registration, a user needs to provide the following:

  • A unique identifier and name
  • The list of kernel parameters, their type, their direction, and whether they are mandatory or optional
  • A parameter validation callback
  • (Optionally) a valid region computation callback
  • (Optionally) init and deinit callbacks

In comparison to the OpenVX user kernel, VCK includes some additions:

  • Tilable parameters: When the parameter of the kernel is declared with type TENVX_TYPE_TILABLE_IMAGE, this indicates that the image is tilable. For tilable input parameters, a neighborhood region can also be optionally declared. It specifies how many extra pixels on each side of the tile are needed to compute the output. This can be set as the kernel parameter attribute, TENVX_KERNEL_PARAM_TILE_NEIGHBORHOOD. The OpenVX framework will always ensure that this neighborhood region is correctly allocated and initialized, taking into account the case when a border mode is requested for the node.
  • Local memory: The execution of a VCK may require some fast temporary memory, which Cadence refers to as "local memory," on the DSP side. Local memory is allocated by the OpenVX implementation and provided at the execution of the VCK node callbacks. This memory is always allocated in the Vision DSP’s TCM, not in external system memory. The DSP local memory request is done by setting the standard VX_KERNEL_LOCAL_DATA_SIZE kernel attribute if the amount is constant, or by VX_NODE_LOCAL_DATA_SIZE if the amount depends on some aspect(s) of the node parameters (such as the width or height of the input image). Local memory is persistent across the node execution; the same memory will be given to the start, process and end DSP callback functions.
  • "Start" and "end" identifiers: Each callback on the DSP side has a unique identifier. In case the VCK kernel includes a "start" and/or "end" DSP callback, the additional identifier(s) need(s) to be provided.

VCK Code Example

The following example implements a simple S16 erode kernel that computes each output pixel as the minimum pixel value of a 5x3 box centered on the same pixel coordinate in the input image.

The kernel identifier must be derived from a kernel library identifier, itself being derived from the company vendor ID. In case the VCK kernel also has a Vision DSP init and end callback, a separate ID should also be allocated for each of them.

Kernel identifiers
// Pick a library ID
#define MY_LIB_VCK (VX_KERNEL_BASE(MY_VENDOR_ID, 0))

// Allocate kernel and DSP callback IDs
enum VCK_SIMPLE_EXAMPLE_KERNEL_IDS {
    // XI lib interop example
    MY_KERNEL_ERODE_5X3_S16 = MY_LIB_VCK,
};

The kernel host registration differs from standard OpenVX user nodes in that it uses a different registration function (tenvxAddVCKernel), a different parameter type (TENVX_TYPE_IMAGE_TILABLE) and an extra kernel attribute set (TENVX_KERNEL_PARAM_TILE_NEIGHBORHOOD and TENVX_KERNEL_REQ_BORDER). The parameter validation callback, called at OpenVX graph verification time, is the same for VCK and standard user kernels (it is not described in detail here).

#define ERODE_X (5 / 2)
#define ERODE_Y (3 / 2)

kernel_erode = tenvxAddVCKernel(context,
                             TENVX_TARGET_VISION_XI,
                             "my.company.vck.erode_5x3_s16",
                              MY_KERNEL_ERODE_5X3_S16,
                              2,   // Two parameters
                              node_parameter_validation, NULL,
                              NULL,  // No host init/deinit
                              TENVX_KERNEL_NO_ENTRY, TENVX_KERNEL_NO_ENTRY // No DSP start/end
);

// Register kernel parameters
vxAddParameterToKernel(kernel_erode, 0, VX_INPUT, TENVX_TYPE_IMAGE_TILABLE, VX_PARAMETER_STATE_REQUIRED);
vxAddParameterToKernel(kernel_erode, 1, VX_OUTPUT, TENVX_TYPE_IMAGE_TILABLE, VX_PARAMETER_STATE_REQUIRED);

// Request a neighborhood to the tile corresponding to the erode properties
tenvx_neighborhood_size_t nbh;
nbh.left = nbh.right = ERODE_X; nbh.top = nbh.bottom = ERODE_Y;
tenvxSetKernelParamAttribute(kernel_erode, 0, TENVX_KERNEL_PARAM_TILE_NEIGHBORHOOD, &nbh, sizeof(nbh));

// Request the Replicate border mode
vx_border_t border;
border.mode = VX_BORDER_REPLICATE;
vxSetKernelAttribute(kernel_erode, TENVX_KERNEL_REQ_BORDER, &border, sizeof(border));

// Finalize the kernel registration
vxFinalizeKernel(kernel_erode);

As with standard OpenVX user nodes, VCKs are instantiated as OpenVX nodes by calling vxCreateGenericNode and setting parameters one by one. To simplify the use of such nodes, it is usual to provide a node creation function.

vx_node erode5x3S16Node(vx_graph graph, vx_image in, vx_image out) {
    vx_node node = vxCreateGenericNode(graph, kernel_erode);

    if (vxGetStatus((vx_reference)node) == VX_SUCCESS) {
        vxSetParameterByIndex(node, 0, (vx_reference)in);
        vxSetParameterByIndex(node, 1, (vx_reference)out);
    }

    return node;
}

The main Vision DSP callback function takes an array of void * elements as input. It is the responsibility of the user code to cast this address to the type corresponding to the kernel parameter type.

#define ERODE_X (5 / 2)
#define ERODE_Y (3 / 2)

vx_status erode_process(void **args) {
    vck_vxTile in = (vck_vxTile)args[0];
    vck_vxTile out = (vck_vxTile)args[1];

    // Input and output images have the same format, tiles then have the same size and position in their respective image
    vx_uint32 tile_width = vck_vxGetTileWidth(in, 0);
    vx_uint32 tile_height = vck_vxGetTileHeight(in, 0);

    // Perform the Erode operation
    int x, y, xx, yy;
    vx_uint8 *line_in_ptr8 = (vx_uint8 *)vck_vxGetTilePtr(in, 0);
    vx_size tilePitchIn = vck_vxGetTilePitch(in, 0); // In bytes
    vx_uint8 *line_out_ptr8 = (vx_uint8 *)vck_vxGetTilePtr(out, 0);
    vx_size tilePitchOut = vck_vxGetTilePitch(out, 0); // In bytes

    for (y = 0; y < tile_height; y++) {
        vx_uint8 *pixel_in_ptr8 = line_in_ptr8;
        vx_int16 *pixel_out_ptr16 = (vx_int16 *)line_out_ptr8;

        for (x = 0; x < tile_width; x++) {
            // Compute the output value
            vx_int16 value = 0x7FFF;

            for (yy = -ERODE_Y; yy <= ERODE_Y; yy++) {
                for (xx = -ERODE_X; xx <= ERODE_X; xx++) {
                    vx_int16 *pixel_ptr = ((vx_int16 *)(pixel_in_ptr8 + yy * tilePitchIn)) + xx;
                    value = *pixel_ptr < value ? *pixel_ptr : value;
                }
            }

            // Write the output pixel
            *pixel_out_ptr16 = value;

            // Next pixel
            pixel_in_ptr8 += sizeof(vx_int16);
            pixel_out_ptr16++;
        }

        // Next line
        line_in_ptr8 += tilePitchIn;
        line_out_ptr8 += tilePitchOut;
    }

    return VX_SUCCESS;
}

The registration of kernel callbacks is done by registering a dispatch callback with the VCK kernel library via the vck_vxRegisterDispatchCallback function, prior to the application calling any OpenVX API functions. The registered callback will be called whenever a kernel with an ID derived from the library ID needs to be executed.

// VCK kernel library dispatch callback
static vck_vx_kernel_f vck_kernellib_callback(vx_enum kernel_id) {
    switch (kernel_id) {
    // Only one kernel in this simple example
    case MY_KERNEL_ERODE_5X3_S16:
        return &erode_process;

    default:
        return NULL;
    }
}

// VCK kernel library registration
void register_vck_lib(void) {
        vck_vxRegisterDispatchCallback(MY_LIB_VCK, &vck_kernellib_callback);
}

The previous discussion and code samples show one example of how the OpenVX API can be used to enable programmers to efficiently leverage powerful and highly optimized hardware without writing extensive hardware-specific (i.e., non-portable) code. Developers who are unfamiliar with a given architecture (in this case, Cadence Tensilica Vision DSPs with tightly-coupled memories and DMA engines) can nonetheless reap the benefits of high performance and efficient architectures via the OpenVX API. This combination of efficiency and portability is made possible by the OpenVX API's graph-based nature.

Thierry Lepley
Software Architect, Cadence Design Systems

Implementation Case Study: Deep Learning Inference Acceleration

Newer versions of both the base OpenVX specification and its extensions include support for the increasingly popular deep learning-based methods of computer vision algorithm development. The following section, authored by VeriSilicon, details an example design that leverages hardware acceleration of deep learning inference for computer vision

Deep learning support is one of the major focuses for Khronos, beginning with OpenVX v1.2. CNN inferencing for computer vision is a common use case. Using the Faster R-CNN model as an example, this section will explore how OpenVX in combination with an embedded processor, such as one of VeriSilicon's VIP cores, enables CNN inference acceleration (Figure 4).


Figure 4. A Faster R-CNN model exemplifies how OpenVX, in combination with a GPU, enables CNN inference acceleration (courtesy Khronos).

Converting Floating-point CNN Models to 8-bit Integer CNN Models

Today's CNN frameworks primarily focus on floating-point data types, whereas most embedded devices favor integer data in order to achieve higher performance and lower power consumption, as well as to require lower bandwidth and storage capacity. The OpenVX vx_tensor object has been developed to naturally support the dynamic fixed-point integer data type. Specifically, the 8-bit dynamic fixed-point integer data type has been widely adopted by the industry for CNN inference acceleration.

The OpenVX neural network extension is an inference-dedicated execution API focusing on embedded devices, with 8-bit integer required as the mandatory data type. In this example, the original Faster R-CNN model, using the float data type, is created by the Caffe framework. Subsequent model quantization from float32 to int8 requires additional tools, provided by silicon vendors and/or 3rd party developers. A slight accuracy loss compared to the original float model is to be expected due to quantization. Supplemental tools and techniques (such as re-training) can restore some amount of accuracy; these particular topics are not, however, currently within the OpenVX Working Group's scope of attention.

Port CNN Models to OpenVX CNN Graphs

The graph is a fundamental algorithm structure common to both OpenVX and CNN frameworks. The OpenVX neural network extension defines standard neural network nodes to build CNN graphs. Application developers can, if they choose, write OpenVX applications that include CNN graphs built from scratch, using OpenVX neural network extension nodes.

Alternatively, it's possible to port conventional CNN models, pre-trained using CNN frameworks, to OpenVX CNN graphs (Figure 5 and Table 1). The OpenVX export and import extension is then used to load (as a subgraph) each resultant OpenVX CNN graph, merging it into an OpenVX application global graph.


Figure 5. Khronos extensions and interoperability standards are key elements in porting CNN models to OpenVX CNN graphs (courtesy Khronos).

Faster R-CNN layers (graph nodes)

OpenVX NN nodes

convolution

vxConvolutionLayer()

Relu activation

vxActivationLayer()

Local normalization

vxNormalizationLayer()

Pooling

vxPoolingLayer()

Softmax

vxSoftmaxLayer()

FullyConnect

vxFullyConnectedLayer()

Table 1. Mapping Faster R-CNN graph nodes to OpenVX neural network extension node functions

Each framework has unique model files used to describe the topology graph and associated weight/bias binaries. Khronos' NNEF (neural network extension format) standard is designed to act as a bridge between various model descriptions. Intermediary NNEF models can be ported to OpenVX CNN graphs by using Khronos-supplied tools. Some vendors also provide tools which support direct (no NNEF bridge required) porting from frameworks' CNN models to OpenVX CNN graphs

Deep learning is a fast-paced, rapidly evolving technology area. It should not be a surprise, therefore, to encounter neural network nodes beyond the scope of existing OpenVX neural network extension definitions (Table 2). Some special neural network nodes might also be suitable for CPUs, for example. Fortunately, OpenVX comprehends vendor-customized kernels and user kernels in order to support additional neural network nodes.

Faster R-CNN layers (graph nodes)

Vendor/User node functions

Reshape

Vendor node. A function to change memory layout.

Region Proposal

User node. A function including sorting, suitable for CPUs.

Dropout

Training-only. A no-op function for inferencing.

Table 2. Example Faster R-CNN graph nodes beyond the scope of OpenVX v1.2's neural network extension

OpenVX CNN Graph Optimization

Two general types of graph optimizations are possible: at the application development level, and at the vendor implementation level. Vendor-level optimizations are usually more device-specific, relying on sophisticated graph analysis and leveraging proprietary hardware acceleration capabilities. Application-level optimizations conversely are more use-case specific, relying on deeper knowledge about application tasks. It's generally recommended to begin optimization work at the application level.

General-purpose OpenVX graph optimization techniques can also benefit CNN graphs. The following CNN-specific graph optimization techniques continue this case study's use of the Faster R-CNN model example. A conventional Faster R-CNN graph, for example, includes Dropout nodes. Dropout can be disregarded, however, by the inferencing process. The OpenVX Faster R-CNN graph can therefore remove dropout nodes as a simplification step.

The CNN graph node is designed to process tensor type data objects. Tensor fragments are not amenable to vendor implementation-level optimizations, however. Therefore, at the application graph level, developers should as much as possible merge nodes that share the same input. In the Faster R-CNN graph example, the last two fully connected nodes share the same input, and can alternatively be merged as a single fully connected node (Figure 6). In this optimized configuration, the soft-max node can operate on the partial output tensor of the merged fully connected node. Such node merges also can reduce bandwidth requirements.


Figure 6. Merging two fully connected nodes is one example of a Faster R-CNN graph optimization (courtesy Khronos).

Packed tensors can also enable more straightforward optimizations at the subsequent vendor implementation level. And the tensor packing concept can be more broadly applied to all Faster R-CNN fully-connected nodes. Without such optimizations, Faster R-CNN fully-connection nodes sequentially apply to individual ROI (region of interest) 1-D tensors, one by one, in a repeating loop (Figure 7). The more efficient alternative approach packs all ROIs together as a 2-D tensor, with repeating fully connected operations consequently equivalent to one-time 1x1 convolution operation. Performance can therefore be significantly improved.


Figure 7. Additional Faster R-CNN graph optimization opportunities come from replacing repeating fully-connected nodes with 1x1 convolution nodes (courtesy Khronos).

Node merging and graph tiling are typical techniques undertaken at the vendor implementation level. Compared to traditional vision processing graphs, CNN graphs typically exhibit well-defined backbone structures with duplicated primitives. Therefore, node merging and tiling optimizations can be relatively straightforward to implement at the vendor implementation level. Performance can also be accurately predicted with the use of vendor profiling tools.

Xin Wang
Senior Director, Vision Architecture, VeriSilicon

Conclusion

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products, and can provide significant new markets for hardware, software and semiconductor suppliers. Key to the widespread adoption of embedded vision is the ease of developing software that runs efficiently on a diversity of hardware platforms, with high performance, low power consumption and cost-effective system resource needs. In the past, this combination of objectives has been challenging, since it has historically required significant code optimization for particular device architectures, thereby hampering portability to other architectures. Fortunately, this situation is changing with the maturation of the OpenVX standard created and maintained by the Khronos Group.

Brian Dipert
Editor-in-Chief, Embedded Vision Alliance
Senior Analyst, BDTI

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. Cadence Design Systems, Intel, NXP Semiconductors and VeriSilicon, the co-authors of this series of articles, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, is intended for product creators interested in incorporating visual intelligence into electronic systems and software. The Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings. The next Embedded Vision Summit is scheduled for May 20-23, 2019 in Santa Clara, California. Mark your calendars and plan to attend; more information, including online registration, will be available on the Embedded Vision Alliance website in the coming months.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other machine learning frameworks. Access is free to all through a simple registration process. And the Embedded Vision Alliance and its member companies also periodically deliver webinars on a variety of technical topics. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Embedded Vision Alliance website.

OpenVX Enhancements, Optimization Opportunities Expand Vision Software Development Capabilities

Bookmark and Share

OpenVX Enhancements, Optimization Opportunities Expand Vision Software Development Capabilities

Key to the widespread adoption of embedded vision is the ease of developing software that runs efficiently on a diversity of hardware platforms, with high performance, low power consumption and cost-effective system resource needs. In the past, this combination of objectives has been a tall order, since it has historically required significant code optimization for particular device architectures, thereby hampering portability to other architectures. Fortunately, this situation is changing with the maturation of the OpenVX standard created and maintained by the Khronos Group. This article discusses recent evolutions of the standard, along with the benefits and details of implementing it on heterogeneous computing platforms.

OpenVX, an API from the Khronos Group, is an open standard for developing high performance computer vision applications that are portable to a wide variety of computing platforms. It uses the concept of a computation graph to abstract the compute operations and data movement required by an algorithm, so that a broad range of hardware options can be used to execute the algorithm. An OpenVX implementation targeting a particular hardware platform translates the graph created by the application programmer into the instructions needed to execute efficiently on that hardware. Such flexibility means that the programmer will not need to rewrite his or her code when re-targeting new hardware, or to write new code specific to that hardware, making OpenVX a cross-platform and heterogeneous computing-supportive API.

A previously published article in this series covered the initial v1.0 OpenVX specification and provided an overview of the standard's objectives, along with an explanation of its capabilities, as they existed in early 2016. This follow-on article focuses on more recent updates to OpenVX, up to and including latest v1.2 of the specification and associated conformance tests, along with the recently published set of extensions that OpenVX implementers can optionally provide. It also discusses the optimization opportunities available with SoCs' increasingly common heterogeneous computing architectures. And it introduces readers to an industry alliance created to help product creators incorporate practical computer vision capabilities into their hardware and software, along with outlining the technical resources that this alliance provides (see sidebar "Additional Developer Assistance").

A companion article showcases several case study examples of OpenVX implementations in various applications, leveraging multiple hardware platforms along with both traditional and deep learning computer vision algorithms.

OpenVX Enhancements

The following section was authored by Cadence, representing Khronos' OpenVX Working Group.

The primary additions to the OpenVX API found in v1.2 involve feature detection, image processing, and conditional execution. Version 1.2 also introduces the tensor object (vx_tensor), which has been added primarily in support of the neural network extension but is also used in several of the new base functions. In addition to explaining these new capabilities, this section will discuss several recently defined extensions, including functionality for image classification (using both traditional and deep-learning methods) and safety-critical applications.

New Feature-detection Capabilities

The feature-detection capabilities found in OpenVX v1.2 include the popular histogram of oriented gradients (HOG) and local binary pattern (LBP) detectors, along with template matching and line finding. These algorithms are useful in tasks such as pedestrian detection, face recognition and lane detection for advanced driver assistance systems (ADAS).

HOG is implemented in OpenVX via two main functions, vxHOGCellsNode and vxHOGFeaturesNode. The vxHOGCellsNode function divides an input image into a number of grid cells and calculates the HOG information for each cell, which consists of gradient orientation histograms and average gradient magnitudes. This is a measure of how much "edge" there is in each cell and the direction of the edges. The information computed by vxHOGCellsNode for these cells can then be fed into a vxHOGFeaturesNode function, which looks at groups of cells and computes HOG "descriptors" that describe the pattern of edges in each area of the image. A set of these descriptors is sometimes called a "feature map."

The feature map is then fed into a classifier that has been trained to tell if the pattern matches what you’re looking for (such as pedestrians). OpenVX offers a classifier API as an extension, with functions to import, use, and release a classifier object. The extension includes a vxScanClassifierNode function to create an OpenVX node that takes a feature map as input and outputs a list of rectangles around the detected objects (Figure 1).


Figure 1. OpenVX's classifier API extension includes a function that takes a feature map as input and outputs a list of rectangles around the detected objects (courtesy Pexels).

LBP is useful for detecting objects, sometimes in conjunction with HOG, as well as in texture classification and face recognition. The OpenVX node for LBP is vxLBPNode, which takes an input image and a few tuning parameters and outputs an image of the same size with the value of the LBP descriptor at every point in the image.

The template matching function compares two images and computes a measure of their similarity. One of the input images is referred to as the source image, and the other, typically much smaller image is referred to as the template image. The corresponding OpenVX function is vxMatchTemplateNode; like many OpenVX functions, it is based on a similar function in OpenCV. vxMatchTemplateNode takes as input the source and template images, plus a parameter that indicates which of six available comparison functions to use. It outputs an image in which each pixel is a measure of the similarity of the template to the source image at that point in the source image.

The line-finding feature detector in OpenVX v1.2 implements the probabilistic Hough transform. The function, called vxHoughLinesPNode, takes an image and various tuning parameters as inputs, outputting a list of detected lines.

New Image-processing Features

OpenVX v1.2 adds three useful image-processing functions: a generalized nonlinear filter, non-maximum suppression, and a bilateral filter. These image-processing functions have images as both their input and output, and are often used to enhance image quality or refine feature map images.

The vxNonLinearFilterNode generalized nonlinear filter function accepts, in addition to the input and output image parameters, two additional input parameters, "mask" and "function," to describe the specific filter desired. The mask parameter defines the shape of the filter, which can be a box, cross, a disk of various sizes, or even an arbitrary shape defined by the user.

The available choices for the function parameter are median, min and max. The median filter is a standard image-processing operation usually used to reduce noise. For each pixel in the output image, the median value within the area of the input image defined by the mask is taken to be the output value. The procedure for the min and max functions is similar, but with the substitution of the min or max value instead of the median value. The max function corresponds to the standard dilate image-processing operation, while the min function corresponds to erode. Dilate and erode are usually used to refine and clean up binary images, but can be applied to gray-scale images as well.

The non-maximum suppression image-processing function is generally used on the output images from feature detectors, which often highlight several pixels in the area of a feature detection. Non-maximum suppression reduces these clumps of detections into a single detection at the maximum point. The associated OpenVX function, vxNonMaxSuppressionNode, accepts an input image, a “window” parameter, and an optional mask to limit where the suppression is performed. It produces an output image with all of the pixels in a clump except for the maximum value in the clump set to zero or the most negative value.

A bilateral filter is a popular image-processing operation that is often referred to as “edge-preserving,” because it can significantly reduce noise and smooth an image without blurring the edges. The associated OpenVX function is vxBilateralFilterNode. The input and output images for this function are encapsulated in the new vx_tensor object, since the bilateral filter can operate on multiple image channels. In addition to the input image, the user provides the bilateral filter with tuning parameters for “diameter” and “sigma” to control the degree of smoothing.

Control Flow and Conditional Execution

Support for control flow and conditional execution is the major new framework feature introduced in OpenVX v1.2. Until now, during graph execution, all nodes in the graph were executed every time the graph was processed, potentially adding unnecessary processing overhead to the application. Conversely, the new control flow and conditional execution features included in OpenVX v1.2 allow the OpenVX application to check various conditions and determine what processing to perform based on them. Prior to OpenVX v1.2, control flow operations could only be executed on the host processor. The conditional execution feature of OpenVX v1.2 enables execution of an if-then-else control flow on the target hardware without need for intervention from the host processor.

The conditional execution feature is implemented in OpenVX v1.2 via the vxScalarOperationNode and vxSelectNode functions. vxScalarOperationNode enables the user to construct and test conditions using simple arithmetic and logical operations such as add, subtract, and, or, etc. It takes two scalar input operands and an operation type as input, and produces a scalar output. Such nodes can be combined to produce arbitrary expressions. vxSelectNode takes as input a Boolean object, which is usually the output of a vxScalarOperationNode operation, along with three object references: a true value, a false value and an output object. These references must all be to objects of the same type and size. When the graph containing these nodes is executed, depending on the value of the Boolean object, either the true value or the false value is copied into the output object (Figure 2).


Figure 2. Support for control flow and conditional execution is the major new framework feature introduced in OpenVX v1.2 (courtesy Cadence).

The subgraph in Figure 2 forms an if-then-else structure, where, depending on the condition calculated in A, the output of the select node S is either the output of B or the output of C. In other words, if A then S ← B, else S ← C. The OpenVX implementation, in analyzing this subgraph can notice that if the condition is true, the output of C is not used, so node C doesn’t need to be executed (and analogously, if A is false, B doesn’t need to be executed). In this particular example, B and C are just single nodes, but this facility can more generally be used to skip arbitrary chains of nodes based on conditions computed in the graph, again potentially without intervention by the host processor.

Finally, as mentioned earlier where it was used in the bilateral filter function, OpenVX v1.2 introduces a vx_tensor object. The base specification for OpenVX 1.2 includes functions to element-wise add and subtract tensors, do an element-wise table lookup, transpose, matrix multiply, and perform bit-depth conversion on tensors. The neural network extension to be discussed shortly also uses the vx_tensor object extensively.

Frank Brill
Design Engineering Director, Cadence Design Systems

OpenVX Extensions

The following section was authored by Intel, representing Khronos' OpenVX Working Group.

Successful ongoing development of the main OpenVX specification is dependent on stability and consensus, both of which take notable time and effort. However, when a technology is developing rapidly, engineers benefit from a fast path to product development. In response, Khronos has created the extensions mechanism to rapidly bring new features to fruition. Khronos extensions still go through a ratification process, but its streamlined nature in comparison to the main specification can dramatically shorten the availability wait time.

To date, the OpenVX Working Group has published multiple extensions, in both finalized (Table 1) and provisional (Table 2) forms.

Extension
Name

OpenVX
Versions

Description

vx_khr_nn

1.2

Neural network extension

vx_khr_ix

1.2, 1.1

Export and import extension for graphs & objects

Table 1. OpenVX Finalized Extensions

Extension
Name

OpenVX
Versions

Description

vx_khr_import_kernel

1.2, 1.1

Import kernel from vendor binary

vx_khr_pipelining

1.2, 1.1

Pipelining, streaming, and batch processing extension

vx_khr_opencl

1.2, 1.1

OpenCL interop extension

vx_khr_class

1.2

Classifier extension

vx_khr_s16

1.1

S16 extension

vx_khr_tiling

1.0

Tiling extension

vx_khr_xml

1.0

XML schema extension

vx_khr_icd

1.0

Installable client driver extension

Table 2. OpenVX Provisional Extensions

Neural Network Extension

The neural network extension enables the use of convolutional neural networks (CNNs) inside an OpenVX graph. The neural network layers are represented as OpenVX nodes, connected by multi-dimensional tensor objects. Supported layer types include convolution, activation, pooling, fully-connected, soft-max, etc. These CNN nodes can also mix with traditional vision algorithm nodes (Figure 3).


Figure 3. OpenVX graphs can mix CNN nodes with traditional vision nodes (courtesy Khronos).

With cost- and power consumption-sensitive embedded system targets in mind, implementations of this extension must support tensor data types INT16, INT7.8, and UINT8. Conformance tests will allow for some tolerance in precision in order to enable optimizations such as weight compression. Example supported networks include Overfeat, AlexNet, GoogLeNet, ResNet, DenseNet, SqeezeNet, LSTM, RNN/BiRNN, Faster-RCNN, FCN, and various Inception versions.

Export and Import Extension

Every OpenVX implementation comes with both a graph optimizer and an execution engine. The graph optimizer can be computationally "heavy," akin to running a compiler. Its use is therefore not only burdensome but also unnecessary for use cases such as the following:

  • Embedded systems which use fixed graphs to minimize the required amount of code
  • Safety-critical systems where the OpenVX library does not have a node creation API
  • CNN extensions that require ability to import pre-compiled binary objects

Alternatively, to optimize OpenVX for use cases such as these, Khronos first subdivides the overall OpenVX workflow into "development" and "deployment" stages (Figure 4). The export and import extension provides mechanisms both to export a pre-compiled binary from the development stage and to import the pre-compiled binary into a deployed application.


Figure 4. The overall OpenVX workflow subdivides into development and deployment stages (courtesy Khronos).

Provisional Extensions

Several OpenVX provisional extensions have already been developed by the OpenVX Working Group, with others to potentially follow. Provisional extensions, as their name implies, provide a means for engineers to begin their implementation development and provide feedback to the OpenVX Working Group before the extension specification is finalized. Recent provisional extensions include the pipelining extension, the OpenCL interop extension, the kernel import extension, and the classifier extension.

Heterogeneous computing architectures are increasingly dominant in modern devices. Obtaining peak utilization and throughput from heterogeneous hardware requires the ability to schedule multiple jobs by means of pipelining or batching. The pipelining extension provides a mechanism to feed multiple values to a given graph input simultaneously or in a stream, enabling the implementation to achieve higher hardware utilization and throughput. An example OpenVX graph and mapping of its nodes to heterogeneous parallel hardware demonstrates this capability (Figure 5). Graph execution without pipelining will result in underutilization of compute resources, with consequent sub-optimal throughput. Conversely, pipelining will deliver higher throughput and utilization.




Figure 5. This example demonstrates mapping of nodes to hardware (top). OpenVX graph execution without pipelining will result in underutilization of compute resources, with consequent sub-optimal throughput (middle). Conversely, pipelining will deliver higher throughput and utilization (bottom) (courtesy Khronos).

The OpenCL interop extension enables efficient data exchange between OpenVX and an OpenCL application or user kernel. Using the interop extension, an OpenCL-based application will be able to efficiently export OpenCL data objects into the OpenVX environment, as well as to access OpenVX data objects as OpenCL data objects. Another key feature is fully asynchronous host-device operation, an important capability for data exchanges.

The kernel import extension provides a means of importing an OpenVX kernel from a vendor binary URL. Unlike the previously discussed export and import extension, this extension (as its name suggests) only focuses on importing kernel objects, which can be sourced directly from vendor-specific binaries pre-compiled offline using either vendor- or third-party-supplied toolsets. One compelling example use case for this extension involves importing a pre-compiled neural network model as an OpenVX kernel object, which can then be instantiated in multiple different graphs.

Finally the classifier extension enables the deployment of an image classifier that can support methods such as support vector machine (SVM) and Cascade (Decision Tree).

Radhakrishna Giduthuri
Deep Learning Architect, Intel

OpenVX and Heterogeneous SoCs

The prior section of this article conceptually discussed the benefits of, and extension mechanisms available in supporting, heterogeneous computing architectures containing multiple parallel and function-optimized processing elements. The following section, authored by NXP Semiconductors, discusses an example implementation of the concept based on the company's SoCs targeting ADAS (advanced driver assistance systems) and fully autonomous vehicles.

The large-scale deployment of vision-based ADAS and autonomous vehicle technology is made possible through the use of specialized SoCs that meet the automotive quality, performance, functionality, availability, and cost targets required by OEMs. These specialized SoCs, such as the NXP S32V family, are highly heterogeneous in nature, including (for example) multiple ARM processor cores, both programmable and hardwired vision accelerators, and GPUs (Figure 6). Programming such SoCs to optimally leverage all available resources is not trivial. Fortunately, OpenVX offers a number of benefits that make the task much easier.


Figure 6. Highly integrated SoCs provide abundant heterogeneous computing opportunities (courtesy NXP Semiconductors).

The development of vision algorithms on ARM processor cores and the GPU is relatively straightforward, enabled by the use of standard programming languages and APIs. However, vision accelerators (both programmable and hardwired) are largely proprietary architectures. Even if an accelerator supplier provides a custom library or API with explicit names for various vision processing operations, it can be difficult to assess what result is actually provided by these functions, as many variations are possible. This variety is particularly of concern when the goal is an optimized implementation of various computer vision algorithms.

In contrast, since the foundation OpenVX API includes both well-defined specifications and conformance tests, a user will know exactly what results to expect from all computer vision operations defined by OpenVX (note, however, that OpenVX provisional extensions are not similarly encompassed by formal conformance). Leveraging the OpenVX standard API as a hardware abstraction layer enables users to straightforwardly invoke proprietary vision processing hardware, assuming it provides OpenVX conformance, with full confidence in their results.

Inherent in this process is the optional ability, for performance and load balancing reasons, to explicitly indicate which heterogeneous SoC resource should perform a particular computation. A user might designate an Optical Flow Pyramid function to run on a vision accelerator, for example, while a Table Lookup function concurrently executes on one or multiple ARM cores. Recall that an OpenVX graph is a collection of vision function nodes connected by data objects (e.g., images). A node encapsulates a kernel, which is an implementation of a vision function on a specific target. And a target can either be physical (such as a GPU, CPU or vision accelerator) or logical. The target of a particular node may either be explicitly specified by the developer or dynamically selected by the framework at graph verification time.

In the context of ADAS and autonomous vehicles, it's also important to note the availability of a safety critical (SC) version of the OpenVX specification. The SC variant of the standard enables implementations that meet ISO 26262 functional safety requirements. One key aspect of this version of OpenVX is its ability to import pre-verified graphs, meaning that no runtime graph build system validation is required. OpenVX SC also removes the vxu library (which invokes a runtime graph build); such operations need to be avoided for functional safety purposes.

Stéphane François
Software Program Manager, Vision and AI R&D, NXP Semiconductors

Conclusion

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products, and can provide significant new markets for hardware, software and semiconductor suppliers. Key to the widespread adoption of embedded vision is the ease of developing software that runs efficiently on a diversity of hardware platforms, with high performance, low power consumption and cost-effective system resource needs. In the past, this combination of objectives has been challenging, since it has historically required significant code optimization for particular device architectures, thereby hampering portability to other architectures. Fortunately, this situation is changing with the maturation of the OpenVX standard created and maintained by the Khronos Group.

Brian Dipert
Editor-in-Chief, Embedded Vision Alliance
Senior Analyst, BDTI

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. Cadence Design Systems, Intel, NXP Semiconductors and VeriSilicon, the co-authors of this series of articles, are members of the Embedded Vision Alliance. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, is intended for product creators interested in incorporating visual intelligence into electronic systems and software. The Embedded Vision Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Embedded Vision Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings. The next Embedded Vision Summit is scheduled for May 20-23, 2019 in Santa Clara, California. Mark your calendars and plan to attend; more information, including online registration, will be available on the Embedded Vision Alliance website in the coming months.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other machine learning frameworks. Access is free to all through a simple registration process. And the Embedded Vision Alliance and its member companies also periodically deliver webinars on a variety of technical topics. Access to on-demand archive webinars, along with information about upcoming live webinars, is available on the Embedded Vision Alliance website.

NIR: Seeing Clearly Even in Low Light

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.

Tips For Designing a Robust Computer Vision System For Self-driving Cars

This article was originally published at Texas Instruments' website. It is reprinted here with the permission of Texas Instruments.

Speed Matters

This article was originally published at Basler's website. It is reprinted here with the permission of Basler.