# Embedded Vision Alliance: Technical Articles

## ARM Guide to OpenCL Optimizing Canny Edge Detection: Optimization Process

# ARM Guide to OpenCL Optimizing Canny Edge Detection: Optimization Process

**Register or sign in to access the Embedded Vision Academy's free technical training content.**

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

### See a sample of this page's content below:

This chapter describes some further optimizations to the kernel.

## Useful mathematical properties of convolution

Convolution operations can be optimized in several ways. One of the most obvious optimization targets is the large number of repeated reads that the unoptimized kernel uses. These reads also use large data type floats for the early stages of the process.

For example, on average a 3 x 3 convolution loads each pixel nine times. This happens because each time the convolution is moved to a new pixel, P, the kernel fetches every neighbor. This fetch occurs even if the pixel values have been loaded before.

The following shows example verbose compiler logs for an unoptimized 5 x 5 convolution process and 3 x 3 convolution process.

Entry point: __llvm2lir_entry_convolution5x5_static

16 work registers used (with spilling), 3 uniform registers used

Pipelines: A / L / T / Overall

Number of instruction words emitted: 219 +81 + 0 = 300

Number of cycles for shortest code path: 113.5 /81 / 0 = 113.5 (A bound)

Number of cycles for longest code path: 117 /81 / 0 = 117 (A bound)

Entry point: __llvm2lir_entry_convolution3x3_static

15 work registers used, 4 uniform registers used

Pipelines: A / L / T / Overall

Number of instruction words emitted: 84 +34 + 0 = 118

Number of cycles for shortest code path: 41 /34 / 0 = 41 (A bound)

Number of cycles for longest code path: 44.5 /34 / 0 = 44.5 (A bound)

These logs suggest that both of these processes are ALU bound, because many mathematics operations are being repeated. To explore this problem and possible solutions, start with a simple example. The following figure shows a simple separable convolution matrix.

**Figure 4-1 Example separable matrix**

This convolution matrix is similar to the matrices used in Canny edge, because it is separable. This means that it can be split into two sequential convolutions F_{h} and F_{v}. The following figure shows the two separated parts of the convolution matrix.

**Figure 4-2 Separable matrix parts**

The means that the program can use the two separated convolution matrices sequentially instead of one application of the larger convolution matrix. The following figure shows the application of the unseparated convolution matrix to an example set of pixels.

...

## Deep Learning with INT8 Optimization on Xilinx Devices

# Deep Learning with INT8 Optimization on Xilinx Devices

*This is a reprint of a Xilinx-published white paper which is also available here (1 MB PDF).*

Xilinx INT8 optimization provide the best performance and most power efficient computational techniques for deep learning inference. Xilinx's integrated DSP architecture can achieve 1.75X solution-level performance at INT8 deep learning operations than other FPGA DSP architectures.

## ABSTRACT

The intent of this white paper is to explore INT8 deep learning operations implemented on the Xilinx DSP48E2 slice, and how this contrasts with other FPGAs. With INT8, Xilinx's DSP architecture can achieve 1.75X peak solution-level performance at INT8 deep learning operation per second (OPS) compared to other FPGAs with the same resource count. As deep learning inference exploits lower bit precision without sacrificing accuracy, efficient INT8 implementations are needed.

Xilinx's DSP architecture and libraries are optimized for INT8 deep learning inference. This white paper describes how the DSP48E2 slice in Xilinx's UltraScale and UltraScale+ FPGAs can be used to process two concurrent INT8 multiply and accumulate (MACC) operations while sharing the same kernel weights. It also explains why 24-bit is the minimal size for an input to utilize this technique, which is unique to Xilinx. The white paper also includes an example of this INT8 optimization technique to show its relevance by revisiting the fundamental operations of neural networks.

## INT8 for Deep Learning

Deep neural networks have propelled an evolution in machine learning fields and redefined many existing applications with new human-level AI capabilities. While more accurate deep learning models have been developed, their complexity is accompanied by high compute and memory bandwidth challenges. Power efficiency is driving innovation in developing new deep learning inference models that require lower compute intensity and memory bandwidth but must not be at the cost of accuracy and throughput. Reducing this overheard will ultimately increase power efficiency and lower the total power required.

In addition to saving power during computation, lower bit-width compute also lowers the power needed for memory bandwidth, because fewer bits are transferred with the same amount of memory transactions.

Research has shown that floating point computations are not required in deep learning inferences to keep the same level of accuracy^{1,2,3}, and many applications, such as image classification, only require INT8 or lower fixed point compute precision to keep an acceptable inference accuracy^{2,3}. **Table 1** shows fine-tuned networks with dynamic fixed point parameters and outputs for convolutional and fully connected layers. The numbers in parentheses indicate accuracy without fine-tuning.

Layer Outputs | CONV Parameters | FC Parameters | 32-Bit Floating Point Baseline | Fixed Point Accuracy | |

LeNet (Exp1) | 4-bit | 4-bit | 4-bit | 99.1% | 99.0% (98.7%) |

LeNet (Exp2) | 4-bit | 2-bit | 2-bit | 99.1% | 98.8% (98.0%) |

Full CIFAR-10 | 8-bit | 8-bit | 8-bit | 81.7% | 81.4% (80.6%) |

SqueezeNet top-1 | 8-bit | 8-bit | 8-bit | 57.7% | 57.1% (55.2%) |

CaffeNet top-1 |
8-bit | 8-bit | 8-bit | 56.9% | 56.0% (55.8%) |

GoogLeNet top-1 | 8-bit | 8-bit | 8-bit | 68.9% | 66.6% (66.1%) |

**Table 1: CNN Models with Fixed-Point Precision**

### Notes:

- Source: Gysel et al, Hardware-oriented Approximation of Convolutional Neural Networks, ICLR 20162

## INT8 Deep Learning on Xilinx DSP Slices

Xilinx's DSP48E2 is designed to do one multiplication and addition operation, with up to 18x27 bit multiplication and up to 48-bits accumulation, efficiently within one clock cycle as shown in **Figure 1**. While looping back to itself or chaining multiple DSP slices together, multiplication and accumulation (MACC) can also be done efficiently with Xilinx devices.

**Figure 1: DSP Slice with MACC Mode**

While running INT8 computations, the wide 27-bit width is innately taken advantage of. In traditional applications, the pre-adder is usually utilized to implement (A+B) x C type of computations efficiently, but this type of computation is not very often seen in deep learning applications. Separating out the result of (A+B) x C into A x C and B x C, allows the accumulation in a separate dataflow, allowing it to fit a typical deep learning computation requirement.

Having an 18x27 bit multiplier is an advantage for INT8 deep learning operations. At a minimum one of the inputs to the multiplier needs to be at least 24-bits and the carry accumulator needs to be 32-bits to perform two INT8 MACC concurrently on one DSP slice. The 27-bit input can be combined with a 48-bit accumulator to achieve a 1.75X deep learning solution performance improvement (1.75:1 DSP multiplier to INT8 deep learning MACC ratio). FPGAs from other vendors only have an 18x19 multiplier in a single DSP block and are limited to a 1:1 ratio of DSP multiplier to INT8 MACC.

### Scalable INT8 Optimization

The goal is to find a way to efficiently encode input a, b, and c so that the multiplication results between a, b and c can be easily separated into a x c and b x c.

In a reduced precision computation, e.g., INT8 multiplication, the higher 10-bit or 19-bit inputs are filled with 0s or 1s, and carry only 1-bit of information. This is also the same for the upper 29-bits of the final 45-bit product. Because of this, it is possible to use the higher 19-bits to carry another computation while the lower 8-bit and 16-bit input results are not affected.

Generally, two rules must be followed to utilize the unused upper bits for another computation:

- Upper bits should not affect the computation of the lower bits.
- Any contamination of the upper bits by the computation of the lower bits must be detectable and recoverable

To satisfy the above rules, the least significant bit of the upper product results must not fall into the lower 16-bits. Thus, the upper bits input should start with at least the 17th bit. For an 8-bit upper input that requires a minimum of 16 + 8 = 24-bits total input size. This minimum 24-bit input size can only guarantee two concurrent multiplications with one multiplier—but still not enough to reach the overall 1.75X MACC throughput.

Following are the steps to compute ac and bc in parallel in one DSP48E2 slice, which is used as an arithmetic unit with a 27-bit pre-adder (both inputs and outputs are 27-bits-wide) and a 27x18 multiplier. See **Figure 2**.

- Pack 8-bit input a and b in the 27-bit port p of the DSP48E2 multiplier via the pre-adder so that the 2-bit vectors are as far apart as possible.

The input a is left-shifted by only 18-bits so that two sign bits a in the 27-bit result from the first term to prevent overflow in the pre-adder when b<0 and a = –128. The shift amount for a being 18, or the width of the DSP48E2 multiplier port B, is coincidental.

**Figure 2: 8-Bit Optimization**

- The DSP48E2 27x18 multiplier is used to compute the product of packed 27-bit port p and an 8-bit coefficient represented in 18-bit c in two's complement format. Now this 45-bit product is the sum of two 44-bit terms in two's complement format: ac left-shifted by 18-bits, and bc.

The post adder can be used to accumulate the above 45-bit product, which contains separable upper and lower product terms. Correct accumulations are carried for the upper and lower terms while accumulating the single 45-bit product. The final accumulation results, if not overflowed, can be separated by simple operations.

The limitation of this technique is the number of product terms each DSP slice can accumulate. With 2-bits remaining between the lower and upper product terms (**Figure 3**), accumulation of up to 7 product terms only can be guaranteed with no overflow for the lower bits. After 7 product terms, an additional DSP slice is required to extend this limitation. As a result, 8 DSP slices here perform 7x2 INT8 multiply-add operations, 1.75X the INT8 deep learning operations compared to competitive devices with the same number of multipliers.

There are many variations of this technique, depending on the requirements of actual use cases. Convolutional neural networks (CNN) with rectified linear unit (ReLU) produce non-negative activation, and the unsigned INT8 format creates one more bit of precision and 1.78X peak throughput improvement.

**Figure 3: Packing Two INT8 Multiplication with a Single DSP48E2 Slice Compute Requirements for CNN**

### Compute Requirements for CNN

Modern neural networks are mostly derived from the original perceptron model^{4}. See **Figure 4**.

**Figure 4: Perceptron and Deep Neural Networks**

Although quite evolved from the standard perceptron structure, the basic operations of modern deep learning, also known as deep neural networks (DNN), are still perceptron-like operations, but in wider ensemble and deeper stacked perceptron structures. **Figure 4** also shows the basic operation of a perceptron, through multiple layers, and ultimately repeated millions to billions of times in each typical deep learning inference. As shown in **Figure 5,** the major compute operations for computing each of the m perceptron/neuron outputs

o_{j} (j∈[1,m])

in a layer of neural networks is: to take the entire n input samples

a_{i}(i∈[1,n])

multiply each input by the corresponding kernel weight

w_{i,j}(i∈[1,n]∈[1,m])

and accumulate the results

Where f(x) can be any activation function of choice.

**Figure 5: Perceptron in Deep Learning**

If the precision of a_{i} and w_{i,j} are limited to INT8, this sum of products is the first of the parallel MACCs described in the INT8 optimization technique.

The second sum of the product uses the same input a_{i}(i∈[1,n]), but a different set of kernel weights w_{i,k}(i∈[1,n],k∈[1,m], and k≠j)

The result of the second perceptron/neuron output is

See **Figure 6**.

**Figure 6: Two Sums of Product Terms in Parallel with Shared Input**

By shifting the w_{i,k} values 18-bits to the left using the INT8 optimization technique, each DSP slice results in a partial and independent portion of the final output values. The accumulator for each of the DSP slices is 48-bits-wide, and is chained to the next slice. This limits the number of chained blocks to 7, before saturation of the shifted w_{i,k} affects the calculation, i.e., 2n MACCs with n DSP slices for the total of n input samples.

Each layer of a typical DNN has 100s to 1000s of input samples. However, after 7 terms of accumulation, the lower terms of the 48-bit accumulator might be saturated, and an extra DSP48E2 slice is needed for the summation every 7 terms. This equates to 14 MACCs with every 7 DSP slices plus one DSP slice for preventing the oversaturation, resulting in a throughout improvement of 7/4 or 1.75X.

In convolution neural networks (CNN), the same set of weights are usually reused heavily in convolutional layers, thus form a x w, and b x w type of parallel MACCs operations. So weight sharing instead of input sharing can also be used (see **Figure 7**).

**Figure 7: Weight Sharing and Input Sharing Comparison Other Methods to Create INT8 Chained MACCs**

## Other Methods to Create INT8 Chained MACCs

INT8 MACCs can also be constructed using the LUTs in the FPGA fabric at a similar frequency to the DSP slice. Depending on the usage of the FPGA, this could be a substantial increase in the deep learning performance, in some cases, increasing the performance by 3X. In many instances, with respect to other non-FPGA architectures, these available compute resources are not accounted for when calculating the available deep learning operations.

The programmable fabric in Xilinx FPGAs is unique because it can handle diverse workloads concurrently and efficiently. For example, Xilinx FPGAs can perform CNN image classification, networking cryptography, and data compression concurrently. Our deep-learning performance competitive analysis does not take the MACC LUTs into account because LUTs are usually more valuable while being used to perform other concurrent functions rather than to perform MACC functions.

## Competitive Analysis

Intel’s (formerly Altera) Arria 10 and upcoming Stratix 10 devices are used in this competitive analysis against Xilinx's Kintex® UltraScale™ and Virtex® UltraScale+™ families. For this compute intensive comparison, the devices chosen have the highest DSP densities in each product family: Arria 10 (AT115), Stratix 10 (SX280), Kintex UltraScale (KU115), Virtex UltraScale+ (VU9P), and Virtex UltraScale+ (VU13P) devices. This comparison focuses on general-purpose MACC performance that can be used in many applications, such as deep learning.

Intel’s MACC performance is based on operators that leverage the pre-adders. However, this implementation produces the sum of product terms and not unique separate product terms—as such, Intel’s pre-adders are not suited for deep learning operations.

The power of Intel devices are estimated using Intel's EPE power estimate tools with the following worst-case assumptions:

- 90% DSP utilization at F
_{MAX} - 50% logic utilization with clock rate at DSP F
_{MAX} - 90% block RAM utilization with the clock rate at half DSP F
_{MAX} - 4 DDR4 and 1 PCIe Gen3 x8
- 12.5% DSP toggle rate
- 80° T
_{J}

**Figure 8** shows the power efficiency comparison of deep learning operations. With INT8 optimization, Xilinx UltraScale and UltraScale+ devices can achieve 1.75X power efficiency on INT8 precision compared to INT16 operations (KU115 INT16 to KU115 INT8). And compared to Intel's Arria 10 and Stratix 10 devices, Xilinx devices deliver 2X–6X better power efficiency on deep learning inference operations.

**Figure 8: INT8 Deep Learning Power Efficiency Comparison: Xilinx vs. Intel**

## Conclusion

This white paper explores how INT8 deep learning operations are optimal on Xilinx DSP48E2 slices, achieving a 1.75X performance gain. The Xilinx DSP48E2 slice can be used to do concurrent INT8 MACCs while sharing the same kernel weights. To implement INT8 efficiently, an input width of 24-bits is required, an advantage that is only supported in Xilinx UltraScale and UltraScale+ FPGA DSP slices. Xilinx is well suited for INT8 workloads for deep learning applications (e.g., image classification). Xilinx is continuing to innovate new hardware- and software-based methodologies to accelerate deep learning applications.

For more information on deep learning in the data center, go to:

https://www.xilinx.com/accelerationstack

## References

- Dettmers, 8-Bit Approximations for Parallelism in Deep Learning, ICLR 2016

https://arxiv.org/pdf/1511.04561.pdf - Gysel et al, Hardware-oriented Approximation of Convolutional Neural Networks, ICLR 2016

https://arxiv.org/pdf/1604.03168v3.pdf - Han et al, Deep Compression: Compressing Deep Neural Networks With Pruning, Trained Quantization And Huffman Coding, ICLR 2016

https://arxiv.org/pdf/1510.00149v5.pdf - Rosenblatt, F., The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Psychological Review, Vol. 65, No. 6, 1958

http://www.ling.upenn.edu/courses/cogs501/Rosenblatt1958.pdf

By: Yao Fu, Ephrem Wu, Ashish Sirasao, Sedny Attia, Kamran Khan, and Ralph Wittig

## ARM Guide to OpenCL Optimizing Canny Edge Detection: Implementation

# ARM Guide to OpenCL Optimizing Canny Edge Detection: Implementation

**Register or sign in to access the Embedded Vision Academy's free technical training content.**

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

### See a sample of this page's content below:

This chapter describes an example implementation of Canny edge detection.

## Buffer layouts and formats

This example implementation relies on convolutions to perform part of the edge detection process. This means that convolutions centered on pixels at the image edge cause unexpected behavior unless extra work is performed.

The problems which occur are:

**Legal memory access problems**

These are caused when a load attempts to use data outside the image, this occurs when the convolution is at the top of the image or the bottom of the image.**Algorithm correctness**

Algorithm correctness problems occur when a convolution is applied to the left side or right side of the image and loads data from the opposite side of the next line of the source image. This is not valid data for the convolution, so the result from it is wrong.

### Solving the legal memory access problem

To solve the legal memory access problem, consider a simple image memory layout organization, a linear buffer with W pixels in each row stored left to right. Each new row is appended to the end of the previous row with no padding data. This means that the first pixel of a row is stored next to the last pixel of the previous row.

The following image shows a simple image layout.

**Figure 3-1 Simple buffer layout**

**Solving the legal memory access problem y component**

This layout means that the neighbor below a pixel in the image is stored at index(P) + W, where P indicates the pixel coordinates whose neighbor is being found. Similarly, the pixel above the current pixel is stored at index(P) - W. This works in most cases however, when the pixel above a top row pixel is required there is no valid neighbor in that direction. Reading from this location can cause a page fault if this problem is not addressed.

The easiest solutions to this problem are:

**Never attempt to perform this kind of calculation near the edges of an image. Instead, cut the result image or fill the result borders with dummy data.**

This solution is too simple and avoids the problem, rather than solving it. This is not a good solution to ensure legal accesses.**Use a condition in the kernel code to decide if the pixel is too close to an edge for the required calculation to work, and implement some special code for this case.**

This solution is expensive because it involves checks on every pixel to fix a problem that affects a small proportion of pixels.**Create a copy of the...**

## The PowerVR Imaging Framework Camera Demo

*This article was originally published at Imagination Technologies' website, where it is one of a series of articles. It is reprinted here with the permission of Imagination Technologies.*

Writing and optimizing code for heterogeneous computing can be difficult, especially if you are starting from scratch.

## ARM Guide to OpenCL Optimizing Canny Edge Detection: Theory

# ARM Guide to OpenCL Optimizing Canny Edge Detection: Theory

**Register or sign in to access the Embedded Vision Academy's free technical training content.**

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

### See a sample of this page's content below:

This chapter describes the theory of Canny edge detection.

## Blur

This stage blurs the source image to remove noise. Without this step, individual pixels which are significantly different from their neighbors would appear as edges.

Blurring reduces the inconsistent effect from random noise by applying a weighted average effect to each pixel, which uses the values from nearby pixels to reduce noise.

A 2D Gaussian matrix convolution operation is used to apply this effect. You can use any size of Gaussian matrix, however for high performance and a good result, a 5 x 5 matrix is recommended. The ideal size of the Gaussian matrix that suits your application might be different, depending on the application using Canny edge detection. The size of the matrix and the coefficients that are used to blur the image affect the smoothness of the result.

The following figure shows the Gaussian 5 x 5 convolution matrix.

**Figure 2-1 The 5 x 5 Gaussian convolution matrix**

The following figure shows the effect of the Gaussian 5 x 5 convolution matrix on an image.

**Figure 2-2 Blur stage input and result**

## Gradient analysis

The gradient of each dimension, x and y, of the image must be calculated. The results for each dimension must then be combined to show the overall gradient magnitude of the image at each pixel and the direction of change.

This stage uses another convolution matrix type, called the Sobel filter. This is normally performed using two 3 x 3 convolution matrices. Each matrix calculates a partial differential in its given direction to obtain the gradient in that direction, either x or y.

The following figure shows the Sobel horizontal matrix.

**Figure 2-3 Sobel horizontal convolution matrix**

The following figure shows the Sobel vertical matrix.

**Figure 2-4 Sobel vertical convolution matrix**

The results from these matrices are combined together to create a total gradient magnitude and direction. The following equation determines the gradient magnitude, magnitude = sqrt(x...

## ARM Guide to OpenCL Optimizing Canny Edge Detection: Introduction

# ARM Guide to OpenCL Optimizing Canny Edge Detection: Introduction

**Register or sign in to access the Embedded Vision Academy's free technical training content.**

If you've already registered, click here to sign in.

### See a sample of this page's content below:

This chapter introduces OpenCL, edge detection, assumptions that have been made when writing the sample code this document refers to, and the suitability of Canny edge detection for GPU compute.

## GPU compute and Canny edge detection

This guide provides an example optimization process for running Canny edge detection operations using an ARM®Mali™ Midgard GPU. This process can improve performance significantly.

ARM®Mali™ Midgard GPUs support the OpenCL Full Profile specification for General Purpose computing on GPU (GPGPU) processing, also known as GPU compute.

This guide provides advice and information on the principals of GPU compute to software developers who want to improve the use of the available hardware in platforms that perform Canny edge detection. It is not a comprehensive guide to optimization and GPU compute for all situations, although many principles in this guide can be applied to other tasks. The performance gains are given as examples, your results might vary.

## What is Canny edge detection?

Canny edge detection is a tunable algorithm that extracts edges from images. This particular algorithm is popular because it produces high-quality edges. The algorithm focuses on the following characteristics:

**Low error rate**

This algorithm produces few false edges.**Good localization**

The location of the output edges closely resembles the locations of the real edges in the original image.**Minimal response**

Edges are marked once, producing a thin and defined output edges.

The following figure shows an example input image and output result from Canny edge detection.

**Figure 1-1 Canny edge input and output**

The following are some applications that use edge detection:

- Feature extraction.
- Image processing.
- Image segmentation.
- Automotive application.

## Which techniques are used in Canny edge detection?

Canny edge detection is a multistage process that uses the following stages:

- Gaussian blur filter.
- Sobel filter.
- Nonmaximum suppression.
- Hysteresis threshold application.

The following figure shows representations of each stage in an example edge detection process.

**Figure 1-2 Basic Canny edge flow**

## When is the task suitable for...

## Deep Dive: Implementing Computer Vision with PowerVR (Part 3: OpenCL Face Detection)

*This article was originally published at Imagination Technologies' website, where it is one of a series of articles. It is reprinted here with the permission of Imagination Technologies.*

## How to Build an Angstrom Linux Distribution for Intel (Altera) SoC FPGAs with OpenCV and Camera Driver Support

*This article was originally published at PathPartner Technology's website. It is reprinted here with the permission of PathPartner Technology.*

## Computer Vision Evolves Towards Ubiquity

*This column was originally published at Vision Systems Design's website. It is reprinted here with the permission of PennWell.*

## Taking on Poverty with Jobs Created by Machine Learning and Computer Vision

*This article was originally published by Embedded Vision Alliance consultant Dave Tokic. It is reprinted here with Tokic's permission.*