Bookmark and Share

ARM Guide to OpenCL Optimizing Canny Edge Detection: Optimization Process

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes some further optimizations to the kernel.

Useful mathematical properties of convolution

Convolution operations can be optimized in several ways. One of the most obvious optimization targets is the large number of repeated reads that the unoptimized kernel uses. These reads also use large data type floats for the early stages of the process.

For example, on average a 3 x 3 convolution loads each pixel nine times. This happens because each time the convolution is moved to a new pixel, P, the kernel fetches every neighbor. This fetch occurs even if the pixel values have been loaded before.

The following shows example verbose compiler logs for an unoptimized 5 x 5 convolution process and 3 x 3 convolution process.

Entry point: __llvm2lir_entry_convolution5x5_static
16 work registers used (with spilling), 3 uniform registers used

Pipelines: A / L / T / Overall
Number of instruction words emitted: 219 +81 + 0 = 300
Number of cycles for shortest code path: 113.5 /81 / 0 = 113.5 (A bound)
Number of cycles for longest code path: 117 /81 / 0 = 117 (A bound)

Entry point: __llvm2lir_entry_convolution3x3_static
15 work registers used, 4 uniform registers used

Pipelines: A / L / T / Overall
Number of instruction words emitted: 84 +34 + 0 = 118
Number of cycles for shortest code path: 41 /34 / 0 = 41 (A bound)
Number of cycles for longest code path: 44.5 /34 / 0 = 44.5 (A bound)

These logs suggest that both of these processes are ALU bound, because many mathematics operations are being repeated. To explore this problem and possible solutions, start with a simple example. The following figure shows a simple separable convolution matrix.


Figure 4-1 Example separable matrix

This convolution matrix is similar to the matrices used in Canny edge, because it is separable. This means that it can be split into two sequential convolutions Fh and Fv. The following figure shows the two separated parts of the convolution matrix.


Figure 4-2 Separable matrix parts

The means that the program can use the two separated convolution matrices sequentially instead of one application of the larger convolution matrix. The following figure shows the application of the unseparated convolution matrix to an example set of pixels.

...