Bookmark and Share

ARM Guide to OpenCL Optimizing Convolution: Optimization Process

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes some of the useful optimization methods, the logic for them, and the results they provide on the test platform.

Not all methods that provide a performance gain are obvious.

The built-in function library and arithmetic simplification

The reference implementation contains some avoidable arithmetic and load/store instructions. You can improve performance by removing these instructions.

In the reference code there are some redundant arithmetic expressions that are computed three times. The following code shows the repeated expressions.

// xPx + scanLine is computed three times

srcRedCh = (short)*(src + xPx + scanline);
srcGreenCh = (short)*(src + xPx + 1 + scanline);
srcBlueCh = (short)*(src + xPx + 2 + scanline);

// x*3 + y*strideByte is computed three times
// Store result in destination image
*(dst + x*3 + y*strideByte) = (uchar)dstRedCh;
*(dst + x*3 + 1 + y*strideByte) = (uchar)dstGreenCh;
*(dst + x*3 + 2 + y*strideByte) = (uchar)dstBlueCh;

Reorganize the code to avoid performing the same computation more than once. This kind of optimization strategy is general and can be used on the application processor code as well.

To further speed up arithmetic instructions, OpenCL provides several functions called built-in CL functions included in a Built-In Function Library (BIFL), most of these are hardware accelerated, and many of these are optimized.

For example, it is better to use the BIFL clamp() instead of using the C function clamp_0_255(). The following code shows the change from C function to BIFL function.

Using C functions:

// Clamp values between [0-255]
dstRedCh = clamp_0_255(dstRedCh);
dstGreenCh = clamp_0_255(dstGreenCh);
dstBlueCh = clamp_0_255(dstBlueCh);

Using BIFL functions:

// Clamp values between [0-255]
dstRedCh = clamp(dstRedCh, (short)0, (short)255);       // Built-in OpenCL function (BIFL)
dstGreenCh = clamp(dstGreenCh, (short)0, (short)255);   // Built-in OpenCL function (BIFL)
dstBlueCh = clamp(dstBlueCh, (short)0, (short)255);     // Built-in OpenCL function (BIFL)

Note: See http://www.khronos.org for more information on the BIFL clamp() function.

To take this further, perform clamping as part of the conversion from short to char using a saturate conversion. The following code shows the result. For more detail see, ...