Bookmark and Share

ARM Guide to OpenCL Optimizing Convolution: Fully Optimized Performance for Different Convolution Matrices

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes the performance of some common 3 x 3 convolution matrices using the fully optimized code.

Performance for common convolution filters

The fully optimized kernel, using the following matrices, produces the results in the following table.

int16_t gaussian3x3[3][3] = {
{1, 2, 1},
{2, 4, 2},
{1, 2, 1}
};

// Laplacian convolution matrix
int16_t laplacian3x3[3][3] = {
{  0, -1,  0},
{ -1,  4, -1},
{  0, -1,  0}
};

// Smooth convolution matrix
int16_t smooth3x3[3][3] = {
{1, 1, 1},
{1, 1, 1},
{1, 1, 1}
};

// Sobel Gx convolution matrix
int16_t sobelGx3x3[3][3] = {
{ -1, 0, 1},
{ -2, 0, 2},
{ -1, 0, 1}
};

int16_t motionBlurMatrix[3][3] = {
{1, 0, 0},
{0, 1, 0},
{0, 0, 1}
};

The following table lists the execution times of the fully optimized code, with constants implemented, for different matrices, compared to the performance of the Gaussian 3 x 3 convolution matrix used during the optimization process. All run on the GPU.

Convolution matrix size and type Execution time for 1024 x 576 resolution Execution time for 2048 x 1536 resolution Execution time for 4096 x 2304 resolution
Gaussian 3 x 3 1x 1x 1x
Laplacian 3 x 3 1x 1x 1x
Smooth 3 x 3 1.4x 1.4x 1.4x
Sobel Gx 3 x 3 1x 0.97x 0.99x
Motion blur 3 x 32 1.2x 1.2x 1.1x

Table 1: Performance compared to the Gaussian 3 x 3 convolution matrix

The kernels are balanced in terms of load/store and arithmetic operations. This means that there is not always a significant speed increase...