Bookmark and Share

ARM Guide to OpenCL Optimizing Convolution: Algorithm Variations

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes variations to the algorithm for 3 x 3 convolution matrices to create an algorithm for 5 x 5 convolution matrices.

5 x 5 convolution matrix

To optimize the code for 5 x 5 convolutions, some changes are required.

The optimization process produces some differences when optimizing for 5 x 5 convolutions, compared to 3 x 3 convolutions.

The following shows the differences in the code:

1. Each work-item needs to load more image data. This means that the previous code for loading data shown in the following code example, must change.

uchar16 temp = vload16(0, middle_pixel + line_offset * stride - 3);
short8 left = convert_short8(temp.s01234567);
short8 middle = convert_short8(temp.s3456789A);
short8 right = convert_short8(temp.s6789ABCD);

The code must change to the following, to work for a 5 x 5 matrix.

uchar16 temp  = vload16(0, src + pixel_idx + (i-2) * stride - 6);
uchar4  temp2 = vload4(0, src + pixel_idx + (i-2) * stride + 10);
short8 left2  = convert_short8(temp.s01234567);
short8 left1  = convert_short8(temp.s3456789A);
short8 middle = convert_short8(temp.s6789ABCD);
short8 right1 = convert_short8( (uchar8)(temp.s9ABC, temp.sDEF, temp2.s0) );
short8 right2 = convert_short8( (uchar8)(temp.sCDEF, temp2.s0123) );

2. Five partial convolutions are needed instead of three.

The following code shows the partial convolutions for a 3 x 3 matrix.

short8 convolution_3x1( int line_offset, __global const uchar *middle_pixel, const int stride, const char left_coeff, const char middle_coeff, const char right_coeff )
{
    uchar16 temp = vload16(0, middle_pixel + line_offset * stride - 3);
    short8 left = convert_short8(temp.s01234567);
    short8 middle = convert_short8(temp.s3456789A);
    short8 right = convert_short8(temp.s6789ABCD);
    return  left * (short8)left_coeff + middle * (short8)middle_coeff + right * (short8)right_coeff;
}
...
    // Row 0
    pixels  = convolution_3x1( -1, src + pixel_idx, stride, MAT0, MAT1, MAT2);
    // Row
    pixels += convolution_3x1(  0, src + pixel_idx, stride, MAT3, MAT4, MAT5);
    // Row 2
    pixels += convolution_3x1( +1, src + pixel_idx, stride, MAT6, MAT7, MAT8);

The code must change to the following, to perform the partial convolutions for a 5 x 5 matrix.

#...