Bookmark and Share

ARM Guide to OpenCL Optimizing Convolution: General Optimization Guidelines

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes some concepts to consider when optimizing kernels.

Driver overhead

Driver overhead is the time required for the host code to set up the buffers, parameters, and kernels before the GPU can execute instructions, and also includes the time from the completion of the execution before data becomes available to the host.

Overhead reduces the effectiveness of GPU compute implementations from ideal performance, particularly when the operation is small. Driver overhead includes:

  • Memory mapping.
  • Data copy.
  • Cache maintenance.
  • Parameter setup.
  • Kernel enqueue.
  • GPU job dispatch.

When the driver overhead requires a similar time to run as the GPU or application processor execution, there is less benefit from GPU compute.

The application pipeline structure affects whether GPU compute is a suitable option. If most of the pipeline uses the application processor, and moving processes to the GPU adds more driver overhead time than is reduced by the change, ARM recommends running those processes on the application processor.

If the driver overhead requires less time than the GPU execution time, the operation is suitable for GPU compute.

Driver overhead for convolution

Driver overhead takes a small part of the total time required for convolution operations. This means that driver overhead does not reduce the performance of convolution operations using GPU compute significantly, and GPU compute is suitable for convolutions.

The following graph shows the proportion of total time spent on the GPU process, and driver overhead for a range of resolutions.


Figure 4-1: Time spent on the GPU convolution process and driver overhead

GPU compute operations normally operate on large data sets. When this is the case, most of the overhead time is buffer preparation. Buffer preparation has the following steps:

  • Mapping.
  • Memory copy.
  • Unmapping.

Mapping and unmapping are fast and represent a small part of the buffer preparation step. Memory copy operations take the longest time.

The following graph shows the amount of time spent on mapping, memory copy, and unmapping.


Figure 4-2: The time spent on buffer preparation steps

The following graph shows the proportions for the smallest resolutions so the processes are visible.

...