Bookmark and Share

ARM Guide to OpenCL Optimizing Canny Edge Detection: Conclusion

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:

This chapter describes some conclusions from the optimization process.


The first stages that use convolution perform well on the GPU. However, the hysteresis stage has sequential dependency on the previous stages. Despite this, well optimized OpenCL code can achieve a significant performance improvement compared to an application processor implementation.

The following techniques are useful for optimizing Canny edge detection:

  • Optimize the high-level algorithm before applying low-level implementation improvements. For example, merge convolution matrices and apply them after separating them if possible.
  • Use a good memory layout for the buffers to improve the performance of several parts of the process. For example, enabling vectorized loads and stores and improving the memory access pattern.
  • Use vector loads to reduce the pressure on the load/store pipeline.
  • Use padding, clamping, and selection to avoid computationally intensive boundary checks with branches or multiple enqueues.
  • Unroll loops in the hysteresis stage: this can increase the performance of kernels which are ALU bound.
  • Reduce the size of data types to save bandwidth, improve performance, and improve power efficiency.
  • Balance kernel load to improve the execution flow.