Embedded Vision Alliance: Technical Articles

ARM Guide to OpenCL Optimizing Convolution: General Optimization Guidelines

Bookmark and Share

ARM Guide to OpenCL Optimizing Convolution: General Optimization Guidelines

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes some concepts to consider when optimizing kernels.

Driver overhead

Driver overhead is the time required for the host code to set up the buffers, parameters, and kernels before the GPU can execute instructions, and also includes the time from the completion of the execution before data becomes available to the host.

Overhead reduces the effectiveness of GPU compute implementations from ideal performance, particularly when the operation is small. Driver overhead includes:

  • Memory mapping.
  • Data copy.
  • Cache maintenance.
  • Parameter setup.
  • Kernel enqueue.
  • GPU job dispatch.

When the driver overhead requires a similar time to run as the GPU or application processor execution, there is less benefit from GPU compute.

The application pipeline structure affects whether GPU compute is a suitable option. If most of the pipeline uses the application processor, and moving processes to the GPU adds more driver overhead time than is reduced by the change, ARM recommends running those processes on the application processor.

If the driver overhead requires less time than the GPU execution time, the operation is suitable for GPU compute.

Driver overhead for convolution

Driver overhead takes a small part of the total time required for convolution operations. This means that driver overhead does not reduce the performance of convolution operations using GPU compute significantly, and GPU compute is suitable for convolutions.

The following graph shows the proportion of total time spent on the GPU process, and driver overhead for a range of resolutions.


Figure 4-1: Time spent on the GPU convolution process and driver overhead

GPU compute operations normally operate on large data sets. When this is the case, most of the overhead time is buffer preparation. Buffer preparation has the following steps:

  • Mapping.
  • Memory copy.
  • Unmapping.

Mapping and unmapping are fast and represent a small part of the buffer preparation step. Memory copy operations take the longest time.

The following graph shows the amount of time spent on mapping, memory copy, and unmapping.


Figure 4-2: The time spent on buffer preparation steps

The following graph shows the proportions for the smallest resolutions so the processes are visible.

...

ARM Guide to OpenCL Optimizing Convolution: Optimization Process

Bookmark and Share

ARM Guide to OpenCL Optimizing Convolution: Optimization Process

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes some of the useful optimization methods, the logic for them, and the results they provide on the test platform.

Not all methods that provide a performance gain are obvious.

The built-in function library and arithmetic simplification

The reference implementation contains some avoidable arithmetic and load/store instructions. You can improve performance by removing these instructions.

In the reference code there are some redundant arithmetic expressions that are computed three times. The following code shows the repeated expressions.

// xPx + scanLine is computed three times

srcRedCh = (short)*(src + xPx + scanline);
srcGreenCh = (short)*(src + xPx + 1 + scanline);
srcBlueCh = (short)*(src + xPx + 2 + scanline);

// x*3 + y*strideByte is computed three times
// Store result in destination image
*(dst + x*3 + y*strideByte) = (uchar)dstRedCh;
*(dst + x*3 + 1 + y*strideByte) = (uchar)dstGreenCh;
*(dst + x*3 + 2 + y*strideByte) = (uchar)dstBlueCh;

Reorganize the code to avoid performing the same computation more than once. This kind of optimization strategy is general and can be used on the application processor code as well.

To further speed up arithmetic instructions, OpenCL provides several functions called built-in CL functions included in a Built-In Function Library (BIFL), most of these are hardware accelerated, and many of these are optimized.

For example, it is better to use the BIFL clamp() instead of using the C function clamp_0_255(). The following code shows the change from C function to BIFL function.

Using C functions:

// Clamp values between [0-255]
dstRedCh = clamp_0_255(dstRedCh);
dstGreenCh = clamp_0_255(dstGreenCh);
dstBlueCh = clamp_0_255(dstBlueCh);

Using BIFL functions:

// Clamp values between [0-255]
dstRedCh = clamp(dstRedCh, (short)0, (short)255);       // Built-in OpenCL function (BIFL)
dstGreenCh = clamp(dstGreenCh, (short)0, (short)255);   // Built-in OpenCL function (BIFL)
dstBlueCh = clamp(dstBlueCh, (short)0, (short)255);     // Built-in OpenCL function (BIFL)

Note: See http://www.khronos.org for more information on the BIFL clamp() function.

To take this further, perform clamping as part of the conversion from short to char using a saturate conversion. The following code shows the result. For more detail see, ...

ARM Guide to OpenCL Optimizing Convolution: The Reference Implementation

Bookmark and Share

ARM Guide to OpenCL Optimizing Convolution: The Reference Implementation

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes some initial implementations of convolutions and describes the reference implementation for the optimization process.

These are the simplest and most intuitive implementations of convolution algorithms.

Convolution example in C and C++, run on the application processor

Before any optimizations are implemented, the algorithm just runs on the application processor. Any optimizations using OpenCL and GPU compute start from this state.

The 2D convolution operation requires two double-loops:

  • Two loops for scanning each pixel of the source image.
  • Two loops for scanning each pixel inside the convolution matrix.

The code is composed of the following main parts:

  1. Convolution computation with two loops.
    The two loops are used for computing the sum of products between all convolution matrix values H(i, j) and the corresponding source image element I(u + x, v + y).

    // Convolution computation
    for(int32_t yKernel = -1; yKernel < 2; yKernel++)
    {
        for(int32_t xKernel = -1; xKernel < 2; xKernel++)
        {
            // Compute address
            int32_t xPx = (x + xKernel)*src->pixel_size();
            int32_t scanline = (y + yKernel)*strideByte;

            // Get red, green and blue channels in source image
            srcRedCh = (int16_t)*(srcData + xPx + scanline);
            srcGreenCh = (int16_t)*(srcData + xPx + 1 + scanline);
            srcBlueCh = (int16_t)*(srcData + xPx + 2 + scanline);

            // Get kernel value
            int16_t convolution_values = *(convolution_matrix + (xKernel + 1) + (yKernel + 1)*3);

            // Store partial result
            dstRedCh += srcRedCh * convolution_values;
            dstGreenCh += srcGreenCh * convolution_values;
            dstBlueCh += srcBlueCh * convolution_values;
        }
    }

     
  2. Normalization to ensure the resulting image has the same brightness as the original. See Integer values for the convolution matrix for an explanation of the scale value.

    // Avoid divide by zero
    if(scale != 0)
    {
        dstRedCh /= scale;
    ...

ARM Guide to OpenCL Optimizing Convolution: Introduction

Bookmark and Share

ARM Guide to OpenCL Optimizing Convolution: Introduction

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter introduces the concepts of convolutions and GPU compute.

GPU compute and convolutions

This guide provides an example optimization process for running convolution operations using an ARM®Mali™ Midgard GPU. This process can improve performance significantly.

ARM®Mali™ Midgard GPUs support the OpenCL Full Profile specification for General Purpose computing on GPU (GPGPU) processing, also known as GPU compute.

This guide provides advice and information on the principals of GPU compute to software developers who want to improve the use of the available hardware in platforms that perform convolutions. It is not a comprehensive guide to optimization and GPU compute for all situations, although many principles in this guide can be applied to other tasks. The performance gains are given as examples, your results might vary.

Where are convolution operations used?

Convolution operations are important in image processing, particularly in filtering. The terms filter, or linear filter, are often used when describing image processing algorithms instead of the term convolution.

Image filtering enables you to apply effects on photos, such as:

  • Blurring.
  • Smoothing.
  • Sharpening.
  • Intensifying.
  • Enhancing.
  • Matching.

These are important effects in raster graphics editors.

The following picture shows examples of sharpening and smoothing.


Figure 1-1: Sharpening and blurring images

These operations are important for many larger and more complicated operations in computer vision and photography. For example:

  • Canny edge detection.
  • Focus detection.

Because convolutions are common it is useful to understand how they can be optimized and improved using GPU compute.

Convolution is a local operation. A local operation is an image transform where the value of each pixel in the destination image depends on a set of pixel values from the source image. The opposite of this is a point operation, where the value of each pixel in the destination image depends only on the pixel in the same position in the source image.

What is a convolution?

A convolution is a mathematical operator that operates on two functions, for example I and H, that returns a third function I'.

The general convolution operation is a continuous integral of one function moved over another. The result is a measure of the overlap between the two functions as a function of the translation distance of the function that moves.

One type of convolution...

Deep Dive: Implementing Computer Vision with PowerVR (Part 2: Hardware IP for Computer Vision)

This article was originally published at Imagination Technologies' website, where it is one of a series of articles. It is reprinted here with the permission of Imagination Technologies.

OpenVX Enables Portable, Efficient Vision Software

Bookmark and Share

OpenVX Enables Portable, Efficient Vision Software

OpenVX, a maturing API from the Khronos Group, enables embedded vision application software developers to efficiently harness the various processing resources available in SoCs and systems.

Vision technology is now enabling a wide range of products, that are more intelligent and responsive than before, and thus more valuable to users. Such image perception, understanding, and decision-making processes have historically been achievable only using large, heavy, expensive, and power-draining computers and cameras. Thus, computer vision has long been restricted to academic research and relatively low-volume production designs.

However, thanks to the emergence of increasingly capable and cost-effective processors, image sensors, memories and other semiconductor devices, along with robust algorithms, it's now practical to incorporate computer vision into a wide range of systems. The Embedded Vision Alliance uses the term “embedded vision” to refer to this growing use of practical computer vision technology in embedded systems, mobile devices, PCs, and the cloud.

Key to the widespread adoption of embedded vision is the ease of developing software that runs efficiently on a diversity of hardware platforms, with high performance, low power consumption and cost-effective system resource needs. In the past, this combination of objectives has been a tall order, since the combination of high performance and low power consumption has historically required significant code optimization for a particular device architecture, thereby hampering portability to other architectures. Fortunately, this situation is beginning to change with the emergence of OpenVX, an open standard created and maintained by the not-for-profit Khronos Group industry consortium.

Overview

OpenVX was developed for the cross-platform acceleration of computer vision applications, prompted by the need for high performance and low power with diverse vision processors. Processing options now in use (solely or in combination) by vision developers include single- and multi-core CPUs, GPUs, DSPs, FPGAs, ISPs (image signal processors), and dedicated vision processors and IP cores. According to the Khronos Group, OpenVX is specifically targeted at low-power, real-time applications, particularly those running on mobile and embedded platforms.

For example, a low-power host processor can set up and manage the vision-processing pipeline, with some or all of the functions running on dedicated hardware and/or one or multiple specialized coprocessors. More generally, the OpenVX specification provides a way for developers to tap into the performance of processor-specific optimized code, while still providing code portability via the API's standardized interface and higher-level abstraction versus with traditional vision software development approaches.

OpenVX graphs are at the foundation of the API’s efficiency advantages. In creating a vision-processing algorithmic pipeline, application developers construct a graph made up of operations, called nodes, which can be coded in any language and implemented using lower-level APIs. Each node, typically created by a vision processor vendor as part of its supplied OpenVX framework, is an instance of a computer vision function (i.e. a kernel) with associated data references, a return value, and performance information. And both application developers and processor vendors can optionally create user-defined nodes based on custom extensions to the OpenVX specification. Note, however, that unlike with vendor-defined custom nodes, which the vendor's OpenVX implementation inherently knows about and can therefore work with in an optimal way, developer-defined custom nodes operate outside of a vendor's OpenVX framework and therefore incur additional usage restrictions.

OpenVX's flexibility supports implementations optimized for performance, power consumption, and/or other parameters. For example, a vendor's framework implementation can "fuse" multiple nodes in order to eliminate intermediate memory transfers. Processing can be also be "tiled" in order to retain data in the local memory or cache. Workload parallelism enables the framework to simultaneously execute multiple nodes on multiple processing targets. And after the host processor initially sets up an application graph, it can then execute near-autonomously, according to the Khronos Group. The final OpenVX v1.0 specification was released in October 2014, with a subsequent maintenance update released in June 2015 and further evolution ongoing.

Standards Comparisons, Contrasts and Compatibilities

Naming similarities can lead to confusion between OpenVX and two other software standards (the first de facto, the second formal) commonly used in vision processing, OpenCV and OpenCL. OpenCV, the open-source computer vision library, is a collection of more than 2,500 software components released under a BSD license and available for free use for computer vision applications and research. OpenCV is written in optimized C and C++, has C++, C, Python and Java interfaces, and supports Windows, Linux, Mac OS, iOS and Android.

Several key distinctions between OpenVX and OpenCV bear mentioning. First, OpenVX is fundamentally targeted at efficient optimization of code on embedded and mobile processors. While OpenCV is also optimized for multiple processors, OpenVX is designed to provide many straightforward optimization opportunities. OpenVX's graph API, along with its opaque data structures, make it uncomplicated to relocate the execution of kernels from CPU to an accelerator with dedicated memory, such as GPU or DSP. The concept of virtual images also allows for further optimizations, such as filter stacking, that minimize data transfers through memory buses.

OpenVX has also been designed with a fundamental focus on functional portability. Any code based on OpenVX will produce functionally identical (albeit not necessarily identical-performance) results when executed on different processors. This achievement is enabled by a tightly defined specification together with conformance tests that enforce functional portability. OpenCV also has an extensive body of algorithmic tests, but they focus on making sure that the library works correctly on each of the platforms, rather than confirming that functions return the same results on multiple platforms.

Finally, while OpenCV contains an extensive set of functions that span the field of computer vision, optimizing OpenCV-sourced algorithms for a new hardware platform can therefore consume significant time and effort, depending on what percentage of the 500+ algorithms in the library require optimization for a particular project. Since OpenVX defines a much smaller set of functions, creating an efficient OpenVX implementation is more feasible (Table 1). And the OpenVX specification is written with the intent of enabling OpenCV users to easily migrate to OpenVX at any point in time.

Vision Function

Unsigned 8-bit

Unsigned 16-bit

Signed 16-bit

Unsigned 32-bit

Signed 32-bit

Floating Point 32-bit

4CCC

AbsDiff

Input, Output

 

 

 

 

 

 

Accumulate

Input

 

Output

 

 

 

 

AccumulateSquared

Input

 

Output

 

 

 

 

AccumulateWeighted

Input, Output

 

 

 

 

 

 

Add

Input, Output

 

Input, Output

 

 

 

 

And

Input, Output

 

 

 

 

 

 

Box3x3

Input, Output

 

 

 

 

 

 

CannyEdgeDetector

Input, Output

 

 

 

 

 

 

ChannelCombine

Input

 

 

 

 

 

Output

ChannelExtract

Output

 

 

 

 

 

Input

ColorConvert

 

 

 

 

 

 

Input, Output

ConvertDepth

Input, Output

 

Input, Output

 

 

 

 

Colvolve

Input, Output

 

Output

 

 

 

 

Dilate3x3

Input, Output

 

 

 

 

 

 

EqualizeHistogram

Input, Output

 

 

 

 

 

 

Erode3x3

Input, Output

 

 

 

 

 

 

FastCorners

Input, Output

 

 

 

 

 

 

Gaussian3x3

Input, Output

 

 

 

 

 

 

HarrisCorners

Input, Output

 

 

 

 

 

 

HalfScaleGaussian

Input, Output

 

 

 

 

 

 

Histogram

Input

 

 

 

Output

 

 

IntegralImage

Input

 

 

Output

 

 

 

TableLookup

Input, Output

 

 

 

 

 

 

Magnitude

 

 

Input, Output

 

 

 

 

MeanStdDev

Input

 

 

 

 

Output

 

Median3x3

Input, Output

 

 

 

 

 

 

MinMaxLoc

Input, Output

 

Input, Output

 

 

Output

 

 

Multiply

Input, Output

 

Input, Output

 

 

 

 

Not

Input, Output

 

 

 

 

 

 

OpticalFlowLK

Input

 

 

Output

 

 

 

Or

Input, Output

 

 

 

 

 

 

Phase

Output

 

Input

 

 

 

 

GaussianPyramid

Input, Output

 

 

 

 

 

 

Remap

Input, Output

 

 

 

 

 

 

ScaleImage

Input, Output

 

 

 

 

 

 

Sobel3x3

Input

 

Output

 

 

 

 

Subtract

Input, Output

 

 

Input, Output

 

 

 

 

Threshold

Input, Output

 

 

 

 

 

 

WarpAffine

Input, Output

 

 

 

 

 

 

WarpPerspective

Input, Output

 

 

 

 

 

 

Xor

Input, Output

 

 

 

 

 

 

Table 1. OpenVX v1.0 Base Vision Functions and Associated Input/Output Types

Then there's OpenCL, also developed and maintained by the Khronos Group (as well as used by OpenCV to harness the power of GPUs and other accelerators). This language-based framework was created as a tool for efficient execution of data-parallel software on heterogeneous hardware architectures. Similarities exist between OpenCL and OpenVX: both are open standards targeted at code acceleration, and both enforce functional portability. And OpenVX is designed so that it is easy to implement in coexistence with OpenCL; an OpenVX node can be implemented using OpenCL, for example. But important differences also exist.

OpenCL specifies a C-based language to program kernels, for example, with a focus on general-purpose math functions. OpenVX, in contrast, targets a specific application domain, vision. Many elements of OpenVX make it extremely efficient on embedded hardware, and it possesses a set of computer vision functions essential for almost any computer vision use case. Also, like OpenCV, OpenCL implements a declarative i.e. explicit memory model. This characteristic prohibits certain types of optimizations that are possible with OpenVX, which as earlier mentioned specifies an opaque memory model for the most of its data objects.

Finally, floating-point requirements (or lack thereof) are an important consideration for embedded platforms. Some of the processors used to accelerate computer vision functions do not support floating-point calculations. Since floating-point data types are an integral part of OpenCL, implementing it on such platforms is challenging. OpenVX conversely does not require floating-point support.

A Developer's Perspective

OpenVX's "much smaller set of functions" compared with OpenCV is something of a mixed blessing. While the narrower focus enables processor suppliers to more rapidly develop optimized OpenVX implementations, the small set of OpenVX functions currently required for conformance isn't comprehensive from the standpoint of what's necessary to develop sophisticated vision applications. To fill any function gaps, as previously mentioned, developers can create custom "user nodes" or rely on custom extensions provided by the vendor of their preferred target vision processor.

An extension-evolution path akin to that already seen with other Khronos standards is also anticipated with OpenVX. When enough customers with similar needs emerge, processor vendors will support OpenVX extensions to meet those needs. And when multiple vendors adopt similar extensions, some of those will likely be incorporated in a future version of the OpenVX specification. This evolution will not occur overnight, among other reasons because OpenCV is well known by developers. But OpenVX advocates believe that the standard's performance, power consumption and portability benefits make extensive function support (initially via extensions, later built in to the specification) a matter of "when", not "if".

OpenVX extension development by means of an API "wrapper" around an OpenCV function is a possibility, but in doing so the extension developer will likely negate a key OpenVX benefit. OpenCV is fundamentally based on processing a frames' worth of data at a time. Conversely, since optimized data flow is at the core of OpenVX, it favors handling data in structures such as tiles and pixels. More generally, it focuses on moving bits versus frames, an efficiency distinction that would be discarded if an OpenCV function were simply inserted into the OpenVX framework without modification.

Graph and Extension Examples

Figure 1 shows a typical OpenVX graph for implementing a pseudo-Canny edge detection algorithm, common in many computer vision applications.


Figure 1. A conventional Canny edge detector OpenVX graph (source: "Addressing System-Level Optimization with OpenVX Graphs")

In a typical computer vision system, the camera's image sensor captures color information by means of a Bayer pattern filter array. Figure 2 shows a typical Bayer pattern. While the OpenVX standard includes multiple standard nodes, Bayer de-mosaicing is not one of them. However, a developer can define a custom OpenVX node that will ingest raw Bayer data and output a grayscale (monochrome) frame.


Figure 2. A conventional Bayer color filter array (CFA) pattern

Such an OpenVX node first needs to implement color interpolation. Specifically, at each pixel location, the node will need to interpolate the two missing color components. The most straightforward interpolation method is the bilinear approach, which can suffer from color artifacts in regions with edges or high-frequency details. More sophisticated de-mosaicing methods include higher-order interpolants such as bicubic or Lanczos resampling. For this particular application, however, a bilinear interpolant will likely suffice, since the output of the custom node is an elementary monochrome signal.

After calculating the R,G and B color components at each pixel location, it's straightforward to then compute monochrome luminance:

Y = 0.299*R + 0.587*G + 0.114*B

Vision systems additionally often employ a wide-angle or “fish-eye” lens in order to increase the field-of-view, with the consequence that lens correction must be utilized to counteract the effects of the lens's optical distortions. In this example, to accomplish the de-warping task, the grayscale frame will be routed to the standard OpenVX vxRemapNode node. A factory calibration procedure generates a remapping table that is applied to each pixel in the monochrome frame. In an optimized GPU-based OpenVX driver implementation, the vxRemapNode node will leverage the GPU texture unit to perform hardware-accelerated bilinear sampling.

Both vxRemapNode and the previously discussed custom OpenVX node for Bayer de-mosaicing and monochrome conversion are inserted at the front end of the graph, thereby allowing the OpenVX driver to optimize the layout for the architecture at hand (Figure 3). Bayer de-mosaicing and pixel warping processes are both highly data-parallel, so the associated computations are good candidates for tiling to achieve peak memory efficiency. And both processes can be accelerated in a fairly straightforward manner on GPU architectures, for example.


Figure 3. Modified OpenVX graph with new front-end

The code for this OpenVX graph example follows:

vx_context x = vxCreateContext();
vx_graph g = vxCreateGraph(c);

vx_image bayer = vxCreateImage(c,w,h,FOURCC_BYR2);
vx_image y = vxCreateImage(g,0,0,FOURCC_VIRT);
vx_image yUndistort = vxCreateImage(g,0,0,FOURCC_VIRT);
vx_image ySmooth = vxCreateImage(g,0,0,FOURCC_VIRT);
// [snip]
vx_node nodes[] = {
  vxCustomBayerToLumaNode(g, bayer, y),
  vxRemapNode(g, y, yUndistort),
  vxGaussian3x3Node(g, yUndistort, ySmooth),
  // [snip]
};
vxVerifyGraph(g);
vxProcessGraph(g);

Developer Assistance

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products. And it can provide significant new markets for hardware, software and semiconductor suppliers. The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform this potential into reality. BDTI, Cadence, Itseez and Vivante, the co-authors of this article, are members of the Embedded Vision Alliance. Cadence and Vivante are processor suppliers who support OpenVX, while Itseez chairs the OpenVX Working Group at Khronos and BDTI was one of the creators of the OpenVX specification and considers OpenVX for use with its engineering services clients.

First and foremost, the Alliance's mission is to provide product creators with practical education, information, and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenVX, OpenCL and OpenCV. Access is free to all through a simple registration process.

The Alliance also holds Embedded Vision Summit conferences in Silicon Valley. Embedded Vision Summits are technical educational forums for product creators interested in incorporating visual intelligence into electronic systems and software. They provide how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Alliance member companies. These events are intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings.

The next Embedded Vision Summit will take place on May 2-4, 2016 in Santa Clara, California, and will include accompanying half- and full-day workshops. Please reserve a spot on your calendar and plan to attend. Online registration and additional Embedded Vision Summit information are available on the Alliance website.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Amit Shoham
Distinguished Engineer, BDTI

Pulin Desai
Product Marketing Director, Cadence

Victor Eruhimov
CEO, Itseez
Chairperson, OpenVX Working Group, Khronos

Shang-Hung Lin
Vice President, Vision and Image Product Development, Vivante

Optimizing Fast Fourier Transformation on ARM Mali GPUs

Bookmark and Share

Optimizing Fast Fourier Transformation on ARM Mali GPUs

This article was originally published at ARM's website. It is reprinted here with the permission of ARM.

The Fast Fourier Transformation (FFT) is a powerful tool in signal and image processing. One very valuable optimization technique for this type of algorithm is vectorization. This article discusses the motivation, vectorization techniques and performance of the FFT on ARM Mali GPUs. For large 1D FFT transforms (greater than 65536 samples), performance improvements over 7x are observed.

Background

A few months ago I was involved in a mission to optimize an image-processing application. The application had multiple 2D convolution operations using large-radii blurring filters. Image convolutions map well to the GPU since they are usually separable and can be vectorized effectively on Mali hardware. However, with a growing filter size, they eventually hit a brick wall imposed by computational complexity theory.


An illustration of a 2D convolution (Source: http://www.westworld.be/page/2/).

In the figure above, an input image is convolved with a 3x3 filter. For the input pixel, there will be about 9 multiplications and 8 additions required to produce the corresponding output. When estimating the time complexity of this 2D convolution, the multiplication and addition operations are assumed as a constant time. Therefore, the time complexity is approximately O(n2), although when the filter size grows, the number of operations per pixel increases and the constant time assumption can no longer hold. With a non-separable filter, the time complexity quickly approaches O(n4) as the filter size becomes comparable to the image size.

In the era of ever increasing digital image resolutions, a O(n4) time is simply not good enough for modern applications. This is where the FFT may offer an alternative computing route. With the FFT, convolution operations can be carried out in the frequency domain. The FFT forward and inverse transformation each needs O(n2 log n) time and has a clear advantage over time/spatial direct convolution which requires O(n4).

The next section assumes a basic understanding of the FFT. A brief introduction to the algorithm can be found here: http://www.cmlab.csie.ntu.edu.tw/cml/dsp/training/coding/transform/fft.html

FFT Vectorization on Mali GPUs

For simplicity, a 1D FFT vectorized implementation will be discussed here. Multi-dimensional FFTs are separable operations, thus the 1D FFT can easily be extended to accommodate higher-dimension transforms. The information flow of the FFT is best represented graphically by the classic butterfly diagram:


16-point Decimation in Frequency (DIF) butterfly diagram.

The transformation is broken into 4 individual OpenCL kernels: 2-point, 4-point, generic and final transformations. The generic and final kernels are capable of varying in size. The generic kernel handles transformations from 8-point to half of the full transformation. The final kernel completes the transformation by computing the full-size butterfly.

The FFT operates within the complex domain. The input data is sorted into a floating-point buffer of real and imaginary pairs:


The structure of the input and output buffer. A complex number consists of two floating-point values. The vector width is also shown.

The first stage of the decimation in time (DIT) FFT algorithm is a 2-point discrete Fourier transform (DFT). The corresponding kernel consists of two butterflies. Each of these two butterflies operate on two complex elements as shown:


An illustration of the first stage kernel. A yellow-grey shaded square represents a single complex number. The yellow outline encloses the butterflies evaluated by a single work item. The same operation is applied to cover all samples.

Each work item has a throughput of 4 complex numbers, 256 bits. This aligns well with the vector width.

In the second stage, the butterflies have a size of 4 elements. Similar to the first kernel, the second kernel has a throughput of 4 complex number, aligning with the vector width. The main distinctions are in the twiddles and the butterfly network:


The illustration for the second stage kernel: a single 4-point butterfly.

The generic stage is slightly more involved. In general, we would like to:

  • Re-use the twiddle factors
  • Keep the data aligned to the vector width
  • Maintain a reasonable register usage
  • Maintain a good ratio between arithmetic and data access operations

These requirements help to improve the efficiency of memory access and ALU usage. They also help to ensure that optimal numbers of work items can be dispatched at a time. With these requirements in mind, each work item for this kernel is responsible for 4 complex numbers for 4 butterflies. The kernel essentially operates on 4 partial butterflies and has a total throughput of:

4 complex number * 2 float * sizeof(float) * 4 partial butterflies = 1024 bit per work item

This is illustrated in the following graph:


The graph above represents the 8-point butterfly case for the generic kernel. The left side of the diagram shows 4 butterflies, which associate with a work item. The red boxes on the left diagram highlight the complex elements being evaluated by the work item. The drawing on the right is a close-up view of a butterfly. The red and orange lines highlight the relevant information flow of the butterfly.

Instead of evaluating a single butterfly at a time, the kernel works on portions of multiple butterflies. This essentially allows the same twiddle factors to be re-used across the butterflies. For an 8-point transform as shown in the graph above, the butterfly would be distributed across 4 work items. The kernel is parameterizable from 8-point to N/2 point transform where N is the total length of the original input.

For the final stage, only a single butterfly of size N exists; twiddle factor sharing is not possible. Therefore, the final stage is just vectorized butterfly network that is parameterized to a size of N.

Performance

The performance of the 1D FFT implementation described in the last section is compared to a reference CPU implementation. In the graph below, the relative performance speed up is shown from 26 to 217 samples. Please note that the x-axis is on a log metric scale:


GPU FFT performance gain over the reference implementation.

We have noticed in our experiments that FFT algorithm performance tends to improve significantly on the GPU between about 4096 and 8192 samples The speed up continues to improve as the sample sizes grows. The performance gain essentially offsets the setup cost of OpenCL with large samples. This trend would be more prominent in higher dimension transformations.

Summary

The FFT is a valuable tool in digital signal and image processing. The 1D FFT implementation presented can be extended to higher dimension transformations. Applications such as computational photography, computer vision and image compression should benefit from this. What is more interesting is that the algorithm scales well on GPUs. The performance will further improve with more optimization and future support of half-float.

By Neil Tan
Software Engineer, ARM

Deep Dive: Implementing Computer Vision with PowerVR (Part 1: Computer Vision Algorithms)

This article was originally published at Imagination Technologies' website, where it is one of a series of articles. It is reprinted here with the permission of Imagination Technologies.

Speeding Up the Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPUs By Means of OpenCL (Part 3)

Bookmark and Share

Speeding Up the Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPUs By Means of OpenCL (Part 3)

This article was originally published at ARM's website. It is reprinted here with the permission of ARM. For more information, please see ARM's developer site, which includes a variety of GPU Compute, OpenCL and RenderScript tutorials.

In this third and last part of this blog series we are going to extend the mixed-radix FFT OpenCL™ implementation to two dimensions and explain how to perform optimisations to achieve good performance on Mobile ARM® Mali™ GPUs. After a short introduction why two-dimensional Fourier transform is an important and powerful tool the first section covers the mathematical background that lets us reuse the existing one-dimensional approach to build the two-dimensional implementation. Afterwards all the steps are described individually and the details are explained. Finally, at the end of each section we show the performance improvements achieved.

So far only the one-dimensional Fourier transform has been considered to explain the mathematical background and basic optimisations. However, in the field of image processing the two-dimensional variant is much more popular. For instance, a simple high-pass filter can be realised in the frequency domain after the image have been transformed by a two-dimensional Fourier transform. The resulting image could be used to implement an edge detection algorithm.

Original image (spatial domain)

Original image (frequency domain)

Removed high frequencies

Filtered image (spatial domain)

High-pass filter via Fourier transform

In general the Fourier transformation can be used to make the convolution operations which are used in many image processing applications more efficient. In particular when the kernel size is big the improvements in terms of the number of operations that have to be executed are significant. In the spatial domain a two-dimensional convolution requires a quadratic number of multiplications and a linear number of additions in the order of the size of the convolution kernel for every element of the image. Therefore the resulting complexity is O(MNmn). In contrast, in the frequency domain the convolution corresponds to a pointwise multiplication of the transformed kernel and image which reduces the complexity for the convolution to O(MN). Nevertheless, since the Fourier transform has to be additionally computed it is not always beneficial to perform the convolution in the frequency domain.

Background

Due to its separability the two-dimensional Fourier transformation can be expressed as two consecutive one-dimensional transformations. The mathematical principles are no different from those explained in the first part of this blog series which means that this background section will be rather short. It just shows briefly how the general form of the two-dimensional discrete Fourier transform can be split into its one-dimensional parts.

The two dimensional DFT can be expressed in the form

where the trigonometric constant coefficients are defined as

The coordinates k and l of the frequency domain are in the range

Since each of the trigonometric constant coefficients depends only on one of the summation indices the inner sum can be extracted and defined as

which is equal to the definition which was previously used to express the one-dimensional DFT. Only the parameter n had to be added to select the correct value in the two-dimensional input space.

If the newly defined one-dimensional transform is inserted into the original expression for the two-dimensional DFT it is easy to see from the following equation that just another one-dimensional transform is calculated

Because of the commutativity of the operations either of the sums can be extracted. Therefore it does not matter over which dimension the first one-dimensional transform is calculated.

Implementations

In the background section the separability of the two-dimensional Fourier transform has been used to represent the two-dimensional as two consecutive one-dimensional transforms, one along each dimension. The row-column algorithm is based on this mathematical fact and is used for this implementation. In this algorithm first a one-dimensional FFT is computed individually for every row of the two-dimensional input data. Afterwards a second one-dimensional FFT is performed for every column. But, instead of having two different functions to compute the FFT for the rows and columns, the same function is used and the input data is transposed between and after the two one-dimensional FFTs. This makes sense because the data layout is assumed to be row-major. Thus the most efficient way in terms of cache locality is to load data along the rows.

Original image

Separate 1D FFTs along the rows (first dimension)

Transpose intermediate image

Separate 1D FFTs along the rows (second dimension)

Transpose final image

Schematics and example of the row-column algorithm

Naive row-column algorithm

The naive implementation directly realises the principle of the row-column algorithm and the execution of the 2D FFT is therefore straightforward. The following diagram provides a first overview over the different stages of the 2D FFT and the pseudo code demonstrates the idea of a naive implementation.

Pseudo code

    // Run the 1D FFT along the rows of the input buffer 
    // The result is stored to a temporary buffer 
    fft_mixed_radix_2d_run_1d_part(input, tmp); 
    
    
    // Transpose the intermediate buffer in-place 
    transpose_complex_list(tmp); 
    
    // Run the 1D FFT along the rows of the (transposed) intermediate buffer 
    // Corresponds to the columns of the original buffer 
    // The result is stored to the output buffer 
    fft_mixed_radix_2d_run_1d_part(tmp, output); 
    
    // Transpose the output buffer in-place 
    transpose_complex_list(output); 

The function to compute the 1D FFT is conceptually the same that has been used in the previous parts of this blog series except that it has to loop over all rows of the source image. As a reminder: The chosen variant of the mixed radix FFT requires the input values to be in digit-reversed order to compute the output values in linear order. Therefore the input values have to be shuffled before the actual FFT can be computed. The reordering is repeated for all rows of the matrix. Similarly the computation of the 1D FFT is repeated for all rows independently of each other.

/** 
* @brief Computes a 1D FFT for each row of the the input 

* @param[in] input The input matrix 
* @param[out] output The output matrix 
*/ 
fft_mixed_radix_2d_run_1d_part(input, output) 

// Iterate over all row of the input matrix 
for(row_in in input and row_out in output) 

// Perform digit-reverse reordering 
rev_row = digit_reverse(row_in);

// Calculate mixed radix 1D FFT using the reversed row 
row_out = ... 

}

Naive implementation based on OpenCL mixed-radix kernels

This optimisation does not introduce anything terribly new compared to the optimisation of the one-dimensional Fourier transform. The first speedup over the CPU implementation can be achieved by running the computation of the actual FFT on the GPU. For this purpose the buffer with the input values has to be mapped to a CLBuffer. Afterwards the reordering to digit-reversed order is performed on the GPU. The rows of the matrix can be reordered independently such that a one-dimensional kernel is enqueued for each one. The reordering is followed by the first one-dimensional part of the FFT which is now also computed on the GPU. Once the computation has finished the buffers are mapped again to perform the matrix transpose between the two stages of the FFT on the CPU and out-of-place. Subsequently, the rows are another time reordered and the second one-dimensional FFT is computed. Finally the results of the second stage are again transposed to restore the original data layout and copied over to the output buffer.

A loop is used to enqueue separate and independent one-dimensional kernels for each row of the input data. The kernels take the offset of the row within the linearised data representation as argument buf_offset_float.

Because each complex element of the buffer is represented through two floating point numbers (real and imaginary part) the row offset has to be doubled to get the actual offset within the buffer.

    // Digit reverse stage 
    // Uses a 1D NDRange within the loop to process all rows of the buffer 
    const size_t digit_rev_gws[] = {N}; 
    
    for(uint row = 0; row < N; ++row) 
    { 
        uint buf_offset_float = row * N * 2; 
        clSetKernelArg(digit_reverse_kernel, 0, sizeof(cl_mem), input_buf); 
        clSetKernelArg(digit_reverse_kernel, 1, sizeof(cl_mem), digit_reversed_buf); 
        clSetKernelArg(digit_reverse_kernel, 2, sizeof(cl_mem), idx_bit_reverse_buff); 
        clSetKernelArg(digit_reverse_kernel, 3, sizeof(cl_uint), &buf_offset_float); 
        clEnqueueNDRangeKernel(queue, digit_reverse_kernel, 1, NULL, digit_rev_gws, NULL, 0, NULL, NULL); 
    } 
    
    // Perform 2D FFT 
    // Uses 1D NDRange within the loop to process all rows of the buffer 
    for(uint row = 0; row < N; ++row) 
    { 
        uint buf_stride_float = N * 2; 
        uint Nx = 1; 
        uint Ny = radix_stage[0]; 
    
        // First stage 
        const size_t first_stage_gws[] = {N / Ny}; 
        clSetKernelArg(first_stage_radix_kernel, 0, sizeof(cl_mem), digit_reversed_buf); 
        clSetKernelArg(first_stage_radix_kernel, 1, sizeof(cl_uint), &buf_stride_float); 
        clEnqueueNDRangeKernel(queue, first_stage_radix_kernel, 1, NULL, first_stage_gws, NULL, 0, NULL, NULL); 
    
        // Update Nx 
        Nx *= Ny; 
    
        // Following stages 
        for(uint s = 1; s < n_stages; ++s) 
        { 
            // ...      
        } 
    } 

 

    /**
     * @brief Computes the first stage of a radix-2 DFT.
     *
     * @param[in, out] input  The complex array.
     * @param[in]      offset The offset of the current row in the array.
     */ 
    kernel void radix_2_first_stage(global float* input, const uint offset) 
    { 
      // Each work-item computes a single radix-2 
      uint idx = get_global_id(0) * 4; 
    
      // Load two complex input values 
      // The row_offset needs to be multiplied by two because complex numbers  
     // are represented through two float values 
      float4 in = vload4(0, input + offset + idx); 
    
      // Compute DFT N = 2 
      DFT_2(in.s01, in.s23); 
    
      // Store two complex output values 
    
      // The row_offset needs to be multiplied by two because complex numbers 
      // are represented through two float values 
    
      vstore4(in, 0, input + offset + idx); 
    } 

Replacing loops with two-dimensional kernels

In the first, naive GPU implementation separate one-dimensional kernels are enqueued for every row of the matrix. The performance can be significantly increased by using kernels with a two-dimensional NDRange instead of loops. The second dimension is then used as a replacement for the row offset within the linearised data structure which was previously incremented in the loop. In addition to the offset in the second dimension the offset in the buffer depends also on the size of the matrix which is passed as parameter N. Again the offset needs to be multiplied by two to account for the two part complex numbers.

First of all this change reduces the number of kernels and therefore the overhead of enqueuing and dispatching them. Moreover it increases the number of work items which can be executed in parallel since in principle work items from all rows can be executed in parallel. Before, the kernels had to be executed one after the other and thus there was only a comparably small number of work items at any point in time. This reduces the GPU's utilisation, particularly for small matrices.

    // Digit reverse stage 
    // Uses a 2D NDRange to process all rows of the buffer 
    const size_t digit_rev_gws[] = {N, N}; 
    clSetKernelArg(digit_reverse_kernel, 0, sizeof(cl_mem), input_buf); 
    clSetKernelArg(digit_reverse_kernel, 1, sizeof(cl_mem), digit_reversed_buf); 
    clSetKernelArg(digit_reverse_kernel, 2, sizeof(cl_mem), idx_bit_reverse_buf); 
    clSetKernelArg(digit_reverse_kernel, 3, sizeof(cl_uint), &N); 
    clEnqueueNDRangeKernel(queue, digit_reverse_kernel, 2, NULL, digit_rev_gws, NULL, 0, NULL, NULL); 
    
    uint buf_stride_float = N * 2; 
    
    // First stage 
    // Uses a 2D NDRange to process all rows of the buffer 
    uint Nx = 1; 
    uint Ny = radix_stage[0]; 
    
    const size_t first_stage_gws[] = {N / Ny, N}; 
    clSetKernelArg(first_stage_radix_kernel, 0, sizeof(cl_mem), digit_reversed_buf); 
    clSetKernelArg(first_stage_radix_kernel, 1, sizeof(cl_uint), &buf_stride_float); 
    clEnqueueNDRangeKernel(queue, first_stage_radix_kernel, 2, NULL, first_stage_gws, NULL, 0, NULL, NULL); 
    
    // Update Nx 
    Nx *= Ny; 
    
    // Following stages 
    for(uint s = 1; s < n_stages; ++s) 
    { 
        // ...      
    } 

 

    /**
     * @brief Computes the first stage of a radix-2 DFT.
     *
     * @param[in, out] input            The complex array.
     * @param[in]      buf_stride_float The number of floats per row in the array.
     */ 
    kernel void radix_2_first_stage(global float* input, const uint buf_stride_float) 
    { 
      // Each work-item computes a single radix-2 
      uint idx = get_global_id(0) * 4; 
    
      // Compute the offset into the buffer for the current row 
      // Needs to be multiplied by two because complex numbers 
      // are represented through two float values 
      const uint offset = get_global_id(1) * buf_stride_float; // <-- Previously computed in the host code 
    
      // Load two complex input values 
      float4 in = vload4(0, input + offset + idx); 
    
      // Compute DFT N = 2 
      DFT_2(in.s01, in.s23); 
    
      // Store two complex output values 
      vstore4(in, 0, input + offset + idx); 
    } 

Parallel transpose on the GPU

The next optimisation moves the transpose operation so it is also running on the GPU. If it is not performed in-place there are no dependencies between the individual elements at all and the transpose can be fully executed in parallel. For in-place algorithms at least two elements have to swapped in each step as otherwise information would be overwritten.

Parallel elementwise out-of-place transpose

This approach resembles the CPU variant of a naive elementwise transpose of a matrix. Each work item reads one element according to its global id from the source buffer and writes it to the corresponding transposed position in the output buffer. The global work group size is thus equal to the size of the matrix (N x N). The elements themselves consist of two floating point values which represent the real and imaginary parts of the complex values.

    /**
     * @brief Transposes a quadratic matrix of complex numbers stored in row major order.
     * 
     * @param[in]  input              The complex input array.
     * @param[out] output             The complex output array.
     * @param[in]  buf_stride_complex The number of complex numbers per row in the array.
     */ 
    kernel void transpose(global float2* input, global float2* output, const uint buf_stride_complex) 
    { 
        uint ix = get_global_id(0); 
        uint iy = get_global_id(1); 
        output[ix * buf_stride_complex + iy] = input[iy * buf_stride_complex + ix]; 
    } 

Parallel blockwise out-of-place transpose

However, this simple approach is not efficient enough to compensate for the additional driver overhead for launching kernels when performing the transpose on the GPU. To increase the performance the transpose operation has to be executed blockwise instead of elementwise. Instead of reading and writing one pixel per work item a small sub-block of the complete matrix is loaded. It is then locally transposed and written back to the corresponding position in the output matrix.

Compared to the naive transpose approach this optimisation increases the cache efficiency and the resource utilisation since each work item operates on more than one pixel. Further it helps to increase the utilisation of the GPU resources by making the kernels larger and slightly more complex. As a result the number of work items decreases which reduces the overhead of dispatching work items for executuion. The choice of two-by-two sub-blocks turned out to be optimal since the local transpose of larger sub-blocks requires more registers.

    /**
     * @brief Transposes a quadratic matrix of complex numbers stored in row major order.
     * 
     * @param[in]  input            The complex input array.
     * @param[out] output           The complex output array.
     * @param[in]  buf_stride_float The number of floats per row in the array.
     */ 
    kernel void transpose2x2(global float* input, global float* output, const uint buf_stride_float) 
    { 
        const uint ix = get_global_id(0); 
        const uint iy = get_global_id(1); 
    
        float4 tmp; 
        float4 u0, u1; 
    
        // Load one sub-block from two rows 
        u0 = vload4(ix, input + (iy * 2) * buf_stride_float); 
        u1 = vload4(ix, input + (iy * 2 + 1) * buf_stride_float); 
    
        // Transpose the sub-block 
        tmp = u0; 
        u0 = (float4)(tmp.s01, u1.s01); 
        u1 = (float4)(tmp.s23, u1.s23); 
    
        // Write the sub-block to the transposed position 
        vstore4(u0, iy, output + (ix * 2) * buf_stride_float); 
        vstore4(u1, iy, output + (ix * 2 + 1) * buf_stride_float); 
    } 

Speedup

Parallel blockwise in-place transpose

Due to its nature a transpose operation can be realised by always swapping to corresponding elements or larger sub-blocks respectively. This allows us to load, transpose and store two sub-blocks per work item in order to reduce the number of work items and to achieve a better balance between arithmetic and load/store operations. As a side-effect the transpose algorithm can be changed from out-of-place to in-place such that the second buffer is no longer needed.

As a consequence of the transpose-and-swap approach it would be sufficient to have one work item for each block in the upper triangular matrix. However, if a one-dimensional kernel is enqueued with the smallest possible number of work items the transformation of the linear index into the two-dimensional index is less efficient than simply using a two-dimensional kernel and aborting the work items below the diagonal. The diagonal itself requires special treatment because the diagonal sub-blocks do not need to be exchanged with a different sub-block but only have to be transposed locally.

    /**
     * @brief Transposes a quadratic matrix of complex numbers stored in row major order.
     * 
     * @param[in, out] input            The complex array.
     * @param[in]      buf_stride_float The number of floats per row in the array.
     */ 
    kernel void transpose2x2(global float* input, const uint buf_stride_float) 
    { 
        const uint ix = get_global_id(0); 
        const uint iy = get_global_id(1); 
    
        // Abort for sub-blocks below the diagonal 
        if(ix < iy) 
        { 
            return; 
        } 
    
        float4 tmp; 
        float4 v0, v1, u0, u1; 
    
        // Load first sub-block 
        u0 = vload4(ix, input + (iy * 2) * buf_stride_float); 
        u1 = vload4(ix, input + (iy * 2 + 1) * buf_stride_float); 
    
        // Transpose first sub-block 
        tmp = u0; 
        u0 = (float4)(tmp.s01, u1.s01); 
        u1 = (float4)(tmp.s23, u1.s23); 
    
        // Only process a second sub-block if the first one is not on the diagonal 
        if(ix != iy) 
        { 
            // Load second sub-block 
            v0 = vload4(iy, input + (ix * 2) * buf_stride_float); 
            v1 = vload4(iy, input + (ix * 2 + 1) * buf_stride_float); 
    
            // Transpose second sub-block 
            tmp = v0; 
            v0 = (float4)(tmp.s01, v1.s01); 
            v1 = (float4)(tmp.s23, v1.s23); 
    
            // Store second sub-block to transposed position 
            vstore4(v0, ix, input + (iy * 2) * buf_stride_float); 
            vstore4(v1, ix, input + (iy * 2 + 1) * buf_stride_float); 
        }  
    
        // Store first sub-block to transposed position 
        vstore4(u0, iy, input + (ix * 2) * buf_stride_float); 
        vstore4(u1, iy, input + (ix * 2 + 1) * buf_stride_float); 
    } 

Speedup

Setting local work group size

For the two-dimensional transpose kernel the automatically inferred values for the local work group size were not able to achieve good performance. It turned out to be better to set the local work group size manually to a size of 2 x 4 – or smaller if the buffer (divided by two because each work items handles a 2 x 2 sub-block) is not a multiple of 4. The best size for the local work groups always depends on the kernel and finding a good size can be hard. However, getting the sizes right can help to achieve better caching behaviour and therefore can improve runtime performance.

    // Setting up constants 
    // For odd matrix sizes padding is necessary 
    // because the transpose kernel stores two complex values 
    const size_t padding = 2 * (N % 2); 
    const size_t buf_stride_float = N * 2 + padding; 
    const size_t buf_stride_half = buf_stride_float / 2; 
    
    clSetKernelArg(transpose_kernel, 0, sizeof(cl_mem), fft_buf); 
    clSetKernelArg(transpose_kernel, 1, sizeof(cl_uint), &buf_stride_float); 
    
    // Set global work group size 
    const size_t transpose_gws[2] = {buf_stride_half, buf_stride_half}; 
    size_t transpose_lws[2]; 
    size_t *transpose_lws_ptr; 
    
    // Determine the largest possible local work group size 
    // It has to divide the global work group size 
    if(buf_stride_half % 4 == 0) 
    { 
        transpose_lws[0] = 2; 
        transpose_lws[1] = 4; 
        transpose_lws_ptr = transpose_lws; 
    } 
    else if(buf_stride_half % 2 == 0) 
    { 
        transpose_lws[0] = 1; 
        transpose_lws[1] = 2; 
        transpose_lws_ptr = transpose_lws; 
    } 
    else 
    { 
        transpose_lws_ptr = NULL; 
    } 
    
    clEnqueueNDRangeKernel(queue, transpose_kernel, 2, NULL, transpose_gws, transpose_lws_ptr, 0, NULL, NULL); 

Speedup

Comparison to radix-2 implementation

The following diagram visualises the benefit in terms of a shorter runtime when the mixed-radix variant is used instead of a radix-2 only implementation. Note the logarithmic scaling of the y-axis. Another advantage of a mixed-radix approach over a radix-2 only FFT is the larger number of supported sizes. For instance, FFTs for matrix sizes 120, 1000 and 3000 can not be computed with the radix-2 only approach and are therefore missing in the diagram.

Conclusion

While this pushes the row-column algorithm to its boundaries it might be possible to achieve better performance if the underlying FFT algorithm is changed. For instance, if the FFT is not computed in-place the digit-reversal could be dropped and moreover the transpose could be merged into the mixed-radix FFT kernels. However, it is difficult to predict the benefits if there are any.

Useful optimisation aspects

The following list summarises the things that have been considered to increase the performance of the 2D FFT mixed-radix algorithm on Mobile ARM® Mali™ GPUs in this article:

  • Instead of enqueuing many one-dimensional kernels in a loop it is better to use one/fewer higher-dimensional kernel/s.
  • The optimal size of the sub-blocks for the blocked transpose depends on the available resources. Increasing the block size has only positive effects as long as the cache is big enough and sufficient registers are available.
  • Padding buffers can help to generalise kernels and can prevent branches in the control flow. For instance, the mapped buffers are padded to the next larger even size in both dimensions to eliminate edge cases during the transpose operation.
  • Loads and stores are most efficient if they are performed according to the data layout because only then the cache efficiency is more optimal.
  • It can be more efficient to conditionally abort superfluous work items in a 2D range directly instead of computing the two-dimensional index out of a linear index.
  • Find ways to parallelize sequential code and keep work items as independent from each other as possible.
  • Compute more than one pixel (or more bytes) per work item.
  • Reduce the number of load/store operations by vectorising. This is especially beneficial because the GPU comprises a vector unit (SIMD).

Further reading

By Gian Marco Iodice
GPU Compute Software Engineer, ARM

Speeding Up the Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPUs By Means of OpenCL (Part 2)

Bookmark and Share

Speeding Up the Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPUs By Means of OpenCL (Part 2)

This article was originally published at ARM's website. It is reprinted here with the permission of ARM. For more information, please see ARM's developer site, which includes a variety of GPU Compute, OpenCL and RenderScript tutorials.

Here we are for the second part of our blog series about the OpenCL™ implementation of Complex to Complex Fast Fourier Transform based on the method mixed-radix on Mobile ARM® Mali™ GPUs.

Whilst in the first article - Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1 - we presented the basic mathematical background, in the following we are going to play with the 3 main computation blocks behind the FFT mixed-radix which will be analyzed in a step-by-step approach from the point of view of both theory and of its efficient and simple implementation with OpenCL.

The development platform used to carry out our performance analysis will be the Firefly-RK3288 which is an all-in-one high performance multi-core computing platform based on the Rockchip RK3288 SoC featuring an ARM® Cortex®-A17 @1.8GHz and ARM® Mali™-T760 (Mali GPU Driver: r6p0-02rel0).

For all performance evaluations, DVFS has been disabled and the GPU clock has been set at the maximum frequency of 600MHz.

Further information about the development platform can be found at:

Implementation

Let's take a look at how we store our complex values. The input and output data are sorted into a floating point complex array with 2*N elements where the real and imaginary parts are placed alternately. Two different buffers with 2*N elements have been used to store the input and output values.

The FFT mixed-radix pipeline that we are going to study is composed mainly of 3 computation blocks:

  1. Digit-reverse
  2. Twiddle factor multiplication
  3. Radix computation

Indexing scheme

The only real complexity of FFT mixed-radix is the indexing scheme. The indexing scheme expresses the relation between the index n and k between 2 generic stages X and Y. This relation is fundamental because it allows for both knowing which input must be processed by each radix basic element and correctly performing the stages of digit reverse and twiddle factors multiplication.

Given 2 generic stages X and Y:

  1. n addresses the values in the stage X: n = nx + ny * Nx
  2. k addresses the values in the stage Y: k = ky + kx * Ny

with:

  • nx = n % Nx, scans the columns - nx [0, Nx - 1]
  • ny = floor(n / Nx), scans the rows - ny [0, N / Nx - 1]
  • kx = floor(k / Ny), scans the rows - kx [0, N / Ny - 1]
  • ky = (k % Ny), scans the columns - ky [0, Ny - 1]

From the above picture we can easily see what nx, ny, kx and ky are. For what concerns Nx and Ny, in the case of just 2 radix stages they are respectively the radix order of stage X and Y. In the case of more than 2 radix stages, Ny is still the radix order of stage Y but Nx becomes the radix products from the first radix stage to the radix stage X.

Nx is also known as span or butterfly span. The span is the offset between each complex input of radix basic element and it was introduced in the first article in reference to the radix-2 FFT algorithm.

i.e.
Let's assume we have N = 8 x 4 x 4 x 3 x 2 (5 stages).
If the stages X and Y were stage 2 and stage 3, Nx would be Nx = 8 x 4 x 4 = 128 and Ny = 3

With a pipeline made up of M radix stages, we would like to have a relation between the index k and n between 2 generic stages, in particular we would like to have something like:

The general expressions for computing these mapping functions are:

where Ni is:

    /**
     * @brief Map index k to index n

     *
     * @param[in] k  Index k
     * @param[in] Nx It is the span
     * @param[in] Ny It is the radix order of stage Y
     *
     * @return       Index n
     */ 
    uint map_k_to_n(uint k, uint Nx, uint Ny) 
    { 
        uint Ni = Nx * Ny; 
        uint ky = k % Ny;    // Remainder 
        uint kx = k / Ny;    // Integer part 
        uint n = (kx % Nx) + (kx / Nx) * Ni + ky * Nx; 
        return n; 
    } 

 

    /**
     * @brief Map index n to index k
     *
     * @param[in] n  Index n
     * @param[in] Nx It is the span
     * @param[in] Ny It is the radix order of stage Y
     *
     * @return       Index k
     */ 
    uint map_n_to_k(uint n, uint Nx, uint Ny) 
    { 
        uint Ni = Nx * Ny; 
        uint k = (n * Ny) % Ni + (n / Nx) % Ny + Ni * (n / Ni); 
        return k; 
    } 

Every time we compute a radix stage, it is good to keep the span Nx update.

    /* Init Nx to 1 */ 
    uint Nx = 1; 

    
    /* Scan each radix stage */ 
    for(uint s = 0; s < n_stages; ++s) 
    { 
         /* Get radix order of stage s */ 
         uint Ny = radix[s]; 
         
         /* Body for computing twiddle factor multiplication and radix computation */ 
         ... 
         
         /* Update Nx */ 
         Nx *= Ny; 
    } 

Digit-reverse

Once we have introduced the indexing scheme, we are ready to analyze the main computation blocks of our pipeline. Let's start with the digit-reverse.

The digit-reverse is the first stage of our pipeline, which places the input elements in a specific order called "digit-reverse order".

The digit-reverse order would be exactly the order of the output elements if we left the input elements in linear order (0, 1, 2,...).

Since we know the relation between the index n and k, a possible way of knowing the digit-reverse order may be to iterate the mapping function map_n_to_k() from the first to the last stage in order to know how the input index n would be mapped at the end.

    /**
     * @brief This function computes the digit reverse index for each input

     *
     * @param[in]  stage             It contains the radix order for each radix stage
     * @param[out] idx_digit_reverse It contains the digit-reverse order index
     * @param[in]  n_stages          Total number of radix stages
     * @param[in]  N                 Number of input
     */ 
    int digit_reverse(float* stage, uint* idx_digit_reverse, uint n_stages, uint N) 
    { 
        /* Scan elements */ 
        for(uint n = 0; n < N; ++n) 
        { 
            uint k = n; 
            uint Nx = stage[0]; 
    
            /* Scan stages */ 
            for (uint s = 1; s < n_stages; ++s) 
            { 
                /* radix of stage s-th */ 
                uint Ny = stage[s]; 
                uint Ni = Ny * Nx; 
    
                /* Update k index */ 
                k = (k * Ny) % Ni + (k / Nx) % Ny + Ni * (k / Ni); 
    
                /* Update Nx */ 
                Nx *= Ny; 
            } 
    
            /* K is the index of digit-reverse */ 
            idx_digit_reverse[n] = k; 
        } 
    } 

Once we know the digit-reverse index, we can implement a CL kernel that places the input values in the required order.

    /**
     * @brief This kernel stores the input in digit-reverse order
     *
     * @param[in]  input            It contains the input complex values in linear order 
     * @param[out] output           It contains the output complex values in digit-reverse order
     * @param[in]  idx_bit_reverse  It contains the digit-reverse order index
     */ 
    kernel void digit_reverse(global float2* input, global float2* output, global uint* idx_digit_reverse) 
    { 
        /* Each work-item stores a single complex values */ 
        const uint n = get_global_id(0); 
    
        /* Get digit-reverse index */ 
        const uint idx = idx_digit_reverse[n]; 
    
        /* Get complex value */ 
        float2 val = (float2)(input[idx]); 
    
        /* Store complex value */ 
        output[n] = val; 
    } 

The digit-reverse kernel uses 2 different cl_buffers:

  1. one for loading the input complex values
  2. one for storing the output complex values in digit-reverse order.

Twiddle Factor Multiplication

Twiddle factor multiplication is the intermediate stage between 2 generic radix stages X and Y. In particular it multiplies all the N output complex elements of radix stage X by a specific trigonometric complex constant before passing those values to the next stage Y.

This trigonometric complex constant -the "twiddle" factor - depends on nx and ky:

Even the twiddle factor multiplication is highly parallelizable and can be implemented by means of OpenCL in the following manner:

    #define M_2PI_F 6.283185482025146484375f 
    
    #define TWIDDLE_FACTOR_MULTIPLICATION(phi, input)                 \ 
    {                                                                 \ 
        float2 w, tmp;                                                \ 
        w.x = cos(phi);                                               \ 
        w.y = sin(phi);                                               \           
        tmp.x = (w.x * input.x) - (w.y * input.y);                    \ 
        tmp.y = (w.x * input.y) + (w.y * input.x);                    \ 
        input = tmp;                                                  \ 
    } 
   
    /**
     * @brief This kernel computes the twiddle factor multiplication between 2 generic radix stages X and Y 
    * 
     * @param[in, out] input It contains the input and output complex values 
     * @param[in]      Nx    It is the span
     * @param[in]      Ny    It is the radix order of stage Y
     * @param[in]      Ni    Nx * Ny
     */ 
    kernel void twiddle_factor_multiplication(global float2* input, const uint Nx, const uint Ny, const uint Ni) 
    { 
        /* Each work-item computes a single complex input */ 
        const uint n = get_global_id(0); 
    
        /* Compute nx */ 
        uint nx = n % Nx; 
    
        /* Compute k index */ 
        uint k = (n * Ny) % Ni + (n / Nx) % Ny + Ni * (n / Ni); 
    
        /* Compute ky */ 
        uint ky = k % Ny; 
    
        /* Compute angle of twiddle factor */ 
        float phi = (-M_2PI_F * nx * ky) / (float)Ni; 
    
        /* Multiply by twiddle factor */ 
        TWIDDLE_FACTOR_MULTIPLICATION(phi, input[n]); 
    } 

The above kernel uses the same cl_buffer for loading and storing.

An interesting optimization concerns the computation of ky. As we know,

so:

since in the above expression A and B are multiples of Ny, the terms 1 and 3 will be 0 therefore:

but for the identity property of the modulo operator we have:

and then we can compute ky as:

At this point we can simplify our OpenCL kernel in the following manner:

    kernel void twiddle_factor_multiplication(global float2* input, const uint Nx, const uint Ny, const uint Ni) 
    { 
        /* Each work-item computes a single complex input */ 
        const uint n = get_global_id(0); 
    
        /* Compute nx */ 
        uint nx = n % Nx; 
    
        /* Compute ky */ 
        uint ky = (n / Nx) % Ny;               // <- 
    
        /* Compute angle of twiddle factor */ 
        float phi = (-M_2PI_F * nx * ky) / (float)Ni; 
    
        /* Multiply by twiddle factor */ 
        TWIDDLE_FACTOR_MULTIPLICATION(phi, input[n]); 
    } 

Another important and simple optimization that we can introduce concerns the computation of the angle for the twiddle factor multiplication. As we can observe, the term (-M_2PI_F / (float)Ni) is a constant for all work-items and can therefore be passed as a argument to the CL kernel.

    /**
     * @brief This kernel computes the twiddle factor multiplication between 2 generic radix stages X and Y
     * 
     * @param[in, out] input     It contains the input and output complex values 
     * @param[in]      Nx        It is the span
     * @param[in]      Ny        It is the radix order of stage Y
     * @param[in]      Ni        Nx * Ny
     * @param[in]      exp_const (-M_2PI_F / (float)Ni)
     */ 
    kernel void twiddle_factor_multiplication(global float2* input, const uint Nx, const uint Ny, const uint Ni, const float exp_const) 
    { 
        /* Each work-item computes a single complex input */ 
        const uint n = get_global_id(0); 
    
        /* Compute nx */ 
        uint nx = n % Nx; 
    
        /* Compute ky */ 
        uint ky = (n / Nx) % Ny; 
    
        /* Compute angle of twiddle factor */ 
        float phi = (float)(nx * ky) * exp_const;                    // <- 
    
        /* Multiply by twiddle factor */ 
        TWIDDLE_FACTOR_MULTIPLICATION(phi, input[n]); 
    } 

With these 2 simple optimizations, we are able to improve the performance of this kernel by a factor of ~1.6/1.7x as illustrated in the following graph.

Radix computation

The radix stage is the main computation block of the pipeline composed of N / S radix-S basic elements.

Each radix-S consists of S inputs and S outputs and computes an optimized DFT of length S.

Since there is not a dependency between each radix basic element of the same stage, even the radix stage is an embarrassingly parallel problem.

In our implementation each radix-S of the same radix stage is performed by a single work-item. As in the first stage the radix basic element takes the input in linear order whilst for the following stages accordingly with the span Nx, we are going to use 2 different kernels for the radix basic element:

  1. One kernel for the first stage, which takes the input in linear order
  2. One kernel for the other radix stages

Both kernels use a single cl_buffer for loading and storing.

In the following we are going to describe the implementation of radix 2/3/4/5/7.

Radix-2

Defined as:

can be expressed in the following matrix form:

    #define DFT_2(c0, c1) \ 
    {                     \ 

        float2 v0;        \ 
        v0 = c0;          \ 
        c0 = v0 + c1;     \ 
        c1 = v0 - c1;     \ 
    } 
    
    /**
     * @brief This kernel computes DFT of size 2 for the first stage
     *
     * @param[in, out] input It contains the input and output complex values
     */ 
    kernel void radix_2_first_stage(global float* input) 
    { 
        /* Each work-item computes a single radix-2 */ 
        uint idx = get_global_id(0) * 4; 
    
        /* Load two complex input values */ 
        float4 in = vload4(0, input + idx); 
    
        /* Compute DFT N = 2 */ 
        DFT_2(in.s01, in.s23); 
    
        /* Store two complex output values */ 
        vstore4(in, 0, input + idx); 
    } 
    
    /**
     * @brief This kernel computes DFT of size 2 for the radix stages after the first
     *
     * @param[in, out] input It contains the input and output complex value
     * @param[in]      Nx    It is the span
     * @param[in]      Ni    Nx * Ny
     */ 
    kernel void radix_2(global float2* input, uint Nx, uint Ni) 
    { 
        /* Each work-item computes a single radix-2 */ 
        uint kx = get_global_id(0); 
    
        /* Compute n index */ 
        uint n = (kx % Nx) + (kx / Nx) * Ni; 
    
        /* Load two complex input values */ 
        float2 c0 = input[n]; 
        float2 c1 = input[n + Nx]; 
    
        /* Compute DFT N = 2 */ 
        DFT_2(c0, c1); 
    
        /* Store two complex output values */ 
        input[n] = c0; 
        input[n + Nx] = c1; 
    } 

Radix-3

Defined as:

can be expressed in the following matrix form:

    #define SQRT3DIV2       0.86602540378443f 
    
    #define DFT_3(c0, c1, c2)                          \ 
    {                                                  \ 
        float2 v0 = c1 + c2;                           \ 
        float2 v1 = c1 - c2;                           \ 
        c1.x = c0.x - 0.5f * v0.x + v1.y * SQRT3DIV2;  \ 
        c1.y = c0.y - 0.5f * v0.y - v1.x * SQRT3DIV2;  \ 
        c2.x = c0.x - 0.5f * v0.x - v1.y * SQRT3DIV2;  \ 
        c2.y = c0.y - 0.5f * v0.y + v1.x * SQRT3DIV2;  \ 
        c0 = c0 + v0;                                  \ 
    } 
    
    /**
     * @brief This kernel computes DFT of size 3 for the first stage
     *
     * @param[in, out] input It contains the input and output complex values
     */ 
    kernel void radix_3_first_stage(global float* input) 
    { 
        /* Each work-item computes a single radix-3 */ 
        uint idx = get_global_id(0) * 6; 
    
        /* Load three complex input values */ 
        float4 in0 = vload4(0, input + idx); 
        float2 in1 = vload2(0, input + idx + 4); 
    
        /* Compute DFT N = 3 */ 
        DFT_3(in0.s01, in0.s23, in1.s01); 
    
        /* Store three complex output values */ 
        vstore4(in0, 0, input + idx); 
        vstore2(in1, 0, input + idx + 4); 
    } 

    /**
     * @brief This kernel computes DFT of size 3 for the radix stages after the first
     *
     * @param[in, out] input It contains the input and output complex value
     * @param[in]      Nx    It is the span
     * @param[in]      Ni    Nx * Ny
     */ 
    kernel void radix_3(global float2* input, uint Nx, uint Ni) 
    { 
        /* Each work-item computes a single radix-3 */ 
        uint kx = get_global_id(0); 
    
        /* Compute n index */ 
        uint n = (kx % Nx) + (kx / Nx) * Ni; 
    
        /* Load three complex input values */ 
        float2 c0 = input[n]; 
        float2 c1 = input[n + Nx]; 
        float2 c2 = input[n + 2 * Nx]; 
    
        /* Compute DFT N = 3 */ 
        DFT_3(c0, c1, c2); 
    
        /* Store three complex output values */ 
        input[n] = c0; 
        input[n + Nx] = c1; 
        input[n + 2 * Nx] = c2; 
    } 

Radix-4

Defined as:

can be expressed in the following matrix form:

    #define DFT_4(c0, c1, c2, c3) \ 
    {                             \ 
        float2 v0, v1, v2, v3;    \ 
        v0 = c0 + c2;             \ 
        v1 = c1 + c3;             \ 
        v2 = c0 - c2;             \ 
        v3.x = c1.y - c3.y;       \ 
        v3.y = c3.x - c1.x;       \ 
        c0 = v0 + v1;             \ 
        c2 = v0 - v1;             \ 
        c1 = v2 + v3;             \ 
        c3 = v2 - v3;             \ 
    } 
    
    /**
     * @brief This kernel computes DFT of size 4 for the first stage
     *
     * @param[in, out] input It contains the input and output complex values
     */ 
    kernel void radix_4_first_stage(global float* input) 
    { 
        /* Each work-item computes a single radix-4 */ 
        uint idx = get_global_id(0) * 8; 
    
        /* Load four complex input values */ 
        float8 in = vload8(0, input + idx); 
    
        /* Compute DFT N = 4 */ 
        DFT_4(in.s01, in.s23, in.s45, in.s67); 
    
        /* Store four complex output values */ 
        vstore8(in, 0, input + idx); 
    } 
    
    /**
     * @brief This kernel computes DFT of size 4 for the radix stages after the first
     *
     * @param[in, out] input It contains the input and output complex value
     * @param[in]      Nx    It is the span
     * @param[in]      Ni    Nx * Ny
     */ 
    kernel void radix_4(global float2* input, uint Nx, uint Ni) 
    { 
        /* Each work-item computes a single radix-4 */ 
        uint kx = get_global_id(0); 
    
        /* Compute n index */ 
        uint n = (kx % Nx) + (kx / Nx) * Ni; 
    
        /* Load four complex input values */ 
        float2 c0 = input[n]; 
        float2 c1 = input[n + Nx]; 
        float2 c2 = input[n + 2 * Nx]; 
        float2 c3 = input[n + 3 * Nx]; 
    
        /* Compute DFT N = 4 */ 
        DFT_4(c0, c1, c2, c3); 
    
        /* Store four complex output values */ 
        input[n] = c0; 
        input[n + Nx] = c1;  
       input[n + 2 * Nx] = c2; 
        input[n + 3 * Nx] = c3; 
    } 

Radix-5

Defined as:

can be expressed in the following matrix form:

    #define W5_A    0.30901699437494f 
    #define W5_B    0.95105651629515f 
    #define W5_C    0.80901699437494f 
    #define W5_D    0.58778525229247f 
    
    #define DFT_5(c0, c1, c2, c3, c4)               \ 
    {                                               \ 
        float2 v0, v1, v2, v3, v4;                  \ 
        v0 = c0;                                    \ 
        v1 = W5_A * (c1 + c4) - W5_C * (c2 + c3);   \ 
        v2 = W5_C * (c1 + c4) - W5_A * (c2 + c3);   \ 
        v3 = W5_D * (c1 - c4) - W5_B * (c2 - c3);   \ 
        v4 = W5_B * (c1 - c4) + W5_D * (c2 - c3);   \ 
        c0 = v0 + c1 + c2 + c3 + c4;                \ 
        c1 = v0 + v1 + (float2)(v4.y, -v4.x);       \ 
        c2 = v0 - v2 + (float2)(v3.y, -v3.x);       \ 
        c3 = v0 - v2 + (float2)(-v3.y, v3.x);       \ 
        c4 = v0 + v1 + (float2)(-v4.y, v4.x);       \ 
    } 
    
    /**
     * @brief This kernel computes DFT of size 5 for the first stage
     *
     * @param[in, out] input It contains the input and output complex values
     */ 
    kernel void radix_5_first_stage(global float* input) 
    { 
        /* Each work-item computes a single radix-5 */ 
        uint idx = get_global_id(0) * 10; 
    
        /* Load five complex input values */ 
        float8 in0 = vload8(0, input + idx); 
        float2 in1 = vload2(0, input + idx + 8); 
    
        /* Compute DFT N = 5 */ 
        DFT_5(in0.s01, in0.s23, in0.s45, in0.s67, in1.s01); 
    
        /* Store five complex output values */ 
        vstore8(in0, 0, input + idx); 
        vstore2(in1, 0, input + idx + 8); 
    } 
    
    /**
     * @brief This kernel computes DFT of size 5 for the radix stages after the first
     *
     * @param[in, out] input It contains the input and output complex value
     * @param[in]      Nx    It is the span
     * @param[in]      Ni    Nx * Ny
     */ 
    kernel void radix_5(global float2* input, uint Nx, uint Ni) 
    { 
        /* Each work-item computes a single radix-5 */ 
        uint kx = get_global_id(0); 
    
        /* Compute n index */ 
        uint n = (kx % Nx) + (kx / Nx) * Ni; 
    
        /* Load five complex input values */ 
        float2 c0 = input[n]; 
        float2 c1 = input[n + Nx]; 
        float2 c2 = input[n + 2 * Nx]; 
        float2 c3 = input[n + 3 * Nx]; 
        float2 c4 = input[n + 4 * Nx]; 
    
        /* Compute DFT N = 5 */ 
        DFT_5(c0, c1, c2, c3, c4); 
    
        /* Store five complex output values */ 
        input[n] = c0; 
        input[n + Nx] = c1; 
        input[n + 2 * Nx] = c2; 
        input[n + 3 * Nx] = c3; 
        input[n + 4 * Nx] = c4; 
    }  

Radix-7

Defined as:

can be expressed in the following matrix form:

    #define W7_A    0.62348980185873f 
    #define W7_B    0.78183148246802f 

    #define W7_C    0.22252093395631f 
    #define W7_D    0.97492791218182f 
    #define W7_E    0.90096886790241f 
    #define W7_F    0.43388373911755f 
    
    #define DFT_7(c0, c1, c2, c3, c4, c5, c6)                           \ 
    {                                                                   \ 
        float2 v0, v1, v2, v3, v4, v5, v6;                              \ 
        v0 = c0;                                                        \ 
        v1 = W7_A * (c1 + c6) - W7_C * (c2 + c5) - W7_E * (c3 + c4);    \ 
        v2 = W7_C * (c1 + c6) + W7_E * (c2 + c5) - W7_A * (c3 + c4);    \ 
        v3 = W7_E * (c1 + c6) - W7_A * (c2 + c5) + W7_C * (c3 + c4);    \ 
        v4 = W7_B * (c1 - c6) + W7_D * (c2 - c5) + W7_F * (c3 - c4);    \ 
        v5 = W7_D * (c1 - c6) - W7_F * (c2 - c5) - W7_B * (c3 - c4);    \ 
        v6 = W7_F * (c1 - c6) - W7_B * (c2 - c5) + W7_D * (c3 - c4);    \ 
        c0 = v0 + c1 + c2 + c3 + c4 + c5 + c6;                          \ 
        c1 = v0 + v1 + (float2)(v4.y, -v4.x);                           \ 
        c2 = v0 - v2 + (float2)(v5.y, -v5.x);                           \ 
        c3 = v0 - v3 + (float2)(v6.y, -v6.x);                           \ 
        c4 = v0 - v3 + (float2)(-v6.y, v6.x);                           \ 
        c5 = v0 - v2 + (float2)(-v5.y, v5.x);                           \ 
        c6 = v0 + v1 + (float2)(-v4.y, v4.x);                           \ 
    } 
    
    /**
     * @brief This kernel computes DFT of size 7 for the first stage
     *
     * @param[in, out] input It contains the input and output complex values
     */ 
    kernel void radix_7_first_stage(global float* input) 
    { 
        /* Each work-item computes a single radix-7 */ 
        uint idx = get_global_id(0) * 14; 
    
        /* Load seven complex input values */ 
        float8 in0 = vload8(0, input + idx); 
        float4 in1 = vload4(0, input + idx + 8); 
        float2 in2 = vload2(0, input + idx + 12); 
    
        /* Compute DFT N = 7 */ 
        DFT_7(in0.s01, in0.s23, in0.s45, in0.s67, in1.s01, in1.s23, in2.s01); 
    
        /* Store seven complex output values */ 
        vstore8(in0, 0, input + idx); 
        vstore4(in1, 0, input + idx + 8); 
        vstore2(in2, 0, input + idx + 12); 
    } 
    
    /**
     * @brief This kernel computes DFT of size 7 for the radix stages after the first
     *
     * @param[in, out] input It contains the input and output complex value
     * @param[in]      Nx    It is the span
     * @param[in]      Ni    Nx * Ny
     */ 
    kernel void radix_7(global float2* input, uint Nx, uint Ni) 
    { 
        /* Each work-item computes a single radix-7 */ 
        uint kx = get_global_id(0); 
    
        /* Compute n index */ 
        uint n = (kx % Nx) + (kx / Nx) * Ni; 
    
        /* Load seven complex input values */ 
        float2 c0 = input[n]; 
        float2 c1 = input[n + Nx]; 
        float2 c2 = input[n + 2 * Nx]; 
        float2 c3 = input[n + 3 * Nx]; 
        float2 c4 = input[n + 4 * Nx]; 
        float2 c5 = input[n + 5 * Nx]; 
        float2 c6 = input[n + 6 * Nx];  
   
        /* Compute DFT N = 7 */ 
        DFT_7(c0, c1, c2, c3, c4, c5, c6); 
    
        /* Store seven complex output values */ 
        input[n] = c0; 
        input[n + Nx] = c1; 
        input[n + 2 * Nx] = c2; 
        input[n + 3 * Nx] = c3; 
        input[n + 4 * Nx] = c4; 
        input[n + 5 * Nx] = c5; 
        input[n + 6 * Nx] = c6; 
    } 

Merging twiddle factor multiplication with radix stage

Generally there are a couple of important benefits to combining 2 OpenCL kernels into one:

  1. fewer GPU jobs to dispatch thus reducing a possible driver overhead. In our implementation we are going to have (log2(N) - 1)fewer GPU jobs
  2. fewer memory accesses which is a great thing not only for performance but also for power consumption

Regarding this last aspect, we can easily guess that, since both twiddle factor multiplication and radix computation use the same cl_buffer for loading and storing, if we had 2 separated kernels we would access the same memory location twice.

Combining these 2 kernels, the new pipeline becomes:

In order to compute the twiddle factor multiplication inside the radix kernel, we need just a small tweak as we can see, for instance, in the following radix-5 OpenCL kernel:

    /**
     * @brief This kernel computes DFT of size 5 for the radix stages after the first
     *
     * @param[in, out] input     It contains the input and output complex value
     * @param[in]      Nx        It is the span
     * @param[in]      Ni        Nx * Ny
     * @param[in]      exp_const (-M_2PI_F / (float)Ni)
     */ 
    kernel void radix_5(global float2* input, uint Nx, uint Ni, float exp_const) 
    { 
        /* Each work-item computes a single radix-5 */ 
        uint kx = get_global_id(0); 
    
        /* Compute nx */ 
        uint nx = kx % Nx;                                   // <- 
    
        /* Compute n index */ 
        uint n = nx + (kx / Nx) * Ni;                        // <- 
    
        /* Load five complex input values */ 
        float2 c0 = input[n]; 
        float2 c1 = input[n + Nx]; 
        float2 c2 = input[n + 2 * Nx]; 
        float2 c3 = input[n + 3 * Nx]; 
        float2 c4 = input[n + 4 * Nx]; 
    
        /* Compute phi */ 
        float phi = (float)nx * exp_const;                  // <- Please note: there is not ky 
    
        /* Multiply by twiddle factor */ 
        TWIDDLE_FACTOR_MULTIPLICATION(phi, c1); 
        TWIDDLE_FACTOR_MULTIPLICATION(2 * phi, c2); 
        TWIDDLE_FACTOR_MULTIPLICATION(3 * phi, c3); 
        TWIDDLE_FACTOR_MULTIPLICATION(4 * phi, c4); 
    
        /* Compute DFT N = 5 */ 
        DFT_5(c0, c1, c2, c3, c4); 
    
        /* Store five complex output values */ 
        input[n] = c0; 
        input[n + Nx] = c1; 
        input[n + 2 * Nx] = c2; 
        input[n + 3 * Nx] = c3; 
        input[n + 4 * Nx] = c4; 
    } 

From the following graph we can notice that the speed-up becomes considerable (~1.5x) already for small N.

Mixed-Radix vs Radix-2

We are approaching the end of this second part but before we finish, we'd like to present a final comparison between mixed-radix and radix-2.

In the first article we introduced mixed-radix as the solution to efficiently overcoming the problem of N not being a power of 2. However if we had N power of 2, what would be the benefit of using mixed-radix rather than radix-2?

As we know, radix-2 is just a special case of mixed-radix. Since in our implementation we have the radix-4 as well, 2 consecutive radix-2 stages can be merged in a single radix-4 stage thus reducing:

  1. The number of radix stages to compute
  2. The number of memory accesses

The following graph shows the speed-up achievable by mixed-radix with radix-4 against a pure radix-2 implementation. As we can appreciate, the speed-up is already considerable for small N getting 1.7x better performance for N greater than 4096.

Summary

In this second article we presented the implementation of the 3 main FFT mixed-radix computation blocks by means of OpenCL. Regarding the twiddle factor multiplications we showed how simple changes in CL kernels can significantly speed-up the computation.

We also saw that, although the nature of the pipeline is sequential, each stage represents an embarrassingly parallel problem that can be easily and efficiently implemented with OpenCL.

In the end, with the final comparison between mixed-radix and radix-2, we appreciated that mixed-radix not only provides the important flexibility in the choice of N but also provides a considerable speed-up with N power of 2 if we have the radix-4 as well.

In the next and final article of this blog series we are going to learn how to use the implemented FFT mixed-radix as a building block for computing the FFT for the case of 2 dimensions. This is mostly used in Image Processing, Computer Vision and Machine Learning applications.

Ciao,
Gian Marco Iodice
GPU Compute Software Engineer, ARM