Embedded Vision Alliance: Technical Articles

ARM Guide to OpenCL Optimizing Pyramid: Introduction

Bookmark and Share

ARM Guide to OpenCL Optimizing Pyramid: Introduction

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter introduces OpenCL, pyramid, and the suitability of pyramid for GPU compute.

GPU compute and pyramid

This guide describes an example optimization process for running pyramid operations using an ARM®Mali™ Midgard GPU. This process can improve performance significantly.

ARM®Mali™ Midgard GPUs support the OpenCL 1.1 Full Profile specification for General Purpose computing on GPU (GPGPU) processing, also known as GPU compute.

This guide provides information on the principals of GPU compute, and advice for software developers who want to improve the use of the available hardware in platforms that perform pyramid. It is not a comprehensive guide to optimization and GPU compute. However, you can apply many of these principles to other tasks. The performance gains are given as examples, your results might vary.

What is pyramid?

Pyramid is an algorithm that produces a set of subsampled images, that are smaller than the original. Each subsampled image is a level of a pyramid, and is numbered from the bottom to the top. Subsampling each image produces the image for the next level. Subsampling an image halves the size, vertically and horizontally.

The original image is at the base, also called level zero.

The following figure shows an example image pyramid. The largest image is level zero. The images next to it are level one and level two.


Figure 1-1 Example image pyramid

This guide uses a common pyramid creation process that uses a Gaussian blur convolution. The result is a Gaussian pyramid.

Where are image pyramids used?

Uses of image pyramids include:

  • Computer vision.
  • Feature extraction.
  • Image compression or coding.
  • Reducing the computation cost, and improving the robustness of low texture areas in stereo vision applications.
  • Blending.
  • Trilinear interpolation.

Because image pyramids are used frequently in many real-time applications, reducing the execution time is a very important source of performance improvements.

Computer vision applications

Image pyramids are useful in computer vision applications when an application is required to search for an object without previous information about its size in the input image.

Size variation can occur because the object being searched for can exist at various sizes, or the object can be at different distances from the viewer. Because of this size variation, a single procedure that is applied only to the original image produces incorrect detections. The creation of multiple images at different resolutions enables the...

Imagination’s Smart, Efficient Approach to Mobile Compute

This article was originally published at Imagination Technologies' website, where it is one of a series of articles. It is reprinted here with the permission of Imagination Technologies.

Measuring GPU Compute Performance

This article was originally published at Imagination Technologies' website, where it is one of a series of articles. It is reprinted here with the permission of Imagination Technologies.

Helping Out Reality

This article was originally published at Intel's website. It is reprinted here with the permission of Intel.

Scalable Electronics Driving Autonomous Vehicle Technologies

This article was originally published at Texas Instruments' website. It is reprinted here with the permission of Texas Instruments.

Making Cars Safer Through Technology Innovation

This article was originally published at Texas Instruments' website. It is reprinted here with the permission of Texas Instruments.

Facial Analysis Delivers Diverse Vision Processing Capabilities

Bookmark and Share

Facial Analysis Delivers Diverse Vision Processing Capabilities

Computers can learn a lot about a person from their face – even if they don’t uniquely identify that person. Assessments of age range, gender, ethnicity, gaze direction, attention span, emotional state and other attributes are all now possible at real-time speeds, via advanced algorithms running on cost-effective hardware. This article provides an overview of the facial analysis market, including size and types of applications. It then discusses the facial analysis capabilities possible via vision processing, along with the means of implementing these functions, including the use of deep learning. It also introduces an industry alliance available to help product creators incorporate robust facial analysis capabilities into their vision designs.

Before you can analyze a face, you first must find it in the scene (and track it as it moves). Once you succeed, however, you can do a lot. With gaze tracking, for example, you're able to determine which direction a person is looking at any particular point in time, as well as more generally ascertain whether their eyes are closed, how often they blink, etc. Using this information, you can navigate a device's user interface by using only the eyes, a function with both flashy (gaming) and practical (medical, for example, for patients suffering from paralysis) benefits. You can also log a person's sleep patterns. You can ascertain whether a driver is dozing off or otherwise distracted from paying attention to the road ahead. And you can gain at least a rudimentary sense of a person's mental and emotional state.

Speaking of emotion discernment, plenty of other facial cues are available for analysis purposes. They include the color and pattern of the skin (both at any particular point in time and as it changes over time), the presence of a smile or a frown, the location, amount, and intensity of wrinkles, etc. The camera you're designing, for example, might want to delay tripping the shutter until the subject has a grin on his or her face. If a vehicle driver is agitated, you might want to dial up the autonomous constraints on his or her compulsions. A person's emotional state is also valuable information in, say, a computer game. And the retail sales opportunity to tailor the presentation of products, promotions, etc. is perhaps obvious.

In fact, retail is one of many key opportunities for yet another major facial analysis function: audience classification. How old is someone? What's their gender? Ethnicity? Real-time identification of a particular individual may not yet be feasible, unless the sample size is fairly small and pre-curated (as with social media networks) or you're the National Security Agency or another big-budget law enforcement agency. But the ability to categorize and "bin" the customers that enter your store each day, for example, is already well within reach.

In all of these facial analysis use cases, various vision processing architecture alternatives (local, locally connected, "cloud", and combinations of these in a heterogeneous computing approach) are possible, along with diversity in the algorithms that those processors will be running and the 2-D and 3-D cameras that will capture the source images to be analyzed. And performance, power consumption, cost, size and weight are all key considerations to be explored and traded off as needed in any particular implementation.

Market Status and Forecasts

Face recognition technology has been widely used in security applications for many years, although it was thrust into the public spotlight after the 9/11 attacks. The FBI database, for example, supposedly now contains more than 400 million images. Broader facial analysis technology has found applications beyond security and surveillance with the emergence of new algorithms, which enable users to extract a wide range of additional information about facial features, such as skin color and texture, gender, eye gaze, and emotional state. This information is widely being used by the business sector, especially within the healthcare, retail, and transportation industries.

Facial analysis has been extensively developed in academic and research lab settings for some time now, and the quality of its results has improved significantly in recent years. Recognition algorithms can be divided into two main approaches: geometric and photometric. The geometric approach looks at distinguishing features and tries to recognize various aspects of a face. In contrast, the photometric approach statistically converts an image into values and then compares these values with a template in order to maximize a match.

Three-dimensional (3D) image capture can improve the accuracy of facial analysis. This technique uses depth-discerning sensors to capture information about the shape of a face. The information is then used to identify distinctive features, such as the contour of the eye sockets, nose, and chin. 3D imaging typically costs more to implement than conventional 2D imaging, due to the more exotic sensors required, and is also more computationally intensive due to the added complexity of the associated algorithms.

Facial recognition suffers from the typical challenges of any computer vision technology. Poor lighting and environmental conditions, sunglasses, and long hair or other objects partially covering the subject’s face can make analysis difficult. Image resolution, color depth and other quality metrics also play a significant role in facial analysis success. In addition, the face naturally changes as a subject ages. The algorithm must “learn” these changes and then apply appropriate adjustments in order to prevent false-positive matches.

The facial biometrics market, which includes both conventional facial analysis and facial thermography (the measurement of the amount of infrared energy emitted, transmitted, and reflected by various areas of the face, which is subsequently displayed as a subject-unique image of temperature distribution), is still relatively young. Tractica’s market analysis indicates that the sector accounted for $149 million in 2015. Tractica forecasts that the market for facial biometrics will reach $882 million by 2024, representing a compound annual growth rate (CAGR) of 21.8% (Figure 1).  Cumulative revenue for the 10-year period 2015 and 2024 will reach $4.9 billion.  Facial analysis technologies represent 57% of the market opportunity during this period, with facial thermography accounting for the balance.


Figure 1. Forecasted worldwide and regional revenue growth (USD) for facial biometrics (courtesy Tractica).

A wide range of emerging applications is driving the forecasted revenue growth. For instance, facial analysis technology has gained significant traction in the retail industry. Applications here include customer behavior analysis, operations optimization, and business intelligence. Facial recognition techniques can be used, for example, to identify a person entering a store. If the person is identified as a frequent visitor, he or she is given preferred treatment. More generally, by analyzing the direction of a person’s gaze, facial analysis technology can analyze the effectiveness of various marketing campaigns. And by identifying customer attributes such as age range, gender, and ethnicity, facial analysis enables retail stores to gain valuable business intelligence.

The financial and gaming markets can use facial recognition technology to identify persons of interest. At an automated teller machine (ATM), for instance, it can be used to identify whether or not the account owner is the person currently accessing the account. Casinos can use it to identify VIP customers as well as troublemakers. Hospitals are using facial recognition systems to detect patient identity theft. Facial detection, analysis and recognition algorithms are also being deployed in intelligent surveillance cameras, capable of tracking individuals as they move as well as identifying known "friendly" persons (family members, etc.). Such capabilities provide manufacturers with significant market differentiators.

Tractica forecasts the top 20 use cases for facial analysis, in terms of cumulative revenue during the period from 2015 to 2024 (Table 1). The top industry sectors for facial recognition, in order of cumulative revenue opportunity during the 2015-2024 forecast period, are:

  1. Consumer
  2. Enterprise
  3. Government
  4. Finance
  5. Education
  6. Defense
  7. Retail
  8. NGOs (non-governmental organizations)
  9. Law Enforcement

Rank

Use Case

Revenue (2015-2024) ($M USD)

1

Consumer Device Authentication

2,358.67

2

Identifying Persons of Interest

1,057.30

3

Time and Attendance

308.60

4

Cashpoint/ATM

201.04

5

Visitor Access

198.51

6

Border Control (entry/exit)

197.07

7

Physical Building Access

139.46

8

De-duplication

77.19

9

Online Courses

60.84

10

National ID Cards

51.04

11

Biometrics Passports

43.53

12

Demographic Analysis/Digital Signage

38.88

13

Mobile Banking - High Value 2nd Factor

38.53

14

Border Control (on bus, etc.)

31.16

15

Point of Sale Payment

29.18

16

Aid Distribution/Fraud Reduction

21.29

17

Fraud Investigation

13.35

18

Healthcare Tracking

10.63

19

Voter Registration/Tracking

4.34

20

Benefits Enrollment (proof of life)

2.80

 

Total

4,883.41

Table 1. Cumulative facial analysis revenue by use case, world markets: 2015-2024 (courtesy Tractica).

Facial analysis technology still has a long way to go before it will realize its full potential; applications beyond security are only beginning to emerge. Tractica believes that facial recognition will find its way into many more applications as accuracy steadily improves, along with camera resolution and other image quality metrics.

Audience Classification

Facial classification (age, gender, ethnicity, etc.) approaches typically follow a four-step process flow: detection, normalization, feature extraction, and classification (Figure 2). Identifying the region of the input image that contains the face, and then isolating it, is an important initial task. This can be accomplished via face landmark detection libraries such as CLandmark and flandmark; these accurate and memory efficient real-time implementations are suitable for embedded systems.


Figure 2. Facial classification encompasses a four-step process flow.

Deep learning training (and subsequent inference) techniques with annotated datasets such as CNNs (convolutional neural networks), which will be further discussed in the next section, can also find use. In such cases, however, implementers need to comprehend and account for any biases present in the training datasets, such as possible over-representation of certain ethnicities as well as over-representation of adults versus children.

Some classification methods benefit from the alignment of the face into an upright canonical pose; this can be achieved by first locating the eyes and then rotating the face image in-plane to horizontally align the eyes. Cropping the area of the image that contains face(s) and resizing these to a specific size for further analysis is often then required. Generally, the various methods are sensitive to face illumination; varying the illumination angle or intensity of a scene can notably alter classification results.

The extracted and normalized face images then undergo a series of preprocessing steps, generated by various filters and with various effects, to prepare the images for feature extraction (Figure 3). Binarization followed by segmentation, for example, can be used to remove unnecessary background information as well as highlight feature zones such as eyebrows, the nose and lips. Gaussian blur/smoothing can be used to remove noise from the image. And the Sobel edge filter can be used to emphasize edges of lines in the face, useful in determining age based on the magnitude of wrinkles in areas where they're commonly found.


Figure 3. Pre-processing filters (and their effects) can be useful in emphasizing and understating various facial features.

After pre-processing comes feature extraction. Typically the eyes, brows, nose and mouth are associated with the most important facial features regions. Additional information can also be extracted from secondary features such as the shape of the hairline, the cheeks and the chin, in order to further classify the images into more specific "bins." Gender classification is based on observed differences and variations of the shape and characteristics of various facial features (Table 2). And age classification can be determined from the ratios of measurements of various facial features, as well as from the variety of wrinkles in the cheeks, forehead and eye regions (which become more pronounced as people age).


Table 2. Gender-based feature characteristics.

Classification is done in steps, typically in a "tree" approach. Discriminating the age range after initially determine gender often works well, since facial structure differences between genders frequently exist. Middle-aged males and females often don't show the same facial aging signs, for example, due to females' better skin care habits. Differences in cranial structures and skin colors among various ethnic groups may also influence the results; a tree with a "parent" classification for ethnicity prior to tackling classifications for gender and age may further improve overall results (Figure 4).


Figure 4. Discriminative demographics classification can improve accuracy for "child" classes.

Video-based classification is more challenging than still-image-based techniques, since video sequences can lead to misclassifications due to frame-to-frame variations in head pose, illumination, facial expressions, etc. In these cases, a temporal majority-voting approach to making classification decisions for a partially tracked face across multiple frames can find use in stabilizing the results, until more reliable frame-to-frame tracking is restored.

Emotion Discernment

Identify emotions is a basic skill learned by humans at an early age and critical to successful social interactions. Most of the time, for most people, a quick glance at another person’s face is sufficient to accurately recognize commonly expressed emotions such as anger, happiness, surprise, disgust, sadness, and fear. Transferring this skill to a machine, however, is a complex task. Researchers have devoted decades of engineering time to writing computer programs that recognize a particular emotion feature with reasonable accuracy, only to have to start over from the beginning in order to recognize a slightly different feature.

But what if instead of programming a machine, you could teach it to recognize emotions with great accuracy? Deep learning techniques are showing great promise in lowering error rates for computer vision recognition and classification, while simultaneously simplifying algorithm development versus traditional approaches. Implementing deep neural networks in embedded systems can give machines the ability to visually interpret facial expressions with near-human levels of accuracy.

A neural network, which can recognize patterns, is considered "deep" if it has at least one hidden middle layer in addition to the input and output layers (Figure 5). Each node is calculated from the weighted inputs sourced from multiple nodes in the previous layer. These weighting values can be adjusted to perform a specific image recognition task, in a process known as neural network training.


Figure 5. A neural network is considered "deep" if it includes at least one intermediary "hidden" layer.

To teach a deep neural network to recognize a person expressing happiness, for example, you first present it with a collection of images of happiness as raw data (image pixels) at its input layer. Since it knows that the result should be happiness, the network recognizes relevant patterns in the picture and adjusts the node weights in order to minimize the errors for the "happiness" class. Each new annotated image showing happiness further refine the weights. Trained with enough inputs, the network can then take in an unlabeled image and accurately analyze and recognize the patterns that correspond to happiness, in a process called inference or deployment (Figure 6).


Figure 6. After being initially (and adequately) trained via a set of pre-annotated images, a CNN (convolutional neural network) or other deep learning architecture is then able to accurately classify emotions communicated via new images' facial expressions.

The training phase leverages a deep learning framework such as Caffe or TensorFlow, and employs CPUs, GPUs, FPGAs or other specialized processors for the training calculations and other aspects of framework usage (Figure 7). Frameworks often also provide example graphs that can be used as a starting point for training. Deep learning frameworks also allow for graph fine-tuning; layers may be added, removed or modified to achieve improved accuracy.


Figure 7. Deep learning frameworks provide a training starting point, along with supporting layer fine-tuning to optimize classification outcomes.

Deep neural networks require a lot of computational horsepower for training, in order to calculate the weighted values of the numerous interconnected nodes. Adequate-capacity and –bandwidth memory, the latter a reflection of frequent data movement during training, are also important. CNNs are the current state-of-the art deep learning approach for efficiently implementing facial analysis and other vision processing functions. CNN implementations can be efficient, since they tend to reuse a high percentage of weights across the image and may take advantage of two-dimensional input data structures in order to reduce redundant computations.

One of the biggest challenges in the training phase is finding a sufficiently robust dataset to "feed" the network. Network accuracy in subsequent inference (deployment) is highly dependent on the distribution and quality of the initial training data. For facial analysis, options to consider are the emotion-annotated public dataset from FREC (the Facial Expression Recognition Challenge) and the multi-annotated proprietary dataset from VicarVision.

The inference/deployment phase can often be implemented on an embedded CPU, potentially in conjunction with a DSP, vision or CNN processor or other heterogeneous co-processor, and/or dedicated-function hardware pipelines (Figure 8). If a vision-plus-CNN coprocessor combination is leveraged, for example, the scalar unit and vector unit are programmed using C and OpenCL C (for vectorization), but the CNN engine does not require specific programming. The final graph and weights (coefficients) from the training phase can feed into a CNN mapping tool, with the embedded vision processor’s CNN engine then configured and ready to execute facial analysis tasks.


Figure 8. Post-training, the deep learning network inference (deployment) phase is the bailiwick of an embedded processor. Heterogeneous coprocessors, if available, efficiently tackle portions of the total calculation burden.

It can be difficult for the CNN to deal with significant variations in lighting conditions and facial poses, as was previously discussed regarding audience classification; here too, pre-processing of the input images to make the faces more uniform can deliver notable accuracy benefits. Heterogeneous multi-core processing architectures enable, for example, a CNN engine to classify one image while the vector unit simultaneously preprocesses (light normalization, image scaling, plane rotation, etc.) the next image...all at the same time that the scalar unit is determining what to do with the CNN detection results. Image resolution, frame rate, number of graph layers and desired accuracy all factor into the number of parallel multiply-accumulations needed and overall performance requirements.

Gaze Tracking and Other Implementation Details

By combining previously discussed deep learning advances with powerful computational imaging algorithms, very low false-acceptance rates for various facial analysis functions are possible. The analysis execution time can be shortened, and accuracy boosted, not only by proper network training prior to inference but also by programming facial templates to improve scalability, and by integrating various computer vision-based face tracking and analysis technologies. FotoNation, for example, has in its various solutions standardized on 12x12 pixel clusters to detect a face, along with recommending a minimum of 50 pixels' spacing between the eyes.

Leading-edge feature detection technologies examine 45 or more landmarks in real-time video, along with over 60 landmarks in a still image, in order to decipher facial features with high confidence. These features can also be detected in diverse orientations, ranging from +/-20° yaw and +/-15° pitch. Achieving adequate performance across diverse lighting conditions is also a major challenge that must be overcome in a robust facial analysis solution. And full-featured facial analysis approaches also support related technologies such as eye/gaze tracking and liveliness detection.

The integration of head and eye position, for example, will facilitate locating and tracking the gaze in the scene, even in cases where the eyes aren't encased in an enclosure free of interfering ambient light (such as a virtual reality headset) and more challenging bright-pupil tracking algorithms must therefore find use. Liveliness detection technology determines whether the frame contains a real person or a high-resolution photograph. Its algorithm takes into account both any motion of the facial features and the skin's reflections. The integration of all these technologies results in rapid and high-reliability facial analysis that's sufficiently robust to overcome even occlusions caused, for example, by wearing glasses.

Conclusion

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products. And it can provide significant new markets for hardware, software and semiconductor suppliers (see sidebar "Additional Developer Assistance"). Facial analysis in its various implementations discussed in this article – audience classification, emotion discernment, gaze tracking, etc. – is relevant and appealing to an ever-increasing set of markets and applications. And as the core technologies continue to advance, response-time and accuracy results will improve in tandem, in the process bringing the "holy grail" of near-instantaneous unique-individual identification ever closer to becoming a reality for the masses.

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. FotoNation, NXP and Synopsys, three of the co-authors of this article, are members of the Embedded Vision Alliance; Tractica is a valued partner organization. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, will be held May 1-3, 2017 at the Santa Clara, California Convention Center.  Designed for product creators interested in incorporating visual intelligence into electronic systems and software, the Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings. Online registration and additional information on the 2017 Embedded Vision Summit are now available.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Durga Peddireddy,
Director of Product Management, FotoNation Limited (a subsidiary of Tessera Holding Corporation)

Ali Osman Ors
Senior R&D Manager, NXP Semiconductors

Gordon Cooper
Product Marketing Manager, Synopsys

Anand Joshi
Principal Analyst, Tractica

Camera Interfaces Evolve to Address Growing Vision Processing Needs

Bookmark and Share

Camera Interfaces Evolve to Address Growing Vision Processing Needs

Before a still image or video stream can be analyzed, it must first be captured and transferred to the processing subsystem. Cameras, along with the interfaces that connect them to the remainder of the system, are therefore critical aspects of any computer vision design. This article provides an overview of camera interfaces, and discusses their evolutions and associated standards, both in general and with specific respect to particular applications. It also introduces an industry alliance available to help product creators incorporate robust vision capabilities into their designs.

Cameras, along with the processors and algorithms that analyze the images captured by them, are an integral part of any computer vision system. A variety of interface options, each with a corresponding set of strengths and shortcomings, exist for connecting cameras to the remainder of the design. And the options list isn't static; legacy standards (both de facto and official) lay the foundations for the emerging standards that follow them.

These interface options cover the connections between standalone cameras and systems, as well as and the links between both camera modules and image sensors and other ICs within an integrated system. Sometimes, camera interface requirements are by necessity application-specific. However, whenever possible, it's beneficial from volume cost amortization and other standpoints to be able to generalize these requirements across classes of applications. And performance, power consumption, cost, breadth of existing industry support (and future trends in this regard) and other factors are all key considerations to be explored and traded off as needed in any particular implementation.

In this article we will first discuss interface standards for industrial vision systems, then turn our attention to automotive-related standards, and conclude with a discussion of intra-system camera- and sensor-to-processor interfaces.

Industrial Vision

In the machine and industrial vision market segment, digital cameras are replacing their analog forebears and find use in diverse applications, including production processes (PCB inspection, glass surface inspection, robot control, etc.) and hyperspectral imaging. Digital cameras are expected to deliver long-term reliability, real-time capability, and exceptional image quality. And as imaging requirements have evolved, so too have fidelity, frame rates and resolution needs. Gone are the days of 15 Hz video captured at 1.3 Mpixels per frame. Today's cameras are capable of operating at hundreds of frames per second and deliver images beyond 20 Mpixels in resolution, along with improved color depth, low-light performance and other quality metrics.

Machine vision applications often require complicated image processing, therefore necessitating the transfer of uncompressed image data from the camera to the host computer. This requirement further increases the demand for robust, high bandwidth interfaces. And because many applications operate in assembly line, remote or other harsh environments, they drive additional requirements for long cable lengths, efficient power transmission, and reliable communication protocols. These sometimes-contending factors have contributed to the development of several different video interfaces to connect machine vision cameras to a processing unit.

In order to meet varying performance and other requirements while maintaining a level of standardization, manufacturers of machine vision hardware and software have developed an open vision interfacing standard called the Generic Interface for Cameras (GenICam), which is hosted by the EMVA (European Machine Vision Association). The intention of GenICam is to insure that the software API (application programming interface) remains consistent regardless of the hardware interface in use. This commonality makes it possible to combine any compliant camera with any compliant application.  The GenICam standard covers a variety of old-to-new interfaces, including IEEE 1394, Camera Link, CoaXPress, GigE Vision and USB Vision. The latter two interfaces are the most recent, and address modern requirements for video transmission, device interoperability, and camera configuration and control.

GigE Vision uses gigabit Ethernet as its physical layer for image data transfer. The GigE Vision standard adds two application layer protocols to the UDP protocol stack: GVCP (the GigE Vision Control Protocol) and GVSP (the GigE Vision Stream Protocol). These protocols both avoid the overhead that comes with TCP-based protocols and deliver the needed reliability (IP packet retransmission, for example) to UDP, together making reliable data rates of up to 120 MBytes/sec feasible. The interface's Ethernet framework enables the creation of camera networks, which makes GigE Vision the ideal interface for multi-camera applications capable of up to 100 m cable lengths between network nodes.

The most recent version of the standard, GigE Vision 2.0 introduced in 2011, delivers additional features such as precise PTP (Precision Time Protocol) time stamps and broadcast triggering. The latter allows users to simultaneously trigger cameras in multi-camera applications over the existing Ethernet cable, without need for additional I/O trigger cables, and in a very precise manner down to a few microseconds of latency. And many GigE Vision-compliant cameras offer built-in PoE (Power over Ethernet) capabilities, negating the need for a local power supply. GigE Vision cameras can operate via any standard Gigabit Ethernet port ranging from a standard desktop PC to embedded processing board; GigE add-in cards are also available that are specifically designed for GigE Vision (providing a PoE supply or enhanced I/O capabilities, for example). And multi-port GigE cards allow operating up to four GigE Vision cameras with full bandwidth using a single PCIe add-in slot.

If more bandwidth per camera is required, USB3 Vision may be an attractive alternative. Introduced in 2013, this standard uses USB 3.0 as the underlying transmission protocol and offers image data rates up to 350 MBytes/s. Prior to the publication of the USB3 Vision specification, an industry standard for USB devices in the image processing industry did not exist. Various camera manufactures released proprietary solutions based around the USB 2.0 interface used on in mass market. USB 2.0 capabilities were generally insufficient, however, to ensure adequate component stability and durability, along with overall solution robustness, at the levels required for industrial applications.

Today, in contrast, virtually every new PC and embedded processing board is equipped with USB 3.0 ports. USB3 Vision cameras can therefore be connected to a processing unit without additional hardware. And just as GigE Vision builds on top of UDP, USB3 Vision adds real-time capabilities and reliability to USB 3.0. The plug-and-play nature of USB 3.0 leads to straightforward integration of machine vision cameras into an application. However, USB3 cable lengths are not as long as is possible with GigE, and long USB3 cable spans can also be prone to radiating high EMI levels. Designers should be cognizant of these discrepancies when selecting a camera interface, determining whether data rate or transmission length is more important in their particular application.

ADAS and Autonomous Vehicles

ADAS (advanced driver assistance systems) and even more complex autonomous vehicles require the ability to interface to multiple cameras, located around the automobile and viewing both external and internal perspectives. These standalone cameras leverage commonly used traditional automotive interfaces such as:

  • MOST (Media Oriented Systems Transport): A high-speed network implemented over either optical or electrical physical layers, MOST uses a ring topology.
  • CAN (Controller Area Network): A low-speed (1 Mbit/s) network implemented over an electrical physical layer. CAN nodes connect as a multi-master bus.
  • FlexRay: A medium-speed (10 Mbit/s) network implemented over an electrical physical layer. It can use either a ring or multi-master bus.
  • IDB-1394 (the Intelligent Transportation Systems Data Bus Using IEEE1394 Technology): A (400 Mbit/s) high-speed network implemented over an electrical physical layer. It uses a daisy-chained topology.
  • LVDS (Low-voltage Differential Signaling): A point-to point-technology that protocols such as Camera Link can use for control and data transfer purposes (Figure 1).


Figure 1. The Camera Link protocol can operate in a point-to-point fashion via LVDS technology.

Each of these automotive interfaces requires a PHY to convert between the LVCMOS interface signals and levels used by the processor and those required by the protocol. In the case of the MOST protocol, the PHY must also convert between the electrical to the optical domains. Should an All-Programmable SoC (single-chip FPGA plus CPU) find use as the processor, for example, either the PS (processing systems) or PL (programmable logic) portion of the chip can interface with the PHY. If the design employs the PS, in order to take advantage of included I/O peripherals such as the CAN interface, DDR memory in conjunction with DMA can then be leveraged to transfer image data to the PL portion of the device for further processing.

Each of these traditional automotive protocols delivers different bandwidth capabilities, which defines the resolution, frame rate and pixel size which the camera is capable of outputting. Video sent over CAN, for example, must often first be highly compressed, resulting in additional processing complexity to restore the image to adequate quality prior to further processing. More generally, network-based protocols are popular in automotive applications, since they reduce the number of interconnections required between the various nodes. This simplification factors into a reduction in overall vehicle mass, thereby improving fuel economy and cost. However, the network must deliver the bandwidth required to transfer image data between each camera and the processing nexus, along with delivering required latency.

The architecture of the network is very important, as different topologies will result in varying data transfer processing requirements. A bus topology, with the processor master addressing various camera slaves as required to obtain image data, can be the simplest approach. Daisy-chained and ring topologies are similar, with data simply passing from one node to the next if not intended for it. In fact, a daisy -chained topology can transform into a ring network if the end daisy-chained nodes are connected. However, the use of daisy-chained and ring topologies result in more complex per-node requirements to route and process messages.

Ethernet and Point-to-point

Speaking of networks, the earlier-discussed Ethernet-based interface for industrial vision also bears consideration in automotive applications. Broadcom, for example, originally developed the Ethernet-derived BroadR-Reach physical layer, with the goal of bringing cost-effective, high-bandwidth connectivity to vehicles. This 100 Mbps communications standard was later standardized as 100BASE-T1 in IEEE 802.3bw-2015 Clause 96. It transmits data over a single copper pair, with 3 bits per symbol, and supports full duplex mode, simultaneously transmitting in both directions. In June 2016, the IEEE P802.3bp working group subsequently also standardized 1000BASE-T1, bringing 1 Gbps bandwidth to low cost, single pair cabling.

Replacing traditional wiring with 100BASE-T1 connections delivers many conceptual advantages. Unshielded twisted pair cables along with smaller and more compact connectors that deliver data at 100 Mbps can reduce connectivity costs by up to 80 percent and cabling weight up to 30 percent versus traditional alternative approaches. However, since even a 1 Mpixel, 4:2:0, 30fps video stream consumes more than 3x the bandwidth available in a 100 Mbps link, it needs to be compressed prior to transmission. Typically H.264's more complex High 10 Intra, High 4:2:2 Intra, and High 4:4:4 Intra profiles find use for video compression, since these profiles support more bits per pixel. Also, these codec variants can operate without storing a full frame in memory due to their intra-only coding schemes, thereby removing the need for external DRAM and resulting in low-cost solutions that retain high quality.

BMW first introduced Ethernet in their cars in 2008 for software updates, diagnostics, and private links. In 2011, the company's scope expanded to additionally using Ethernet for infotainment and driver assist. BMW subsequently introduced Ethernet-based cameras for rear and surround view on the company's 2013 vehicle models. And at the Ethernet Technology Day in 2015, BMW announced that it had adopted Ethernet as the main system bus in new 7-series vehicles. Other automotive OEMs are quickly following BMW's lead: Jaguar Land Rover, VW, Audi, Porsche, and General Motors are today also in production with Ethernet-based systems.

On the other end of the implementation spectrum is a point-to-point connection implemented over LVDS. Such interfaces provide for a simple, high performance connection between the processor and each camera. The aggregate cost and complexity, however, are increased versus an alternative networked approach, due to the requiring cabling between each camera and the processing core. Once commonly used protocol implemented over LVDS is Camera Link, which serializes 28 bits of data across five LVDS channels (4 data, 1 clock). In its base configuration, this point-to-point interface is capable of supporting data rates at up 2.04 Gbps.

MIPI Alliance Standards

To this point in the article, we've discussed several application-focused examples of inter-system interfaces, which connect standalone cameras to distinct processing modules such as a computer. Now, we'll explore intra-system interfaces, intended to connect an image sensor or camera module to a locally located processor within the same system. Much of today's industry standardization effort in this area is spearheaded by the MIPI Alliance; the organization's CWG (Camera Working Group) brings together image sensor companies and application processor companies, along with test equipment and validation IP providers, to develop and advance various imaging and vision interface solutions.

MIPI has developed two different imaging architectures to address industry needs (Figure 2):

  • MIPI CSI-2 is a purpose-built integrated architecture consisting of both application and transport layers, and supporting both MIPI's C-PHY and D-PHY physical layer designs. CSI-2 has achieved broad market adoption in mobile and mobile influenced products, and continues to evolve on a two-year cadence to meet various applications' imaging and vision needs.
  • MIPI CSI-3 is an application stack developed to work with an existing UniPro transport network utilizing MIPI's M-PHY physical layer specification. The CSI-3 v1.1 specification was released in 2014, provides camera and video capabilities, and is stable with no further evolution planned at this time.


Figure 2. The MIPI Alliance's CSI-2 and CSI-3 specifications address varying transport and PHY requirements.

The CSI-2 v1.3 specification, finalized in 2014 and also referred to by MIPI as "Gen2", evolves beyond prior 6 Gbps/10 wire performance limitations (Figure 3). Its link performance gains can deliver enhanced system capabilities such as higher-resolution sensors, faster frame rates, expanded color depths, and HDR (high dynamic range) imaging enhancements. When used in conjunction with the C-PHY, CSI-2 v1.3 operates at up to 5.7 Gbps (2.5 Gsps), using a 3-wire embedded clock and data scheme, with lane scaling of +3. The C-PHY variant also uses innovating coding techniques to reduce channel toggle rate by a factor of 2.28x versus prior approaches.


Figure 3. "Gen2" CSI-2 v1.3 improves both link rate and power consumption metrics versus specification predecessors.

When used in conjunction with the D-PHY, CSI-2 v1.3 operates at up to 2.5 Gbps, using 4 wires (and lane scaling of +2) with differential data and a forwarded differential clock. The 35-40% power consumption reduction of CSI-2 v1.3-based image sensors versus precursors will be valuable in many applications. This reduced power consumption also translates into lower heat emissions, beneficial in medical applications that may involve direct contact with live tissue, such as high-definition "cold cam" imaging, for example.

An explanation of lane scaling follows: A MIPI CSI-2 over D-PHY link requires a minimum of 4 wires; 2 for the forwarded periodic clock, and 2 for the data lane. The link may realize higher bandwidth by scaling the data lanes with multiples of 2 data wires. For instance, high performance image sensors often use 2 wires for the clock and 8 wires to realize 4 data lanes. Similarly, a MIPI CSI-2 over C-PHY link requires a minimum of 3 wires, for the embedded clock and data. The link may realize higher bandwidth  by scaling the embedded clock and data lanes with multiples of 3 wires. For instance, high performance image sensors often use 9 wires to realize 3 lanes. Moreover the CSI-2 over C-PHY link utilizes encoding to realize 2.28x bandwidth gain versus the channel toggle rate. For instance, a 3.5 GSps CSI-2 over C-PHY lane delivers 8 Gsps of effective bandwidth.

An explanation of why the reduced toggle rate of the C-PHY version of CSI-2 v1.3 is important follows: Image sensors often use process nodes that are a few generations older than those employed by application processors. Therefore, a reduced toggle rate (i.e. switching rate) may be beneficial with these older process node technologies.

CS1-2 v2.0 and Successors

The latest published CSI-2 standard is v2.0, released in 2016 and also referred to as "Gen3". When used in conjunction with the C-PHY, CSI-2 v2.0 operates at up to 8 Gbps (3.5 Gsps), as before leveraging a 3-wire embedded clock and data scheme with lane scaling of +3. And when used in conjunction with the D-PHY, CSI-2 v2.0 operates at up to 4.5 Gbps, again leveraging 4 wires and lane scaling of +2. CSI-2 v2.0 also delivers five other key imaging enhancements versus its specification predecessors.

RAW-16 and RAW-20 Support

CSI-2 v2.0 introduces RAW-16 and RAW-20 color depth imaging capabilities derived from scaling the prior-generation RAW-8 and RAW-10 pipeline architectures. Such advancements can significantly improve HDR and SNR (signal-to-noise ratio) capabilities for a variety of vision applications.

Enhanced Virtual Channels

The proliferation of image sensors, both standalone and coupled to aggregators, along with multi-exposure HDR imaging are driving the requirement for considerably higher VC (virtual channel) support on a variety of platforms. CSI-2 v2.0 provides much needed VC expansion from prior CSI-2 specification solutions, which were limited to 4 VCs (Figure 4). For example, if an ADAS-equipped or autonomous vehicle is equipped with three pairs of sensors, supporting short-,  mid- and long-range viewing, and with each pair implementing both three-exposure HDR and IR (infrared) image capture capabilities, the system would require 24 VCs.


Figure 4. Expanded virtual channel support addresses increasingly common multi-camera configurations.

CSI-2 v2.0 over D-PHY now supports up to 16 VCs. In this particular physical layer implementation, the VC expansion also requires a newly architected ECC (error-correcting code) implementation in order to recover up to two bits' worth of information. Conversely, CSI-2 v2.0 over C-PHY directly supports up to 32 VCs, with a path to extend this support up to 1,024 VCs. Expanding VCs for CSI-2 2.0 over C-PHY also utilizes existing reserved (i.e. previously unused) fields.

An explanation of virtual channels follows: Image sensors often capture images in multiple data formats (RAW, compressed JPEG, metadata, etc.) and must often also support multiple-exposure HDR (high dynamic range) imaging requirements. Each distinct data format within an image sensor is assigned a unique virtual channel in order for the application processor to differentiate the incoming payloads, corresponding to the data type frames from physical sensors. Moreover, when multiple image sensors are combined using a CSI-2 Aggregator, virtual channels are used to spatially differentiate the data frames from individual sensors.

Latency Reduction and Transport Efficiency Benefits

CSI-2 v2.0's LRTE (Latency Reduction Transport Efficiency) PDQ (Packet Delimiter Quick) scheme enables the elimination of legacy packet delimiter overhead via a new and efficient packet delimiter (Figure 5). Key benefits of the LRTE PDQ include:

  1. An increase in the number of supported image sensors via CSI-2 aggregators, without any change in the underlying PHY, connector, cable and SERDES (serializer/de-serealizer) infrastructure on a given platform, yielding considerable system cost savings.
  2. A considerable reduction in latency for real-time perception and decision-making applications.
  3. Optimal transport while preserving CSI-2 packet delimiter integrity.


Figure 5. CSI-2 v2.0's LRTE (latency reduction transport efficiency) feature optimizes bandwidth utilization versus legacy approaches.

Also, in addressing electrical overstress that might normally be exhibited by the advanced process nodes now used to fabricate both image sensors and application processors, LRTE provides an optional alternate low-power signaling feature, along with a transition path to eventually phase out current high voltage (1.2 v) requirements.

The following system design example, summarized in Table 1, analyzes CSI-2 benefits when using LRTE PDQ, which supports popular 2.3 MPixel,/12 BBP (bit-per-pixel)/ 60 FPS (frame per second) automotive image sensors and reduces latency by more than 23% in comparison to continued use of legacy C-PHY packet delimiters.

Example system: CSI-2 v2.0 over C-PHY v1.2 (3.5 Gsps/lane, i.e. Gbps/lane)

  • 24 Gbps using 3 lanes (requires 9 wires)
  • CSI-2 v2.0 supports up to 32 VC over C-PHY

Multi-sensor aggregator with each image sensor configured as 1920x1200 pixels (2.3 Mpixels), 12 BPP, 60 FPS

Image sensor packet transport:

  • 1,920 pixels x 12 bps = 23,040 bits per horizontal row LP (long packet)
  • 23040 bits / 24 Gbps = 0.960 µs/LP (CSI-2 payload; ignoring packet header and footer)
    PH (packet header) + PF (packet footer) = 336 bits + 48 bits= 384 bits
  • 23,424 bits / 24 Gbps = 0.976 µs /LP_PHF (CSI-2 payload with PH and PF)

CSI-2 v2.0 with legacy C-PHY packet delimiters:

  • LP delimiter (SoT/EoT) = ~0.3 µs
  • Total time per LP_PHF = 1.276 µs
  • Packet delimiter overhead = 23.5%
  • Time per image frame = 1.276 µs x 1,200 = 1.53 ms (ignoring FS/FE)
    Streaming 60 frames from a single image sensor requires 1.53 ms x 60 = 91.87 ms
    Maximum supported image sensors per aggregator = 10
    FLOOR[1/0.09187] = FLOOR[10.88]

CSI-2 v2.0 with LRTE PDQ:

  • Time per image frame = 0.976 µs x 1,200 = 1.171 ms
  • Streaming 60 frames from a single image sensor requires 1.172ms x 60 = 70.27 ms
  • Maximum supported image sensors per aggregator = 14
  • FLOOR[1/0.07027] = FLOOR[14.23]

Benefits of CSI-2 v2.0 with LRTE PDQ:

  • Frame transport efficiency improvement: 23.5%
  • Additional supported image sensors on aggregator: +4 (over the same channel)
  • Alternatively, you can reduce the toggle rate and/or number of wires


Table 1. Summary of design example efficiency and other improvements

DPCM 12-10-12

Leveraging the success of DPCM (differential pulse code modulation) 12-8-12 for data transfer purposes, DPCM 12-10-12 has subsequently been developed in order to further boost image quality (Figure 6). DPCM 12-10-12 reduces that maximum absolute error of a single-bit change in pixel value by a factor of 4.43x. Furthermore, the MTF (modulation transfer function) frequency response analysis closely tracks the original input, showing that the converted image is devoid of compression artifacts over various illumination and contrast combinations, therefore virtually indistinguishable from the original image.



Figure 6. CSI-2 2.0's DPCM 12-10-12 reduces PCM errors by a factor of 4.43x (top) along with notably reducing compression artifacts (bottom) in comparison to prior-generation DPCM 12-8-12.

Table 2 illustrates the link bandwidth reduction and cost savings benefits of using CSI-2 DPCM for relatively popular imaging formats.


Table 2. Bandwidth reduction and cost savings for CSI-2 DPCM with various imaging formats.

An explanation of why DPCM is important follows: differential pulse code modulation provides compression while preserving object edges, devoid of lossy compression artifacts. This attribute is particularly useful for vision applications; recognizing characters on a street sign, for example. And the DPCM compression techniques are passed through a myriad of objective qualifications, such as high/medium/low contrast modular transfer function response testing.

Lowered Electromagnetic Interference

CSI-2 v2.0 supports Galois Field 2^16 scrambling to lower electromagnetic interference by reducing PSD (power spectral density) emissions across a broad range of frequencies, both for the CSI-2 over C-PHY channel (embedded clock and data) and the CSI-2 over D-PHY data lanes (Figure 7). Standard scrambling utilizes a lane-based seed reinitialized using the C/D-PHY Sync signals, while enhanced scrambling is provided for CSI-2 over C-PHY with unique Sync words.



Figure 7. CSI-2 over D-PHY (top) and C-PHY (bottom) both exhibit decreased PSD (power spectral density) emissions when Galois Field scrambling is active.

Table 3 summarizes CSI-2 v2.0 imaging conduit capabilities.

CSI-2 v2.0 features

Requires C-PHY v1.2?

Requires D-PHY v2.1?

Notes

RAW-16

Improves intra-scene dynamic range

No

No

Supports legacy C/D-PHYs

RAW-20

Improves intra-scene dynamic range

No

No

Supports legacy C/D-PHYs

Galois Field 2^16 Scrambling

Reduces PSD emissions

Standard: No

Standard: No

Enhanced: Yes

Supports Legacy C-PHY (embedded clock and data) and D-PHY (data lanes only). Enhanced SCR requires an updated C-PHY Sync detector

LRTE PDQ

Reduces latency, enhanced sensor aggregation, improves efficiency

Preserve integrity: No

Preserve integrity HS-Idle: Yes

Drop integrity: No

Maintain CSI PGCD integrity: Uses legacy C-PHY Sync Code. Requires new Gen3 D-PHY v2.1 HS-Idle

 

Drop CSI PCCD integrity: supports legacy D-PHY

LRTE ALPS

Transition path to reduce/eliminate 1.2V requirements

ALP Codes: Yes

LVLP: Yes

Gen3 D-PHY v2.1 supports LVLP

 

Gen3 C-PHY v1.2 supports ALP Codes. C-PHY logic change  is required, and LVLP is nice-to-have.

DPCM 12-10-12

Reduce RAW-12 bandwidth while preserving SNR

No

No

Supports legacy C/D-PHYs

Virtual Channel Extension

Sensor aggregation, Multi/HDR exposures along with IR support

No (supports 32 VCs)

No (supports 16 VCs)

CSI-2 over C-PHY reuses RES with a path to 1024 VCs

 

CSI-2 over D-PHY requires new ECC logic to support 16 VCs

Gen1 / Gen2 IOT Fixes

ECC syndrome, embedded data interleave, lane de-skew and elastic store

No

No

Fixes for issues identified in prior product-interoperability testing

Table 3. CSI-2 v2.0 key features.

As previously mentioned, the MIPI Camera Working Group evolves CSI-2 at a two-year cadence (Figure 8). "Gen4" CSI-2, currently scheduled for finalization in 2018, will build on prior generation specifications with new features, potentially including:

  • A unified CSI-2 system architecture, including an embedded Camera Command Interface, for long-reach applications.
  • Architecture revisions to support end-to-end security, decoupled from the underlying PHY, SerDes, aggregators, repeaters, networks and other building blocks.
  • Features common to machine learning and inference applications, including CNNs (convolutional neural networks) and HoG (the histogram of oriented gradients)

Such so-called "common denominators" encompass a broad array of provisions to aid imaging and vision applications. For instance, ARVR (Augmented Reality Virtual Reality) panels require support for extremely high resolution in order to deliver near-20/20 vision. One of the Gen4 CSI-2 common denominators that’s presently under consideration involves the inclusion of (x,y) coordinates for eye-tracking image sensors, sending this information as a CSI-2 data type using low latency (LRTE) mode to the graphics or other compute engine residing in an application processor. This engine could therefore conserve power by using reduced resources to generate high resolution content only in the limited area surrounding the (x,y) focal region.

For more information on CSI-2, including webinars, tutorials, and contact details, visit the MIPI CWG site at http://mipi.org/working-groups/camera.

Conclusion

Vision technology is enabling a wide range of products that are more intelligent and responsive than before, and thus more valuable to users. Vision processing can add valuable capabilities to existing products. And it can provide significant new markets for hardware, software and semiconductor suppliers (see sidebar "Additional Developer Assistance"). Reliably transferring still and video information from a camera or image sensor capture source to its processing destination is a critical aspect of any vision processing system design. Ever-increasing application demands for higher quality images at increasing resolutions and frame rates drive ongoing evolutions and revolutions in these interfaces, for both intra- and inter-system configurations.

Sidebar: Additional Developer Assistance

The Embedded Vision Alliance, a worldwide organization of technology developers and providers, is working to empower product creators to transform the potential of vision processing into reality. Allied Vision, Basler, Intel, videantis and Xilinx, the co-authors of this article, are members of the Embedded Vision Alliance; the MIPI Alliance is a valued partner organization. The Embedded Vision Alliance's mission is to provide product creators with practical education, information and insights to help them incorporate vision capabilities into new and existing products. To execute this mission, the Embedded Vision Alliance maintains a website providing tutorial articles, videos, code downloads and a discussion forum staffed by technology experts. Registered website users can also receive the Embedded Vision Alliance’s twice-monthly email newsletter, Embedded Vision Insights, among other benefits.

The Embedded Vision Alliance also offers a free online training facility for vision-based product creators: the Embedded Vision Academy. This area of the Embedded Vision Alliance website provides in-depth technical training and other resources to help product creators integrate visual intelligence into next-generation software and systems. Course material in the Embedded Vision Academy spans a wide range of vision-related subjects, from basic vision algorithms to image pre-processing, image sensor interfaces, and software development techniques and tools such as OpenCL, OpenVX and OpenCV, along with Caffe, TensorFlow and other deep learning frameworks. Access is free to all through a simple registration process.

The Embedded Vision Alliance’s annual technical conference and trade show, the Embedded Vision Summit, will be held May 1-3, 2017 at the Santa Clara, California Convention Center.  Designed for product creators interested in incorporating visual intelligence into electronic systems and software, the Summit provides how-to presentations, inspiring keynote talks, demonstrations, and opportunities to interact with technical experts from Embedded Vision Alliance member companies. The Summit is intended to inspire attendees' imaginations about the potential applications for practical computer vision technology through exciting presentations and demonstrations, to offer practical know-how for attendees to help them incorporate vision capabilities into their hardware and software products, and to provide opportunities for attendees to meet and talk with leading vision technology companies and learn about their offerings. Online registration and additional information on the 2017 Embedded Vision Summit are now available.

By Brian Dipert
Editor-in-Chief, Embedded Vision Alliance

Michael Melle
Sales Development Manager, Allied Vision

Francis Obidimalor
Marketing Manager, Allied Vision

Mark Hebbel
Head of New Business Development, Basler

Frank Karstens
Product Manager, Basler

Haran Thanigasalam
Senior Platform Architect, Intel
Camera Workgroup Chairman, MIPI Alliance

Marco Jacobs
Vice President of Marketing, videantis

Aaron Behman
Director of Strategic Marketing for Embedded Vision, Xilinx

ARM Guide to OpenCL Optimizing Canny Edge Detection: The Test Method

Bookmark and Share

ARM Guide to OpenCL Optimizing Canny Edge Detection: The Test Method

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:


This chapter describes how to re-use the sample implementation and test the performance of optimizations.

How to test the optimization performance

The following performance test method produces the results that are shown in this guide:

  1. Use the difference in the CL timer values, called by CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END, to measure the time that the kernel on the GPU takes.
  2. Measure the execution-time ratio between the optimized implementation, and the implementation without the optimizations to evaluate the performance increase each optimization achieves. This enables you to see the benefits of each optimization as they are added.
  3. Run the kernel across various image resolutions, to see how different optimizations affect different resolutions. Depending on the use case, the performance of one resolution might be more important than the others. For example, a real-time web-cam feed requires different performance compared to taking a high-resolution photo with a camera.

The resolutions that have been tested are:

  • 1280 x 720.
  • 1920 x 1080.
  • 3840 x 2160.

To obtain the results that this guide uses, the results from 10 runs are averaged. This reduces the effects that individual runs have on the results.

After measuring the performance of the code with a new optimization, you can then add more optimizations. With each new optimization added, repeat the test steps and compare the results with the results from the code before the implementation of the new optimization.

Mali Offline Compiler

The Mali™ Offline Compiler is a command-line tool that translates OpenCL kernel source code into binary for execution on the Mali GPUs.

You can use the Offline Compiler to produce a static analysis output, that shows:

  • The number of work and uniform registers that the code uses when it runs.
  • The number of instruction words that are emitted for each pipeline.
  • The number of cycles for the shortest path for each pipeline.
  • The number of cycles for the longest path for each pipeline.
  • The source of the current bottleneck.

To start the Offline Compiler and produce the static analysis output, execute the command, mali_clcc -v on the kernel.

To get the Offline Compiler see, http://malideveloper.arm.com.

Supported Zero-copy Flows Inside the PowerVR Imaging Framework

This article was originally published at Imagination Technologies' website, where it is one of a series of articles. It is reprinted here with the permission of Imagination Technologies.