Embedded Vision Alliance: Technical Articles

Smart Enhanced Back-Up Camera (SmartEBC) for CV220X Image Cognition Processors

Bookmark and Share

Smart Enhanced Back-Up Camera (SmartEBC) for CV220X Image Cognition Processors

By Tina Jeffrey
Program Manager
This  is a reprint of a CogniVue-published white paper in the company's Knowledge Center, and is also available here (1.2 MB PDF).

The CogniVue Smart Enhanced BackUp Camera (SmartEBC) is a revolutionary automotive rear-view camera application that analyzes image data from a single image sensor to track objects in the scene, and perform feature detection and distance estimation of the nearest obstacle to the rear of the vehicle. SmartEBC executes on CogniVue’s CV220X family of SoC devices utilizing its patented APEX parallel processing technology. SmartEBC augments the functionality of traditional backup camera products with the ability to algorithmically interpret the scene by detecting objects behind a vehicle when in reverse, and alert the driver with critical real-time information to avoid a collision.

Traditional backup cameras provide a simple, distorted view of the rear environment to the driver. The driver must scan the environment for obstacles in the rear path while performing a vehicle reversing operation without an aid. CogniVue’s SmartEBC application detects and recognizes the closest obstacle at the rear of the vehicle including moving objects such as pedestrians and cyclists. For scenarios in which a collision is imminent such as the vehicle is moving towards an object or an object is moving towards the vehicle, an alarm signals the presence of the obstacle to the driver so that appropriate braking time is accommodated. Distance calculations are performed continuously when a vehicle is in motion moving towards a stationary object, for the purpose of providing the driver with obstacle ‘zone’ alarms (green, yellow, red) which represent varying distances between the detected object and the vehicle.

SmartEBC Offers Multiple Viewing Modes

SmartEBC software supports three distinct driver-selectable views of the rear vehicle environment. These views are described below, with illustrations of each that follow.

Full view provides an undistorted wide-angle view of the rear environment; Full view is the default view. It corrects the “fish-eye” or wide-angle lens distortion and renders the images to the display for the driver. The figure below illustrates the effects of dewarping/perspective correction when the system operates in Full view mode.

Top view, also referred to as “Bird’s-Eye view” provides an undistorted view of the rear environment from the vantage point of looking down from above.

Left/Right view, also referred to as “Split view” provides undistorted views of the far left and right regions of the rear environment of the vehicle.

Distorted "Fish-Eye" Input Image (above left), Corrected "Full View” (above right)

Full View (above left), - Top View (above right)

Full View (above left) - Left/Right Split View(above right)

A single button on a camera module product interfaces to a General Purpose I/O to allow the user to toggle between the multiple views. Through a similar mechanism, the product may offer a feature for programming a customized viewing mode sequence for the system, different from the default ordering if desired. The CV220X has numerous GPIOs available to system integrators to accommodate a variety of system designs.

SmartEBC Algorithm: Vision Based Obstacle Detection and Distance Estimation

The SmartEBC algorithm detects objects through 3D reconstruction of the scene using a single rear view parking camera by first dewarping the incoming raw video frames and then performing the necessary transformations on the image scene. Image features across multiple frames are detected and tracked using optimized versions of Good Features To Track (GFFT) and Block Matching algorithms. An image feature in the context of SmartEBC is either a corner or a speckle found within a small patch of the image, and is detected by analyzing these patches for strong gradients occurring in at least two different directions. SmartEBC detects and tracks features both on the ground to estimate the vehicle’s motion, and features associated with obstacles in the scene. For the SmartEBC to accurately calculate the ground motion a minimum vehicle speed of 3 cm/s is required (~0.1 km/hr). The algorithm is optimized to reduce false detections prior to reporting the distance of the nearest obstacle to the driver.

The figure below shows the regions of interest associated with the rear of the vehicle with the ground region of interest denoted as region ‘A’ and any feature located in the collision volume (B) is determined to be an obstacle. The red circle at the rear of the vehicle illustrates the position of the camera module and the blue arrow indicates the camera’s optical axis center in the field of view.

Regions of Interest at Rear of Vehicle

For a single rear camera module, the use of a wide-angle lens ensures that a sufficient side-to-side viewing region behind the vehicle is visible. However, a tradeoff of using a wide-angle lens is that the wider the field of view (FoV) of the lens, the lower the resolution (higher distortion) of the image the further away from the lens’ optical center. The SmartEBC algorithm operates optimally with the camera mounted such that its optical axis center intersects the ground region (A) at 1.2 meters in the Y direction (i.e. behind the vehicle).

The SmartEBC application accepts camera mounting heights (along Z axis) in a range between 60 and 145 cm in order to accommodate a wide variety of vehicle models. Given the camera’s mounting height, and maintaining the requirement the camera be mounted such that its optical axis is pointed at 1.2 meters directly behind the vehicle, the camera’s mounting angle is automatically determined. This requirement ensures the highest resolution area of the image in the required region of interest is processed by the SmartEBC for optimal detection. If a sub-optimal mounting angle is used (ie. the end of the blue arrow in the figure above is not at the 1.2 meter mark), there is less resolution of image data due to the wide angle nature of the lens making the features to track less distinct. This does not imply that features will not be detected and tracked, but rather that the further away from the optimal detection zone a feature is, the more pronounced it must be.

Obstacle detection and distance estimation is performed while the camera module operates in any of the viewing modes - Full, Top, and Left/Right views. The SmartEBC application software consists of multiple patented algorithms operating concurrently based on the movement of the vehicle and nearest object as described in the following scenarios:

  • Stationary Vehicle – Moving Object: This algorithm operating in this mode makes use of object motion to detect that an object exists. The algorithm computes ground motion and determines that the vehicle is stationary, in this scenario. The motion of the moving object is tracked and highlighted. A simple cautionary alarm is raised to the driver but no zone based distance estimation is presented. The purpose of this alarm is to warn the driver of imminent moving objects behind the vehicle prior to the vehicle moving in reverse.
  • Moving Vehicle – Stationary Object: The algorithm computes ground motion and determines that the vehicle is in reverse motion. The algorithm determines if an object is stationary or moving by doing a relative motion analysis. When a stationary object is detected the distance between the object and camera is calculated. An alarm is then presented to the driver. There are three configurable detection ‘zones’:
    • Green zone: Nearest object is between 1-2meters behind the vehicle. A green indicator on the display is overlaid and a low frequency tone is sounded (as required).
    • Yellow zone: Nearest object is between 60cm and 1 meter from rear of vehicle. A yellow indicator on the display is overlaid and a mid-frequency tone is sounded (as required).
    • Red zone: Object is between 0-60cm from rear of vehicle and collision is imminent. A red indicator is overlaid on the display and a high frequency tone is sounded (as required).
  • Moving Vehicle – Moving Object: The algorithm computes ground motion and determines the vehicle is in reverse motion. Then the nearest object is determined to be in motion. This motion may be in any direction behind the vehicle. The application produces a Red zone-based alarm without a distance measurement since the moving object may change speed and direction at any time. When in this mode of operation, the main goal is to draw the driver’s attention to the display immediately.

Note that for Stationary vehicle – Stationary object scenarios the selected view is rendered to the display in real- time (30fps) with no object detection or distance information overlay information, as there is no imminent danger when neither the vehicle nor the obstacle is moving.

System Solution/Hardware Requirements

The SmartEBC application requires the input of VGA or WVGA images from a well-paired sensor and lens combination. The sensor input is delivered to the CV220X as an uncompressed raw stream of YUV422 format data that is analyzed for obstacles on a frame by frame basis.The system must be designed to withstand all weather conditions and seasonal temperature variations.

The system is automatically activated when the vehicle is placed into reverse since it is powered directly from the vehicle’s reversing lights. Therefore, when the driver shifts into drive or park, the camera turns off.

The SmartEBC algorithm is highly developed and does not require CAN bus vehicle input speed for assessing whether or not the vehicle is in motion (reversing). Instead, the algorithm calculates ground speed algorithmically and uses 3D triangulation of points in space, and point motion vectors for determining the distance between the camera and the object. Note that for "before market" installations, the algorithms could make use of CAN bus vehicle information such as the steering angle input, to trigger different parking guide overlays on the display.

Digital images captured from the sensor are sent to the CV220X device (small enough to be embedded inside the camera module) which runs the application code. The application will easily accommodate different hardware implementations that provide an uncompressed raw YUV422 input stream to the CV220X device and accept an NTSC/PAL signal for display rendering. Some systems may have the CV220X embedded in the display unit in the vehicle’s cockpit or in the rear view mirror. Alternatively, the CV220X may be housed in an in-line module interfacing between the camera sensor and the driver display console.

The selected sensor/lens combination for the first production canned release of SmartEBC is the OmniVision OV7962 dynamic range sensor equipped with a 190 degree viewing angle. The output display is NTSC/PAL.

An alarm is triggered when an object/pedestrian is detected within the distance zones in the rear path of the vehicle. This allows drivers with information to accommodate appropriate braking.

SmartEBC "In-Vehicle" Installation

Combining Vision and Ultrasonic Technology for Smart Cameras

The SmartEBC algorithm interprets and processes data from a single image sensor, to perform object detection and distance estimation of the nearest object. The algorithm operates by continuously tracking textures (“features”) which are associated with objects in the scene. The more texture in the scene the better the object and ground is tracked and the more accurate the detection will be. If objects lack sufficient texture it is difficult and sometimes not possible to accurately track, and therefore will not be detected properly. For example, a uniform wall, or a very smooth object may not be detected by the algorithm. Also, vision-based smart camera systems are limited by available light. In complete darkness, dense fog, or very low light conditions, reliable detection is not possible. In most cases, the SmartEBC software is able to detect these conditions and provide a visual warning to the driver that the object detection will be unreliable.

Providing highly reliable detection across as many conditions as possible is also a trade-off with generating too many false alarms. False alarms happen when the algorithm concludes that there is a cluster of features present when in fact the features may be equivalent to an optical illusion. For instance, a bright reflection on the ground produced by shining overhead lights in a parking garage may be interpreted by the algorithm as an object, with a warning presented to the driver.

One way of increasing the reliability of the SmartEBC application is to enhance the solution with distance information from another sensor such as ultrasonic sensing technology. Adding distance data from a single ultra- sonic sensor integrated inside the camera module with the image sensor combines the strengths of vision-based SmartEBC and ultrasonic (US) technology to reduce the chances of not properly detecting objects with insufficient texture or features.

US proximity detectors measure the time taken for a sound pulse to be reflected back to the sensor providing feedback to the driver indicating proximity to the obstacle. While US sensors are in widespread use in vehicles, they have limitations. US sensors cannot detect small objects or objects which lie below the sensor’s cone-shaped operating range. As the system relies on the reflection of sound waves, items that are not flat or large enough to reflect sound (for instance a narrow pole or a longitudinal object pointed directly at the vehicle) are often not detected. These sensors can also be mis-triggered on steep slopes when the ground itself is incorrectly detected by the system as an object. Given SmartEBC addresses these shortcomings, merging these technologies will increase system reliability.

The following table highlights the advantages and disadvantages of each sensor technology.




Image Sensor & SmartEBC

- Distance estimation with single sensor (low cost, simple & low cost installation)

- Tracks multiple objects & provides distance estimation for nearest one

- Combines visual feedback with algorithmic feedback for object detection

- Can pick up relatively small objects (smaller than objects US technology can detect)

- Provides feedback in conditions where objects cannot be detected reliably (eg. Low - light)

- Limited by environmental lighting conditions

- Will not detect objects that lack sufficient texture

Ultrasonic Sensor

- Works in any lighting condition

- Can detect any object large enough to provide an echo

- Provides accurate distance estimation - Inexpensive bill of materials

- No visual feedback

- Fairly directional. Objects off center have poor detection requiring multiple sensors for full coverage

- May fail to detect small objects (detection is threshold based)

- Prone to false alarms in environments where there’s a lot of signal reflection such as dust, air turbulence, rough road surfaces

- Will not function properly if clogged with snow or mud and no feedback to indicate there is a problem.

- Expensive and unattractive installation especially for aftermarket

If SmartEBC software has distance data from a second source (i.e. ultrasonic) then the decision making for warning the driver of objects behaves as indicated in the following table.

Image Sensor State

Second Distance Sensor State


Not Detecting


- Position of object approximated by position bar on display.

- Zone distance estimation shown according to image OD algorithm

- Position of object approximated by position bar on display.

- Zone distance estimation shown according to image OD algorithm

Not Detecting

- Position of object not shown. Position bar covers full length of display.

- Zone distance estimation shown according to 2nd sensor data.

- No position or distance estimation data shown.

- If image sensor light level too low or too noisy a warning regarding lack of vision is shown to driver.

Customizing A Smart Enhanced Backup Camera Application

CogniVue’s SmartEBC application is built using the ICP SDK platform and associated image processing libraries including the Camera Calibration Toolkit and Image Processing Toolkit. CogniVue’s SmartEBC application comes as a pre-packaged software solution in the form of a binary executable.

Some customization to the SmartEBC solution is possible to accommodate specific camera module hardware and custom user interfaces. These modifications include:

  • Developing sensor driver and modifying sensor settings to accommodate different imaging sensors;
  • Creating custom views by changing the default look up tables (LUTs) specific for lens selected; LUTs may be hand-generated or developed automatically using CogniVue’s Camera Calibration Toolkit;
  • Overlaying a custom graphic logo on the display.
  • Develop or customize a new boot loader to manage boot sequence. The application’s binary image is stored in flash memory (SPI flash, NAND flash).

For customization beyond what is described above, contact CogniVue.

Not the Eye But the Heart for ADAS Vision: Analog Devices Enables Mass Deployments of Camera-Based ADAS

Bookmark and Share

Not the Eye But the Heart for ADAS Vision: Analog Devices Enables Mass Deployments of Camera-Based ADAS

By Peter Voss, Marketing Manager for Advanced Driver Assistance Systems
Analog Devices
This article was originally published by Hanser Automotive Magazine. It is reprinted here with the permission of the original publisher.


Making the automotive environment safer by reducing injuries and fatalities is always a hot topic of the automotive industry, an aspiration that's only gated by the availability of commercially deployable technology. Active Safety Systems, also known as ADAS (advanced driver assistance systems), present a major emerging market trend. The next major technology innovation after ABS (anti-lock braking systems) and stability control systems, which can be considered standard today, ADAS is rapidly gaining adoption. Analog Devices has just introduced the Blackfin® BF60x family of processors with a "Pipelined Vision Processor" (PVP) targeted at automotive ADAS Vision applications in support of the newly increased Euro NCAP car safety demands, while enabling customer products to achieve ISO26262 compliance.


Vision-based ADAS deployments are not new; they have already been found in cars for a number of years, with many of our customers deploying multiple Blackfin-based ADAS vision systems. However, deployments up to this point are still moderate in volume and, with few exceptions, are seen as customer options on premium brands only. Multiple reasons exist for the slow start. The technology available was expensive, vehicle manufacturers could not find automotive-ready hardware that complied with the automotive temperature and power consumption specifications, and qualified support was difficult to find.

In addition, no clear performance guidelines were available and, as a result, these systems were not recognized in the overall car safety ratings. From the end customer point of view, the added value of such systems were not acknowledged by the time the driver configured and ordered his new vehicle. Or, when it was recognized, the high price tag still scared some consumers away. While current deployment numbers hardly seem to justify the R&D money spent over many years by vehicle manufacturers and component manufacturers, some of them now have the "vision" of the future.

Is vision-based ADAS finally taking off?

The long startup phase for ADAS vision applications is about to start paying off, because the added safety provided by camera-based technology is beginning to be recognized by rating organizations, legislative bodies and end users. This fact immediately requires that the solutions be cost-optimized, dedicated and scalable innovations from the component suppliers, but it still leaves room for the automobile manufacturers and their suppliers (i.e. the OEMs) to differentiate with their own IP.

One recent catalyst for this market is the Euro NCAP (New Car Assessment Program) organization. Euro NCAP conducts crash tests and provides independent safety ratings (0-5 stars) for popular cars sold in Europe. Euro NCAP just increased the requirements for pedestrian protection, which will help ignite the mass launch for ADAS vision applications. Pedestrian protection is usually solved with specific crash absorbing bumper material and specially designed hoods – basically making the cars softer in areas where the pedestrian might impact. On the other hand, an electronic ADAS vision system, helping to avoid hitting the pedestrian in the first place, seems to be a lower cost and otherwise attractive alternative, once such technology exists.

With these new standards, if an OEM wants to maintain its 5 star Euro NCAP rating, the bar for pedestrian protection has just been raised by 50%. In fact, many vehicles rated 5 star by NCAP just last year would lose one of the stars if re-rated using the new 2012 Euro NCAP metrics. This change accelerates the desire for a viable, cost-effective, mass-deployable solution.  A camera-based ADAS vision system controlling autonomous breaking can solve the problem for the OEMs, and it presents an attractive option to satisfy the tightened pedestrian protection demands by Euro NCAP (Figure 1).

Figure 1. A camera-based ADAS vision system

Optimal systems level the performance-to-cost ratio

With the BF60x family just announced, Analog Devices is currently sampling the first dedicated ADAS Vision processor with unprecedented performance-per-watt and per-dollar value, in support of the new Euro NCAP requirements for tightened pedestrian protection. At Analog Devices, we focused from the beginning on the desired outcome for a mass-deployable programmable solution that still allowed our customers the highest degree of flexibility to innovate. To Analog Devices, it was clear that the ADAS vision market can’t just be addressed with pure processing power, once this segment goes into high volume deployments. Hence, the design targets were:

  • A cost-effective vision processor solution in support of  2014-and-beyond mass deployments
  • Up to 5 concurrent ADAS vision functions
  • Lowest-possible power consumption
  • Integral functional safety hardware support
  • A fully programmable solution to enable customers add their own IP, and
  • A scalable solution covering the low-to-high end of the market

The result is the first two dedicated ADAS vision processor family members (BF609 – Mega Pixel processing and BF608 – VGA processing) which support up to 5 concurrent vision functions and handle the processing of images up to megapixel size, at 30 FPS (frames per second) and the lowest-in-class power consumption of <1.3W @ 105°C. Since market deployments in 2014 and beyond were the target, ISO26262-compliant hardware support was a given for Analog Devices to integrate. The BF60x typically reduces system cost by up to 30% in a 5-function system.

Popular ADAS vision functions that can be realized utilizing the BF60x include the following:

  • LDW (Lane Departure Warning) is based on a lane detection algorithm which will warn the driver when the vehicle is about to unintentionally depart a lane; that is, if no turn signal is set but the lane marking is about to be crossed. Unintentional and unannounced lane changes are a major contributing factor to accidents.
  • LKA (Lane Keep Assist) is an enhancement of LDW, with the added advantage that the system will apply pressure to the steering wheel with the intention of keeping the vehicle in the lane.
  • PD (Pedestrian Detection) is an important application to support the tightened Euro NCAP pedestrian protection requirements from 2012 onwards. Pedestrians are identified and tracked; is a pedestrian is detected within the drive path of the vehicle, a warning or braking action is issued.
  • FCW (Forward Collision Warning) is based on an object detection algorithm that searches for multiple vehicles in front of the drive path. Although a conventional (i.e. non-depth sensing) camera cannot precisely determine absolute distance, the estimated distance can still be used to warn the driver.
  • FCM (Forward Collision Mitigation) is an enhancement to FCW, with the advantage that the system will autonomously apply braking pressure to slow the vehicle or even bring it to a complete stop. In the later case, a secondary sensor (e.g. radar) might also find use.
  • IHB (Intelligent High Beam) automatically detects when the high beam headlights need to be lowered in order to not blind other road participants.
  • -TSR (Traffic Sign Recognition) in many cases currently refers to simply detecting speed signs (i.e. speed limit assist). However, some applications also interpret other traffic signs and inform the driver (e.g. the time range over which a speed limit is valid), and
  • Concurrent ADAS vision functions references a single system that concurrently executes two or more applications at the same time, e.g. PD + FCM + LDW+ …

Power consumption innovation through intelligent memory bandwidth utilization

Why is power consumption so important for ADAS vision systems?  To understand this, one needs to be aware of where those systems are often mounted in cars. From a cost perspective (remember that we are talking about mass-deployable systems) it is best to have those systems in a single ECU (electronic control unit), including the imager. The mounting position is typically behind the windshield and in front of the rear view mirror. Considering the direct sunlight exposure, along with there being no cost-effective way to provide active cooling, the components inside are themselves required to produce a minimum amount of incremental heat in order to keep the overall temperature within tolerance.

For the BF60x, intelligent system partitioning between flexible and programmable hardware acceleration and a fully programmable DSP was selected. The task was to reduce external memory accesses as much as possible, since a large amount of power is consumed while moving high volume pixel data to and from external memory. As a side effect of this optimization, memory bottlenecks during development are much less likely to occur. In addition, a configurable and programmable PVP (pipelined vision processor) was introduced to handle video data inputs directly from the megapixel imager, The PVP can, in parallel, accept intermediary pre-processed data from internal or external memory. All of this pre-processing is performed without requiring any use of the programmable Blackfin DSP cores, leaving 1,000 MIPS of additional algorithmic processing power available for other functions (Figure 2).

Figure 2. Blackfin BF60x processor block diagram

Development of the ecosystem to reduce time to market

Time to market and ease of applications development are critical elements of success. ADI enables customers with specialized system tools to accelerate algorithm development and processor adoption in the shortest-possible amount of time. In addition, ADI provides reference applications demonstrating the optimal use of BF60x processor capabilities.

Making automotive safety affordable

At Analog Devices, ADAS developments for vision and radar systems are a key focus. Typical applications can be serviced by ADAS vision, ADAR radar or a combination of both sensor types via "Sensor Fusion" (Table 1).




Fusion (Vision + Radar)

Lane change assist




Lane departure warning




Lane keeping assist




Blind spot detection




Pedestrian detection




Forward collision warning




Forward collision mitigation




Cross-traffic alert




Adaptive cruise control




Emergency braking




Intelligent light control




Table1: ADAS applications by sensor technology

With the new dedicated ADAS vision solution offered from Analog Devices, in form of the Blackfin BF609 and BF608, ADAS vision technology becomes affordable in order to support mass deployments into all levels of vehicles - no longer just a privilege for the luxury cars. In Europe, Euro NCAP is indirectly fueling those requirements with its new safety ratings. Globally similar trends can be observed. While this article focused on forward vision ADAS applications, the same Analog Devices-developed technology also applies to intelligent rear-view cameras, whch are now getting a major push by legislation in North America. And this technology also has the potential to cost-reduce currently expensive night vision systems (e.g. detecting pedestrians and wildlife), thereby broadening their adoption into much more attractive price segments.

For more information about the Blackfin ADSP-BF608/609 please visit www.analog.com/blackfin. And for more information about Analog Devices automotive applications, please visit www.analog.com/automotive.

20/20 Embedded Vision for Robots

By Alexandra Dopplinger
Global Segment Marketing Manager, Freescale Semiconductor
and Brian Dipert
Embedded Vision Alliance
Senior Analyst

"Get Smart" With TI’s Embedded Analytics Technology

Bookmark and Share

"Get Smart" With TI’s Embedded Analytics Technology

By Gaurav Agarwal, Frank Brill, Bruce Flinchbaugh, Branislav Kisacanin, Mukesh Kumar, and Jacek Stachurski
Texas Instruments

This is a reprint of a Texas Instruments-published white paper, which is also available here (2.1 MB PDF).


When a driver starts a car, he doesn’t think about starting an intelligent analytics system; sometimes, that’s precisely what he’s doing. In the future, we will encounter intelligent systems more often as embedded analytics is added to applications such as automotive vision, security and surveillance systems, industrial and factory automation, and a host of other consumer applications.

Texas Instruments Incorporated (TI) has been innovating in embedded analytics for more than 20 years, blending real-world sensor driving technologies like video and audio with embedded processors and analytics algorithms. TI provides software libraries and development tools to make these intelligent applications fast and easy to develop.

Now, high-performance, programmable and low-power digital signal processors (DSPs) are providing the foundation for a new wave of embedded analytics systems capable of gathering data on their own, processing it in real time, reaching conclusions and taking actions.


This white paper explains how TI, together with members of the TI Design Network, are today empowering leading-edge embedded analytics systems in some of the most prominent application areas, including automotive, surveillance, access control and industrial inspection systems, as well as many emerging applications, including digital signage, gaming and robotics.

What is “embedded analytics”?

Embedded analytics technology unites embedded systems and the human senses to enable systems to analyze information and make intelligent decisions. Although embedded analytics technology appeals to a wide range of industries, there is a set of technical characteristics that most embedded analytics applications share. They are:

  • Diverse algorithms: Embedded analytics draws on a myriad of mathematical, statistical, signal and image-processing techniques. It combines these with machine learning, pattern recognition and other types of algorithms. The way in which these algorithms are combined tends to be unique to the application, and each of the algorithms usually needs to be adjusted a bit. This makes programmable processors and flexible software, often in the form of re-usable software libraries, very important.
  • Fast processing, predictable latency: Embedded analytics generates a tremendous computational load that must be processed in real time. Also, time allocated for processing must be bounded and deterministic. Otherwise, the timing of the system is thrown off. Advanced architectures with parallelism help in this regard.
  • Data throughput: Practically all embedded analytics applications involve some form of extreme data throughput. Huge amounts of data are brought into the system from sensors, cameras, microphones and other input devices. This data must be processed quickly, and the results, often involving huge amounts of data, must be output just as rapidly. To maintain data throughput, embedded analytics systems need advanced solutions like hierarchical memory organization, advanced direct memory access (DMA) controllers and wide memory interfaces.
  • Low power consumption: Many applications of embedded analytics are mobile or deeply embedded systems that may or may not have access to the power grid. Low power drain is often a must-have.
  • Cost: Many systems with embedded analytics – such as IP security cameras, smart TVs and games – are cost sensitive, yet the technical requirements are considerable. Balancing the two is a challenge.

Automotive embedded analytics for Advanced Driver Assistance Systems (ADAS)

First introduced into the automotive market more than a decade ago, embedded analytics has become widespread to the point where it is a “must-have” feature on many cars. Outside and in the vehicle, TI’s DSPs, particularly the TMS320C6000TM DSP platform, enable the various vision and audio processing subsystems that form a vehicle’s embedded analytics system (Figure 1).

Figure 1. ADAS enables the car to assist the driver in avoiding dangers on the road.

Many, but not all, of the vision processing subsystems in automobiles are outward facing. That is, image sensors monitor the space around a car and perform a wide variety of analytics functions intended to assist the driver, protect the vehicle from possible damage, and safeguard objects and pedestrians in the roadway. For example, several vision-based subsystems, widely known as Advanced Driver Assistance Systems (ADAS), process the field of vision in front of the car and provide information directly to the driver. These subsystems include a lane departure warning system, which warns drivers when the vehicle begins to move out of its lane; high-beam assist, which adjusts the level of the car’s headlights automatically when the lights from an approaching vehicle are detected; traffic sign recognition, which ensures that drivers don’t miss speed limit changes and other important road signs; forward collision warning to help drivers avoid front-end collisions; and an object detection capability that can automatically take countermeasures to avoid pedestrians or obstructions.

Other types of ADAS systems can assist with parking maneuvers, monitor the entire area around the car as well as the driver’s rear- and side-view blind spots to provide warnings, sound alarms or automatic evasive actions, and offer night vision functionality based on infrared sensors. In many cars on the road today, an adaptive cruise control system with embedded analytics will automatically detect other vehicles based on vision or radar data, calculate the distance and adjust the speed of the car to maintain a pre-determined distance.

TI’s DaVinciTM video processors, including DM81x video processors, are key to enabling ADAS technology. The parallel architecture of these processors can handle many vision algorithms with the short latency necessary for these safety applications. In addition, the processors’ high performance is balanced by the sub-3-Watt power budget, a must-have for automotive applications. In the future, TI’s smart multicore, automotive grade (AEC-Q100) OMAPTM processors will unleash the high-performance and low-power capabilities necessary for collecting, analyzing and displaying information and warnings in real time.

Inside the vehicle, embedded analytics enables various hands-free voice recognition control systems for the vehicle’s infotainment system. For more than 30 years, TI has been in research and development of speech-recognition technology, and a portion of this research has been donated to the open source community in the form of the TI Embedded Speech Recognizer (TIesr).

TIesr is a medium-size speech recognition system intended for embedded applications in automotive, industrial controls, consumer products, appliances and other market segments that require that the speech recognition and analytic processing are performed locally in the device itself. It should be noted that some large-size, more powerful speech recognition/analytic applications are not true embedded systems. In certain cases, these types of applications will utilize a communication link and perform much of the processing remotely, often in a cloud computing client or server application.

Embedded analytics in the automotive industry will continue to evolve as new techniques are investigated and developed, and as technology providers like TI continue to innovate with low-power, programmable single and multicore DSPs and the tools that facilitate their rapid deployment. Three-dimensional (3D) vision systems, for example, are becoming an integral part of automotive embedded analytics. In recent years, extensive research has been compiled on stereoscopic vision, which deploys two cameras. Other vision-related techniques like structured light and time-of-flight systems could be employed with embedded vision algorithms that leverage 3D sensor measurements to solve problems requiring higher precision.

Security and surveillance embedded analytics

Security and surveillance systems have also incorporated embedded analytics for quite some time. Initially, analytics was employed in conjunction with data compression/decompression algorithms to optimize the communication bandwidth associated with security systems. This led to greater penetration of embedded analytics and, specifically, vision-related analytics for automated real-time monitoring applications of property and infrastructure, traffic conditions and others. In addition, a significant amount of off-line video analytics has been implemented for forensics purposes.

Besides vision analytics, sound-processing technologies are bringing embedded audio analytics to security applications as well. Alarms can be triggered by sounds of aggression, explosions, sirens, collisions, breakins and other sounds of trouble. Multiple microphones or sound sensors in surveillance applications are also implemented to analyze and determine where the source of certain sounds is located or the direction from which the sounds are coming.

In addition to vision- or sound-only implementations of analytics in security applications, embedded analytics has brought these two sensory technologies together in certain systems.

In sound-assisted video analytics (SAVA), audio analytics inspect the sound scene of a surveyed environment and provide additional information about activities not readily discerned from video. A system could detect glass breaking, and as a result of embedded analytics, a surveillance camera might be redirected to the region of interest where the sound originated. Or, the sound of an intrusion might trigger an increased resolution of certain cameras for better images. Also, audio annotation may help determine the relevance of a large amount of recorded surveillance video. Sound identification may warn of potential security risks even when they are partially obstructed or hidden, or before they appear within the camera’s field-of-view. Taking advantage of the complementary aspects of video and audio provides a powerful framework that can lead to system robustness for enhanced alarm detection rates.

Security systems that require embedded analytics can leverage many of the capabilities provided by TI’s C6000TM DSPs, DaVinciTM video processors and other system-on-chip (SoC) devices. In addition to their low power and powerful processing capabilities, these programmable devices are architected for high-bandwidth data movement. A comprehensive tools environment specific to embedded analytics ensures rapid development cycles and an accelerated time-to-market.

TI’s DaVinci DMVAx video processors are equipped with capabilities targeted at embedded analytic security applications. Some of these capabilities include integrated video analytics acceleration, the industry’s first vision co-processor, an image co-processor and a complete video processing subsystem capable of face detection, video stabilization, noise filtering and other functions. Based on an ARM9TM core, TI’s DMVAx processors are supported by TI’s Smart Analytics, which includes five fundamental embedded analytics functions: camera tamper detection; intelligent motion detection; trip zone, which detects and analyzes objects moving from one zone to another; object counting; and streaming metadata, which tracks and tags objects on a frame-to-frame basis (Figure 2).

Figure 2. Smart analytics are embedded on TI’s DaVinci DMVAx video processors.

An integral part of the DMVAx processors’ embedded analytics capabilities is TI’s smart codec technology for improving codec efficiencies in analytic applications. For example, smart codec technology might function in concert with face detection to allot more bits to the face in an image and thereby achieve higher resolution for this region of interest (Figure 3).

Figure 3. DMVA2 block diagram.

TI offers reference designs for digital cameras with Internet Protocol (IP) connectivity that simplify development and allow designers to concentrate on adding features that will differentiate their products from the competition. These reference designs are based on TI’s DaVinci video processors, including the DMVAx, as well as an IP camera software suite. TI’s Digital Media Video Analytics Library (DMVAL) contains much of the base functionality needed to assemble an embedded analytic security system. Another building block for embedded analytic applications, TI’s Vision Library (VLIB), accelerates the development of vision subsystems in embedded analytic systems for security, automotive and others.

TI’s TMS320C674x DSPs are ideal for audio analytics. The processor offers the floating- and fixed-point capabilities and parallel architecture needed for real-time processing of audio analytics algorithms, but with low power consumption and at a low cost.

Access control

Many biometric characteristics are used to verify identity, including hand and face geometry, retinal scans and fingerprint analysis. For example, fingerprint scanners are used for identity verification at public safety facilities, on cell phones and laptops, at health care facilities and even at the local gym to enable quick and easy access to personal information and secure buildings and to keep everyone else out.

Systems that process these applications take a “picture” of the hand, face, retina or fingerprint, analyze the image for biometric data, and store this data in a database used for future matching. These applications must often be ultra-low-power when they are on mobile electronics like cell phones and laptops. Slightly more performance is necessary to obtain the image of and perform processing on faces, irises and retinas (Figure 4).

Figure 4. Block diagram of fingerprint process system.

TI’s TMS320C55x ultra-low-power DSPs are ideal for residential or commercial fingerprint recognition systems. They fulfill the need for less than two seconds of recognition time for a system with a 100-user fingerprint template. Since the power consumption is the 16-bit DSP industry’s lowest, users only need to change the battery of battery-powered systems every few months. TI offers the C5515 DSP Fingerprint Development Kit to simplify development of this application. For face recognition, iris recognition and other higher performance biometrics applications, TI’s C674x DSPs and OMAP-L138 DSP+ARM® processor are ideal.

Industrial embedded analytics

Control systems, factory automation, robotics, automated optical inspection, currency inspection, traffic management and many other types of industrial systems incorporate various aspects of embedded analytics. Often, machine vision is central to these industrial systems, but many also include a range of sensor inputs not found in other types of embedded analytic applications, such as pressure, temperature, motion, sound and other sensors.

The ongoing and seemingly constant advancements in low-power yet high-performance DSPs have enabled greater levels of intelligence in all aspects of industrial embedded analytics utilizing machine vision. As a consequence, the cameras on the factory floor and the centralized vision processing systems they are connected to are all able to function as powerful platforms for additional analytics processing. A smart camera, for instance, might perform some of the image enhancement and refinement functions locally that had previously been performed in the central vision processing system. Then, the smart camera could analyze the image and respond to it by zooming in or out, or turning for a better angle. And since the central vision processing systems are not constrained by the low power budgets or small enclosures of smart cameras, multiple single and multicore DSPs can be added to the centralized image-processing subsystem to support high-order embedded analytics like 3D object analysis, surface texture analysis and more. See Figure 5 for a diagram of a typical industrial imaging system.

Figure 5. Typical industrial imaging system.

In industrial embedded analytic applications, the scalable processing power and both fixed- and floating-point capabilities of TI’s TMS320C66x multicore DSPs give these low-power and programmable devices the characteristics required by smart cameras, vision-processing systems and other rugged processing platforms. A host of software tools and libraries, including TI’s Multicore Software Development Kits (MCSDKs) also streamlines development.

TI’s C66x DSPs integrate one to eight C66x DSP cores and are based on TI’s scalable KeyStone multicore architecture. They have a wide array of peripherals integrated on-chip, including very high throughput interfaces to FPGAs and CPLDs that accelerate system design and reduce system cost. Combining the KeyStone architecture with extensive memory resources ensures that each processing core will function at its fullest.

C66x DSPs are well suited to a wide variety of industrial applications, including optical defect inspection, part identification, high-speed barcode readers, color inspection, optical character readers (OCR), traffic management, currency inspection and high-end industrial printer/scanners.

Emerging embedded analytic applications

As an enabling technology, embedded analytics is so adaptable and malleable that it can emerge and be deployed in surprisingly unrelated and disparate places. Frequently, its appearance is unexpected. Typically, it disrupts the status quo in an application segment and takes it to a higher and more exciting level.

Embedded analytics is the engine behind robotics, augmented reality and a range of new natural user interfaces incorporating 2D or 3D gesture recognition and/or depth sensing. These capabilities play into a wide array of applications as varied as video games, medical imaging, home automation, smart TVs, e-commerce, digital signage and unmanned vehicles. The impetus underlying many of these emerging applications is simply to give machines a certain ability to analyze and respond to the real world around them. 2D and 3D vision analysis is an important capability in this regard because it moves computer vision closer to human vision.

Embedded analytics for 2D vision analysis can bring about new interactive and natural user interfaces for computers, appliances, industrial machines and other devices. For example, instead of relying on a mouse to move the cursor on a PC screen, users are able to control their computers with several hand gestures. Of course, adding the third dimension to vision analysis is considerably more complex, but it opens the door to many new applications, some of which have yet to be invented.

3D vision analysis will extend many applications that today deploy 2D vision. For example, today’s 2D hand gesture recognition can morph into a full-body tracking interface. Microsoft’s Kinect is a good example. The fastest ever consumer adoption of an embedded analytic vision technology, Kinect allows players to interact with a computer without accessories. The computer, which in Kinect’s case is the Xbox 360, perceives the players and calculates the body pose from 3D information (Figure 6).

Figure 6. Screenshot from TI’s body-tracking demo (using third-party algorithm on TI’s DaVinciTM DM3730 video processor).

Digital signage is another example of an emerging embedded analytics application. Not just a static digital advertisement, digital signage with embedded analytics is able to read the person reading the sign. Inside a retail store, such a sign will serve up an ad targeted at the demographic group of the reader.

A broad range of TI processors are adept at 2D or 3D processing tasks for a variety of applications. For 2D processing for hand tracking and other low-level applications, TI’s SitaraTM AM335x and AM37x ARM® microprocessors are a good fit. For applications requiring full-body tracking or tracking multiple users, TI’s DaVinci DM3730 and DM8148 video processors, as well as the smart multicore OMAPTM mobile applications processors, offer a variety of performance options and capabilities.

Getting smart with embedded analytics

Embedded analytics is reframing how technology is encountered in everyday life. In the past, a problem would be brought to a computer, where answers would be dispensed, and in the end, a human being would decide on a solution. Now, embedded analytics is moving digital-processing technology to the problem, and the system determining a solution. The technology challenges enabling embedded analytic applications are as diverse and as unique as the problems being solved. Fortunately, the embedded processor innovations from TI are meeting these challenges head on.

The sheer diversity of emerging embedded analytics applications demands a broad range of embedded processors to meet all requirements. TI’s breadth of embedded analytics processors, software and tools; additional hardware and software support from its extensive Design Network; and years of leadership in automotive, security and industrial analytics will continue to help systems “get smart” by enabling embedded analytics for new applications.

Sidebar: A good listener: TIesr

A robust and efficient open source speech recognizer, the TI Embedded Speech Recognizer targets embedded platforms with a simple, easy-to-use application programming interface (API). Capable of adapting to changing noise environments and various microphones, the downloadable TIesr balances memory requirements and processing power with its speech-recognition capabilities and robustness.

Sidebar: Hearing is believing

The ecosystem that has grown around TI’s embedded analytic technologies includes third-party companies that are developing breakthrough audio solutions.

Audio Analytic has developed a range of analytics, each detecting a specific class of sound, used individually or in combinations to address particular applications and security scenarios.

For example, detecting breaking glass or car alarms can add significant value to premises or property protection applications. Aggression and gunshot detection provide increased staff protection in lone worker locations or other public safety and potentially hostile situations such as hospital A&E, prisons or police-custody centers. Also, keyword detection allows monitoring stations to be alerted when members of staff require assistance through use of designated security keywords.

Learn more: www.audioanalytic.com

Sidebar: Easy to image-ine

TI’s Design Network includes several companies that provide hardware and software design and optimization services for imaging applications based on TI’s processors.

eInfochips’ product design services and IP portfolio reduce development time, cost and risk for developers of industrial and video surveillance analytics applications and beyond.

Learn more: eInfochips’ Video Analytics Daughter card developed around TI’s DaVinciTM DM6435 video processor and Video Analytics Services.

D3 Engineering provides a fast, low-risk path through embedded product development. Building on proven DesignCoreTM modules and application software libraries, D3 Engineering speeds design through launch of embedded systems for digital video and analytics, digital power management, and precision motion control.

Learn more: www.D3Engineering.com

Embedded Vision In Medicine: Let Smartphone Apps Inspire Your Design Decisions

By Brian Dipert
Embedded Vision Alliance
Senior Analyst

Leveraging Multicore Processors for Machine Vision Applications

Bookmark and Share

Leveraging Multicore Processors for Machine Vision Applications

By Mukesh Kumar
Marketing Director, Multicore Processors
Texas Instruments
This is a reprint of a Texas Instruments-published white paper, which is also available here (500 KB PDF).


Meeting the needs of practically any conceivable type of vision application, TI has a wide selection of processors ranging from microprocessors based on ARM cores, to SoCs with ARM and DSP cores, as well as very high-performance multicore processors. These devices form a broad portfolio of solutions for processing-intensive imaging and vision-based applications, like smart cameras, centralized vision systems, commercial off-the- shelf accelerator cards and frame grabbers. Products like these are typically deployed in industrial automation systems, such as automated optical inspection systems, robotic vision sub- systems, high-speed identification systems including 1D/2D barcode readers, document and textile printing and scanning equipment, and many others.

Several types of subsystems that are commonly found in imaging and vision systems can benefit from TI DSPs. For example, DSP-based image processors and frame grabbers would have low power dissipation and, as a result, could be integrated into smart cameras capable of the extensive processing required by these applications.

The processing power required by contemporary industrial inspection systems is also trending upward. The primary reason for this is the fact that modern inspection systems operate on a much larger image data set and perform much more complex algorithms in real time. Commonly, inspection systems are configured with multiple high resolution (megapixel) and high frame rate (frames- per-second) cameras that stream large amounts of data for processing. At the high end of such applications, multiple cameras acquire complete 3-D volume data while depth cameras generate a depth profile and stereoscopic cameras might capture a surface profile.


In recent years, the processing requirements for imaging and other industrial vision processing applications such as inspection systems have followed a steep upward curve. The field of vision to be processed in real time has grown considerably. The complexity of vision processing algorithms has increased geometrically. Higher resolutions, faster frames-per-second rates and video analytics which generate decisions based on the results of vision processing algorithms are only several of the many facets of industrial vision processing that are escalating the raw processing loads on such systems. And these escalating loads show no signs of abating.

TI has responded with a wide range of microprocessors, single core and multicore digital signal processors (DSPs) and system-on-chips (SoCs) that can fill practically any level of processing needed in a vision system.

  • Image and vision processing algorithms: At the heart of inspection systems are a host of image- and vision-processing algorithms. These algorithms can be grouped into several categories, including image enhancement and formation, morphological operations, and feature extraction and detection.
  • Morphological operations: Morphological operations are non-linear operations which incorporate a “structuring element” that probes the image, providing results on how well an elemental structure fits within the image. The outputs of morphological operations could result in thickening or thinning edges, removing small objects within a larger object, connecting broken edges, eliminating small holes and filling small gaps.
  • Feature extraction and detection: Most feature extraction and detection algorithms include edge detection, line tracing, object shape analysis, a classification algorithm and template matching. Sometimes the image is transformed into a different domain such as Fourier and Wavelet before features are extracted.

TMS320C6657 for imaging systems

The latest addition to TI’s KeyStone family of processors, the TMS320C665x series, features capabilities that make it especially suitable for machine vision applications (Figure 1). For cost- and power-sensitive applications, two single–core DSPs, the TMS320C6655 and TMS320C6654, are pin-for-pin compatible with the dual-core TMS320C6657 DSP. Another feature of these devices which makes them suitable to many imaging processing applications includes operations over the extended temperature range of -55 to 100C, guaranteeing long term viability and availability.

Figure 1. TI’s family of TMS320C665x DSPs provide the programmable processing power that smart cameras, frame grabbers and generic imaging systems require

Increased processing

Most industrial image processing subsystems require high performance in terms of the speed of the processing core(s) as measured in GHz (billions of hertz), instruction execution as measure in MIPS (millions of instructions per second), computational processing as measured be MMACs (millions of multiply/accumulates per second) and floating point operations as measure by GFLOPs (billions of floating point operations per second).

TI’s TMS320C665x DSPs are based on the industry’s highest performing fixed and floating point DSP core, the TMS320C66x generation. The dual-core C6657, with each core running at 1.25 GHz, is effectively a 2.5 GHz DSP providing 80 GMACs and 40 GFLOPs of processing performance. The C66x core’s instruction set architecture (ISA) is based on an eight-issue machine. That is, it can execute eight 32-bit instructions or one 256-bit very long instruction word (VLIW) per cycle with an instruction pipeline 11 deep and 64 internal registers. It can execute eight single-precision floating point MAC operations per cycle and perform double and mixed precision operations. Moreover, when compared to the C64x+ fixed point DSP core, some 90 new instructions have been added to the C66x to support floating point and vector math processing. The C66x core’s raw computational performance is an industry-leading 32 MACS/cycle and 16 FLOPs/cycle. This indicates that the core at 1.25 GHz can perform 40 GMAC/s and 20 GFLOP/s.

Floating point processing increases the dynamic range of the C66x core, an important consideration for certain image processing algorithms. With two C66x cores in the C6657 device, system designers can parallelize the execution of the entire system and, in certain cases, even individual algorithms so that independent operations can be performed simultaneously on the two cores. Each core of the C6657 has an internal memory architecture with L1 program and L1 data caches of 256 KB (32 KB) and 1 MB of internal L2 SRAM. This SRAM can be partitioned and configured as RAM or cache memory. System reliability is enhanced by Error Correction and Checking (ECC) on the L1 and L2 on-chip memories.

System performance can be vastly improved with a processor architecture containing multiple levels of memory caching and a significant amount of on-chip RAM. Since image sizes are generally much larger than the size of on-chip RAM, image processing systems invariably need large external RAMs. This indicates that the DSP’s processing cores will require high-bandwidth external memory interfaces in order to move large amounts of data effectively from external memory into the processing cores.

A shared memory architecture allows the multiple cores in a multicore DSP to either operate on different sections of the same image in parallel or to perform different processing functions on the same section of image data serially. With this sort of shared memory architecture and an intelligent direct memory access controller shared by all of the cores, performance can be improved significantly by transferring data from external memory, memory-mapped peripherals and on-chip memory at the same time as instructions are being executed. This is accomplished via double buffering. A total of 3 Mbytes of on-chip memory in the C6657 can store sections of images that are being processed.

The C6657 has a 32-bit DDR3 external memory interface which can operate at any of three speeds: 800 mega-transfers per second (MTS), 1033 MTS or 1333 MTS. It also can operate in either 16- or 32-bit modes. An on-chip Multicore Shared Memory Controller (MSMC) has an address translation block which expands the addressable memory space to eight gigabytes. The DDR3 memory controller is capable of ECC for improved system reliability. System performance is accelerated and external memory accesses reduced by a pre-fetch mechanism in the MSMC that acts as a caching mechanism for external memory. In addition, data packet movements across the chip’s TeraNet communications switching fabric cannot be blocked by memory accesses since the MSMC gives each core the ability to access on-chip memory directly, avoiding the TeraNet completely.

In most imaging systems on-chip memory is not sufficient for storing all of program code and data. As a result, the DSP cores must have efficient DMA mechanisms for transferring code and data back and forth from external memory, internal memory, peripherals and accelerators. An independent DMA controller can transfer data among the processing cores, memories, accelerators and peripheral interfaces without the intervention of the processing cores. In this way, the cores can be dedicated to executing program instructions while the data transfers are happening under the control of the DMA. One of the DMAs implemented in the C6657 is known as the Enhanced Direct Memory Access-3 (EDMA3) controller. It services certain software-driven paging transfers such as data movements between external and internal memories, the sorting or subframe extraction functions of various data structures and event-driven peripherals. In all of these functions it offloads data transfers from the DSP core. EDMA channel controllers on the C6657 support performance enhancing features such as the two addressing modes (constant addressing and increment addressing). It also transfers data in three-dimensional arrays, frames or blocks. Its linking mechanism allows for ping-pong buffering, circular buffering and repetitive/continuous transfers with no intervention by the processing cores. The EDMA has debug visibility into queue watermarking, threshold, error and status recording, and others. The C6657 also has a separate DMA mechanism called Independent DMA controller which transfers data and code between on-chip RAM and L1 program and L1 data caches.

Another aspect of TI’s KeyStone architecture which improves performance is Multicore Navigator, an innovative hardware-based packetized message transfer mechanism with 8,192 message queues and six Channeled-DMA channels for transferring messages. When a message is directed to one of its queues, Multicore Navigator uses a hardware accelerator to dispatch the needed task to the appropriate hardware. Multicore Navigator enables very high speed Inter-Process Communication (IPC) and Peripheral/Accelerator Interfacing which significantly simplifies the device’s software architecture and reduces the involvement of the processing cores in these functions.

High-speed connectivity

The C6657 DSP has various options for connecting the imaging and image-processing subsystems. Many imaging systems require one or more analog or digital interfaces like CameraLink. With a choice of high-speed serializer/deserializer (serdes) interfaces, designers have the option of connecting the processing system to an FPGA, which in turn is connected to the image capture subsystem(s).

The C6657’s HyperLink, a 50 GB/s serdes interface, provides a very high speed interface to FPGAs through which image data from one or more image sensing elements or cameras can be brought into DSP memory. HyperLink’s low-protocol overhead and high throughput make it an ideal interface for chip-to-chip interconnections. HyperLink can directly link the C6657 to companion chips or die, such as C667x devices or other C665x DSPs. HyperLink and Multicore Navigator can transparently dispatch a task to a connected device where it will be executed as if it were running on local resources. This provides connectivity between DSPs at the level of the switching fabric.

Other high-speed interfaces including SGMII Gigabit Ethernet provide high-speed network connectivity. The C6657 has a four-lane (5 Gbps/lane) interface to Serial RapidIO (SRIO v2.3), a high-performance, low pin count interconnect aimed at the embedded market. Deploying SRIO as the interconnect technology for a given board design can lead to reduced system costs by lowering latencies, reducing the overhead in packet processing and increasing system bandwidth.

The C6657 also features two lanes of PCI Express (PCIe) Gen II. This can function as the interface for the C6657 and PCIe-compliant devices. It might also be used to interface PCIe boards plugged into a PCIe backplane. The PCIe interface is a low pin count, high reliability and high-speed interface rated at 5 Gbps per lane or 10 Gbps total on the two links.

Another interface module on the C6657, Serial Peripheral Interface (SPI), functions as an interface between the DSP and SPI-compliant devices. The primary intent of this module is to connect the processing cores to read-only memory for booting the system. The External Memory Interface (EMIF16) and Inter-Integrated Circuit (I2C) port provide other alternative interfaces for the DSP cores and external memories such as NAND/NOR flash or EEPROM.

Evaluation Module and software

A low-cost Evaluation Module (EVM) is available with one C6657 DSP on an AMC form factor card. The HyperLink interface is routed to its own connector while other interfaces, including Gigabit Ethernet, SRIO and PCIe are routed to an edge connector.

The C6657 is supported by TI’s integrated software and hardware development toolset, Code Composer StudioTM (CCS) integrated development environment (IDE). Included in CCS is a full suite of compilers, a source code editor, a project building environment, debugger, profiler, simulators and many other code development capabilities. Furthermore, these features have been enhanced to support multicore software development and debugging. For example, CCS’s compilers support OpenMP, a popular open source multicore programming framework. The CCS integrated development environment (IDE) is based on Eclipse, an open source software framework used by many embedded software vendors. CCS, as well as several development emulators, takes advantage of the C6657’s hardware debugging features like Advanced Event Triggering (AET), which enables the insertion of hardware breakpoints into code and other functions and Trace Buffers for tracing code execution. The primary emulation interface is IEEE 1149.1 JTAG. TI’s code composer studio and several third-party tools make it easy for developers to get started, develop and debug application software.

In addition to these development tools, a portfolio of software building blocks is available as well. TI’s Multicore Software Development Kit (MCSDK) provides developers with a well integrated software development platform encompassing efficient multicore communication layers for intercore and interchip communication, validated and optimized drivers integrated with SYS/BIOS, a real time operating system (RTOS), and Linux support with appropriate demonstration examples. Figure 2 provides a detailed diagram of the MCSDK.

Figure 2. Multicore Software Development Kit

Several libraries for image processing functions are also available on the C6657. One such library is IMGLIB, a library of functions optimized for image/video processing and written in C. It consists of over 70 functions, including the source code of many C-callable, assembly-optimized, general-purpose image/video processing routines. These routines are typically used in computationally intense real-time applications where optimal execution speed is critical. These routines ensure execution speeds considerably faster than equivalent code written in the standard ANSI C language. In addition, by providing ready-to-use DSP functions, IMGLIB can significantly shorten the development time for image/video processing applications. The rich set of software routines included in IMGLIB is organized into three functional categories as follows:

  • Compression and Decompression
  • Forward and Inverse DC
  • Motion Estimation
  • Quantization
  • Wavelet Processing
  • Image Analysis
  • Boundary and Perimeter Estimation
  • Morphological Operations
  • Edge Detection
  • Image Histogram
  • Image Thresholding
  • Image Filtering and Format Conversion
  • Image Convolution
  • Image Correlation
  • Median Filtering
  • Color Space Conversion
  • Error Diffusion
  • Pixel Expansion

In addition to these functions, a set of 22 low-level kernels are available for performing simple image operations such as addition, subtraction, multiplication and others. These are intended as a starting point for developing more complex kernels.

Another library, the Video Analytics & Vision Library (VLIB), is made up of more than 40 royalty-free kernels that accelerate video analytics development and increase performance by a factor of 10. VLIB is an extensible library optimized for the C66x DSP core. It is available royalty-free only in object format. This collection provides the ability to perform the following:

  • Background Modeling and Subtraction
  • Object Feature Extraction
  • Tracking and Recognition
  • Low-level Pixel Processing

VLIB provides an extensible foundation for the following applications:

  • Video Analytics
  • Video Surveillance
  • Automotive Vision
  • Embedded Vision
  • Game Vision
  • Machine Vision
  • Consumer Electronics

The MCSDK also includes an example application (Figure 3) which demonstrates the functioning of various components such as SYS/BIOS, OpenMP, IMGLIB and other components in an image processing application.

Figure 3. A typical image processing system


For many years, multicore DSPs have shown their value in a wide variety of applications across a range of industries. Homogenous multicore DSP devices have frequently been the choice of designers of systems requiring compute-intensive signal processing within a limited power budget and confined to a compact physical space. TI’s C6657/5/4 DSPs offer exceptional computational performance, a wide selection of I/O interfaces, expansive memory space and other key features integrated into hardware to support the high–performance needs of industrial inspection systems (Figure 4).

Figure 4. Comparison of devices in C665x family

The simplicity of developing a software-defined imaging system in a high-level language like C on DSP cores accelerates the implementation of new and innovative algorithms and reduces customers’ time-to-market.

Improve Perceptual Video Quality: Skin-Tone Macroblock Detection

Bookmark and Share

Improve Perceptual Video Quality: Skin-Tone Macroblock Detection

By Paula Carrillo, Akira Osamoto, and Adithya K. Banninthaya
Texas Instruments
Accurate skin-tone reproduction is important in conventional still and video photography applications, but it's also critical in some embedded vision implementations; for accurate facial detection and recognition, for example. And intermediary lossy compression between the camera and processing circuitry is common in configurations that network-link the two function blocks, either within a LAN or over a WAN (i.e. "cloud"). More generally, the technique described in this document uses dilation and other algorithms to find regions of interest, which is relevant to many vision applications. And implementing vision algorithms efficiently, i.e. finding vision algorithms that are computationally efficient, is obviously an important concept for embedded vision. This is a reprint of a Texas Instruments-published white paper, which is also available here (800 KB PDF).


In video compression algorithms, the quantization parameter (QP) is usually selected based on the relative complexity of the region in the picture as well as the over-all bit usage. However, complexity-based rate-control algorithms do not take into account the fact that more complex objects, such as human faces, are more sensitive to degradation during perceptual video compression. To improve the overall perceived quality of the image, it is important to classify human faces as regions of interest (ROI) and preserve as much detail in those regions as possible. The challenge is developing a reliable algorithm that will operate in real time. This white paper details a low-complexity solution that is able to run on a single-core digital signal processor (DSP) as part of an encoder implementation.

Skin-tone macroblock detection

The proposed solution is a low-complexity, color-based skin-tone detection which classifies skin-tone macroblocks (MBs) as ROI MBs and non-skin-tone macroblocks as non-ROI MBs. An MB can be defined as a 16×16 block of pixels. The classification of ROI MBs and non-ROI MBs is based on empirical thresholds applied to the mean of the color components. These threshold values were defined after extensive research using material that covers various races. According to this classification and a modified rate control (RC) that smoothly assigns different levels of quality, we can increase visual quality (VQ) in human faces. The new RC assigns a lower QP to ROI areas compared to non-ROI areas while maintaining the overall bits-per-frame budget.

Erosion and dilation

Erosion and dilation algorithms are used to refine detection — reduce false positives and missed MBs. These morphology algorithms use classified neighbors’ information to fill holes (missed MBs) and locate isolated blocks (false positives). False positive ROI MBs lead to flawed allocation of important bits, while missed ROI MBs create a rough region perception.

Erosion helps to find false positives and mark them as non-ROI. Dilation, on the other hand, finds holes in skin regions (like eyes or mouth in faces regions) and marks them as ROI. In Figure 1, areas in pink show detected MBs as ROI. The face on the left shows detected skin regions without applying morphology algorithms, and as a result, the eyes are not part of the ROI. Alternatively, the figure on the right shows the complete face is marked as ROI, as morphology algorithms have been applied.

Figure 1. MBs detected as ROI in pink. Left image: ROI without applying morphology algorithms. Right image: ROI detection using morphology algorithms.

Erosion and dilation algorithms can be implemented in pre-processing or merged inside the encoder. When used in pre-processing, all the MBs of a frame are classified as an ROI or non-ROI before being encoded. All neighboring MBs’ skin information is used to decide if a MB skin classification is a false positive, a hole, or if it is correct, and accordingly make the correct ROI or non-ROI classification.

When the erosion and dilatation algorithms are merged inside the encoder, only top, left, top-left and top-right skin MB neighbors’ information are available for making refinement decisions, but this version is suitable for low-latency applications. Figure 2 shows in pink detected MBs as ROI. The figure on the left shows results for a ROI detection merged inside the encoder. The figure on the right uses all MBs neighbors’ information in order to return a final ROI classification.

Figure 2. Left image: ROI detection inside of an encoder. Right image: ROI detection in pre-processing.

Activity gradient threshold

In addition to the erosion and dilation algorithms, an MB activity gradient threshold is implemented to reduce the amount of false positives, especially when videos have many small faces, such as faces in a crowd. Background faces in a crowd are not given ROI treatment.

An implementation based on 8×8 pixel blocks detection for Luma and 4×4 pixel blocks for Chroma components (in case of 4:2:0 video format) improves the algorithm precision compared to 16×16 pixel block detection. If two or more of the four blocks from a MB are classified as skin blocks, then the complete MB is marked as skin MB.

Pre-processing decision

It’s also important to implement another pre-processing step inside the ROI algorithm in order to eliminate frames that have too many MBs marked as ROI, in which case it is pointless to do frame bit redistribution. If more than 30 percent of a frame is detected as skin, all the MBs are remarked as non-ROI.

Decimation process

Finally, a decimation process will reduce processing cycles and increase channel density per core. This process skips some pixel values in order to get the mean of the color components blocks. For Luma components, rather than getting the mean value of 64 8-bit pixels, we decimate in steps of four and get the mean of only four 8-bit pixels per 8×8 blocks. For Chroma 4×4 components blocks, the decimation step used is two. Decimation decreases ROI classification precision. However, it results in a fair tradeoff when pre- processing multiple HD channels in a single core.

Figure 3 and Figure 4 show a VQ comparison of using or not using ROI detection as part of H.264 encoder. Figure 3 was encoded at a low bit rate to stress the VQ difference. This sequence has many small water droplets in movement. Small moving objects drain rate control’s bit budget, resulting in a visual degradation of the face when ROI RC modification is not applied.

Figure 3. Chromakey sequence, H.264 encoded. Left image: No ROI applied. Right image: ROI applied.

Figure 4 shows content similar to that found in video conferencing, where the background is usually static and faces are the critical information to transmit.

Figure 4. Video conference content. Left image: No ROI RC is applied. Right image: ROI RC is applied.

Video market trends

The current video market demands a low-complexity implementation of skin-tone detection algorithms with a highly accurate classification. ROI detection can be implemented in a video frame pre-processing stage, or with less accuracy, it can be merged inside a standard video codec. A low-complexity implementation gives the advantage of fast decision-making (fewer cycles) when determining if an MB is part of an ROI area. Fast decisions with a low-complexity classification algorithm allow a real-time ROI detection implementation on low-power processors for high-channel-density scenarios. Additionally, a low-complexity ROI implementation allow encoders improve overall video quality in applications including video broadcast, video conference, video security and smart cameras.

TI technology for skin-tone macroblock detection

The ROI detection algorithm was implemented and tested on Texas Instruments’ (TI’s) 1-GHz TMS320C674x floating-point DSP on TI’s DaVinciTM TMS320DM816x video processor. The solution is XDAIS compliant. Initially, no optimized implementation permits ROI pre-processing of three channels 1080p60. After using DSP-specialized MAC and SIMD instructions, performance was boosted to six channels of 1080p60. In video processors with three video accelerators (IVAHD), like TI’s DM816x video processor, it is possible to encode three HD channels using ROI information and still have room on the DSP for more pre-processing, such as audio or text-detection algorithms.

Figure 5 shows data and control flow implemented on TI’s DM816x video processor. From a data point of view, skin-tone detection gets YUV input buffers and generates a MB map, which is appended to the YUV data and used as metadata information for the encoder. From a control point of view, the integrated ARM® CortexTM-A8 runs the main application, which invokes process calls on the C674x DSP with raw input data available in DDR. The C674x DSP generates ROI information in DDR and informs the ARM® CortexTM-A8 when the frame is done; then the ARM Cortex-A8 invokes a process call on the ARM Cortex-M3, and ROI information is padded as metadata to the Cortex-M3. Once frame encoding is done, the ARM Cortex-M3 informs the ARM Cortex-A8, and the described process starts again.

Figure 5. Data and control flow on TI’s DaVinci DM816x video processor using ROI detection.

As part of the video transcoder (VTC) demo on TI’s DaVinciTM DM816x video processor, developers have the option to mark detected ROI MBs with white dots for a real-time verification of the ROI detection algorithm. An example of this capability can be seen in Figure 6. The VTC demo also includes the option of encoder clips with ROI detected regions using TI’s IVAHD H.264 encoder. Using ROI detection MB map, TI’s IVAHD H.264 encoder Rate Control (RC) smoothly redistributes frame bits between ROI and non-ROI regions for a better video quality perception.

Figure 6. TI’s VTC demo with ROI detection visualization.


To improve the overall perceived quality of images that focus on human faces, skin-tone macroblock detection offers a low-complexity solution. This paper has shown how different techniques can be implemented in order to increase accuracy and maintain HD channel density for real-time pre-processing of the ROI detection algorithm on TI’s DSPs, including the C674x DSP on TI’s DaVinci DM816x video processor. TI’s VTC demo offers the option of ROI H.264 encoding and a real-time visualization of ROI detection.

For more information about TI’s VTC demo, please visit: www.ti.com/truviewvtcdemo

For more information about TI’s DaVinci DM816x video processor, please visit: www.ti.com/dm8168

Start Developing OpenCV Applications Immediately Using the BDTI Quick-Start OpenCV Kit (Article)

Bookmark and Share

Start Developing OpenCV Applications Immediately Using the BDTI Quick-Start OpenCV Kit (Article)

OpenCV is an open-source software component library for computer vision application development.  OpenCV is a powerful tool for prototyping embedded vision algorithms.  Originally released in 2000, it has been downloaded over 3.5 million times.  The OpenCV library supports over 2,500 functions and contains dozens of valuable vision application examples.  The library supports C, C++, and Python and has been ported to Windows, Linux, Android, MAC OS X and iOS.

The most difficult part of using OpenCV is building the library and configuring the tools.  The OpenCV development team has made great strides in simplifying the OpenCV build process, but it can still be time consuming.  To make it as easy as possible to start using OpenCV, BDTI has created the Quick-Start OpenCV Kit, a VMware image that includes OpenCV and all required tools preinstalled, configured, and built.  This makes it easy to quickly get OpenCV running and to start developing vision algorithms using OpenCV. The BDTI Quick-Start OpenCV Kit can be run on any Windows computer by using the free VMware player, or on Mac OS X using VMware Fusion. This article describes the process of installing and using the BDTI Quick-Start OpenCV Kit.  For more information about OpenCV and other OpenCV tools from BDTI, go here.

Please note that the BDTI Quick-Start OpenCV Kit contains numerous open source software packages, each with its own license terms.  Please refer to the licenses associated with each package to understand what uses are permitted.  If you have questions about any of these licenses, please contact the authors of the package in question.  If you believe that the BDTI OpenCV VMware image contains elements that should not be distributed in this way, please contact us

Figure 1. Ubuntu desktop installed in the BDTI VMware image

The BDTI Quick-Start OpenCV Kit uses Ubuntu 10.04 for the OS.  The Ubuntu desktop is intuitive and easy to use.  OpenCV 2.3.0 has been preinstalled and configured in the image, along with the GNU C compiler and tools (gcc version 4.4.3).  Various examples are included along with a framework so you can get started with your own vision algorithms immediately.  The Eclipse integrated development environment is also installed and configured for debugging OpenCV applications.  Five example Eclipse projects are included to seed your own projects.

Figure 2. Eclipse integrated development environment installed in the BDTI VMware image

A USB webcam is required to use the examples provided in the BDTI Quick-Start OpenCV Kit.  Logitech USB web cameras have been tested with this image, specifically the Logitech C160.  Be sure to install the Windows drivers provided with the camera on your Windows system.

To get started, first download the BDTI Quick-Start OpenCV Kit from the Embedded Vision Academy.  To use the image on Windows, you must also download the free VMware player. After downloading the zip file from the Embedded Vision Academy, unzip it into a folder.  Double-click the vmx file highlighted by the arrow in Figure 3.  If you have VMware player correctly installed, you should see the Ubuntu desktop as shown in Figure 1.  You may see some warnings upon opening the VMware image concerning “Removable Devices.”  If so, simply click “OK.” In addition, depending on which version of VMware player you have, you may get a window concerning “Software Updates.”  Simply click “Remind Me Later.” 

Figure 3. The unzipped BDTI OpenCV VMware image

To shut down the VMware image, click the “power button” in the upper right corner of the Ubuntu desktop and select “Shut Down.”

To connect the webcam to the VMware image, plug the webcam into your computer’s USB port and follow the menus shown in Figure 4. Find the “Virtual Machine” button in the top of the VMware window as shown highlighted in Figure 4.  Then select “Removable Devices” and look for your webcam in the list.  Select your webcam and click connect.  For correct operation your webcam should have a check mark next to it, as shown by the Logitech USB device (my web camera) in Figure 4. 

Figure 4. Connecting the camera to the VMware image

To test your camera with the VMware image, double click the “Click_Here_To_Test_Your_Camera” icon in the upper left corner of the Ubuntu Desktop.  A window should open showing a live feed from the camera. If you do not see a live video feed, verify that the camera is connected to the VMware image using the menus shown in Figure 4.  If the camera is still not working, exit the VMware image and try the camera on the Windows host.  The camera must be properly installed on the Windows host per the manufacturer’s instructions.

Command Line OpenCV Examples

There are two sets of OpenCV examples preloaded in the BDTI Quick-Start OpenCV Kit.  The first set is command-line based, the second set is Eclipse IDE based. The command line examples can be found in the “OpenCV_Command_Line_Demos” folder as shown in Figure 5.

Figure 5. OpenCV command line demos folder

Double click the “Terminal” icon and type “. ./demosat the prompt.  That is a period, followed by a space, followed by a period and a forward slash then the word demos.  Commands are case sensitive, so watch the Caps Lock key.

Figure 6. The command line demos

The command line examples include example makefiles to provide guidance for your own projects.  To build a demo, simply change directory into the directory for that demo and type “make”, as illustrated here (commands below are in bold):

ubuntu@ubuntu:~/Desktop/OpenCV_Command_Line_Demos$ ls
FaceDetector  gnome-terminal.desktop  MotionDetection
framework       LineDetection                  OpticalFlow

ubuntu@ubuntu:~/Desktop/OpenCV_Command_Line_Demos$ cd FaceDetector/
ubuntu@ubuntu:~/Desktop/OpenCV_Command_Line_Demos/FaceDetector$ ls
example  example.cpp  example.o  haarcascade_frontalface_alt2.xml  Makefile

ubuntu@ubuntu:~/Desktop/OpenCV_Command_Line_Demos/FaceDetector$ make

All of the examples are named “example.cpp” and create an executable binary with name “example”.  To run the program, type “./example”.

ubuntu@ubuntu:~/Desktop/OpenCV_Command_Line_Demos/FaceDetector$ ./example

Figure 7. The face detector example

To exit the example, simply highlight the one of the windows (other than the console window) and push any key.

To edit a command line example, use the “gedit” command to launch a graphical editor.

ubuntu@ubuntu:~/Desktop/OpenCV_Command_Line_Demos/FaceDetector$ gedit example.cpp &

This opens the file named “example.cpp” in the graphical editor as shown in Figure 8.

Figure 8. Using gedit to edit an OpenCV example C file

Eclipse Graphical Integrated Development Environment OpenCV Examples

The Eclipse examples are the same as the command line examples but configured to build in the Eclipse environment.  The source code is identical, but the makefiles are specialized for building OpenCV applications in an Eclipse environment. To start the Eclipse IDE, double click the “Eclipse_CDT” icon on the Ubuntu Desktop.  Eclipse will open as shown in Figure 2. 

The left Eclipse pane lists the OpenCV projects.  The center pane is the source debugger.  To debug a project, simply highlight the desired project in the left pane and click the green bug on the top toolbar.  When the debug window opens, push F8 (see Figure 9).

Figure 9. Debugging an OpenCV example in Eclipse

The Eclipse IDE makes it easy to debug your OpenCV application by allowing you to set breakpoints, view variables, and step through code.  For more information about debugging with eclipse, read this excellent guide. To stop debugging simply click on the red square in the IDE debugger.

There are five examples, each provided in both the Eclipse and command line build format.  These five examples have been chosen to show common computer vision functions in OpenCV.  Each example uses OpenCV sliders to control the parameters of the algorithm on the fly.  Moving these sliders with your mouse will change the specified parameter in real-time, letting you experiment with the behavior of the algorithm without writing any code.  The five examples are motion detection, line detection, optical flow and face detection. Each is described briefly below.  This article does not go into details of each algorithm; look for future articles covering each of the algorithms used in these examples in detail. For now, let’s get to the examples.

Motion Detection

As the name implies, motion detection uses the change of pixels between frames to classify pixels as unique features (Figure 10). The algorithm considers pixels that do not change between frames as being stationary and therefore part of the background. Motion detection or background subtraction is a very practical and easy-to-implement algorithm. In its simplest form, the algorithm looks for differences between two frames of video by subtracting one frame from the next. In the output display, white pixels are moving, black pixels are stationary.

Figure 10. The user interface for the motion detection example

This example adds an additional element to the simple frame subtraction algorithm: a running average of the frames. Each frame averaging routine runs over a time period specified by the LearnRate parameter. The higher the LearnRate, the longer the running average. By setting LearnRate to 0, you disable the running average and the algorithm simply subtracts one frame from the next.

The Threshold parameter sets the change level required for a pixel to be considered moving. The algorithm subtracts the current frame from the previous frame, giving a result. If the result is greater than the threshold, the algorithm displays a white pixel and considers that pixel to be moving.

LearnRate: Regulates the update speed (how fast the accumulator "forgets" about earlier images).

Threshold: The minimum value for a pixel difference to be considered moving.

Line Detection

Line detection classifies straight edges in an image as features (Figure 11). The algorithm relegates to the background anything in the image that it does not recognize as a straight edge, thereby ignoring it. Edge detection is another fundamental function in computer vision.

Figure 11. The user interface for the line detection example

Image processing determines an edge by sensing close-proximity pixels of differering intensity. For example, a black pixel next to a white pixel defines a hard edge. A gray pixel next to a black (or white) pixel defines a soft edge. The Threshold parameter sets a minimum on how hard an edge has to be in order for it to be classified as an edge. A Threshold of 255 would require a white pixel be next to a black pixel to qualify as an edge. As the Threshold value decreases, softer edges in the image appear in the display.

After the algorithm detects an edge, it must make a difficult decision: is this edge part of a straight line? The Hough transform, employed to make this decision, attempts to group pixels classified as edges into a straight line. It uses theMinLength and MaxGap parameters to decide ("classify" in computer science lingo) a group of edge pixels into either a straight continuous line or ignored background information (edge pixels not part of a continuous straight line are considered background, and therefore not a feature).

Threshold: Sets the minimum difference between adjoining groups of pixels to be classified as an edge.

MinLength: The minimum number of "continuous" edge pixels required to classify a potential feature as a straight line.

MaxGap: The maximum allowable number of missing edge pixels that still enable classification of a potential feature as a "continuous" straight line.

Optical Flow

Optical flow estimates motion by analyzing how groups of pixels in the current frame changed position from the previous  frame of a video sequence (Figure 12). The "group of pixels" is a feature. Optical flow estimation finds use in predict where objects will be in the next frame. Many optical flow estimation algorithms exist; this particular example uses the Lucas-Kanade approach. The algorithm's first step involves finding "good" features to track between frames. Specifically, the algorithm is looking for groups of pixels containing corners or points.

Figure 12. The user interface for the optical flow example

The qlevel variable determines the quality of a selected feature. Consistency is the end objective of using a lot of math to find quality features. A "good" feature (group of pixels surrounding a corner or point) is one that an algorithm can find under various lighting conditions, as the object moves. The goal is to find these same features in each frame. Once the same feature appears in consecutive frames, tracking an object is possible. The lines in the output video represent the optical flow of the selected features.

The MaxCount parameter determines the maximum number of features to look for. The minDist parameter sets the minimum distance between features. The more features used, the more reliable the tracking. The features are not perfect, and sometimes a feature used in one frame disappears in the next frame. Using multiple features decreases the chances that the algorithm will not be able to find any features in a frame.

MaxCount: The maximum number of good features to look for in a frame.

qlevel: The acceptable quality of the features. A higher quality feature is more likely to be unique, and therefore to be correctly findable in the next frame. A low quality feature may get lost in the next frame, or worse yet may be confused with another point in the image of the next frame.

minDist: The minimum distance between selected features.

Face Detector

The face detector used in this example is based on the Viola-Jones feature detector algorithm (Figure 13). Throughout this article, we have been working with different algorithms for finding features; i.e. closely grouped pixels in an image or frame that are unique in some way. The motion detector used subtraction of one frame from the next frame to find pixels that moved, classifying these pixel groups as features. In the line detector example, features were groups of pixels organized in a straight line. And in the optical flow example, features were groups of pixels organized into corners or points in an image.

Figure 13. The user interface for the face detector example

The Viola-Jones algorithm uses a discrete set of six Haar-like features (the OpenCV implementation adds additional features). Haar-like features in a 2D image include edges, corners, and diagonals. They are very similar to features in the optical flow example, except that detection of these particular features occurs via a different method.

As the name implies, the face detector example detects faces. Detection occurs within each individual frame; the detector does not track the face from frame to frame. The face detector can also detect objects other than faces. An XML file "describes" the object to detect. OpenCV includes various Haar cascade XML files that you can use to detect various object types. OpenCV also includes tools to allow you to train your own cascade to detect any object you desire and save it as an XML file for use by the detector.

MinSize: The smallest face to detect. As a face gets further from the camera, it appears smaller. This parameter also defines the furthest distance a face can be from the camera and still be detected.

MinN: The minimum neighbor parameter groups faces that are detected multiple times into one detection. The face detector actually detects each face multiple times in slightly different positions. This parameter simply defines how to group the detections together. For example, a MinN of 20 would group all detection within 20 pixels of each other as a single face.

ScaleF: Scale factor determines the number of times to run the face detector at each pixel location. The Haar cascade XML file that determines the parameters of the to-be-detected object is designed for an object of only one size. In order to detect objects of various sizes (faces close to the camera as well as far away from the camera, for example) requires scaling the detector. This scaling process has to occur at every pixel location in the image. This process is computationally expensive, but a scale factor that is too large will not detect faces between detector sizes. A scale factor too small, conversely, can use a huge amount of CPU resources. You can see this phenomenon in the example if you first set the scale factor to its max value of 10. In this case, you will notice that as each face moves closer to or away from the camera, the detector will not detect it at certain distances. At these distances, the face size is in-between detector sizes. If you decrease the scale factor to its minimum, on the other hand, the required CPU resources skyrocket, as shown by the extended detection time.

Canny Edge Detector

Many algorithms exist for finding edges in an image. This example focuses on the Canny algorithm (Figure 14). Considered by many to be the best edge detector, the Canny algorithm was developed in 1986 by John F. Canny of U.C. Berkeley. In his paper, "A computational approach to edge detection," Canny describes three criteria to evaluate the quality of edge detection:

  1. Good detection: There should be a low probability of failing to mark real edge points, and low probability of falsely marking non-edge points. This criterion corresponds to maximizing signal-to-noise ratio.
  2. Good localization: The points marked as edge points by the operator should be as close as possible to the center of the true edge.
  3. Only one response to a single edge: This criterion is implicitly also captured in the first one, since when there are two responses to the same edge, one of them must be considered false.

Figure 14. The user interface for the Canny edge detector example

The example allows you to modify the Canny parameters on the fly using simple slider controls.

Low Thres: Canny Low Threshold Parameter (T2)  (LowThres)

High Thres: Canny High Threshold Parameter (T1) (HighThres)

Gaus Size : Gaussian Filter Size (Fsize)

Sobel Size: Sobel Operator Size (Ksize)

The example also opens six windows representing the stages in the Canny edge detection algorithm. All windows are updated in real-time.

Gaussian Filter: This window shows the output of the Gaussian filter.

GradientX: The result of the horizontal derivative (Sobel) of the image in the Gaussian Filter window.

GradientY: The result of the vertical derivative (Sobel) of the image in the Gaussian Filter window.

Magnitude: This window shows the result of combining the GradientX and GradientY images using the equation G = |Gx|+|Gy|

Angle: Color-coded result of the angle equation, combining GradientX and GradientY using arctan(Gy/Gx).

Black  = 0 degrees
Red    = 1 degrees to 45 degrees
White  = 46 degrees to 135 degrees
Blue    = 136 degrees to 225 degrees
Green = 226 degrees to 315 degrees
Red    = 316 degrees to 359 degrees

The 0-degree marker indicates the left to right direction, as shown in figure 15.

Canny: The Canny edgemap

Figure 15. The Direction Color Code for the Angle Window in the Canny Edge Detector Example. Left to Right is 0 Degrees

Detection Time

Each of these examples writes the detection time to the console while the algorithm is running. This time represents the number of milliseconds the particular algorithm took to execute. A larger amount of time represents higher CPU utilization. The OpenCV library as built in these examples does not have hardware acceleration enabled; however OpenCV currently supports CUDA and NEON acceleration.


The intent of this article and accompanying BDTI Quick-Start OpenCV Kit software is to help the reader quickly get up and running with OpenCV. The examples discussed in this article represent only a miniscule subset of algorithms available in OpenCV; I chose them because at a high level they represent a broad variety of computer vision functions. Leveraging these algorithms in combination with, or alongside, other algorithms can help you solve various industrial, medical, automotive, and consumer electronics design problems.

About BDTI

Berkeley Design Technology, Inc. (BDTI) provides world-class engineering services for the design and implementation of complex, reliable, low-cost video and embedded computer vision systems.  For details of BDTI technical competencies in embedded vision and a listing of example projects, go to www.BDTI.com.

For 20 years, BDTI has been the industry's trusted source of analysis, advice, and engineering for embedded processing technology and applications. Companies rely on BDTI to prove and improve the competitiveness of their products through benchmarking and competitive analysis, technical evaluations, and embedded signal processing software engineering. For free access to technology white papers, benchmark results for a wide range of processing devices, and presentations on embedded signal processing technology, visit www.BDTI.com.

Appendix: References

OpenCV v2.3 Programmer's Reference Guide

OpenCV WiKi


Embedded Vision Alliance

Viola-Jones Object Detector (PDF)

Good Features to Track (PDF)

Lucas-Kanade Optical Flow (PDF)

Revision History

April 25, 2012 Initial version of this document
May 3, 2012 Added Canny edge detection example

Designing High-Performance Video Systems in 7 Series FPGAs with the AXI Interconnect

Bookmark and Share

Designing High-Performance Video Systems in 7 Series FPGAs with the AXI Interconnect

By Sateesh Reddy Jonnalagada and Vamsi Krishna
Xilinx Corporation
Embedded vision applications deal with a lot of data; a single 1080p60 (1920x1080 pixel per frame, 60 frames per second) 24-bit color video stream requires nearly 3 Gbps of bandwidth, and 8-bit alpha (transparency) or 3-D depth data further amplifies the payload by 33% in each case. Transferring that data from one node to another quickly and reliably is critical to robust system operation. As such, advanced interconnect technologies such as Xilinx's AXI are valuable in embedded vision designs. This is a reprint of a Xilinx-published application note, which is also available here (2.1 MB PDF).

This application note covers the design considerations of a video system using the performance features of the LogiCORETM IP Advanced eXtensible Interface (AXI) Interconnect core. The design focuses on high system throughput using approximately 80% of DDR memory bandwidth through the AXI Interconnect core with FMAX and area optimizations in certain portions of the design.

The design uses eight AXI video direct memory access (VDMA) engines to simultaneously move 16 streams (eight transmit video streams and eight receive video streams), each in 1920 x 1080 pixel format at 60 or 75 Hz refresh rates, and up to 32 data bits per pixel. Each VDMA is driven from a video test pattern generator (TPG) with a video timing controller (VTC) block to set up the necessary video timing signals. Data read by each AXI VDMA is sent to a common on-screen display (OSD) core capable of multiplexing or overlaying multiple video streams to a single output video stream. The output of the OSD core drives the onboard high-definition media interface (HDMI) video display interface through the color space converter.

The performance monitor block is added to capture DDR performance metrics. DDR traffic is passed through the AXI Interconnect to move 16 video streams over 8 VDMA pipelines. All 16 video streams moved by the AXI VDMA blocks are buffered through a shared DDR3 SDRAM memory and are controlled by a MicroBlazeTM processor.

The reference system is targeted for the Kintex-7 FPGA XC7K325TFFG900-1 on the Xilinx KC705 evaluation board (revision C or D) (Reference 1).

Included Systems

The reference design is created and built using version 13.4 of the Xilinx Platform Studio (XPS) tool, which is part of the ISE® Design Suite: System Edition. XPS helps simplify the task of instantiating, configuring, and connecting IP blocks together to form complex embedded systems. The design also includes software built using the Xilinx Software Development Kit (SDK). The software runs on the MicroBlaze processor subsystem and implements control, status, and monitoring functions. Complete XPS and SDK project files are provided with this application note to allow the user to examine and rebuild the design or to use it as a template for starting a new design.


High-performance video systems can be created using Xilinx AXI IP. The use of AXI Interconnect, Memory Interface Generator (MIG), and VDMA IP blocks can form the core of video systems capable of handling multiple video streams and frame buffers sharing a common DDR3 SDRAM memory. AXI is a standardized IP interface protocol based on the Advanced Microcontroller Bus Architecture (AMBA®) specification. The AXI interfaces used in the reference design consist of AXI4, AXI4-Lite, and AXI4-Stream interfaces as described in the AMBA AXI4 specifications (Reference 2). These interfaces provide a common IP interface protocol framework around which to build the design.

Together, the AXI interconnect and AXI MIG implement a high-bandwidth, multi-ported memory controller (MPMC) for use in applications where multiple devices share a common memory controller. This is a requirement in many video, embedded, and communications applications where data from multiple sources moves through a common memory device, typically DDR3 SDRAM.

AXI VDMA implements a high-performance, video-optimized DMA engine with frame buffering, scatter gather, and two-dimensional (2D) DMA features. AXI VDMA transfers video data streams to or from memory and operates under dynamic software control or static configuration modes.

A clock generator and processor system reset block supplies clocks and resets throughout the system. High-level control of the system is provided by an embedded MicroBlaze processor subsystem containing I/O peripherals and processor support IP. To optimize the system to balance performance and area, multiple AXI Interconnect blocks are used to implement segmented/hierarchical AXI Interconnect networks with each AXI Interconnect block individually tuned and optimized.

Hardware Requirements

The hardware requirements for this reference system are:

  • Xilinx KC705 evaluation board (revision C or D)
  • Two USB Type-A to Mini-B 5-pin cables
  • HDMI to DVI cable
  • Display monitor supporting 1920 x 1080 pixel resolution up to 75 frames/sec (The reference design was tested using a Dell P2210 monitor)

The installed software tool requirements for building and downloading this reference system are:

  • Xilinx Platform Studio 13.4
  • ISE Design Suite 13.4
  • SDK 13.4

Reference Design Specifics

In addition to the MicroBlaze processor, the reference design includes these cores:

  • MDM
  • LMB block RAM
  • AXI2AXI Connector
  • csc_rgb_to_ycrcb422

Figure 1 and Table 1 show a block diagram and address map of the system, respectively.

Figure 1. Reference System Block Diagram

Table 1. Reference System Address Map

Table 1. Reference System Address Map (Cont’d)

Hardware System Specifics

This section describes the high-level features of the reference design, including how to configure the main IP blocks. Information about useful IP features, performance/area trade-offs, and other configuration information is also provided. This information is applied to a video system, but the principles used to optimize the system performance apply to a wide range of high-performance AXI systems. For information about AXI system optimization and design trade-offs, see AXI Reference Guide (Reference 3).

This application note assumes the user has some general knowledge of XPS. See EDK Concepts, Tools, and Techniques: A Hands-On Guide to Effective Embedded System Design (Reference 4) for more information about the XPS tools.

Video-Related IP

The reference design implements eight video pipelines each running at a 1920 x 1080 pixel format at 60 or 75 frames/sec. Each picture consists of four bytes per pixel to represent an upper bound for high-quality video streams like RBGA (with alpha channel information). Each video pipeline running at 60 frames/sec requires a bandwidth of 497.7 MB/s (~4 Gb/s) whereas at 75 frames/sec, each video pipeline requires a bandwidth of 622 MB/s (~5 Gb/s).

Note: The source code supplied with the reference design is for 1920 x 1080 pixels running at 75 Hz. To operate the same design at 60 Hz, the user should change the sixth port input frequency of the clock generator to 148000000 in the microprocessor hardware specification (MHS) file and run the design.

The video traffic is generated by TPG IP cores and displayed by the OSD core. The total aggregate read/write bandwidth generated is equivalent to 16 video streams requiring 9.9538 GB/s (7.96 Gb/s).

This application note demonstrates AXI system performance using 16 high-definition video streams. At a minimum, video systems must include a source, some internal processing, and a destination. There can be multiple stages internally using a variety of IP modules. The canonical video system in Figure 2 shows that most video systems consist of input, pre-processing, main processing, post-processing, and output stages. Many of the video stages illustrated require memory access at video rates. Video data goes in or out of memory according to the requirements of internal processing stages. In this application note, a series of test pattern generators create the internal IP block memory traffic to simulate typical conditions.

Figure 2. Typical Video System

AXI Interconnects

This design contains multiple AXI Interconnects each tuned to balance for throughput, area, and timing considerations (see LogiCORE IP AXI Interconnect Product Specification (v1.05.a) (Reference 5). The AXI_MM0, AXI_MM1, and AXI_MM2 instances are used for high-speed masters and slaves that include high throughput and high FMAX optimizations. The AXI_MM0, AXI_MM1, and AXI_MM2 interconnects are optimized for higher throughput. They are used to buffer frame data generated by the TPG and to access the same data from the buffer through the VDMA to display on the LCD. The AXI_Lite and AXI_Lite_Video Interconnect instances are generally optimized for area. They are used by the processor to access slave registers and to write to the VDMA register space for control of the AXI VDMA. The AXI VDMA operation and its register descriptions are described in detail in LogiCORE IP AXI Video Direct Memory Access (axi_vdma) (v3.01.a) data sheet (Reference 6).

AXI Interconnect (AXI_MM Instance)

This AXI Interconnect instance provides the highest FMAX and throughput for the design by having a 512-bit core data width and running at 200 MHz. The AXI Interconnect core data width and clock frequency match the capabilities of the attached AXI MIG so that width and clock converters between them are not needed. Sizing the AXI Interconnect core data width and clock frequency below the native width and clock frequency of the memory controller creates a system bandwidth bottleneck in the system. To help meet the timing requirements of a 512-bit AXI interface at 200 MHz, a rank of register slices are enabled between AXI_MM Interconnect and AXI MIG. Together, AXI Interconnect and AXI MIG form an 18-port AXI MPMC connected to MicroBlaze processor instruction cache (ICache) and data cache (DCache) ports, eight AXI VDMA MM2S ports, and eight AXI VDMA S2MM ports. The configuration of this AXI Interconnect is consistent with the system performance optimization recommendations for an AXI MPMC based system as described in the AXI Reference Guide (Reference 3).

AXI VDMA Instances

The AXI VDMA core is designed to provide video read/write transfer capabilities from the AXI4 domain to the AXI4-Stream domain, and vice versa. The AXI VDMA provides high-speed data movement between system memory and AXI4-Stream based target video IP. AXI4 interfaces are used for the high-speed data movement and buffer descriptor fetches across the AXI Interconnect.

The AXI VDMA core incorporates video-specific functionality, i.e., Gen-Lock and Frame Sync, for fully synchronized frame DMA operations and 2D DMA transfers. In addition to synchronization, frame store numbers and scatter gather or register direct mode operations are available for ease-of-control by the central processor.

In this design, the AXI VDMA scatter gather feature is not used because the system could be implemented sufficiently using the simpler register direct mode of AXI VDMA, which would remove the area cost of the scatter gather feature. Scatter gather should only be enabled if the system requires relatively complex software control of the AXI VDMA operations.

Initialization, status, and management registers in the AXI VDMA core are accessed through an AXI4-Lite slave interface.

This design uses eight instances of AXI VDMA, each using two 64-bit interfaces toward the AXI4 memory map and two 32-bit interfaces toward the streaming side. The upsizer in the VDMA is used to convert 32-bit transactions from the streaming side to 64-bit wide transactions to the memory map side of the VDMA core. Similarly, downsizers are used to convert 64-bit memory-mapped transactions to 32-bit streaming side transactions.

The 64-bit wide MM2S and S2MM interfaces from the AXI VDMA instances are connected to the AXI_MM instance of the AXI Interconnect. The masters run at 148.5 MHz (60 Hz frame rate)/185 MHz (75 Hz frame rate) (video clock), which require asynchronous clock converters to the 200 MHz AXI Interconnect core frequency. Upsizers in the AXI Interconnect are used to convert 64-bit transactions from the AXI VDMA to 512-bit wide transactions to the AXI Interconnect core.

For maximum throughput for the AXI VDMA instances, the maximum burst length is set to 256. In addition, the master interfaces have a read and write issuance of 4 and a read and write FIFO depth of 512 to maximize throughput. These settings all follow performance recommendations for AXI endpoint masters as described in the AXI Reference Guide (Reference 3).

In addition, line buffers inside the AXI VDMA for the read and write sides are set to 1K deep, and the store and forward features of the AXI VDMA are enabled on both channels to improve system performance and reduce the risk of system throttling. See the LogiCORE IP AXI Video Direct Memory Access (axi_vdma) (v3.01.a) data sheet (Reference 6) for more information.

If the design sets the parameter C_PRMRY_IS_ACLK_ASYNC to 1, follow these steps:

  1. Right-click on the core instance and select Make This IP Local to make the pcore local to the XPS project.
  2. Navigate to the pcores/axi_vdma_v5_00_a/data/ directory.
  3. Open the axi_vdma_2_1_0.tcl file.
  4. Comment out any lines from 77 to 136 in the TCL file that incorrectly constrain signals in the same clock domain. For example, if the core is set to asynchronous mode (C_PRMRY_IS_ACLK_ASYNC=1) and m_axi_mm2s_aclk and s_axi_lite_aclk use the same clock source, comment out these timing ignore (TIG) constraints:

    puts $outputFile "TIMESPEC TS_${instname}_from_s_axi_lite_aclk_to_m_axi_mm2s_aclk = FROM \"s_axi_lite_aclk\" TO \"m_axi_mm2s_aclk\" TIG;"

    puts $outputFile "TIMESPEC TS_${instname}_from_m_axi_mm2s_aclk_to_s_axi_lite_aclk = FROM \"m_axi_mm2s_aclk\" TO \"s_axi_lite_aclk\" TIG;"
  5. Save the file.
  6. In XPS, select Project and click Rescan User Repositories.

MicroBlaze Processor ICache and DCache

The MicroBlaze processor ICache and DCache masters are connected to the AXI Interconnect and run at 100 MHz because the MicroBlaze processor runs a software application from main memory that sets up and monitors the video pipelines. Running the MicroBlaze processor at this frequency helps timing and area.

See the MicroBlaze Processor Reference Guide: Embedded Development Kit EDK 13.4 (Reference 7) for more information. The 100 MHz clock setting ensures that synchronous integer ratio clock converters in the AXI Interconnect can be used, which offers lower latency and less area than asynchronous converters.


The single slave connected to the AXI Interconnect is the axi_7series_ddrx memory controller (a block that integrates the MIG tool into XPS). The memory controller’s AXI Interface is 512 bits wide running at 200 MHz and disables narrow burst support for optimal throughput and timing. This configuration matches the native AXI interface clock and width corresponding to a 64-bit DDR3 DIMM at 800 MHz memory clock, which is the maximum performance of the memory controller for a Kintex-7 device in -1 speed grade.

The slave interface has a read/write issuance of eight. Register slices are enabled to ensure that the interface meets timing at 200 MHz. These settings help ensure that a high degree of transaction pipelining is active to improve system throughput. See the 7 Series FPGAs Memory Interface Solutions User Guide (Reference 8) for more information about the memory controller.

AXI Interconnect (AXI_Lite, AXI_Lite_Video)

The MicroBlaze processor data peripheral (DP) interface master writes and reads to all AXI4-Lite slave registers in the design for control and status information.

These interconnects are 32 bits and do not require high FMAX and throughput. Therefore, they are connected to a slower FMAX portion of the design by a separate AXI Interconnect.

Because there are more than 16 AXI4-Lite slave interfaces in the design, AXI2AXI connectors and additional AXI Interconnect instances are required to allow the processor to access all the AXI4-Lite interfaces in the system.

The AXI_Lite and AXI_Lite_Video AXI Interconnect blocks are configured for shared-access mode because high throughput is not required in this portion of the design. Therefore, area can be optimized over performance on these interconnect blocks. Also, these two interconnects are clocked at 50 MHz to ensure that synchronous integer ratio clock converters in the AXI Interconnect can be used, which offer lower latency and less area than asynchronous clock converters.

AXI_Lite Interconnect

The slaves on the AXI_Lite Interconnect are for MDM, AXI_UARTLITE, AXI_IIC, AXI_INTC, AXI_VTC (two instances), AXI OSD, and the slave AXI2AXI connectors to the AXI_Lite_Video interconnect.

AXI_Lite_Video Interconnect

An AXI2AXI connector connects the AXI_Lite Interconnect to the AXI_Lite_Video Interconnect as a master. The slaves on this AXI Interconnect are AXI_TPG (eight instances) and the AXI VDMA slave interface (eight instances).


The AXI VTC is a general-purpose video timing generator and detector. The input side of this core automatically detects horizontal and vertical synchronization pulses, polarity, blanking timing, and active video pixels. The output side of the core generates the horizontal and vertical blanking and synchronization pulses used in a standard video system including support for programmable pulse polarity.

The AXI VTC contains an AXI4-Lite Interface to access slave registers from a processor. For more information about the AXI VTC, see the LogiCORE IP Video Timing Controller v3.0 data sheet (Reference 9).

In this design, two AXI VTC instances are used without detection. The first instance is used for the video input portion of the video pipelines. The second instance is used for the AXI OSD, which is the read portion of the video pipelines.

The Video Timing Controller v3.0 core is provided under license and can be generated using the CORE GeneratorTM tool v13.2 or higher.


The AXI TPG contains an AXI4-Lite Interface to access slave control registers from a processor.

In this reference design, the video traffic to DDR3 memory is generated by a series of TPGs. Each TPG block can generate several video test patterns that are commonly used in the video industry for verification and testing. In the reference design, the TPG is used as a replacement for other video IP because only the amount of traffic generated to demonstrate the performance of the system is of interest. The control software demonstrates generation of flat colors, color bars, horizontal and vertical burst patterns, and the generation of zone plates. No matter which test pattern is chosen, the amount of data generated is the same, namely, 1080p HD video. For example, an RGBA (32-bit), 1080p60 pattern generates 497.7 MB/s, which is a nearly 4 Gb/s data stream. Similarly, an RGBA (32-bit), 1920 x 1080 pixel pattern at a 75 Hz frame rate generates 622 MB/s, which is a nearly 5 Gb/s data stream.

Several operating modes are accessible through software control. In this application note, the TPG always generates a test pattern that could be one of flat colors, color bars, horizontal ramp, vertical ramp, or zoneplates. These patterns are meant for testing purposes only and are not calibrated to broadcast industry standards.


The OSD LogiCORE IP provides a flexible video-processing block for alpha blending, compositing up to eight independent layers, and generating simple text and graphics capable of handling images up to 4K x 4K sizes in YUVA 4:4:4 or RGBA image formats in 8, 10, or 12 bits per color component. In this application note, the OSD blends the eight video streams as separate display layers. Because the video streams generated by the TPG cores are enabled through software control, the display shows the blended layers on top of each other. Figure 3 shows a three-level block diagram of the OSD core.

Figure 3. Sample Three-Layer OSD Core Block Diagram

The AXI OSD contains an AXI4-Lite interface to access the slave registers from a processor. For more information about the AXI OSD, see the LogiCORE IP Video On-Screen Display v2.0 data sheet (Reference 10).

The Video On-Screen Display core is provided under the SignOnce IP site license and can be generated using the CORE Generator tool, which is a part of the Xilinx ISE Design Suite software tool.

A simulation evaluation license for the core is shipped with the CORE Generator system. To access the full functionality of the core, including FPGA bit-stream generation, a full license must be obtained from Xilinx.

AXI Performance Monitor

The AXI performance monitor core (AXI PERFORMANCE MONITOR) measures throughput for a DDR3 memory connected to the AXI Interconnect. The processor accesses the AXI performance monitor core registers through a slave AXI4-Lite interface contained in the core. The AXI performance monitor core only monitors the read and write channels between the AXI slave and the AXI Interconnect. The core does not modify or change any of the AXI transactions it is monitoring. The core also calculates the glass-to-glass delay of the system by connecting appropriate signals to it.

Note: In this application note, glass-to-glass delay is defined as the number of clock cycles consumed to display a frame from the TPG (video source) on an LCD screen (video sync).

Several signals must be connected in the system to measure the throughput. The DDR slave interconnect (AXI_MM) is connected to one of the four slots of the monitor. In addition to these, the AXI_Lite bus interface is connected to access the core registers by the processor. In addition to the signals of these two bus interfaces, the core clock (the higher of the two bus interface clock frequencies) must be connected. To evaluate the glass-to-glass delay of the system “Vid_clk”, “Vtc0_Fsync”, “Vsync_osd”, “Tpg_Active_video_in”, “Tpg_Data”, “Osd_Active_Video_In”, and “Osd_data” are also connected. The Fsync signal generated by the VTC and the Vsync signal generated by the color space converter are used to evaluate glass-to-glass delay.

The core can measure performance metrics such as total read byte count, write byte count, read requests, write requests, and write responses. Count start and count end conditions come from the processor through the register interface. The global clock counter of the core measures the number of clocks between the count start and count end events. The counters used for the performance monitor can be configured for 32 or 64 bits through the register interface. Final user-selectable metrics can also be read through the register interface.

In this application note, the DDR3 slave is connected to one of the slots of the AXI performance monitor core to measure the throughput of the core. Valid, ready, strobe, and other AXI signals connected to the performance monitor slots are used to enable various counters for measuring events on the bus.

Software Applications: AXI VDMA DISPLAY Application

The software application starts up the video pipelines allowing the user to examine bandwidth in real time and display separate layers or alpha blend all layers on the LCD screen.

Application-level software for controlling the system is written in C using the provided drivers for each IP. The programmer’s model for each IP describes the particular API used by the drivers. Alternatively, application software can be written to use the IP control registers directly and handle the interrupts at the application level, but using the provided drivers and a layer of control at the application level is a far more convenient option.

The application software in the reference design performs these actions:

  1. The software application first resets the HDMI port on the KC705 board through the IIC interface.
  2. The TPG instances are set to write a default gray pattern that does not start until the AXI VTC instances are started.
  3. The AXI VDMA instances are started, which consists of the processor writing into its registers. The program then starts the read/write channels to begin the transfers for the VDMA instances.
  4. The AXI VTC instances are started with 1920 x 1080 pixels (75 Hz) timing configuration.
  5. The AXI OSD is configured for 1920 x 1080 resolution output. The eight TPG instances in the design are configured to write:
    • Color bars (layer 0)
    • Zone plate patterns (layer 1)
    • Vertical bars (layer 2)
    • Horizontal bars (layer 3)
    • Tartan bars (layer 4)
    • Flat red (layer 5)
    • Flat green (layer 6)
    • Flat blue (layer 7)

After the initial setup sequence, the user can choose to view a certain layer by selecting a number (option 0–7). When the number of a particular layer is selected, the OSD registers are modified to make the alpha blending on that particular layer be the highest value, while the others are at the smallest. When option 8 is selected (alpha blending all layers), different values are given to the alpha blending register for each layer to show all layers on the LCD screen at the same time. Option 9 reads performance metrics from the core and option d displays the glass-to-glass delay consumed by the system.

Executing the Reference Design in Hardware

This section provides instructions to execute the reference design in hardware. This reference design runs on the KC705 board shown in Figure 4.

Figure 4. KC705 Board

In these instructions, numbers in parentheses correspond to callout numbers in Figure 4. Not all callout numbers are referenced.

  1. Connect a USB cable from the host PC to the USB JTAG port (6). Ensure the appropriate device drivers are installed.
  2. Connect a second USB cable from the host PC to the USB UART port (12). Ensure that the USB UART drivers described in Hardware Requirements have been installed.
  3. Connect the KC705 HDMI connector (18) to a video monitor capable of displaying a 1920 x 1080 resolution and displaying up to 75 Hz video signal.
  4. Connect a power supply cable.
  5. Set power ON (27).
  6. Start a terminal program (e.g., HyperTerminal) on the host PC with these settings:
    • Baud Rate: 9600
    • Data Bits: 8
    • Parity: None
    • Stop Bits: 1
    • Flow Control: None

Executing the Reference System Using the Pre-Built Bitstream and the Compiled Software Application

These are the steps to execute the system using files in the ready_for_download directory of the <unzip_dir>/kc705_video_8x_pipeline/ directory:

  1. In a command shell or terminal window, change directories to the ready_for_download directory. Move into one of the directories 60Hz or 75Hz (75Hz is shown in this example):

    % cd <unzip dir>/kc705_video_8x_pipeline/ready_for_download/75Hz
  2. Invoke the Xilinx Microprocessor Debugger (XMD) tool:

    % xmd
  3. Download the bitstream inside XMD:

    XMD% fpga -f download.bit
  4. Connect to the processor inside XMD:

    XMD% connect mb mdm
  5. Download the processor code (ELF) file:

    XMD% dow axi_vdma_display.elf
  6. Run the software:

    XMD% run

Results from Running Hardware and Software

The Dell P2210 LCD monitor connected to the KC705 board displays a color bar pattern, and the HyperTerminal screen displays the output shown in Figure 5.

Figure 5. HyperTerminal Output

The user can choose one of the eleven options displayed on the HyperTerminal screen:

  • 0 = Color bars (layer 0)
  • 1 = Zoneplate patterns (layer 1)
  • 2 = Vertical ramp (layer 2)
  • 3 = Horizontal ramp (layer 3)
  • 4 = Tartan bars (layer 4)
  • 5 = Flat red (layer 5)
  • 6 = Flat green (layer 6)
  • 7 = Flat blue (layer 7)
  • 8 = Alpha blend of all layers simultaneously (layers 0–7)
  • 9 = Real-time system performance (one second of transfers)
  • d = Real-time system glass-to-glass delay of one frame


The AXI_MM interconnect is 512 bits running at 200 MHz. The theoretical maximum bandwidth on each channel is 12.8 GB/s.

The DDR3 PHY is set for 64 bits with a memory clock frequency of 800 MHz. The theoretical throughput on DDR3 is 12.8 GB/s, which is the total bandwidth available in the design.

Using option 9 of the software application should show this output (the numbers might vary slightly from the values shown):

---------DDR3, AXI4 Slave Profile Summary........

Theoretical DDR Bandwidth = 12800000000 bytes/sec
Practical DDR bandwidth = 9975795872 bytes/sec

Percentage of DDDR Bandwidth consumed by eight Video Pipelines (Approx.)

= 77.9359%

The total bandwidth is approximately 9,975 MB/s out of 12,800 MB/s, which is around 77% of the total theoretical bandwidth of the main memory.

Using option d of the software application should display this output:

Processing Time Per Frame (Glass to Glass delay) = 13.572015 ms

Note: The numbers might vary slightly from the values shown.

Building Hardware

This section covers rebuilding the hardware design.

Before rebuilding the project, the user must ensure that the licenses for AXI OSD and AXI VTC are installed. To obtain evaluation licenses for the AXI VTC or AXI OSD, refer to these websites:

  • Xilinx Video Timing Controller (Reference 11)
  • Xilinx On-Screen Display LogiCORE IP (Reference 12)

Note: The source code in the reference design only applies to the 75 Hz frame rate. The user can change the input frequency of the sixth clock port in the MHS file to 148000000 and generate a bitstream to operate in 60 Hz mode. The generated bitstream is at


  1. Open kc705_video_8x_pipeline/HW/k7_MB_video_pipelines/system.xmp in XPS.
  2. Select Hardware > Generate Bitstream to generate a bitstream for the system.
  3. Select Device Configuration > Update Bitstream to initialize the block RAM with a bootloop program. This ensures that the processor boots up with a stable program in memory.

Compiling Software in SDK

  1. Start SDK. In Linux, type xsdk to start SDK.
  2. In the workspace launcher, select this workspace:

    <unzip dir>/kc705_video_8x_pipeline/SW/SDK_Workspace
  3. Click OK.
  4. Set the repository by selecting Xilinx Tools > Repositories.
  5. For local repositories, click New....
  6. Change directories to <unzip dir>/kc705_video_8x_pipeline/SW/repository.
  7. Click OK.
  8. Import the board support package (BSP), hardware platform, and software applications by selecting File > Import > General > Existing Projects into the workspace.
  9. Click Next, then browse to <unzip dir>/kc705_video_8x_pipeline/SW.
  10. Click OK.
  11. Ensure that all checkboxes are selected (including axi_vdma_display and K7_MB_video_pipelines_hw_platform).
  12. Ensure that the associated software applications are selected.
  13. Click Finish.

The BSP and software applications compile at this step. The process takes 2 to 5 minutes. The user can now modify existing software applications and create new software applications in SDK.

Running the Hardware and Software through SDK

  1. Select Xilinx Tools > Program FPGA.

    Note: Ensure bootloop is used for microblaze_0.
  2. Click Program.
  3. In the Project Explorer window, right click and select axi_vdma_display > Run As > Launch on Hardware.

Design Characteristics

The reference design is implemented in a Kintex-7 FPGA (XC7K325TFFG900-1) using the ISE Design Suite: Embedded Edition 13.4.

The resources used are:

  • Total LUTs used: 97,101 out of 203,800 (47%)
  • Total I/Os used: 163 out of 500 (32%)
  • Total internal memory used:
    • RAMB36E1s: 236 out of 445 (53%)
    • RAMB18E1s: 57 out of 890 (6%)

Note: Device resource utilization results depend on the implementation tool versions. Exact results can vary. These numbers should be used as a guideline.

Reference Design

The reference design has been fully verified and tested on hardware. The design includes details on the various functions of the different modules. The interface has been successfully placed and routed at 200 MHz on the main AXI Interfaces to the memory controller using the ISE Design Suite 13.4.

The reference design files for this application note can be downloaded from:


The reference design matrix is shown in Table 2.

Table 2. Reference Design Matrix

Utilization and Performance

Table 3 shows device and utilization information.

Table 3. Device and Utilization

Device resource utilization is detailed in Table 4 for the IP cores shown in Figure 1. The information in Table 4 is taken from the Design Summary tab in XPS under the Design Overview > Module Level Utilization report selection. The utilization information is approximate due to cross-boundary logic optimizations and logic sharing between modules.

Table 4. Module Level Resource Utilization

Table 4. Module Level Resource Utilization (Cont’d)

Table 5 summarizes the bandwidth calculations for the physical memory interface.

Table 5. DDR3 Memory Physical Interface Maximum Theoretical Bandwidth

Table 6 summarizes the total bandwidth of video data moved through memory.

Table 6. Average Bandwidth Used for Video Traffic

Table 7 summarizes the percentage of the maximum theoretical bandwidth used by the video streams.

Table 7. Percentage of the Maximum Theoretical Bandwidth Used


This application note describes a video system using an AXI Interconnect core configured to operate at a bandwidth of approximately 10 GB/s. Eight video pipelines, each processing high-definition video streams of 1920 x 1080 pixels at 75 frames/sec are connected to the DDR memory through the AXI Interconnect. To meet high-performance design requirements, the DDR3 controller (DDR memory with an 800 MHz clock and 64-bit data width) is configured to utilize approximately eighty percent of its available bandwidth.


This application note uses the following references:

  1. UG810, KC705 Evaluation Board for the Kintex-7 FPGA User Guide
  2. AMBA AXI4 specifications
  3. UG761, AXI Reference Guide
  4. UG683, EDK Concepts, Tools, and Techniques: A Hands-On Guide to Effective Embedded System Design (v13.4)
  5. DS768, LogiCORE IP AXI Interconnect Product Specification (v1.05.a)
  6. DS799, LogiCORE IP AXI Video Direct Memory Access (axi_vdma) Product Specification (v3.01.a)
  7. UG081, MicroBlaze Processor Reference Guide: Embedded Development Kit EDK 13.4
  8. UG586, 7 Series FPGAs Memory Interface Solutions User Guide
  9. DS857, LogiCORE IP Video Timing Controller v3.0 Product Specification
  10. DS837, LogiCORE IP Video On-Screen Display v2.0 Product Specification
  11. Xilinx Video Timing Controller
  12. Xilinx On-Screen Display LogiCORE IP
  13. UG111, Embedded System Tools Reference Manual: EDK v13.4

HDR Sensors for Embedded Vision

Bookmark and Share

HDR Sensors for Embedded Vision

By Michael Tusch
Founder and CEO
Apical Limited

At the late-March 2012 Embedded Vision Alliance Summit, Eric Gregori and Shehrzad Qureshi from BDTI presented a helpful overview of CCD and CMOS image sensor technology. I thought it might be interesting to extend this topic to cover so-called HDR (High Dynamic Range) / WDR (Wide Dynamic Range) sensors. HDR and WDR mean the same thing –it’s just a matter of how you use each axis of your dynamic range graph. I’ll employ the common terminology "HDR" throughout this particular article.

I think that this is an interesting topic because many embedded vision applications require equivalent functionality in all real-scene environments. We know that conventional cameras, even high-end DSLRs, aren’t able to capture as much information in very high contrast scenes as our eyes can discern. This fact explains why we have rules of photography such as “make sure the sun is behind you”. Indeed, conventional image sensors do have problems in such conditions, but the industry has devoted significant work over many years to HDR sensors which extend raw capture capability far beyond what is available in conventional consumer and industrial cameras. The reliability of the image capture component is of course one key element of the overall system performance. 

The dynamic range (DR) of the sensor is the ratio of the brightest pixel intensity to the darkest pixel intensity that the camera can capture within a single frame. This number is often expressed in deciBels (dB), i.e.

DR in dB = 20 * log10 (DR)

The human eye does very well and, depending on exactly how the quantity is measured, is typically quoted as being able to resolve around 120-130 dB in daytime conditions.

Image sensors are analog devices that convert pixel intensities to digital values via an analog-digital converter (ADC). The bit depth of the output pixels sets an upper limit on the sensor dynamic range, as shown in Table 1.

Type of sensor


Maximum intensity levels recorded

Maximum sensor dynamic range (dB)

Very low-cost standard




Average standard




Higher quality standard












Table 1. Dynamic range potential of various image sensor types

In reality, the maximum dynamic range is never quite achieved, since in practice the noise level takes up to ~2 bits off the useful pixel range.

Standard CMOS and CCD sensors achieve up to ~72 dB dynamic range. This result is sufficient for the great majority of scene conditions. However, some commonly encountered scenes exist which overwhelm such sensors. Well-known examples are backlit conditions (i.e. a subject standing in front of a window), outdoor scenes with deep shadows and sunsets, and nighttime scenes with bright artificial lights (Figure 1).

Figure 1. This backlit scene has a dynamic range of around 80 dB.

Such scenes typically exhibit a dynamic range of around 100 dB and, in rare cases, up to 120 dB (Figure 2). If captured with a conventional sensor, the image either loses detail in shadows or has blown-out (i.e. clipped) highlights.

Figure 2. This high-contrast scene has a dynamic range of around 100 dB.

Numerous attempts have been made to extend standard CMOS and CCD technology, overcoming the limitations of pixel sensitivity and ADC precision, in order to capture such scenes. Pixim developed the first really successful HDR sensor, based on CCD technology, and it was the industry standard for many years. However the technology, which effectively processes each pixel independently, is somewhat high cost. More recently, other vendors have concentrated on sensors constructed from more conventional CMOS technology. Numerous different solutions are available; the remainder of this article will survey the main vendors and the techniques they use.

Multi-frame HDR is an HDR method that does not rely on custom CMOS or CCD technology. Acting as a sort of video camera, the sensor is programmed to alternate between a long and a short exposure on a frame-by-frame basis, with successive images blended together by the image sensor processor (ISP) in memory to produce a single HDR image (Figure 3). If the blending algorithm is robust, an exposure ratio multiple of around 16 is comfortably achievable, adding an extra 4 bits to the single-exposure dynamic range. For example, a 12-bit sensor can produce images characteristic of a 16-bit sensor.

Figure 3. Blending together short- and long-exposure versions of a scene creates a multi-frame HDR result.

As with all HDR technologies, there is a catch. In this particular case, it is the potential generation of motion artifacts, most noticeable as "ghosting" along the edges of objects that have moved between the two frames. Such artifacts are very expensive to eliminate even partially, although specific processing in the ISP can significantly suppress their appearance. Further, the effective frame rate is reduced. If the input frame rate is 60 fps, the output can remain at 60 fps, but highlights and shadows will exhibit an effective frame rate closer to 30 fps, and mid-tones will be somewhere between 30 and 60 fps depending on how clever the blending algorithm is.

The Altasens A3372 12-bit CMOS sensor uses a “checkerboard” pixel structure, wherein alternating Bayer RGGB pixel quad clusters are set to long- and short-exposure configurations (Figure 4). In HDR scenes, the long-exposure pixels capture dark information, while short-exposure pixels handle bright details.

Figure 4. The Altasens A3372 checkerboard array devotes alternating quad-pixel clusters to capturing dark and light scene details.

Long exposure delivers improved signal-to-noise but results in the saturation of pixels corresponding to bright details; the short exposure pixels conversely capture the bright details properly. Dynamic range reaches ~100 dB. The cost of HDR in this case is in the heavy processing required to convert the checkerboard pattern to a normal linear Bayer pattern. This reconstruction requires complex interpolation because, for example, in highlight regions of an HDR image, half of the pixels are missing (clipped). An algorithm must estimate these missing values. While such interpolation can be done with remarkable effectiveness, some impact on effective resolution inevitably remains. However, this tradeoff is rather well controlled, since the sensor only needs to employ the dual-exposure mode when the scene demands it; the A3372 reverts to non-HDR mode when it's possible to capture the scene via the standard 12-bit single-exposure model.

A very different HDR method is the so-called “companding” technique employed by sensors such as Aptina's MT9M034 and AR0330, along with alternatives from other vendors. Such sensors use line buffers to accumulate multiple exposures (up to 4, in some cases) line-by-line. The output pixels retain a 12-bit depth, set by the ADC precision, but those 12 bits pack in up to 20 or more effective bits of linear intensity data. Companding is conceptually similar to the way gamma correction is used to encode 2 bits of additional data in a color space such as sRGB. Inverting this non-linear data structure enables obtaining an HDR Bayer image.

This method produces the highest dynamic ranges; one vendor claims 160 dB. But it again comes with associated costs. First, the data inversion relies on a very accurate and stable knowledge of where the various exposures begin and finish. In practice, imperfections lead to noise at specific intensity levels that can be hard to eliminate. Second, the sequential exposures in time create motion artifacts discussed earlier. These can be suppressed but are difficult to remove. Standard techniques for flicker avoidance (such as "beating" with the 50Hz or 60Hz flicker of indoor lighting) also don’t work when more than one exposure time exists.

Yet another HDR sensor implementation is the dual-pixel structure employed by Omnivision in sensors such as the OV10630. It consists of a non-Bayer array of pixels made up of two physically different types: a “dark” pixel and a “bright” pixel, which can be of different sizes. The dark pixels are more sensitive to light and therefore handle dark areas well, with good signal-to-noise. Conversely, the bright pixels are less light sensitive and therefore don't saturate as readily in bright regions. In principle, the dual-pixel approach is a very "clean" HDR technology. It avoids motion artifacts and requires no complex non-linear processing. Penalties include the fact that two pixels are blended into one, so the effective resolution is half of the actual resolution. The dual-pixel structure is also more costly on a per-pixel basis, and the output raw pixel pattern cannot be processed by standard ISPs.

More generally, each of the sensor types discussed in this article requires a different image-processing pipeline to convert its captured images into a standard output type. This fact means that it is not typically possible to directly connect an HDR sensor to a standard camera DSP and obtain an HDR result. Figure 5 below shows the pipelines for Bayer-domain processing of the multi-frame, Altasens and Aptina-type HDR sensor raw inputs. Standard processing is possible subsequent to color interpolation.

Figure 5. The image processing flow varies depending on what type of HDR sensor is being employed.

Obtaining genuine HDR imagery is also not just a matter of leveraging an HDR sensor coupled with an HDR ISP. For scenes with dynamic range beyond 100 dB, optics also plays a central role. Unless the lens is of sufficient quality and the optical system has the necessary internal anti-reflection coatings to prevent back-reflection from the sensor, it is impossible to avoid flare and glare in many HDR scenes, creating artifacts that effectively negate much of the sensor's capture capabilities. To put it simply, building a HDR camera suitable for the full range of scene conditions is not inexpensive.

In conclusion, a variety of sensor and ISP technologies exist for capturing and processing HDR imagery. They all involve some kind of image quality trade-off in exchange for the extended dynamic range, either in resolution or in time. It is worth remembering that although the technology may be elaborate, the purpose is simply to extend effective pixel bit depth and reduce noise. To see this, compare the images shown in Figure 6.

Figure 6. A comparison of two images reveals HDR shadow strengths.

The upper image was captured using a 12-bit CMOS sensor in normal mode. The image below it harnesses the exact same sensor but employs the multi-exposure mode discussed earlier. The effect of the HDR mode closely resembles that of noise reduction. In the left-hand image, strong local tone mapping is used to increase the digital gain so that shadows are visible, while exposure is kept low enough to avoid highlight clipping. This technique in effect captures the window area at ISO 100 and the shadow area at ISO 3200, and it does not require any non-standard capture technology. The HDR image, conversely, obtains the same exposure values for shadows and highlights, but this time by varying the exposure times, leading to greater sensitivity and lower noise in the shadow region.

High-performance temporal and spatial noise reduction technology can extend dynamic range by up to ~12 dB. And high-performance dynamic range compression technology can map input dynamic range to a standard output without loss of information. So a standard 12-bit CMOS sensor with good NR can achieve around 84 dB, which is “pretty good HDR”, while a 14-bit CMOS sensor with good NR can achieve nearly 100 dB, which is “mainstream HDR”. However, specific HDR sensors are required for truly high dynamic range scenes.