Bookmark and Share

Computer Vision Metrics: Chapter Seven (Part B)

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:

For Part A of Chapter Seven, please click here.

Bibliography references are set off with brackets, i.e. "[XXX]". For the corresponding bibliography entries, please click here.

Scene Composition and Labeling

Ground truth data is composed of labeled features such as foreground, background, and objects or features to recognize. The labels define exactly what features are present in the images, and these labels may be a combination of on-screen labels, associated label files, or databases. Sometimes a randomly composed scene from the wild is preferred as ground truth data, and then only the required items in the scene are labeled. Other times, ground truth data is scripted and composed the way a scene for a movie would be.

In any case, the appropriate objects and actors in the scene must be labeled, and perhaps the positions of each must be known and recorded as well. A database or file containing the labels must therefore be created and associated with each ground truth image to allow for testing. See Figure 7-4, which shows annotated or labeled ground truth dataset images for a scene analysis of cuboids [62]. See also the Labelme database described in Appendix B, which allows contributors to provide labeled databases.

Figure7-4. Annotated or labeled ground-truth dataset images for scene analysis of cuboids (left and center). The labels are annotated manually into the ground- truth dataset, in yellow (light gray in B&W version) marking the cuboid edges and corners. (right) Ground-truth data contains pre-computed 3D corner HOG descriptor sets, which are matched against live detected cuboid HOG feature sets. Successful matches shown in green (dark gray in B&W version). (Images used by permission © Bryan Russel, Jianxiong Xiao, and Antonio Torralba)


Establishing the right set of ground truth data is like assembling a composition; several variables are involved, including:

  • Scene Content: Designing the visual content, including fixed objects (those that do not move), dynamic objects (those that enter and leave the scene), and dynamic variables (such as position and movement of objects in the scene).
  • Lighting: Casting appropriate lighting onto the scene.
  • Distance: Setting and labeling the correct distance for each object to get the pixel resolution needed—too far away means not enough pixels...