Chapter 16. Tuning the Pipeline

Chapter 16. Tuning the Pipeline
Prev		Next

Providing code fragments and examples as appropriate, this chapter presents a variety of techniques for optimizing the different parts of the pipeline. The following topics are used:

CPU Tuning: Basics

The first stage of the rendering pipeline is the traversal of the data and sending the current rendering data to the rest of the pipeline. In theory, the entire rendering database (scene graph) must be traversed in some fashion for each frame because both scene content and viewer position can be dynamic.

To get the best possible CPU performance, use the following overall guidelines:

Compile your application for optimum speed.

Compile all object files with at least -O2. Note that the compiler option for debugging, –g, turns off all optimization. If you must run the debugger on optimized code, you can use –g3 with -O2 with limited success. If you are not compiling with -xansi (the default) or -ansi, you may need to include –float for faster floating point operations.

On certain platforms, other compile-time options (such as -mips3 or -mips4) are available.
On IRIX systems, always compile for the n32 ABI, instead of the obsolete o32 ABI. n32 is now the default for the IRIX compilers.
Use a simple data structure and a fast traversal method.

The CPU tuning strategy focuses on developing fast database traversal for drawing with a simple, easily accessed data structure. The fastest rendering is achieved with an inner loop that traverses a completely flattened (non-hierarchical) database. Most applications cannot achieve this level of simplicity for a variety of reasons. For example, some databases occupy too much memory when completely flattened. Note also that you run a greater risk of cache misses if you flatten the data.

When an application is CPU-limited, the entire graphics pipeline may be sitting idle for periods of time. The following sections describe techniques for structuring application code so that the CPU does not become the bottleneck.

Immediate Mode Drawing Versus Display Lists and Vertex Buffer Objects

When deciding whether you want to use display list or immediate mode drawing, consider the amount of work you do in constructing your databases and using them for purposes other than graphics. The following are three cases to consider:

If you create models that never change and are used only for drawing, then OpenGL display lists or vertex buffer objects are the right representation.

Display lists can be optimized in hardware-specific ways, loaded into dedicated display list storage in the graphics subsystem, downloaded to on-board dlist RAM, and so on. See “CPU Tuning: Display Lists” for more information on display lists.
If you create models that are subject to infrequent change but are rarely used for any purpose other than drawing, then vertext buffer objects or vertex arrays are the right representation.

Vertex arrays are relatively compact and have modest impact on the cache. Software renderers can process the vertices in batches; hardware renderers can process a few triangles at a time to maximize parallelism. As long as the vertex arrays can be retained from frame to frame so that you do not incur a lot of latency by building them afresh each frame, they are the best solution for this case. See “Using Vertex Arrays” for more information.
If you create very dynamic models or if you use the data for heavy computations unrelated to graphics, then the glVertex()-style interface (immediate mode drawing) is the best choice.

Immediate mode drawing allows you to do the following:
- To maximize parallelism for hardware renderers
- To optimize your database for the other computations you need to perform
- To reduce cache thrashing
Overall, this will result in higher performance than forcing the application to use a graphics-oriented data structure like a vertex array. Use immediate mode drawing for large databases (which might have to be paged into main memory) and dynamic database— for example, for morphing operations where the number of vertices is subject to change or for progressive refinement. See “CPU Tuning: Immediate Mode Drawing” for tuning information.

If you are still not sure whether to choose display lists or immediate mode drawing, consider the following advantages and disadvantages of display lists.

Display lists have the following advantages:

You do not have to optimize traversal of the data yourself; display list traversal is well-tuned and more efficient than user programs.
Display lists manage their own data storage. This is particularly useful for algorithmically generated objects.
Display lists are significantly better for remote graphics over a network. The display list can be cached on the remote CPU so that the data for the display list does not have to be re-sent every frame. Furthermore, the remote CPU handles much of the responsibility for traversal.
Display lists are preferable for direct rendering if they contain enough primitives (a total of about 10) because display lists are stored efficiently. If the lists are short, the setup performance cost is not offset by the more efficient storage or saving in CPU time.

Display lists do have the following drawbacks that may affect some applications:

The most troublesome drawback of display lists is data expansion. To achieve fast, simple traversal on all systems, all data is copied directly into the display list. Therefore, the display list contains an entire copy of all application data plus additional overhead for each command. If the application has no need for the data other than drawing, it can release the storage for its copy of the data and the penalty is negligible.
If vertices are shared in structures more complex than the OpenGL primitives (line strip, triangle strip, triangle fan, and quad strip), they are stored more than once.
If the database becomes sufficiently large, paging eventually hinders performance. Therefore, when contemplating the use of OpenGL display lists for really large databases, consider the amount of main memory.
The compile time for display lists may be significant.

CPU Tuning: Display Lists

In display list mode, pieces of the database are compiled into static chunks that can then be sent to the graphics pipeline. In this case, the display list is a separate copy of the database that can be stored in main memory in a form optimized for feeding the rest of the pipeline.

For example, suppose you want to apply a transformation to some geometric objects and then draw the result. If the geometric objects are to be transformed in the same way each time, it is better to store the matrix in the display list. The database traversal task is to hand the correct chunks to the graphics pipeline. Display lists can be recreated easily with some additional performance cost.

Tuning for display lists focuses mainly on reducing storage requirements. Performance improves if the data fit in the cache because this avoids cache misses when the data is traversed again.

Follow these rules to optimize display lists:

If possible, compile and execute a display list in two steps instead of using GL_COMPILE_AND_EXECUTE.
Call glDeleteLists() to delete display lists that are no longer needed.

This frees storage space used by the deleted display lists and expedites the creation of new display lists.
Avoid the duplication of display lists.

For example, if you have a scene with 100 spheres of different sizes and materials, generate one display list that is a unit sphere centered about the origin. Then, for each sphere in the scene, follow these steps:
1. Set the material for the current sphere.
2. Issue the necessary scaling and translation commands for sizing and positioning the sphere. Watch for the scaling of normals.
3. Invoke glCallList() to draw the unit sphere display list.
In this way, a reference to the unit sphere display list is stored instead of all of the sphere vertices for each instance of the sphere.
Make the display list as flat as possible, but be sure not to exceed the cache size.

Avoid using an excessive hierarchy with many invocations of glCallList(). Each glCallList() invocation results in a lookup operation to find the designated display list. A flat display list requires less memory and yields simpler and faster traversal. It also improves cache coherency.

Display lists are best used for static objects. Do not put dynamic data or operations in display lists. Instead, use a mixture of display lists for static objects and immediate mode for dynamic operations.

Note: See Chapter 18, “System-Specific Tuning”, for potential display list optimizations on the system you are using.

CPU Tuning: Immediate Mode Drawing

Immediate mode drawing means that OpenGL commands are executed when they are called rather than from a display list. This style of drawing provides flexibility and control over both storage management and drawing traversal. The trade-off for the extra control is that you have to write your own optimized subroutines for data traversal. Tuning, therefore, has the following two parts:

While you may not use each technique in this section, minimize the CPU work done at the per-vertex level and use a simple data structure for rendering traversal.

There is no recipe for writing a peak-performance immediate mode renderer for a specific application. To predict the CPU limitation of your traversal, design potential data structures and traversal loops and write small benchmarks that mimic the memory demands you expect. Experiment with optimizations and benchmark the effects. Experimenting on small examples can save time in the actual implementation.

Optimizing the Data Organization

It is common for scenes to have hierarchical definitions. Scene management techniques may rely on specific hierarchical information. However, a hierarchical organization of the data raises the following performance concerns and should be used with care:

The time spent traversing pointers to different sections of a hierarchy can create a CPU bottleneck.

This is partly because of the number of extra instructions executed, but it is also a result of the inefficient use of cache and memory. Overhead data not needed for rendering is brought through the cache and can push out needed data, to cause subsequent cache misses.
Traversing hierarchical structures can cause excessive memory paging.

Hierarchical structures can be distributed throughout memory. It is difficult to be sure of the exact amount of data you are accessing and of its exact location; therefore, traversing hierarchical structures can access a costly number of pages.
Complex operations may need access to both the geometric data and other scene information, complicating the data structure.
Caching behavior is often difficult to predict for dynamic hierarchical data structures.

The following are rules for optimizing data organization:

In general, store the geometry data used for rendering in static, contiguous buffers rather than in the hierarchical data structures.
Do not interlace data used to render frames and infrequently used data in memory. Instead, include a pointer to the infrequently used data and store the data itself elsewhere.
Flatten your rendering data (minimize the number of levels in the hierarchy) as much as cache and memory considerations and your application constraints permit.

The appropriate amount of flattening depends on the system on which your application will run.
Balance the data hierarchy. This makes application culling (the process of eliminating objects that do not fall within the viewing frustum) more efficient and effective.

Optimizing Database Rendering Code

This section includes some suggestions for writing peak-performance code for inner rendering loops.

During rendering, an application ideally spends most of its time traversing the database and sending data to the graphics pipeline. Hot spots are instructions in the display loop that are executed many times every frame. Any extra overhead in a hot spot is greatly magnified by the number of times it is executed.

When using simple, high-performance graphics primitives, the application is even more likely to be CPU-limited. The data traversal must be optimized so that it does not become a bottleneck.

During rendering, the sections of code that actually issue graphics commands should be the hot spots in application code. These subroutines should use peak-performance coding methods. Small improvements to a line that is executed for every vertex in a database accumulate to have a noticeable effect when the entire frame is rendered.

The rest of this section looks at examples and techniques for optimizing immediate mode rendering:

Examples for Optimizing Data Structures for Drawing

Follow these suggestions for optimizing how your application accesses data:

One-Dimensional Arrays. Use one-dimensional arrays traversed with a pointer that always holds the address for the current drawing command. Avoid array-element addressing or multidimensional array accesses.
bad: glVertex3fv(&data[i][j][k]); good: glVertex3fv(dataptr);
Adjacent structures. Keep all static drawing data for a given object together in a single contiguous array traversed with a single pointer. Keep this data separate from other program data, such as pointers to drawing data or interpreter flags.

Flat structures. Use flat data structures and do not use multiple-pointer indirection when rendering, as shown in the following:

Bad		`glVertex3fv(object->data->vert);`
OK		`glVertex3fv(dataptr->vert);`
Good		`glVertex3fv(dataptr);`

The following code fragment is an example of efficient code to draw a single smooth-shaded, lit polygon. Notice that a single data pointer is used. It is updated once at the end of the polygon after the glEnd() call.

glBegin(GL_QUADS);
glNormal3fv(ptr);
glVertex3fv(ptr+3);
glNormal3fv(ptr+6);
glVertex3fv(ptr+9);
glNormal3fv(ptr+12);
glVertex3fv(ptr+15);
glNormal3fv(ptr+18);
glVertex3fv(ptr+21);
glEnd();
ptr += 24;

Examples for Optimizing Program Structure

The following are areas for optimizing your program structure:

Loop unrolling (1). Avoid short, fixed-length loops especially around vertices. Instead, unroll these loops:

Bad		`for(i=0; i < 4; i++){ glColor4ubv(poly_colors[i]); glVertex3fv(poly_vert_ptr[i]);` `}`
Good		`glColor4ubv(poly_colors[0]); glVertex3fv(poly_vert_ptr[0]); glColor4ubv(poly_colors[1]); glVertex3fv(poly_vert_ptr[1]); glColor4ubv(poly_colors[2]); glVertex3fv(poly_vert_ptr[2]); glColor4ubv(poly_colors[3]); glVertex3fv(poly_vert_ptr[3]);`

Loop unrolling (2). Minimize the work done in a loop to maintain and update variables and pointers. Unrolling can often assist in this:

Bad

glNormal3fv(*(ptr++)); glVertex3fv(*(ptr++)); or glNormal3fv(ptr); ptr += 4; glVertex3fv(ptr); ptr += 4;

Good

glNormal3fv(*(ptr)); glVertex3fv(*(ptr+1)); glNormal3fv(*(ptr+2)); glVertex3fv(*(ptr+3)); or
glNormal3fv(ptr); glVertex3fv(ptr+4); glNormal3fv(ptr+8); glVertex3fv(ptr+12);

Note: On current MIPS processors, loop unrolling may hurt performance more than it helps; so, use it with caution. In fact, unrolling too far hurts on any processor because the loop may use an excessive portion of the cache. If it uses a large enough portion of the cache, it may interfere with itself; that is, the whole loop will not fit (not likely) or it may conflict with the instructions of one of the subroutines it calls.

Loops accessing buffers. Minimize the number of different buffers accessed in a loop:

Bad
glNormal3fv(normaldata); glTexCoord2fv(texdata); glVertex3fv(vertdata);

Good
glNormal3fv(dataptr); glTexCoord2fv(dataptr+3); glVertex3fv(dataptr+5);
Loop end conditions. Make end conditions on loops as trivial as possible; for example, compare the loop variable to a constant, preferably zero. Decrementing loops are often more efficient than their incrementing counterparts:

Bad
for (i = 0; i < (end-beginning)/size; i++) {...}
Better
for (i = beginning; i < end; i += size) {...}
Good
for (i = total; i > 0; i--) {...}
Conditional statements.
- Use switch statements instead of multiple if-else-if control structures.
- Avoid if tests around vertices; use duplicate code instead.
Subroutine prototyping. Prototype subroutines in ANSI C style to avoid run-time typecasting of parameters:
void drawit(float f, int count) { }

Multiple primitives. Send multiple primitives between glBegin()/glEnd() pairs whenever possible:

glBegin(GL_TRIANGLES)
....
..../* many triangles */
....
glEnd

glBegin(GL_QUADS)
....
..../* many quads */
....
glEnd

Using Specialized Drawing Subroutines and Macros

This section describes several ways to improve performance by making appropriate choices about display modes, geometry, and so on.

Make decisions about which geometry to display and which modes to use at the highest possible level in the program organization.

The drawing subroutines should be highly specialized leaves in the program's call tree. Decisions made too far down the tree can be redundant. For example, consider a program that switches back and forth between flat-shaded and smooth-shaded drawing. Once this choice has been made for a frame, the decision is fixed and the flag is set. For example, the following code is inefficient:

/* Inefficient way to toggle modes */
draw_object(float *data, int npolys, int smooth)  {
int i;
glBegin(GL_QUADS);
for (i = npolys; i > 0; i--) {
    if (smooth) glColor3fv(data);
    glVertex3fv(data + 4);
    if (smooth) glColor3fv(data + 8);
    glVertex3fv(data + 12);
    if (smooth) glColor3fv(data + 16);
    glVertex3fv(data + 20);
    if (smooth) glColor3fv(data + 24);
    glVertex3fv(data + 28);
}
glEnd();

Even though the program chooses the drawing mode before entering the draw_object() routine, the flag is checked for every vertex in the scene. A simple if test may seem innocuous; however, when done on a per-vertex basis, it can accumulate a noticeable amount of overhead.

Compare the number of instructions in the disassembled code for a call to glColor3fv() first without and then with the if test.

A ssembly code for a call without the if test (six instructions):

lw a0,32(sp)
lw t9,glColor3fv
addiu a0,a0,32
jalr ra,t9
nop
lw gp,24(sp)

Assembly code for a call with an if test (eight instructions):

lw t7,40(sp)
beql t7,zero,0x78
nop
lw t9,glColor3fv
lw a0,32(sp)
jalr ra,t9
addiu a0,a0,32
lw gp,24(sp)

Notice the two extra instructions required to implement the if test. The extra if test per vertex increases the number of instructions executed for this otherwise optimal code by 33%. These effects may not be visible if the code is used only to render objects that are always graphics-limited. However, if the process is CPU-limited, then moving decision operations such as this if test higher up in the program structure improves performance.

Preprocessing Drawing Data (Meshes and Vertex Loops)

Putting some extra effort into generating a simpler database makes a significant difference when traversing that data for display. A common tendency is to leave the data in a format that is good for loading or generating the object, but not optimal for actually displaying it. For peak performance, do as much of the work as possible before rendering.

Preprocessing turns a difficult database into a database that is easy to render quickly. This is typically done at initialization or when changing from a modeling to a fast-rendering mode. This section describes preprocessing meshes and vertex loops to illustrate this point.

Preprocessing Meshes Into Fixed-Length Strips

Preprocessing can be used to turn general meshes into fixed-length strips.

The following sample code shows a commonly used, but inefficient, way to write a triangle strip render loop:

float* dataptr;
...
while (!done) switch(*dataptr) {
    case BEGINSTRIP:
        glBegin(GL_TRIANGLE_STRIP);
        dataptr++;
        break;
    case ENDSTRIP:
        glEnd();
        dataptr++;
        break;
    case EXIT:
        done = 1;
        break;
    default: /* have a vertex !!! */
        glNormal3fv(dataptr);
        glVertex3fv(dataptr + 4);
        dataptr += 8;
}

This traversal method incurs a significant amount of per-vertex overhead. The loop is evaluated for every vertex and every vertex must also be checked to make sure that it is not a flag. These checks waste time and also bring all of the object data through the cache, reducing the performance advantage of triangle strips. Any variation of this code that has per-vertex overhead is likely to be CPU-limited for most types of simple graphics operations.

Preprocessing Vertex Loops

Preprocessing is also possible for vertex loops, as shown in the following:

glBegin(GL_TRIANGLE_STRIP);
for (i=num_verts; i > 0; i--) {
    glNormal3fv(dataptr); 
    glVertex3fv(dataptr+4);
    dataptr += 8;
    }
glEnd();

For peak immediate mode performance, precompile strips into specialized primitives of fixed length. Only a few fixed lengths are needed. For example, use strips that consist of 12, 8, and 2 primitives.

Note: The optimal strip length may vary depending on the hardware platform. For more information, see Chapter 18, “System-Specific Tuning”.

The specialized strips are sorted by size, resulting in the efficient loop shown in this sample code:

/* dump out N 8-triangle strips */
for (i=N; i > 0; i--) {
    glBegin(GL_TRIANGLE_STRIP);
    glNormal3fv(dataptr);
    glVertex3fv(dataptr+4);
    glNormal3fv(dataptr+8);
    glVertex3fv(dataptr+12);
    glNormal3fv(dataptr+16);
    glVertex3fv(dataptr+20);
    glNormal3fv(dataptr+24);
    glVertex3fv(datatpr+28);
    ...
    glEnd();
    dataptr += 64;
}

A mesh of length 12 is about the maximum for unrolling. Unrolling helps to reduce the overall cost-per-loop overhead but, after a point, it produces no further gain.

Over-unrolling eventually hurts performance by increasing code size and reducing effectiveness of the instruction cache. The degree of unrolling depends on the processor; run some benchmarks to understand the optimal program structure on your system.

Optimizing Cache and Memory Use

This section first provides some background information about the structure of the cache and about memory lookup. It then gives some tips for optimizing cache and memory use.

Memory Organization

On most systems, memory is structured as a hierarchy that contains a small amount of faster, more expensive memory at the top and a large amount of slower memory at the base. The hierarchy is organized from registers in the CPU at the top down to the disks at the bottom. As memory locations are referenced, they are automatically copied into higher levels of the hierarchy; so, data that is referenced most often migrates to the fastest memory locations.

The following are the areas of concern:

The cache feeds data to the CPU, and cache misses can slow down your program.

Each processor has instruction caches and data caches. The purpose of the caches is to feed data and instructions to the CPU at maximum speed. When data is not found in the cache, a cache miss occurs and a performance penalty is incurred as data is brought into the cache.
The translation-lookaside buffer ( TLB) keeps track of the location of frequently used pages of memory. If a page translation is not found in the TLB, a delay is incurred while the system looks up the page and enters its translation.

The goal of machine designers and programmers is to maximize the chance of finding data as high up in the memory hierarchy as possible. To achieve this goal, algorithms for maintaining the hierarchy, embodied in the hardware and the operating system, assume that programs have locality of reference in both time and space; that is, programs keep frequently accessed locations close together. Performance increases if you respect the degree of locality required by each level in the memory hierarchy.

Even applications that appear not to be memory-intensive, in terms of total number of memory locations accessed, may suffer unnecessary performance penalties for inefficient allocation of these resources. An excess of cache misses, especially misses on read operations, can force the most optimized code to be CPU-limited. Memory paging causes almost any application to be severely CPU-limited.

Minimizing Paging

This section provides some guidelines for minimizing memory paging:

Minimizing Lookups

To minimize page lookups, follow these guidelines:

Keep frequently used data within a minimal number of pages. Starting with IRIX 6.5, each page consists of 16 KB. In earlier versions of IRIX, each page consists of 4 KB (16 KB in high-end systems). Minimize the number of pages referenced in your program by keeping data structures within as few pages as possible. Use osview to verify that no TLB misses are occurring.
Store and access data in flat, sequential data structures particularly for frequently referenced data. Every pointer indirection could result in the reading of a new page. This is guaranteed to cause performance problems with CPUs like R10000 that try to do instructions in parallel.
In large applications (which cause memory swapping), use mpin() to lock important memory into RAM.

Minimizing Cache Misses

Each processor may have first-level instruction and data caches on chip and have second-level caches that are bigger but somewhat slower. The sizes of these caches vary; you can use the hinv command to determine the sizes on your system. The first-level data cache is always a subset of the data in the second-level cache.

Memory access is much faster if the data is already loaded into the first-level cache. When your program accesses data that is not in one of the caches, a cache miss results. This causes a cache line of several bytes, including the data you just accessed, to be read from memory and stored in the cache. The size of this transaction varies from machine to machine. Caches are broken down into lines, typically 32-128 bytes. When a cache miss occurs, the corresponding line is loaded from the next level down in the hierarchy.

Because cache misses are costly, try to minimize them by following these steps:

Keep frequently accessed data together. Store and access frequently used data in flat, sequential files and structures and avoid pointer indirection. In this way, the most frequently accessed data remains in the first-level cache wherever possible.
Access data sequentially. If you are accessing words sequentially, each cache miss brings in 32 or more words of needed data; if you are accessing every 32nd word, each cache miss brings in one needed word and 31 unneeded words, degrading performance by up to a factor of 32.
Avoid simultaneously traversing several large independent buffers of data, such as an array of vertex coordinates and an array of colors within a loop. There can be cache conflicts between the buffers. Instead, pack the contents into one interleaved buffer when possible. If this packing forces a big increase in the size of the data, it may not be the right optimization for that program. If you are using vertex arrays, try using interleaved arrays.

Second-level data cache misses also increase bus traffic, which can be a problem in a multiprocessing application. This can happen with multiple processes traversing very large data sets. See “Immediate Mode Drawing Versus Display Lists and Vertex Buffer Objects” for additional information.

Measuring Cache-Miss and Page-Fault Overhead

To find out if cache and memory usage are a significant part of your CPU limitation, follow these guidelines:

Use osview to monitor your application.
A more rigorous way to estimate the time spent on memory access is to compare the execution-profiling results collected with PC sampling with those of basic block counting. Perform each test with and without calls to glVertex3fv().
- PC sampling in Speedshop gives a real-time estimate of the time spent in different sections of the code.
- Basic block counting from Speedshop gives an ideal estimate of how much time should be spent, not including memory references.
See the speedshop man page or the Speedshop User's Guide for more information.

PC sampling includes time for system overhead; so, it always predicts longer execution than basic block counting. However, your PC sample time should not be more than 1.5 times the time predicted by Speedshop.

The CASEVision/WorkShop tools, in particular the performance analyzer, can also help with those measurements. The WorkShop Overview introduces the tools.

CPU Tuning: Advanced Techniques

After you have applied the techniques discussed in the previous sections, consider using the following advanced techniques to tune CPU-limited applications:

Mixing Computation With Graphics

When you are fine-tuning an application, interleaving computation and graphics can make it better balanced and therefore more efficient. Key places for interleaving are after glXSwapBuffers(), glClear(), and drawing operations that are known to be fill-limited (such as drawing a backdrop or a ground plane or any other large polygon).

A glXSwapBuffers() call creates a special situation. After calling glXSwapBuffers(), an application may be forced to wait for the next vertical retrace (in the worst case, up to 16.7 msecs) before it can issue more graphics calls. For a program drawing 10 frames per second, 15% of the time (worst case) can be spent waiting for the buffer swap to occur.

In contrast, non-graphic computation is not forced to wait for a vertical retrace. Therefore, if there is a section of computation that must be done every frame that includes no graphics calls, it can be done after the glXSwapBuffers() instead of causing a CPU limitation during drawing.

Clearing the screen is a time-consuming operation. Doing non-graphics computation immediately after the clear is more efficient than sending additional graphics requests down the pipeline and being forced to wait when the pipeline's input queue overflows.

Experimentation is required to do the following:

To determine where the application is demonstrably graphics-limited
To ensure that inserting the computation does not create a new bottleneck

For example, if a new computation references a large section of data that is not in the data cache, the data for drawing may be swapped out for the computation, then swapped back in for drawing. The result is worse performance than the original organization.

Examining Assembly Code

When tuning inner rendering loops, examining assembly code can be helpful. You need not be an expert assembly coder to interpret the results. Just looking at the number of extra instructions required for an apparently innocuous operation is often informative.

On IRIX systems, use the dis command to disassemble optimized code for a given procedure and to correlate assembly code lines with line numbers from the source code file. This correlation is especially helpful for examining optimized code. The -S option to the cc command produces a .s file of assembly output, complete with your original comments.

On Silicon Graphics Prism systems, use the objdump –d [–S] command instead of the dis command. The –S option is available on the gcc command but comments are not included.

Using Additional Processors for Complex Scene Management

If your application is running on systems with multiple processors, consider supplying an option for doing scene management on additional processors to relieve the rendering processor from the burden of expensive computation.

Using additional processors may also reduce the amount of data rendered for a given frame. Simplifying or reducing rendering for a given scene can help reduce bottlenecks in all parts of the pipeline, as well as the CPU. One example is removing unseen or backfacing objects. Another common technique is to use an additional processor to determine when objects are going to appear very far away and use a simpler model with fewer polygons and less expensive modes for distant objects.

Modeling to the Graphics Pipeline

The modeling of the database directly affects the rendering performance of the resulting application and therefore has to match the performance characteristics of the graphics pipeline and make trade-offs with the database traversals. Graphics pipelines that support connected primitives, such as triangle meshes, benefit from having long meshes in the database. However, the length of the meshes affects the resulting database hierarchy, and long strips through the database do not cull well with simple bounding geometry.

Model objects with an understanding of inherent bottlenecks in the graphics pipeline:

Pipelines that are severely fill-limited benefit from having objects modeled with cut polygons and more vertices and fewer overlapping parts, which decreases depth complexity.
Pipelines that are easily geometry- or host-limited benefit from modeling with fewer polygons.

There are several other modeling tricks that can reduce database complexity:

Use textured polygons to simulate complex geometry. This is especially useful if the graphics subsystem supports the use of textures where the alpha component of the texture marks the transparency of the object. Textures can be used as cut-outs for objects like fences and trees.
Use textures for simulating particles, such as smoke.
Use textured polygons as single-polygon billboards. Billboards are polygons that are fixed at a point and rotated about an axis, or about a point, so that the polygon always faces the viewer. Billboards are useful for symmetric objects such as light posts and trees and also for volume objects, such as smoke. Billboards can also be used for distant objects to save geometry. However, the managing of billboard transformations can be expensive and affect both the cull and the draw processes.

The sprite extension can be used for billboards on certain platforms; see “SGIX_sprite—The Sprite Extension” in Chapter 9.

Tuning the Geometry Subsystem

The geometry subsystem is the part of the pipeline in which per-polygon operations, such as coordinate transformations, lighting, texture coordinate generation, and clipping are performed. The geometry hardware may also be used for operations that are not strictly transform operations, such as convolution.

T his section presents the following techniques for tuning the geometry subsystem:

Using Peak-Performance Primitives for Drawing

This section describes how to draw geometry with optimal primitives. Consider the following guidelines to optimize drawing:

Use connected primitives (line strips, triangle strips, triangle fans, and quad strips). Put at least 8 primitives in a sequence—12 to 16 if possible.

Connected primitives are desirable because they reduce the amount of data sent to the graphics subsystem and the amount of per-polygon work done in the pipeline. Typically, about 12 vertices per glBegin()/glEnd() block are required to achieve peak rates (but this can vary depending on your hardware platform). For lines and points, it is especially beneficial to put as many vertices as possible in a glBegin()/glEnd() sequence. For information on the most efficient vertex numbers for the system you are using, see Chapter 18, “System-Specific Tuning”.
Use “well-behaved” polygons, convex and planar, with only three or four vertices.

If you use concave and self-intersecting polygons, they are broken down into triangles by OpenGL. For high-quality rendering, you must pass the polygons to GLU to be tessellated. This can make them prohibitively expensive. Nonplanar polygons and polygons with large numbers of vertices are more likely to exhibit shading artifacts.

If your database has polygons that are not well-behaved, perform an initial one-time pass over the database to transform the troublemakers into well-behaved polygons and use the new database for rendering. Using connected primitives results in additional gains.
Minimize the data sent per vertex.

Polygon rates can be affected directly by the number of normals or colors sent per polygon. Setting a color or normal per vertex, regardless of the glShadeModel() used, may be slower than setting only a color per polygon, because of the time spent sending the extra data and resetting the current color. The number of normals and colors per polygon also directly affects the size of a display list containing the object.
Group like primitives and minimize state changes to reduce pipeline revalidation.

Using Vertex Arrays

Vertex arrays offer the following benefits:

The OpenGL implementation can take advantage of uniform data formats.
The glInterleavedArrays() call lets you specify packed vertex data easily. Packed vertex formats are typically faster for OpenGL to process.
The glDrawArrays() call reduces the overhead for subroutine calls.
The glDrawElements() call reduces the overhead for subroutine calls and also reduces per-vertex calculations because vertices are reused.

Using Display Lists Appropriately

You can often improve geometry performance by storing frequently-used commands in a display list. If you plan to redraw the same geometry multiple times, or if you have a set of state changes that are applied multiple times, consider using display lists. Display lists allow you to define the geometry or state changes once and execute them multiple times. Some graphics hardware stores display lists in dedicated memory or stores data in an optimized form for rendering (see also “CPU Tuning: Display Lists”).

Storing Data Efficiently

Putting some extra effort into generating a more efficient database makes a significant difference when traversing the data for display. A common tendency is to leave the data in a format that is good for loading or generating the object but not optimal for actually displaying the data. For peak performance, do as much work as possible before rendering. Preprocessing of data is typically performed at initialization time or when changing from a modeling mode to a fast rendering mode.

Minimizing State Changes

Your program will almost always benefit if you reduce the number of state changes. A good way to do this is to sort your scene data according to what state is set and render primitives with the same state settings together. Primitives should be sorted by the most expensive state settings first. Typically it is expensive to change texture binding, material parameters, fog parameters, texture filter modes, and the lighting model. However, some experimentation will be required to determine which state settings are most expensive on your system. For example, on systems that accelerate rasterization, it may not be very expensive to disable or enable depth testing or to change rasterization controls such as the depth test function. But if your system has software rasterization, this may cause the graphics pipeline to be revalidated.

It is also important to avoid redundant state changes. If your data is stored in a hierarchical database, make decisions about which geometry to display and which modes to use at the highest possible level. Decisions that are made too far down the tree can be redundant.

Optimizing Transformations

OpenGL implementations are often able to optimize transform operations if the matrix type is known. Use the following guidelines to achieve optimal transform rates:

Call glLoadIdentity() to initialize a matrix rather than loading your own copy of the identity matrix.
Use specific matrix calls such as glRotate*(), glTranslate*(), and glScale*() rather than composing your own rotation, translation, or scale matrices and calling glLoadMatrix() or glMultMatrix().
If possible, use single precision such as glRotatef(), glTranslatef(), and glScalef(). On most systems, this may not be critical because the CPU converts doubles to floats.

Optimizing Lighting Performance

OpenGL offers a large selection of lighting features. Some are virtually free in terms of computational time and others offer sophisticated effects with some performance penalty. For some features, the penalties may vary depending on your hardware. Be prepared to experiment with the lighting configuration.

As a general rule, use the simplest possible lighting model, a single infinite light with an infinite viewer. For some local effects, try replacing local lights with infinite lights and a local viewer.

You normally will not notice a performance degradation when using one infinite light, unless you use lit textures or color index lighting.

Use the following settings for peak-performance lighting:

Single infinite light
- Ensure that GL_LIGHT_MODEL_LOCAL_VIEWER is set to GL_FALSE in glLightModel() (the default).
- Ensure that GL_LIGHT_MODEL_TWO_SIDE is set to GL_FALSE in glLightModel() (the default).
- Local lights are noticeably more expensive than infinite lights. Avoid lighting where the fourth component of GL_LIGHT_POSITION is nonzero.
- There may be a sharp drop in lighting performance when switching from one light to two lights, but the drop for additional lights is likely to be more gradual.
RGB mode
GL_COLOR_MATERIAL disabled
GL_NORMALIZE disabled

Because this is usually necessary when the modelview matrix includes a scaling transformation, consider preprocessing the scene to eliminate scaling.

Lighting Operations With Noticeable Performance Costs

Follow these additional guidelines to achieve peak lighting performance:

Do not change material parameters frequently.

Changing material parameters can be expensive. If you need to change the material parameters many times per frame, consider rearranging the scene traversal to minimize material changes. Also, consider using glColorMaterial() to change specific parameters automatically rather than using glMaterial() to change parameters explicitly.

The following code fragment illustrates how to change ambient and diffuse material parameters at every polygon or at every vertex:

glColorMaterial(GL_FRONT_AND_BACK, GL_AMBIENT_AND_DIFFUSE);
glEnable(GL_COLOR_MATERIAL);
/* Draw triangles: */
glBegin(GL_TRIANGLES);
/* Set ambient and diffuse material parameters: */
glColor4f(red, green, blue, alpha);
glVertex3fv(...);glVertex3fv(...);glVertex3fv(...);
glColor4f(red, green, blue, alpha);
glVertex3fv(...);glVertex3fv(...);glVertex3fv(...);
...
glEnd();

Disable two-sided lighting unless your application requires it.

Two-sided lighting illuminates both sides of a polygon. This is much faster than the alternative of drawing polygons twice. However, using two-sided lighting is significantly slower than one-sided lighting for a single rendering object.
Disable GL_NORMALIZE.

If possible, provide unit-length normals and do not call glScale*() to avoid the overhead of GL_NORMALIZE. On some OpenGL implementations, it may be faster to simply rescale the normal instead of renormalizing it when the modelview matrix contains a uniform scale matrix.
Avoid scaling operations if possible.
Avoid changing the GL_SHININESS material parameter if possible. Setting a new GL_SHININESS value requires significant computation each time.

Choosing Modes Wisely

OpenGL offers many features that create sophisticated effects with excellent performance. For each feature, consider the trade-off between effects, performance, and quality.

Turn off features when they are not required.

Once a feature has been turned on, it can slow the transform rate even when it has no visible effect.

For example, the use of fog can slow the transform rate of polygons even when the polygons are too close to show fog and even when the fog density is set to zero. For these conditions, turn off fog explicitly with the following call:
glDisable(GL_FOG)
Minimize expensive mode changes and sort operations by the most expensive mode. Specifically, consider these tips:
- Use small numbers of texture maps to avoid the cost of switching between textures. If you have many small textures, consider combining them into a single larger, tiled texture. Rather than switching to a new texture before drawing a textured polygon, choose texture coordinates that select the appropriate small texture tile within the large texture.
- Avoid changing the projection matrix or changing glDepthRange() parameters.
- When fog is enabled, avoid changing fog parameters.
- Turn fog off for rendering with a different projection (for example, orthographic) and turn it back on when returning to the normal projection.
Use flat shading whenever possible.

Flat shading reduces the number of lighting computations from one per vertex to one per primitive and also reduces the amount of data that must be passed from the CPU through the graphics pipeline for each primitive. This is particularly important for high-performance line drawing.
Beware of excessive mode changes, even mode changes considered cheap, such as changes to shade model, depth buffering, and blending function.

Advanced Transform-Limited Tuning Techniques

This section describes advanced techniques for tuning transform-limited drawing. Use the following guidelines to draw objects with complex surface characteristics:

Use textures to replace complex geometry.

Textured polygons can be significantly slower than their non-textured counterparts. However, texture can be used instead of extra polygons to add detail to a geometric object. This can provide simplified geometry with a net speed increase and an improved picture, as long as it does not cause the program to become fill-limited. Texturing performance varies across the product line; so, this technique might not be equally effective on all systems. Experimentation is usually necessary.
Use glAlphaFunc() in conjunction with one or more textures to give the effect of rather complex geometry on a single polygon.

Consider drawing an image of a complex object by texturing it onto a single polygon. Set alpha values to zero in the texture outside the image of the object. (The edges of the object can be antialiased by using alpha values between zero and one.) Orient the polygon to face the viewer. To prevent pixels with zero alpha values in the textured polygon from being drawn, make the following call:
glAlphaFunc(GL_NOTEQUAL, 0.0)
This effect is often used to create objects like trees that have complex edges or many holes through which the background should be visible (or both).
Eliminate objects or polygons that will be out of sight or too small.
Use fog to increase visual detail without drawing small background objects.
Use culling on a separate processor to eliminate objects or polygons that will be out of sight or too small to see.
Use occlusion culling: draw large objects that are in front first, then read back the depth buffer, and use it to avoid drawing objects that are hidden.

Tuning the Raster Subsystem

In the raster system, per-pixel and per-fragment operations take place. The operations include writing color values into the framebuffer or more complex operations like depth buffering, alpha blending, and texture mapping.

An explosion of both data and operations is required to rasterize a polygon as individual pixels. Typically, the operations include depth comparison, Gouraud shading, color blending, logical operations, texture mapping, and possibly antialiasing. This section describes the following techniques for tuning fill-limited drawing:

Using Backface/Frontface Removal

To reduce fill-limited drawing, use backface/frontface removal. For example, if you are drawing a sphere, half of its polygons are backfacing at any given time. Backface/ frontface removal is done after transformation calculations but before per-fragment operations. This means that backface removal may make transform-limited polygons somewhat slower but make fill-limited polygons significantly faster. You can turn on backface removal when you are drawing an object with many backfacing polygons, then turn it off again when drawing is completed.

Minimizing Per-Pixel Calculations

One way to improve fill-limited drawing is to reduce the work required to render fragments. This section describes the following ways to do this:

Avoiding Unnecessary Per-Fragment Operations

Turn off per-fragment operations for objects that do not require them and structure the drawing process to minimize their use without causing excessive toggling of modes.

For example, if you are using alpha blending to draw some partially transparent objects, make sure that you disable blending when drawing the opaque objects. Also, if you enable alpha testing to render textures with holes through which the background can be seen, be sure to disable alpha testing when rendering textures or objects with no holes. It also helps to sort primitives so that primitives that require alpha blending or alpha testing to be enabled are drawn at the same time. Finally, you may find it faster to render polygons such as terrain data in back-to-front order.

Organizing Drawing to Minimize Computation

Organizing drawing to minimize per-pixel computation can significantly enhance performance. For example, to minimize depth buffer requirements, disable depth buffering when drawing large background polygons, then draw more complex depth-buffered objects.

Using Expensive Per-Fragment Operations Efficiently

Use expensive per-fragment operations with care. Per-fragment operations, in the rough order of increasing cost (with flat shading being the least expensive and multisampling the most expensive), are as follows:

Flat shading
Gouraud shading
Depth buffering
Alpha blending
Texturing
Multisampling

Note: The actual order depends on your system.

Each operation can independently slow down the pixel fill rate of a polygon, although depth buffering can help reduce the cost of alpha blending or multisampling for hidden polygons.

Using Depth Buffering Efficiently

Any rendering operation can become fill-limited for large polygons. Clever structuring of drawing can eliminate the need for certain fill operations. For example, if large backgrounds are drawn first, they do not need to be depth-buffered. It is better to disable depth buffering for the backgrounds and then enable it for other objects where it is needed.

For example, flight simulators use this technique. Depth buffering is disabled; the sky, ground, and then the polygons lying flat on the ground (runway and grid) are drawn without suffering a performance penalty. Then depth buffering is enabled for drawing the mountains and airplanes.

There are other special cases in which depth buffering might not be required. For example, terrain, ocean waves, and 3D function plots are often represented as height fields (X-Y grids with one height value at each lattice point). It is straightforward to draw height fields in back-to-front order by determining which edge of the field is furthest away from the viewer, then drawing strips of triangles or quadrilaterals parallel to that starting edge and working forward. The entire height field can be drawn without depth testing, provided it does not intersect any piece of previously-drawn geometry. Depth values need not be written at all, unless subsequently-drawn depth-buffered geometry might intersect the height field; in that case, depth values for the height field should be written, but the depth test can be avoided by calling the following function:

glDepthFunc(GL_ALWAYS)

Balancing Polygon Size and Pixel Operations

The pipeline is generally optimized for polygons that have 10 pixels on a side. However, you may need to work with polygons larger or smaller than that depending on the other operations taking place in the pipeline:

If the polygons are too large for the fill rate to keep up with the rest of the pipeline, the application is fill-rate limited. Smaller polygons balance the pipeline and increase the polygon rate.
If the polygons are too small for the rest of the pipeline to keep up with filling, then the application is transform-limited. Larger and fewer polygons, or fewer vertices, balance the pipeline and increase the fill rate.

If you are drawing very large polygons such as backgrounds, performance will improve if you use simple fill algorithms. For example, do not set glShadeModel() to GL_SMOOTH if smooth shading is not required. Also, disable per-fragment operations such as depth buffering, if possible. If you need to texture the background polygons, consider using GL_REPLACE as the texture environment.

Other Considerations

The following are other ways to improve fill-limited drawing:

Use alpha blending with discretion.

Alpha blending is an expensive operation. A common use of alpha blending is for transparency, where the alpha value denotes the opacity of the object. For fully opaque objects, disable alpha blending with glDisable(GL_BLEND).
Avoid unnecessary per-fragment operations.

Turn off per-fragment operations for objects that do not require them and structure the drawing process to minimize their use without causing excessive toggling of modes.

Using Clear Operations

When considering clear operations, use the following guidelines:

If possible, avoid clear operations.

For example, you can avoid clearing the depth buffer by setting the depth test to GL_ALWAYS.
Avoid clearing the color and depth buffers independently.

The most basic per-frame operations are clearing the color and depth buffers. On some systems, there are optimizations for common special cases of these operations.

Whenever you need to clear both the color and depth buffers, do not clear each buffer independently. Instead, make the following call:
glClear(GL_COLOR_BUFFER_BIT|GL_DEPTH_BUFFER_BIT)
Be sure to disable dithering before clearing.

Optimizing Texture Mapping

Follow these guidelines when rendering textured objects:

Avoid frequent switching between texture maps.

If you have many small textures, consider combining them into a single, larger mosaic texture. Rather than switching to a new texture before drawing a textured polygon, choose texture coordinates that select the appropriate small texture tile within the large texture.
Use texture objects to encapsulate texture data.

Place all the glTexImage*() calls (including mipmaps) required to completely specify a texture and the associated glTexParameter*() calls (which set texture properties) into a texture object and bind this texture object to the rendering context. This allows the implementation to compile the texture into a format that is optimal for rendering and, if the system accelerates texturing, to efficiently manage textures on the graphics adapter.
When using texture objects, call glAreTexturesResident() to make sure that all texture objects are resident during rendering.

On systems where texturing is done on the host, glAreTexturesResident() always returns GL_TRUE. If necessary, reduce the size or internal format resolution of your textures until they all fit into memory. If such a reduction creates intolerably fuzzy textured objects, you may give some textures lower priority.
If possible, use glTexSubImage*D() to replace all or part of an existing texture image rather than the more costly operations of deleting and creating an entire new image.
Avoid expensive texture filter modes.

On some systems, trilinear filtering is much more expensive than nearest or linear filtering.

Tuning the Imaging Pipeline

This section briefly lists some ways in which you can improve pixel processing. Example 17-1 provides a code fragment that shows how to set the OpenGL state so that subsequent calls to glDrawPixels() or glCopyPixels() will be fast.

To improve performance in the imaging pipeline, follow these guidelines:

Disable all per-fragment operations.
Define images in the native hardware format so that type conversion is not necessary.
For texture download operations, match the internal format of the texture with that on the host.
Byte-sized components, particularly unsigned byte components, are fast. Use pixel formats where each of the components (red, green, blue, alpha, luminance, or intensity) is 8 bits long.
Use fewer components; for example, use GL_LUMINANCE_ALPHA or GL_LUMINANCE.

Use a color matrix and a color mask to store four luminance values in the RGBA framebuffer. Use a color matrix and a color mask to work with one component at a time. If one component is being processed, convolution is much more efficient. Then process all four images in parallel. Processing four images together is usually faster than processing them individually as single-component images.

The following code fragment uses the green component as the data source and writes the result of the operation into some (possibly all) of the other components:

/* Matrix is in column major order */
GLfloat smearGreenMat[16] = {
    0, 0, 0, 0,
    1, 1, 1, 1,
    0, 0, 0, 0,
    0, 0, 0, 0,
};
/* The variables update R/G/B/A indicate whether the 
* corresponding component would be updated.
*/
GLboolean updateR, updateG, updateB, updateA;

...
 
/* Check for availability of the color matrix extension */
 
/* Set proper color matrix and mask */
glMatrixMode(GL_COLOR);
glLoadMatrixf(smearGreenMat);
glColorMask(updateR, updateG, updateB, updateA);
 
/* Perform the imaging operation */    
glEnable(GL_SEPARABLE_2D_EXT);
glCopyTexSubImage2DEXT(...);
/* Restore an identity color matrix.  Not needed when the same 
* smear operation is to used over and over
*/
glLoadIdentity();
 
/* Restore previous matrix mode (assuming it is modelview) */
glMatrixMode(GL_MODELVIEW);
...

Load the identity matrix into the color matrix to turn the color matrix off.

When using the color matrix to broadcast one component into all others, avoid manipulating the color matrix with transformation calls such as glRotate(). Instead, load the matrix explicitly using glLoadMatrix().
Locate the bottleneck.

Similar to polygon drawing, there can be a pixel-drawing bottleneck due to overload in host bandwidth, processing, or rasterizing. When all modes are off, the path is most likely limited by host bandwidth, and a wise choice of host pixel format and type pays off tremendously. This is also why byte components are sometimes faster. For example, use packed pixel format GL_RGB5_A1 to load texture with a GL_RGB5_A1 internal format.

When either many processing modes or several expensive modes such as convolution are on, the processing stage is the bottleneck. Such cases benefit from one-component processing, which is much faster than multicomponent processing.

Zooming up pixels may create a raster bottleneck.
A big-pixel rectangle has a higher throughput (that is, pixels per second) than a small rectangle. Because the imaging pipeline is tuned to trade off a relatively large setup time with a high pixel-transfer efficiency, a large rectangle distributes the setup cost over many pixels to achieve higher throughput.
Having no mode changes between pixel operations results in higher throughput. New high-end hardware detects pixel mode changes between pixel operations. When there is no mode change between pixel operations, the setup overhead is drastically reduced. This is done to optimize for image tiling where an image is painted on the screen by drawing many small tiles.
On most systems, glCopyPixels() is faster than glDrawPixels().
Tightly packing data in memory (for example, row length=0, alignment=1) is slightly more efficient for host transfer.

Prev	Table of Contents	Next
Chapter 15. Tuning Graphics Applications: Fundamentals		Chapter 17. Tuning Graphics Applications: Examples

Bad		`glNormal3fv(normaldata); glTexCoord2fv(texdata); glVertex3fv(vertdata);`
Good		`glNormal3fv(dataptr); glTexCoord2fv(dataptr+3); glVertex3fv(dataptr+5);`