Chapter 18. System-Specific Tuning

Chapter 18. System-Specific Tuning
Prev		Next

This chapter first describes some general issues regarding system-specific tuning and then provides tuning information that is relevant for particular Silicon Graphics systems. Use these techniques as needed if you expect your program to be used primarily on one kind of system or a group of systems. The chapter uses the following to topics:

Some points are also described in earlier chapters but repeated here because they result in particularly noticeable performance improvement on certain platforms.

Note: To determine your particular hardware configuration, use /usr/gfx/ gfxinfo. See the man page for gfxinfo for more information. You can also call glGetString() with the GL_RENDERER argument. For information about the renderer strings for different systems, see the man page for glGetString().

Introduction to System-Specific Tuning

Many of the performance tuning techniques described in the previous chapters (such as minimizing the number of state changes and disabling features that are not required) are good ideas, regardless of your platform. Other tuning techniques need to be customized for certain systems. For example, before you sort your database based on state changes, you need to determine which state changes are the most expensive for your target system.

In addition, you may want to modify the behavior of your program depending on which modes are fast. This is especially important for programs that must run at a particular frame rate. To maintain the frame rate on certain systems, you may need to disable some features. For example, if a particular texture mapping environment is slow on one of your target systems, you must disable texture mapping or change the texture environment whenever your program is running on that platform.

Before you can tune your program for each of the target platforms, you must do some performance measurements. This is not always straightforward. Often a particular device can accelerate certain features, but not all at the same time. It is therefore important to test the performance for combinations of features that you will be using. For example, a graphics adapter may accelerate texture mapping but only for certain texture parameters and texture environment settings. Even if all texture modes are accelerated, you have to experiment to see how many textures you can use at the same time without causing the adapter to page textures in and out of the local memory.

A more complicated situation arises if the graphics adapter has a shared pool of memory that is allocated to several tasks. For example, the adapter may not have a framebuffer deep enough to contain a depth buffer and a stencil buffer. In this case, the adapter would be able to accelerate both depth buffering and stenciling but not at the same time; or, perhaps, depth buffering and stenciling can both be accelerated but only for certain stencil buffer depths.

Typically, per-platform testing is done at initialization time. You should do some trial runs through your data with different combinations of state settings and calculate the time it takes to render in each case. You may want to save the results in a file so that your program does not have to do this test each time it starts. You can find an example of how to measure the performance of particular OpenGL operations and to save the results using the isfast program from the OpenGL website.

Optimizing Performance on InfiniteReality Systems

This section describes optimizing performance on InfiniteReality systems in the following sections:

Managing Textures on InfiniteReality Systems

The following texture management strategies are recommended for InfiniteReality systems:

Using the texture_object extension (OpenGL 1.0) or texture objects (OpenGL 1.1) usually yields better performance than using display lists.
OpenGL will make a copy of your texture if needed for context switching; so, deallocate your own copy as soon as possible after loading it.

On Infinite Reality systems, only the copy on the graphics pipe exists. If you run out of texture memory, OpenGL must save the texture that did not fit from the graphics pipe to the host, clean up texture memory, and then reload the texture. To avoid these multiple moves of the texture, always clean up textures you no longer need so that you do not deplete texture memory.

This approach has the advantage of very fast texture loading because no host copy is made.
To load a texture immediately, perform the following steps:
1. Enable texturing.
2. Bind your texture.
3. Call glTexImage*().
To define a texture without loading it into the hardware until the first time it is referenced, perform the following steps:
1. Disable texturing.
2. Bind your texture.
3. Call glTexImage*().
In this case, a copy of your texture is placed in main memory.
Do not overflow texture memory; otherwise, texture swapping will occur.
If you want to implement your own texture memory management policy, use subtexture loading. You have the following two options:
- Allocate one large empty texture, call glTexSubImage*() to load it piecewise, and then use the texture matrix to select the relevant portion.
- Allocate several textures, then fill them in by calling glTexSubImage*() as appropriate.
For both options, it is important that after initial setup, you never create and destroy textures but reuse existing ones.
Use 16-bit texels whenever possible; RGBA4 can be twice as fast as RGBA8. As a rule, remember that bigger formats are slower.
If you need a fine color ramp, start with 16-bit texels and then use a texture lookup table and texture scale/bias.
Texture subimages should be multiples of 8 texels wide for maximum performance.
For loading textures, use pixel formats on the host that match texel formats on the graphics system.
Avoid OpenGL texture borders; they consume large amounts of texture memory. For clamping, use the GL_CLAMP_TO_EDGE_SGIS style defined by the SGIS_texture_edge_clamp extension.

Offscreen Rendering and Framebuffer Management

InfiniteReality systems support offscreen rendering through a combination of OpenGL features and extensions:

Pixel buffers		Pixel buffers (pbuffers) are offscreen pixel arrays that behave much like windows, except that they are invisible. See “SGIX_pbuffer—The Pixel Buffer Extension” in Chapter 6.
Framebuffer configurations		Framebuffer configurations define color buffer depths, determine presence of Z buffers, and so on. See “Using Visuals and Framebuffer Configurations” in Chapter 4.
Concurrent reads/writes		The function `glXMakeCurrentReadSGI()` allows you to read from one window or pbuffer while writing to another. See “SGI_make_current_read—The Make Current Read Extension” in Chapter 6.

In addition, glCopyTexImage*() allows you to copy from a pbuffer or window to texture memory. This function is supported through an extension in OpenGL 1.0 but is part of OpenGL 1.1.

For framebuffer memory management, consider the following tips:

Use pbuffers. pbuffers are allocated by “layer” in unused portions of the framebuffer.
If you have deep windows, such as multisampled or quad- buffered windows, then you will have less space in the framebuffer for pbuffers.
A pbuffer is swappable (to avoid collisions with windows) but is not completely virtualized; that is, there is a limit to the number of pbuffers you can allocate. The sum of all allocated pbuffer space cannot exceed the size of the framebuffer.
A pbuffer can be volatile (subject to destruction by window operations) or nonvolatile (swapped to main memory in order to avoid destruction). Volatile pbuffers are recommended because swapping is slow. Treat volatile pbuffers like they were windows, subject to exposure events.

Optimizing State Changes

The following items provide guidelines for optimizing state changes:

As a rule, it is more efficient to change state when the relevant function is disabled than when it is enabled.

For example, when changing line width for antialiased lines, make the following calls:
glLineWidth(width); glEnable(GL_LINE_SMOOTH);
As a result of these calls, the line filter table is computed just once when line antialiasing is enabled. The table may be computed twice (once when antialiasing is enabled and again when the line width is changed) if you make the following calls:
glEnable(GL_LINE_SMOOTH); glLineWidth(width);
As a result, it may be best to disable a feature if you plan to change state and then enable it after the change.
The following mode changes are fast: sample mask, logic op, depth function, alpha function, stencil modes, shade model, cullface, texture environment, matrix transforms.
The following mode changes are slow: texture binding, matrix mode, lighting, point size, line width.
For the best results, map the near clipping plane to 0.0 and the far clipping plane to 1.0 (this is the default). Note that a different mapping (for example 0.0 and 0.9) will still yield a good result. A reverse mapping, such as near = 1.0 and far = 0.0, noticeably decreases depth-buffer precision.
When using a visual with a 1-bit stencil, it is faster to clear both the depth buffer and stencil buffer than it is to clear the depth buffer alone.
Use the color matrix extension for swapping and smearing color channels. The implementation is optimized for cases in which the matrix is composed of zeros and ones.
Be sure to check for the usual things: indirect contexts, drawing images with depth buffering enabled, and so on.
Triangle strips that are multiples of 10 (12 vertices) are best.
InfiniteReality systems optimize 1-component pixel draw operations. They are also faster when the pixel host format matches the destination format.
Bitmaps have high setup overhead. Consider these approaches:
- If possible, draw text using textured polygons. Put the entire font in a texture and use texture coordinates to select letters.
- To use bitmaps efficiently, compile them into display lists. Consider combining more than one into a single bitmap to save overhead.
- Avoid drawing bitmaps with invalid raster positions. Pixels are eliminated late in the pipeline and drawing to an invalid position is almost as expensive as drawing to a valid position.

Miscellaneous Performance Hints

The following are some miscellaneous performance hints:

Minimize the amount of data sent to the pipeline.
- Use display lists as a cache for geometry. Using display lists is critical on Onyx systems. It is less critical, but still recommended, on Onyx2 systems. The performance of the two systems differs because the bus between the host and the graphics is faster on Onyx2 systems.
  
  The display list priority extension (see “SGIX_list_priority—The List Priority Extension” in Chapter 12) can be used to manage display list memory efficiently.
- Use texture memory or offscreen framebuffer memory ( pbuffers) as a cache for pixels.
- Use small data types aligned for immediate-mode drawing (such as RGBA color packed into a 32-bit word, surface normals packed as three shorts, texture coordinates packed as two shorts). Smaller data types mean, in effect, less data to transfer.
Render with exactly one thread per pipe.
Use multiple OpenGL rendering contexts sparingly.

Assuming no texture swapping, the rendering context-switching rate is about 60,000 calls per second. Therefore, each call to glXMakeCurrent() costs the equivalent of 100 textured triangles or 800 32-bit pixels.

Optimizing Performance on Onyx4 and Silicon Graphics Prism Systems

This section describes OpenGL performance optimizations for Onyx4 and Silicon Graphics Prism systems. Both Onyx4 and Silicon Graphics Prism systems use commodity graphics GPUs. Compared to older SGI graphics systems such as InfiniteReality and VPro, graphics hardware of this type differs substantially in features and in how to achieve peak performance ( fast paths).

This section describes the following topics:

For a more complete discussion of performance issues, including higher-level issues such as multipipe scaling, see the document Silicon Graphics UltimateVision Graphics Porting Guide. You can also refer to the latest platform-specific documentation and release notes for your system, since additional performance optimizations and fast paths are ongoing.

Geometry Optimizations: Drawing Vertices

On older SGI graphics systems, immediate-mode rendering could usually reach peak performance of the geometry pipeline. However, the geometry pipeline capacity in Onyx4 and Silicon Graphics Prism GPUs greatly exceeds the available CPU-to-graphics bandwidth. The fastest paths for geometry on Onyx4 and Silicon Graphics Prism systems are either display lists or vertex buffer objects.

It is usually easiest to use display lists when porting older applications. When constructing display lists, a variety of optimizations are performed by the system. Some of these optimizations may be controlled by environment variables, as defined in platform-specific documentation.

Vertex buffer objects (using the ARB_vertex_buffer_object extension) are the preferred fast path when writing new code. When drawing indexed geometry, make sure to store both vertex array data and the array index data in buffer objects.

Drawing geometry using vertex buffer objects or display lists can be more than five times faster than immediate-mode rendering. The performance gain is typically larger on Onyx4 systems than on Silicon Graphics Prism systems.

Texturing Optimizations: Loading and Rendering Texture Images

The GPUs in Onyx4 and Silicon Graphics Prism systems support less texture memory than InfiniteReality systems. In addition, texture memory is shared with framebuffer and display list memory. This memory sharing may further reduce available available texture memory depending on the framebuffer configuration, use of pixel buffers, size of display lists, etc. However, you can reduce the texture memory requirements through the use of compressed texture formats.

Using OpenGL 1.3 core features and the EXT_texture_compression_s3tc extension, Onyx4 and Silicon Graphics Prism systems both support compressed texture formats. Compressed textures use approximately one-sixth of the space required for an equivalent uncompressed texture and require correspondingly less graphics memory bandwidth when rendering. Texture compression should be used whenever the resulting image quality loss is acceptable. When texture compression is not acceptable, use the fastest uncompressed texture formats, as described in the following section “Pixel Optimizations: Reading and Writing Pixel Data”.

In some cases, the graphics drivers may automatically compress textures by default. Refer to platform-specific documentation for more information about controlling this process.

Pixel Optimizations: Reading and Writing Pixel Data

When you use functions like glDrawPixels(), glReadPixels(), and glTexImage2D() to transfer pixel and uncompressed texture data between the CPU and graphics pipeline, it is much faster when you use pixel format and type combinations that are efficiently supported by the GPUs and drivers.

When reading and writing pixel data, the format GL_RGBA and type GL_UNSIGNED_BYTE are fastest. When reading and writing uncompressed texture images, the same format and type are fastest, as well as the internal texture format GL_RGBA. When writing format GL_DEPTH_COMPONENT (depth buffer data), the type GL_UNSIGNED_SHORT is fastest.

Other combinations of pixel format and type require additional conversion and packing/unpacking steps. Some additional format/type combinations may be optimized in the future; refer to the platform release notes for more information.

Differences Between Onyx4 and Silicon Graphics Prism Systems

In contrast to older SGI graphics systems, Onyx4 and Silicon Graphics Prism systems both use commodity graphics GPUs. The optimizations cited earlier in this section are applicable to commodity graphics GPUs on any system. However, the following differences between Onyx4 and Silicon Graphics Prism systems may affect performance:

Onyx4 systems use MIPS CPUs while Silicon Graphics Prism systems use Intel Itanium CPUs. In general, Silicon Graphics Prism systems have higher CPU performance and greater memory bandwidth compared to Onyx4 systems. This affects compute-bound applications.
Onyx4 systems run the IRIX operating system while Silicon Graphics Prism systems run Linux. The OpenGL and X feature sets of the two systems are very similar, and the operating system differences generally do not, in and of themselves, affect performance.
Onyx4 systems use a PCI-X interface between the CPU and GPU while Silicon Graphics Prism systems use an AGP 8x interface. The AGP interface offers considerably higher bandwidth, which will improve performance for immediate-mode rendering, pixel and texture uploads and downloads, and other operations that must shift large amounts of data between the CPU and GPU.

However, even on Silicon Graphics Prism systems, it is important to follow the advice cited earlier in this section regarding use of display lists and vertex buffer objects, efficient texture and pixel formats, etc. Neither the PCI-X interface nor the AGP interface is capable of feeding data to GPUs at a transfer rate equal to their processing rate. Hence, caching data on GPUs is critical to peak performance.
Silicon Graphics Prism systems, a more recent product line, will have more opportunities for upgrades, resulting in greater performance and more OpenGL features, to both CPUs and GPUs.

Prev	Table of Contents	Next
Chapter 17. Tuning Graphics Applications: Examples		Appendix A. Benchmarks