ANSI/ISO C++ Professional Programmer's Handbook

12 Optimizing Your Code

by Danny Kalev

Introduction
- Scope of This Chapter
Before Optimizing Your Software
Declaration Placement
Inline Functions
Optimizing Memory Usage
- Bit Fields
- Unions
Speed Optimizations
A Last Resort
Conclusions

Introduction

One often-heard claim during the past 30 years is that performance doesn't matter because the computational power of hardware is constantly dropping. Therefore, buying a stronger machine or extending the RAM of an existing one can make up for the sluggish performance of software written in a high-level programming language. In other words, a hardware upgrade is more cost-effective than the laborious task of hand-tuning code. That might be correct for client applications that execute on a standard personal computer. A modestly priced personal computer these days offers higher computational power than a mainframe did two decades ago, and the computational power still grows exponentially every 18 months or so. However, in many other application domains, a hardware upgrade is less favorable because it is too expensive or because it simply is not an option. In proprietary embedded systems with 128K of RAM or less, extending the RAM requires redesigning the entire system from scratch, as well as investing several years in the development and testing of the new chips. In this case, code optimization is the only viable choice for satisfactory performance.

But optimization is not confined to esoteric application domains such as embedded systems or hard core real-time applications. Even in mainstream application domains such as financial and billing systems, code optimization is sometimes necessary. For a bank that owns a $1,500,000 mainframe computer, buying a faster machine is less preferable than rewriting a few thousand lines of critical code. Code optimization is also the primary tool for achieving satisfactory performance from server applications that support numerous users, such as Relational Database Management Systems and Web servers.

Another common belief is that code optimization implies less readable and harder to maintain software. This is not necessarily true. Sometimes, simple code modifications such as relocating the declarations in a source file or choosing a different container type can make all the difference in the world. Yet none of these changes entails unreadable code, nor do they incur any additional maintenance overhead. In fact, some of the optimization techniques can even improve the software's extensibility and readability. More aggressive optimizations can range from using a simplified class hierarchy, through the combination of inline assembly code. The result in this case is less readable, harder to maintain, and less portable code. Optimization can be viewed as a continuum; the extent to which it is applied depends on a variety of considerations.

Scope of This Chapter

Optimization is a vast subject that can easily fill a few thick volumes. This chapter discusses various optimization techniques, most of which can be easily applied in C++ code without requiring a deep understanding of the underlying hardware architecture of a particular platform. The intent is to give you a rough estimate of the performance cost of choosing one programming strategy over another (you can experiment with the programs that are discussed in the following sections on your computer). The purpose is to provide you with practical guidelines and notions, rather than delve into theoretical aspects of performance analysis, efficiency of algorithms, or the Big Oh notation.

Before Optimizing Your Software

Detecting the bottlenecks of a program is the first step in optimizing it. It is important, however, to profile the release version rather than the debug version of the program because the debug version of the executable contains additional code. A debug-enabled executable can be about 40% larger than the equivalent release executable. The extra code is required for symbol lookup and other debug "scaffolding". Most implementations provide distinct debug and release versions of operator new and other library functions. Usually, the debug version of new initializes the allocated memory with a unique value and adds a header at block start; the release version of new doesn't perform either of these tasks. Furthermore, a release version of an executable might have been optimized already in several ways, including the elimination of unnecessary temporary objects, loop unrolling (see the sidebar "A Few Compiler Tricks"), moving objects to the registers, and inlining. For these reasons, you cannot assuredly deduce from a debug version where the performance bottlenecks are actually located.

A Few Compiler Tricks
A compiler can automatically optimize the code in several ways. The named return value and loop unrolling are two instances of such automatic optimizations.

Consider the following code:
  int *buff = new int[3];
  for (int i =0; i<3; i++)
    buff[i] = 0;
This loop is inefficient: On every iteration, it assigns a value to the next array element. However, precious CPU time is also wasted on testing and incrementing the counter's value and performing a jump statement. To avoid this overhead, the compiler can unroll the loop into a sequence of three assignment statements, as follows:
  buff[0] = 0; 
  buff[1] = 0; 
  buff[2] = 0; 
  
The named return value is a C++-specific optimization that eliminates the construction and destruction of a temporary object. When a temporary object is copied to another object using a copy constructor, and when both these objects are cv-unqualified, the Standard allows the implementation to treat the two objects as one, and not perform a copy at all. For example
class A
{
public:
  A();
  ~A();
  A(const A&);
  A operator=(const A&);
};
A f() 
{
  A a;
  return a;
}
A a2 = f();
  
The object a does not need to be copied when f() returns. Instead, the return value of f() can be constructed directly into the object a2, thereby avoiding both the construction and destruction of a temporary object on the stack.

Remember also that debugging and optimization are two distinct operations. The debug version needs to be used to trap bugs and to verify that the program is free from logical errors. The tested release version needs to be used in performance tuning and optimizations. Of course, applying the code optimization techniques that are presented in this chapter can enhance the performance of the debug version as well, but the release version is the one that needs to be used for performance evaluation.

NOTE: It is not uncommon to find a "phantom bottleneck" in the debug version, which the programmer strains hard to fix, only to discover later that it has disappeared anyway in the release version. Andrew Koenig wrote an excellent article that tells the story of an evasive bottleneck that automatically dissolved in the release version ("An Example of Hidden Library Overhead", C++ Report vol. 10:2, February 1998, page 11). The lesson that can be learned from this article is applicable to everyone who practices code optimization.

Declaration Placement

The placing of declarations of variables and objects in the program can have significant performance effects. Likewise, choosing between the postfix and prefix operators can also affect performance. This section concentrates on four issues: initialization versus assignment, relocation of declarations to the part of the program that actually uses them, a constructor's member initialization list, and prefix versus postfix operators.

Prefer Initialization to Assignment

C allows declarations only at a block's beginning, before any program statements. For example

void f();
void g()
{
  int i; 
  double d; 
  char * p;
  f(); 
}

In C++, a declaration is a statement; as such, it can appear almost anywhere within the program. For example

void f();
void g()
{
 int i; 
 f(); 
 double d; 
 char * p;
}

The motivation for this change in C++ was to allow for declarations of objects right before they are used. There are two benefits to this practice. First, this practice guarantees that an object cannot be tampered with by other parts of the program before it has been used. When objects are declared at the block's beginning and are used only 20 or 50 lines later, there is no such guarantee. For instance, a pointer to an object that was allocated on the free store might be accidentally deleted somewhere before it is actually used. Declaring the pointer right before it is used, however, reduces the likelihood of such mishaps.

The second benefit in declaring objects right before their usage is the capability to initialize them immediately with the desired value. For example

#include <string>
using namespace std;
void func(const string& s)
{
  bool emp = s.empty(); //local declarations enables immediate initialization
}

For fundamental types, initialization is only marginally more efficient than assignment; or it can be identical to late assignment in terms of performance. Consider the following version of func(), which applies assignment rather than initialization:

void func2() //less efficient than func()? Not necessarily
{
  string s;
  bool emp;
  emp = s.empty(); //late assignment
}

My compiler produces the same assembly code as it did with the initialization version. However, as far as user-defined types are concerned, the difference between initialization and assignment can be quite noticeable. The following example demonstrates the performance gain in this case (by modifying the preceding example). Instead of a bool variable, a full-blown class object is used, which has all the four special member functions defined:

int constructor, assignment_op, copy, destr; //global counters
class C
{
public:
  C();
  C& operator = (const C&);
  C(const C&);
  ~C();
};
C::C() 
{
  ++constructor;
}
C& C::operator = (const C& other)
{
  ++assignment_op;
  return *this;
}
C::C(const C& other)
{
  ++copy;
}
C::~C()
{
  ++destr;
}

As in the previous example, two different versions of the same function are compared; the first uses object initialization and the second uses assignment:

void assign(const C& c1)
{
 C c2;
 c2 = c1;
}
void initialize(const C& c1)
{
 C c2 = c1;
}

Calling assign() causes three member function invocations: one for the constructor, one for the assignment operator, and one for the destructor. initialize() causes only two member function invocations: the copy constructor and the destructor. Initialization saves one function call. For a nonsensical class such as C, the additional runtime penalty that results from a superfluous constructor call might not be crucial. However, bear in mind that constructors of real-world objects also invoke constructors of their base classes and embedded objects. When there is a choice between initialization and assignment, therefore, initialization is always preferable.

Relocating Declarations

Preferring initialization of objects over assignment is one aspect of localizing declarations. On some occasions, the performance boost that can result from moving declarations is even more appreciable. Consider the following example:

bool is_C_Needed(); 
void use()
{
  C c1;
  if (is_C_Needed() == false)
  {
    return; //c1 was not needed
  }   
  //use c1 here
  return; 
}

The local object c1 is unconditionally constructed and destroyed in use(), even if it is not used at all. The compiler transforms the body of use() into something that looks like this:

void use()
{
  C c1;
  c1.C::C(); //1. compiler-added constructor call
  if (is_C_Needed() == false)
  {
    c1.C::~C(); //2. compiler-added destructor call
    return; //c1 was not needed but was constructed and destroyed still
  }   
  //use c1 here
  c1.C::~C(); //3. compiler-added destructor call
  return; 
}

As you can see, when is_C_Needed() returns false, the unnecessary construction and destruction of c1 are still unavoidable. Can a clever compiler optimize away the unnecessary construction and destruction in this case? The Standard allows the compiler to suppress the creation (and consequently, the destruction) of an object if it is not needed, and if neither its constructor nor its destructor has any side effects. In this example, however, the compiler cannot perform this feat for two reasons. First, both the constructor and the destructor of c1 have side effects -- they increment counters. Second, the result of is_C_Needed() is unknown at compile time; therefore, there is no guarantee that c1 is actually unnecessary at runtime. Nevertheless, with a little help from the programmer, the unnecessary construction and destruction can be eliminated. All that is required is the relocation of the declaration of c1 to the point where it is actually used:

void use()
{
  if (is_C_Needed() == false)
  {
    return; //c1 was not needed
  }   
  C c1; //moved from the block's beginning
  //use c1 here
  return; 
}

Consequently, the object c1 is constructed only when it is really needed -- that is, when is_C_Needed() returns true. On the other hand, if is_C_Needed() returns false, c1 is neither constructed nor destroyed. Thus, simply by moving the declaration of c1, you managed to eliminate two unnecessary member function calls! How does it work? The compiler transforms the body of use() into something such as the following:

void use()
{
  if (is_C_Needed() == false)
  {
    return; //c1 was not needed
  }   
  C c1; //moved from the block's beginning
  c1.C::C(); //1 compiler-added constructor call
  //use c1 here
  c1.C::~C(); //2 compiler-added destructor call
  return; 
}

To realize the effect of this optimization, change the body of use(). Instead of constructing a single object, you now use an array of 1000 C objects:

void use()
{
  if (is_C_Needed() == false)
  {
    return; //c1 was not needed
  }   
  C c1[1000]; 
  //use c1 here
  return; 
}

In addition, you define is_C_Needed() to return false:

bool is_C_Needed() 
{ 
  return false;
}

Finally, the main() driver looks similar to the following:

int main()
{ 
  for (int j = 0; j<100000; j++)
    use();
  return 0;  
}

The two versions of use() differ dramatically in their performance. They were compared on a Pentium II, 233MHz machine. To corroborate the results, the test was repeated five times. When the optimized version was used, the for loop in main() took less than 0.02 of a second, on average. However, when the same for loop was executed with the original, the nonoptimized version of use() took 16 seconds. The dramatic variation in these results isn't too surprising; after all, the nonoptimized version incurs 100,000,000 constructor calls as well as 100,000,000 destructor calls, whereas the optimized version calls none. These results might also hint at the performance gain that can be achieved simply by preallocating sufficient storage for container objects, rather than allowing them to reallocate repeatedly (see also Chapter 10, "STL and Generic Programming").

Member-Initialization Lists

As you read in Chapter 4, "Special Member Functions: Default Constructor, Copy Constructor, Destructor, and Assignment Operator," a member initialization list is needed for the initialization of const and reference data members, and for passing arguments to a constructor of a base or embedded subobject. Otherwise, data members can either be assigned inside the constructor body or initialized in a member initialization list. For example

class Date //mem-initialization version
{
private:
  int day;
  int month;
  int year;
  //constructor and destructor
public:
  Date(int d = 0, int m = 0, int y = 0) : day , month(m), year(y) {}
};

Alternatively, you can define the constructor as follows:

Date::Date(int d, int m, int y) //assignment within the constructor body
{
  day   = d; 
  month = m; 
  year  = y;
}

Is there a difference in terms of performance between the two constructors? Not in this example. All the data members in Date are of a fundamental type. Therefore, initializing them by a mem-initialization list is identical in terms of performance to assignment within the constructor body. However, with user-defined types, the difference between the two forms is significant. To demonstrate that, return to the member function counting class, C, and define another class that contains two instances thereof:

class Person
{
private:
  C c_1;
  C c_2;
public:
  Person(const C& c1, const C& c2 ): c_1(c1), c_2(c2) {}
};

An alternative version of Person's constructor looks similar to the following:

Person::Person(const C& c1, const C& c2)
{
 c_1 = c1; 
 c_2 = c2; 
}

Finally, the main() driver is defined as follows:

int main()
{
  C c; //created only once, used as dummy arguments in Person's constructor 
  for (int j = 0; j<30000000; j++)
  {
    Person p(c, c);
  }
  return 0;  
}

The two versions were compared on a Pentium II, 233MHz machine. To corroborate the results, the test was repeated five times. When a member initialization list was used, the for loop in main() took 12 seconds, on average. The nonoptimized version took 15 seconds, on average. In other words, the assignment inside the constructor body is slower by a factor of 25% compared to the member-initialized constructor. The member function counters can give you a clue as to the reasons for the difference. Table 12.1 presents the number of member function calls of class C for the member initialized constructor and for the assignment inside the constructor's body.

Table 12.1 Comparison Between Member Initialization and Assignment Within the Constructor's Body for Class Person

Initialization Method	Default Constructor Calls	Assignment Operator Calls	Copy Constructor Calls	Destructor Calls
Member initialization list	0	0	60,000,000	60,000,000
Assignment within Constructor	60,000,000	60,000,000	0	60,000,000

When a member initialization list is used, only the copy constructor and the destructor of the embedded object are called (note that Person has two embedded members), whereas the assignment within the constructor body also adds a default constructor call per embedded object. In Chapter 4, you learned how the compiler inserts additional code into the constructor's body before any user-written code. The additional code invokes the constructors of the base classes and embedded objects of the class. In the case of polymorphic classes, this code also initializes the vptr. The assigning constructor of class Person is transformed into something such as the following:

Person::Person(const C& c1, const C& c2) //assignment within constructor body
{
 //pseudo C++ code inserted by the compiler before user-written code
  c_1.C::C(); //invoke default constructor of embedded object c_1
  c_2.C::C(); //invoke default constructor of embedded object c_2
//user-written code comes here:
  c_1 = c1; 
  c_2 = c2; 
}

The default construction of the embedded objects is unnecessary because they are reassigned new values immediately afterward. The member initialization list, on the other hand, appears before any user-written code in the constructor. Because the constructor body does not contain any user-written code in this case, the transformed constructor looks similar to the following:

Person::Person(const C& c1, const C& c2) // member initialization list ctor
{
 //pseudo C++ code inserted by the compiler before user-written code
  c_1.C::C(c1); //invoke copy constructor of embedded object c_1
  c_2.C::C(c2); //invoke copy constructor of embedded object c_2
//user-written code comes here (note: there's no user code)
}

You can conclude from this example that for a class that has subobjects, a member initialization list is preferable to an assignment within the constructor's body. For this reason, many programmers use member initialization lists across the board, even for data members of fundamental types.

Prefix Versus Postfix Operators

The prefix operators ++ and -- tend to be more efficient than their postfix versions because when postfix operators are used, a temporary copy is needed to retain the value of the operand before it is changed. For fundamental types, the compiler can eliminate the extra copy. However, for user-defined types, this is nearly impossible. A typical implementation of the overloaded prefix and postfix operators demonstrates the difference between the two:

class Date
{
private:
  //...
  int AddDays(int d); 
public:
  Date operator++(int unused); 
  Date& operator++(); 
};
Date Date::operator++(int unused)  //postfix
{
  Date temp(*this); //create a copy of the current object
  this->AddDays(1); //increment  current object
  return temp; //return by value a copy of the object before it was incremented
}
Date& Date::operator++()   //prefix
{ 
  this->AddDays(1); //increment  current object
  return *this; //return by reference the current object
}

The overloaded postfix ++ is significantly less efficient than the prefix for two reasons: It requires the creation of a temporary copy, and it returns that copy by value. Therefore, whenever you are free to choose between postfix and prefix operators of an object, choose the prefix version.

Inline Functions

Inline functions can eliminate the overhead incurred by a function call and still provide the advantages of ordinary functions. However, inlining is not a panacea. In some situations, it can even degrade the program's performance. It is important to use this feature judiciously.

Function Call Overhead

The exact cost of an ordinary function call is implementation-dependent. It usually involves storing the current stack state, pushing the arguments of the function onto the stack and initializing them, and jumping to the memory address that contains the function's instructions -- only then does the function begin to execute. When the function returns, a sequence of reverse operations also takes place. In other languages (such as Pascal and COBOL), the overhead of a function call is even more noticeable because there are additional operations that the implementation performs before and after a function call. For a member function that merely returns the value of a data member, this overhead can be unacceptable. Inline functions were added to C++ to allow efficient implementation of such accessor and mutator member functions (getters and setters, respectively). Nonmember functions can also be declared inline.

Benefits of Inline Functions

The benefits of inlining a function are significant: From a user's point of view, the inlined function looks like an ordinary function. It can have arguments and a return value; furthermore, it has its own scope, yet it does not incur the overhead of a full-blown function call. In addition, it is remarkably safer and easier to debug than using a macro. But there are even more benefits. When the body of a function is inlined, the compiler can optimize the resultant code even further by applying context-specific optimizations that it cannot perform on the function's code alone.

All member functions that are implemented inside the class body are implicitly declared inline. In addition, compiler synthesized constructors, copy constructors, assignment operators, and destructors are implicitly declared inline. For example

class A 
{
private:
  int a;
public:
  int Get_a() { return a; } // implicitly inline
  virtual void Set_a(int aa) { a = aa; } //implicitly inline
  //compiler synthesized canonical member functions also declared inline
};

It is important to realize, however, that the inline specifier is merely a recommendation to the compiler. The compiler is free to ignore this recommendation and outline the function; it can also inline a function that was not explicitly declared inline. Fortunately, C++ guarantees that the function's semantics cannot be altered by the compiler just because it is or is not inlined. For example, it is possible to take the address of a function that was not declared inline, regardless of whether it was inlined by the compiler (the result, however, is the creation of an outline copy of the function). How do compilers determine which functions are to be inlined and which are not? They have proprietary heuristics that are designed to pick the best candidates for inlining, depending on various criteria. These criteria include the size of the function body, whether it declares local variables, its complexity (for example, recursion and loops usually disqualify a function from inlining), and additional implementation- and context-dependent factors.

What Happens When a Function that Is Declared inline Cannot Be Inlined?

Theoretically, when the compiler refuses to inline a function, that function is then treated like an ordinary function: The compiler generates the object code for it, and invocations of the function are transformed into a jump to its memory address. Unfortunately, the implications of outlining a function are more complicated than that. It is a common practice to define inline functions in the class declaration. For example

   // filename Time.h
#include<ctime>
#include<iostream>
using namespace std;
class Time
{
public:
  inline void Show() { for (int i = 0; i<10; i++) cout<<time(0)<<endl;}
};
   // filename Time.h

Because the member function Time::Show() contains a local variable and a for loop, the compiler is likely to ignore the inline request and treat it as an ordinary member function. However, the class declaration itself can be #included in separately compiled translation units:

    // filename f1.cpp
#include "Time.hj"
void f1()
{
  Time t1;
  t1.Show();
}
    // f1.cpp
// filename f2.cpp
#include "Time.h"
void f2()
{
  Time t2;
  t2.Show();
}
    // f2.cpp

As a result, the compiler generates two identical copies of the same member function for the same program:

void f1();
void f2();
int main()
{
  f1(); 
  f2();
  return 0;
}

When the program is linked, the linker is faced with two identical copies of Time::Show(). Normally, function redefinition causes a link-time error. Un-inlined functions are a special case, however. Older implementations of C++ coped with this situation by treating an un-inlined function as if it had been declared static. Consequently, each copy of the compiled function was only visible within the translation unit in which it was declared. This solved the name clashing problem at the cost of multiple local copies of the same function. In this case, the inline declaration did not boost performance; on the contrary, every call of the un-inlined function was resolved as an ordinary function call with the regular overhead. Even worse, the multiple copies of the function code increased compilation and linkage time and bloated the size of the executable. Ironically, not declaring Time::Show() inline might have yielded better performance! Remember that the programmer is generally not aware of all the actual costs of this -- the compiler strains quietly, the linker sighs silently, and the resultant executable is more bloated and sluggish than ever. But it still works, and the users scratch their heads, saying, "This object-oriented programming is really awful! I'm sure this app would run much faster if I'd written it in C!".

Fortunately, the Standard's specification regarding un-inlined functions was recently changed. A Standard compliant implementation generates only a single copy of such a function, regardless of the number of translation units that define it. In other words, an un-inlined function is treated similarly to an ordinary function. However, it might take some time for all compiler vendors to adopt the new specifications.

Additional Issues of Concern

There are two more conundrums that are associated with inline functions. The first has to do with maintenance. A function can begin its life as a slim inline function, offering the benefits that were previously described. At a later phase in the lifetime of the system, the function body can be extended to include additional functionality, resulting from changes in the implementation of its class. Suddenly, the inline substitution can become inefficient or even impossible. It is therefore important to reconsider the removal of the inline specifier from such functions. For member functions that are defined in the class body, the change is more complicated because the function definition has to be moved to a separate source file.

Another problem might arise when inline functions are used in code libraries. It is impossible to maintain binary compatibility if the definition of an inline function changes. In this case, the users must recompile their code to reflect the change. For a non-inline function, the users only need to relink their code, which is considerably less of a burden than a massive recompilation and relink.

The Do's and Don'ts of inline

The lesson here is that inline is not a magical potion for enhancing performance. For very short functions -- for example, accessors, mutators, and function wrappers (see Chapter 13, "C Language Compatibility Issues") -- the inline specifier can be profitable in terms of both execution speed and program size. If the inlined function is not very short and it is called extensively, however, the result can be an increase in the size of the executable. Furthermore, many processors cache the machine instructions of frequently used parts of the program; excessive inlining can cause a reduced instruction cache hit and, consequently, poorer overall performance. The real annoyance occurs when the compiler refuses to inline a function even though it was declared inline. On older implementations, the result was quite painful. On Standard compliant implementations, the consequences of un-inlining are less detrimental, but they are still undesirable. Some compilers are clever enough to figure out on their own which functions are to be inlined. However, most compilers are less inline-savvy so it is best to examine the effect of an inline declaration empirically. If the inline declaration does not enhance performance, avoid it.

Optimizing Memory Usage

Optimization has several aspects: faster execution speed, efficient usage of system resources, and minimal usage of memory. In general, code optimization attempts to improve all these aspects. The declaration relocation technique that was demonstrated earlier eliminates the unnecessary creation and destruction of objects, thereby reducing the program's size and accelerating its runtime speed. However, other optimization techniques are heavily biased toward one direction -- speedier code or a smaller memory footprint. Sometimes, though, these goals are mutually exclusive; that is, compacting the memory footprint engenders slower code, whereas a faster code implies a larger memory footprint. This section presents various techniques for optimizing, or compacting, the memory requirements of a program.

Bit Fields

In both C and C++ it is possible to store and access data directly in the tiniest possible unit: a single bit. Because a bit is not the natural storage unit for C/C++ implementations, the use of bit fields can increase the size of the executable due to the additional maneuvers that are exercised by the processor in accessing a sequence of one or more bits. This is a clear-cut case of sacrificing runtime speed for the sake of minimizing memory usage.

NOTE: Note, however, that some hardware architectures provide special processor instructions for accessing bits. Therefore, whether bit fields affect the program's speed or not is very much platform-dependent.

Normally, you don't use bit fields just to save a few more bytes. For some applications, however, the tradeoff between execution speed and storage compaction is definitely worth its while. For example, the billing system of an average international telephone company stores every phone call as a record in a relational database. These records are processed in batch periodically to calculate the customer's monthly bill. The database stores millions of new records every day, and it has to keep the customer's billing information for at least one year. The complete database contains around one billion records at any given time. Because the database is also backed up periodically, and because it might also be a distributed database, every record is stored in more than one physical location. In fact, there might be 20 billion records stored in different backup generations and distributed portions of the database at any given time. A minimal billing record contains the customer's ID, a timestamp, codes that indicate the type of the call (for example, local or long distance) and the tariff (off peak, peak time). Literally, every bit counts here -- one redundant bit implies 2.5GB of wasted storage!

A non-space-optimizing definition of the billing record might look like this:

struct BillingRec
{
  long cust_id;
  long timestamp;
  enum CallType
  {
    toll_free,
    local,
    regional,
    long_distance,
    international,
    cellular
  } type;
  enum CallTariff
  {
    off_peak,
    medium_rate,
    peak_time
  } tariff;
};

A BillingRec occupies no fewer than 16 bytes of memory on my 32-bit machine. Clearly, space is wasted here. The first two fields occupy four bytes each, as expected. However, the two enum variables occupy an additional eight bytes, even though they both can be safely represented in less than a single byte. A tweaked version of BillingRec can squeeze the enum values into two bit fields:

struct BillingRec
{
  long cust_id;
  long timestamp;
  enum CallType
  {
    toll_free,
    local,
    regional,
    long_distance,
    international,
    cellular
  };
  enum CallTariff
  {
    off_peak,
    medium_rate,
    peak_time
  };
  unsigned call: 3; //three bits 
  unsigned tariff: 2; //two bits
};

The size of BillingRec is now 12 bytes. The four bytes that are saved are equal to megabytes of data storage per day. Still, the size can be reduced even more. The two bit fields occupy five bits in total, which is less than a byte. One might therefore expect BillingRec to occupy 9 bytes rather than 12. The problem is that the compiler inserts three additional padding bytes after the bit fields to align the size of BillingRec on a word boundary (more on member alignment in Chapter 11, "Memory Management"). The additional padding bytes ensure faster access time -- at the cost of three wasted bytes. There are two ways to overcome this problem: You can change the compiler's setting to allow alignment on a byte boundary, or you can change the size of the other members so that -- in total -- it reaches exactly eight bytes.

NOTE: Note that both solutions might not be portable, and on some hardware architectures, the compiler will nonetheless insist on word boundary alignment. Check your compiler's specifications regarding member alignment settings.

Changing the size of the members is somewhat tricky because the first two members have to become bit fields as well:

struct BillingRec
{
  int cust_id: 24; // 23 bits + 1 sign bit
  int  timestamp: 24;
  enum CallType
  {//...
  };
  enum CallTariff
  {//...
  };
  unsigned call: 3;
  unsigned tariff: 2;
};

This time, BillingRec occupies eight bytes in total, which is half of its original size. The storage that is saved in this example can amount to 10GB annually. Considering the cheap prices of magnetic storage media these days, saving a few thousand dollars might not seem to be a compelling argument -- but there is another reason for favoring smaller data storage: the costs of digital communication. A distributed database has synchronized copies in multiple sites. The synchronization process is usually done by means of digital data transfer from the central database to its synchronized copies, and vice versa. The transmission of millions of records on leased lines is pretty expensive. But for a phone company that owns these lines, this is not an issue of special concern; suppose, however, that the company is an international bank that pays hundreds of dollars for every hour of data transfer. In this case, halving the data volume is unquestionably profitable. Another point to remember is the Web; if the telephone company has a Web site that enables its customers to view their billing information online, the download time of hundreds of records through analog dialup lines can be cut in half by this tweak.

Unions

Unions can also be used to minimize memory waste by locating two or more data members at the same memory address, where the value of (at most) one of the data members is active at any time. The size of a union is sufficient to hold the largest of its data members. A union can have member functions, including a constructor and destructor, but it cannot have virtual member functions. A union cannot serve as a base class of, nor can it inherit from, another class. In addition, a union cannot store objects that have nontrivial special member functions. C++ also supports anonymous unions. An anonymous union is an unnamed object of an unnamed type (anonymous unions are also discussed in Chapter 11). For example

union { long n; void * p};  // anonymous
n = 1000L;  // members are directly accessed
p = 0; // n is now also 0

Unlike a named union, an anonymous one cannot have member functions or nonpublic data members.

When are unions useful? The following class retrieves a person's data from a database. The key can be either a unique ID number or a person's last name, but never both at once:

class PersonalDetails 
{
private:
  char * name;
  long ID;
  //...
public:
  PersonalDetails(const char *nm);  //key is of type char * used
  PersonalDetails(long id) : ID(id) {} //numeric key used	
};

Memory is wasted here because only one of the keys can be used at a time. An anonymous union can be used in this case to minimize memory usage. For example

class PersonalDetails 
{
private:
  union  //anonymous
  { 
    char * name; 
    long ID; 
  };
public:
  PersonalDetails(const char *nm); 
  PersonalDetails(long id) : ID(id) {/**/}  // direct access to a member
  //...
};

By using a union, the size of class PersonalDetails is halved. Again, saving four bytes of memory is not worth the trouble unless this class serves as a mold for millions of database records or if the records are transmitted on slow communication lines. Note that unions do not incur any runtime overhead, so there is no speed tradeoff in this case. The advantage of an anonymous union over a named one is that its members can be accessed directly.

Speed Optimizations

In time-critical applications, every CPU cycle counts. This section presents a few simple guidelines for speed optimization. Some of them have been around since the early days of C; others are C++ specific.

Using a Class To Pack a Long Argument List

The overhead of a function call is increased when the function has a long list of arguments. The runtime system has to initialize the stack with the values of the arguments; naturally, this operation takes longer when there are more arguments. For example, executing the following function 100,000,000 times takes 8.5 seconds on average on my machine:

void retrieve(const string& title, //5 arguments 
              const string& author, 
              int ISBN,  
              int year, 
              bool&  inStore)
{}

Packing the argument list into a single class and passing it by reference as the only argument reduces the result to five seconds, on average. Of course, for functions that take a long time to execute, the stack initialization overhead is negligible. However, for short and fast functions that are called very often, packing a long parameter list within a single object and passing it by reference can improve performance.

Register Variables

The storage specifier register can be used as a hint to the compiler that an object will be heavily used in the program. For example

void f()
{
  int *p = new int[3000000];
  register int *p2 = p; //store the address in a register
  for (register int j = 0; j<3000000; j++)
  {
    *p2++ = 0;
  }
  //...use  p  
  delete [] p;
}

Loop counters are good candidates for being declared as register variables. When they are not stored in a register, a substantial amount of the loop's execution time is wasted in fetching the variable from memory, assigning a new value to it, and storing it back in memory repeatedly. Storing it in a machine register reduces this overhead. Note, however, that register is only a recommendation to the compiler. As with function inlining, the compiler can refuse to store the object in a machine register. Furthermore, modern compilers optimize loop counters and move them to the machine's registers anyway. The register storage specification is not confined to fundamental types. Rather, it can be used for any type of object. If the object is too large to fit into a register, the compiler can still store the object in a faster memory region, such as the cache memory (cache memory is about ten times faster than the main memory).

NOTE: Some compilers ignore the register specification altogether and automatically store the program's variables according to a set of built-in optimization rules. Please consult your vendor's specifications for more details on the compiler's handling of register declarations.

Declaring function parameters with the register storage specifier is a recommendation to pass the arguments on the machine's registers rather than passing them on the stack. For example

void f(register int j, register Date d);

Declaring Constant Objects as const

In addition to the other boons of declaring constant objects as const, an optimizing compiler can take advantage of this declaration, too, and store such an object in a machine register instead of in ordinary memory. Note that the same optimization can be applied to function parameters that are declared const. On the other hand, the volatile qualifier disables such an optimization (see Appendix A, "Manual of Programming Style"), so use it only when it is unavoidable.

Runtime Overhead of Virtual Functions

When a virtual function is called through a pointer or a reference of an object, the call doesn't necessarily impose additional runtime penalties. If the compiler can resolve the call statically, no extra overhead is incurred. Furthermore, a very short virtual function can be inlined in this case. In the following example, a clever compiler can resolve the calls of the virtual member functions statically:

#include <iostream>
using namespace std;
class V 
{
public:  
  virtual void show() const { cout<<"I'm V"<<endl; }
};
class W : public V 
{
public:
  void show() const { cout<<"I'm W"<<endl; }
};
void f(V & v, V *pV) 
{
  v.show();   
  pV->show();  
}
void g()
{
  V v;
  f(v, &v);
}
int main()
{
  g();
  return 0;
}

If the entire program appears in a single translation unit, the compiler can perform an inline substitution of the call of the function g() in main(). The invocation of f() within g() can also be inlined, and because the dynamic type of the arguments that are passed to f() is known at compile time, the compiler can resolve the virtual function calls inside f() statically. There is no guarantee that every compiler actually inlines all the function calls; however, some compilers certainly take advantage of the fact that the dynamic type of the arguments of f() can be determined at compile time, and avoid the overhead of dynamic binding in this case.

Function Objects Versus Function Pointers

The benefits of using function objects instead of function pointers (function objects are discussed in Chapter 10 and in Chapter 3, "Operator Overloading") are not limited to genericity and easier maintenance. Furthermore, compilers can inline the call of a function object, thereby enhancing performance even further (inlining a function pointer call is rarely possible).

A Last Resort

The optimization techniques that have been presented thus far do not dictate design compromises or less readable code. In fact, some of them improve the software's robustness and the ease of maintenance. Packing a long argument list within a class object, const declarations, and using function objects rather than function pointers provide additional benefits on top of the performance boost. Under strict time and memory constraints, however, these techniques might not suffice; additional tweaks are sometimes required, which affect the portability and extensibility of the software. The techniques that are presented in this section are to be used only as a last resort, and only after all the other optimizations have been applied.

Disabling RTTI and Exception Handling Support

When you port pure C code to a C++ compiler, you might discover a slight performance degradation. This is not a fault in the programming language or the compiler, but a matter of compiler tuning. All you have to do to gain the same (or better) performance that you might get from a C compiler is switch off the compiler's RTTI and exception handling support. Why is this? In order to support RTTI or exception handling, a C++ compiler inserts additional "scaffolding" code to the original source file. This increases the executable size a little, and imposes slight runtime overhead (the overhead of exception handling and RTTI are discussed in Chapter 6, "Exception Handling," and Chapter 7, "Runtime Type Identification", respectively). When pure C is used, this additional code is unnecessary. Please note, however, that you should not attempt to apply this tweak with C++ code or C code that uses any C++ constructs such as operator new and virtual functions.

Inline Assembly

Time-critical sections of C++ code can be rewritten in native assembly code. The result can be a significant increase in speed. Note, however, that this measure is not to be taken lightly because it makes future modifications much more difficult. Programmers who maintain the code might not be familiar with the particular assembly language that is used, or they might have no prior experience in assembly language at all. Furthermore, porting the software to other platforms requires rewriting of the assembly code parts (in some instances, upgrading the processor can also necessitate rewriting). In addition, developing and testing assembly code is an arduous task that can take much more time than developing and testing code that is written in a high-level language.

Generally, operations that are coded in assembly are low-level library functions. On most implementations, for example, the standard library functions memset() and strcpy() are written in native assembly code. C and C++ enable the programmer to embed inline assembly code within an asm block. For example

asm   
{
  mov a, ecx
   //...
}

Interacting with the Operating System Directly

API functions and classes enable you to interact with the operating system. Sometimes, however, executing a system command directly can be much faster. For this purpose, you can use the standard function system() that takes a shell command as a const char *. For example, on a DOS/Windows system, you can display the files in the current directory as follows:

#include <cstdlib>
using namespace std;
int main()
{
  system("dir");  //execute the "dir" command
}

Here again, the tradeoff is between speed on the one hand and portability and future extensibility on the other hand.

Conclusions

In an ideal world, software designers and developers might focus their efforts on robust, extensible, and readable code. Fortunately, the current state of affairs in the software world is much closer to that ideal than it was 15, 30, or 50 years ago. Notwithstanding that, performance tuning and optimizations will probably remain a necessity for a long time. The faster hardware becomes, the more the software that runs on it is required to meet higher demands. Speech recognition, online translation of natural languages, neural networks, and complex mathematical computations are only a few examples of resource-hungry applications that will evolve in the future and require careful optimizations.

Textbooks often recommend that you put off optimization consideration to the final stages of testing. Indeed, the primary goal is to get the system to work correctly. Nonetheless, some of the techniques presented here -- such as declaring objects locally, preferring prefix to postfix operators, and using initialization instead of assignment -- need to become a natural habit. It is a well-known fact that programs usually spend 90% of their time executing only 10% of their code (the numbers might vary, but they range between 80% and 20% to 95% and 5%). The first step in optimization is, therefore, identifying that 10% of your programs and optimizing them. Many automated profiling and optimization tools can assist you in identifying these critical code parts. Some of these tools can also suggest solutions to enhance performance. Still, many of the optimization techniques are implementation-specific and always require human expertise. It is important to empirically verify your suspicions and to test the effect of suggested code modifications to ensure that they indeed improve the system's performance. Programmers' intuitions regarding the cost of certain operations are often misleading. For example, shorter code is not necessarily faster code. Similarly, writing convoluted code to avoid the cost of a simple if statement is not worth the trouble because it saves only one or two CPU cycles.