[4]One of the
biggest time-wasters in C is character arrays: keeping track of the difference
between static quoted strings and arrays created on the stack and the heap, and
the fact that sometimes you’re passing around a char* and sometimes
you must copy the whole array.
(This is the general problem of shallow copy vs.
deep copy.) Especially because string manipulation is so common,
character arrays are a great source of misunderstandings and bugs.
Despite this, creating string classes remained a common
exercise for beginning C++ programmers for many years. The Standard C++ library
string class solves the problem of character array manipulation once and
for all, keeping track of memory even during assignments and
copy-constructions. You simply don’t need to think about it.
This chapter examines the Standard C++ string class,
beginning with a look at what constitutes a C++ string and how the C++ version
differs from a traditional C character array. You’ll learn about
operations and manipulations using string objects, and see how C++
strings accommodate variation in character sets and string data
conversion.
Handling text is perhaps one of the oldest of all programming
applications, so it’s not surprising that the C++ string draws
heavily on the ideas and terminology that have long been used for this purpose
in C and other languages. As you begin to acquaint yourself with C++
strings this fact should be reassuring, in the respect that no matter
what programming idiom you choose, there are really only about three things you
can do with a string: create or modify the sequence of characters stored
in the string, detect the presence or absence of elements within the
string, and translate between various schemes for representing
string characters.
In C, a string is simply an array of characters that always
includes a binary zero (often called the null terminator) as its final
array element. There are two significant differences between C++ strings
and their C progenitors. First, C++ string objects associate the array of
characters which constitute the string with methods useful for managing
and operating on it. A string also contains certain
“housekeeping” information about the size and storage location of
its data. Specifically, a C++ string object knows its starting location
in memory, its content, its length in characters, and the length in characters
to which it can grow before the string object must resize its internal
data buffer. This gives rise to the second big difference between C char
arrays and C++ strings. C++ strings do not include a null
terminator, nor do the C++ string handling member functions rely on the
existence of a null terminator to perform their jobs. C++ strings greatly
reduce the likelihood of making three of the most common and destructive C
programming errors: overwriting array bounds, trying to access arrays through
uninitialized or incorrectly valued pointers, and leaving pointers
“dangling” after an array ceases to occupy the storage that was once
allocated to it.
The exact implementation of memory layout for the string class
is not defined by the C++ Standard. This architecture is intended to be flexible
enough to allow differing implementations by compiler vendors, yet guarantee
predictable behavior for users. In particular, the exact conditions under which
storage is allocated to hold data for a string object are not defined. String
allocation rules were formulated to allow but not require a reference-counted
implementation, but whether or not the implementation uses reference counting,
the semantics must be the same. To put this a bit differently, in C, every
char array occupies a unique physical region of memory. In C++,
individual string objects may or may not occupy unique physical regions
of memory, but if reference counting is used to avoid storing duplicate copies
of data, the individual objects must look and act as though they do exclusively
own unique regions of storage. For example:
//: C01:StringStorage.cpp #include <string> #include <iostream> using namespace std; int main() { string s1("12345"); // Set the iterator indicate the first element string::iterator it = s1.begin(); // This may copy the first to the second or // use reference counting to simulate a copy string s2 = s1; // Either way, this statement may ONLY modify first *it = '0'; cout << "s1 = " << s1 << endl; cout << "s2 = " << s2 << endl; } ///:~
Reference counting may serve to make an implementation more
memory efficient, but it is transparent to users of the string
class.
Creating and initializing strings is a straightforward
proposition, and fairly flexible as well. In the example shown below, the first
string, imBlank, is declared but contains no initial value. Unlike
a C char array, which would contain a random and meaningless bit pattern
until initialization, imBlank does contain meaningful information. This
string object has been initialized to hold “no characters,”
and can properly report its 0 length and absence of data elements through the
use of class member functions.
The next string, heyMom, is initialized by the
literal argument "Where are my socks?". This form of initialization uses a
quoted character array as a parameter to the string constructor. By
contrast, standardReply is simply initialized with an assignment. The
last string of the group, useThisOneAgain, is initialized using an
existing C++ string object. Put another way, this example illustrates
that string objects let you:
//: C01:SmallString.cpp #include <string> using namespace std; int main() { string imBlank; string heyMom("Where are my socks?"); string standardReply = "Beamed into deep " "space on wide angle dispersion?"; string useThisOneAgain(standardReply); } ///:~
These are the simplest forms of string initialization,
but there are other variations which offer more flexibility and control. You can
:
//: C01:SmallString2.cpp #include <string> #include <iostream> using namespace std; int main() { string s1 ("What is the sound of one clam napping?"); string s2 ("Anything worth doing is worth overdoing."); string s3("I saw Elvis in a UFO."); // Copy the first 8 chars string s4(s1, 0, 8); // Copy 6 chars from the middle of the source string s5(s2, 15, 6); // Copy from middle to end string s6(s3, 6, 15); // Copy all sorts of stuff string quoteMe = s4 + "that" + // substr() copies 10 chars at element 20 s1.substr(20, 10) + s5 + // substr() copies up to either 100 char // or eos starting at element 5 "with" + s3.substr(5, 100) + // OK to copy a single char this way s1.substr(37, 1); cout << quoteMe << endl; } ///:~
The string member function substr( ) takes
a starting position as its first argument and the number of characters to select
as the second argument. Both of these arguments have default values and if you
say substr( ) with an empty argument list you produce a copy of the
entire string, so this is a convenient way to duplicate a string.
Here’s what the string quoteMe contains
after the initialization shown above :
"What is that one clam doing with Elvis in a UFO.?"
Notice the final line of example above. C++ allows
string initialization techniques to be mixed in a single statement, a
flexible and convenient feature. Also note that the last initializer copies
just one character from the source string.
Another slightly more subtle initialization technique involves
the use of the string iterators string.begin( ) and
string.end( ). This treats a string like a container
object (which you’ve seen primarily in the form of vector so far in
this book – you’ll see many more containers soon) which has
iterators indicating the start and end of the “container.”
This way you can hand a string constructor two iterators and it will copy
from one to the other into the new string:
//: C01:StringIterators.cpp #include <string> #include <iostream> using namespace std; int main() { string source("xxx"); string s(source.begin(), source.end()); cout << s << endl; } ///:~
The iterators are not restricted to begin( ) and
end( ), so you can choose a subset of characters from the source
string.
C++ strings may not be initialized with single
characters or with ASCII or other integer values.
//: C01:UhOh.cpp #include <string> using namespace std; int main() { // Error: no single char inits //! string nothingDoing1('a'); // Error: no integer inits //! string nothingDoing2(0x37); } ///:~
If you’ve programmed in C, you are accustomed to the
convenience of a large family of functions for writing, searching, rearranging,
and copying char arrays. However, there are two unfortunate aspects of
the Standard C library functions for handling char arrays. First, there
are three loosely organized families of them: the “plain” group, the
group that manipulates the characters without respect to case, and the
ones which require you to supply a count of the number of characters to be
considered in the operation at hand. The roster of function names in the C
char array handling library literally runs to several pages, and though
the kind and number of arguments to the functions are somewhat consistent within
each of the three groups, to use them properly you must be very attentive to
details of function naming and parameter passing.
The second inherent trap of the standard C char array
tools is that they all rely explicitly on the assumption that the character
array includes a null terminator. If by oversight or error the null is omitted
or overwritten, there’s very little to keep the C char array
handling functions from manipulating the memory beyond the limits of the
allocated space, sometimes with disastrous results.
C++ provides a vast improvement in the convenience and safety
of string objects. For purposes of actual string handling operations,
there are a modest two or three dozen member function names. It’s worth
your while to become acquainted with these. Each function is overloaded, so you
don’t have to learn a new string member function name simply
because of small differences in their
parameters.
One of the most valuable and convenient aspects of C++ strings
is that they grow as needed, without intervention on the part of the programmer.
Not only does this make string handling code inherently more trustworthy, it
also almost entirely eliminates a tedious “housekeeping” chore
– keeping track of the bounds of the storage in which your strings live.
For example, if you create a string object and initialize it with a string of 50
copies of ‘X’, and later store in it 50 copies of
“Zowie”, the object itself will reallocate sufficient storage to
accommodate the growth of the data. Perhaps nowhere is this property more
appreciated than when the strings manipulated in your code change in size, and
you don’t know how big the change is. Appending, concatenating, and
inserting strings often give rise to this circumstance, but the string member
functions append( ) and insert( ) transparently
reallocate storage when a string grows.
//: C01:StrSize.cpp #include <string> #include <iostream> using namespace std; int main() { string bigNews("I saw Elvis in a UFO. "); cout << bigNews << endl; // How much data have we actually got? cout << "Size = " << bigNews.size() << endl; // How much can we store without reallocating cout << "Capacity = " << bigNews.capacity() << endl; // Insert this string in bigNews immediately // before bigNews[1] bigNews.insert(1, " thought I "); cout << bigNews << endl; cout << "Size = " << bigNews.size() << endl; cout << "Capacity = " << bigNews.capacity() << endl; // Make sure that there will be this much space bigNews.reserve(500); // Add this to the end of the string bigNews.append("I've been working too hard."); cout << bigNews << endl; cout << "Size = " << bigNews.size() << endl; cout << "Capacity = " << bigNews.capacity() << endl; } ///:~
Here is the output:
I saw Elvis in a UFO. Size = 21 Capacity = 31 I thought I saw Elvis in a UFO. Size = 32 Capacity = 63 I thought I saw Elvis in a UFO. I've been working too hard. Size = 66 Capacity = 511
This example demonstrates that even though you can safely
relinquish much of the responsibility for allocating and managing the memory
your strings occupy, C++ strings provide you with several tools to
monitor and manage their size. The size( ), resize( ),
capacity( ), and reserve( ) member functions can be very
useful when its necessary to work back and forth between data contained in C++
style strings and traditional null terminated C char arrays. Note the
ease with which we changed the size of the storage allocated to the
string.
The exact fashion in which the string member functions
will allocate space for your data is dependent on the implementation of the
library. When one implementation was tested with the example above, it appeared
that reallocations occurred on even word boundaries, with one byte held back.
The architects of the string class have endeavored to make it possible to
mix the use of C char arrays and C++ string objects, so it is likely that
figures reported by StrSize.cpp for capacity reflect that in this
particular implementation, a byte is set aside to easily accommodate the
insertion of a null terminator.
insert( ) is particularly nice because it absolves
you of making sure the insertion of characters in a string won’t overrun
the storage space or overwrite the characters immediately following the
insertion point. Space grows and existing characters politely move over to
accommodate the new elements. Sometimes, however, this might not be what you
want to happen. If the data in string needs to retain the ordering of the
original characters relative to one another or must be a specific constant size,
use the replace( ) function to overwrite a particular sequence of
characters with another group of characters. There are quite a number of
overloaded versions of replace( ), but the simplest one takes three
arguments: an integer telling where to start in the string, an integer telling
how many characters to eliminate from the original string, and the replacement
string (which can be a different number of characters than the eliminated
quantity). Here’s a very simple example:
//: C01:StringReplace.cpp // Simple find-and-replace in strings #include <string> #include <iostream> using namespace std; int main() { string s("A piece of text"); string tag("$tag$"); s.insert(8, tag + ' '); cout << s << endl; int start = s.find(tag); cout << "start = " << start << endl; cout << "size = " << tag.size() << endl; s.replace(start, tag.size(), "hello there"); cout << s << endl; } ///:~
The tag is first inserted into s (notice that
the insert happens before the value indicating the insert point, and that
an extra space was added after tag), then it is found and
replaced.
You should actually check to see if you’ve found
anything before you perform a replace( ). The above example
replaces with a char*, but there’s an overloaded version that
replaces with a string. Here’s a more complete demonstration
replace( )
//: C01:Replace.cpp #include <string> #include <iostream> using namespace std; void replaceChars(string& modifyMe, string findMe, string newChars){ // Look in modifyMe for the "find string" // starting at position 0 int i = modifyMe.find(findMe, 0); // Did we find the string to replace? if(i != string::npos) // Replace the find string with newChars modifyMe.replace(i,newChars.size(),newChars); } int main() { string bigNews = "I thought I saw Elvis in a UFO. " "I have been working too hard."; string replacement("wig"); string findMe("UFO"); // Find "UFO" in bigNews and overwrite it: replaceChars(bigNews, findMe, replacement); cout << bigNews << endl; } ///:~
Now the last line of output from replace.cpp looks like
this:
I thought I saw Elvis in a wig. I have been working too hard.
If replace doesn’t find the search string, it returns
npos. npos is a static constant member of the basic_string
class.
Unlike insert( ), replace( )
won’t grow the string’s storage space if you copy new
characters into the middle of an existing series of array elements. However, it
will grow the storage space if you make a “replacement” that
writes beyond the end of an existing array. Here’s an example:
//: C01:ReplaceAndGrow.cpp #include <string> #include <iostream> using namespace std; int main() { string bigNews("I saw Elvis in a UFO. " "I have been working too hard."); string replacement("wig"); // The first arg says "replace chars // beyond the end of the existing string": bigNews.replace(bigNews.size(), replacement.size(), replacement); cout << bigNews << endl; } ///:~
The call to replace( ) begins
“replacing” beyond the end of the existing array. The output looks
like this:
I saw Elvis in a UFO. I have been working too hard.wig
Notice that replace( ) expands the array to
accommodate the growth of the string due to “replacement” beyond the
bounds of the existing array.
You may have been hunting through this chapter trying to do
something relatively simple like replace all the instances of one character with
a different character. Upon finding the above section on replacing, you thought
you found the answer but then you started seeing groups of characters and counts
and other things that looked a bit too complex. Doesn’t string have
a way to just replace one character with another everywhere?
The string class by itself doesn’t solve all
possible problems. The remainder are relegated to the STL algorithms, because
the string class can look just like an STL container (the STL algorithms
work with anything that looks like an STL container). All the STL algorithms
work on a “range” of elements within a container. Usually that range
is just “from the beginning of the container to the end.” A
string object looks like a container of characters: to get the beginning
of the range you use string::begin( ) and to get the end of the
range you use string::end( ). The following example shows the use of
the STL replace( ) algorithm to replace all the instances of
‘X’ with ‘Y’:
//: C01:StringCharReplace.cpp #include <string> #include <algorithm> #include <iostream> using namespace std; int main() { string s("aaaXaaaXXaaXXXaXXXXaaa"); cout << s << endl; replace(s.begin(), s.end(), 'X', 'Y'); cout << s << endl; } ///:~
Notice that this replace( ) is not called
as a member function of string. Also, unlike the
string::replace( ) functions which only perform one replacement, the
STL replace is replacing all instances of one character with another.
The STL replace( ) algorithm only works with
single objects (in this case, char objects), and will not perform
replacements of quoted char arrays or of string objects.
Since a string looks like an STL container, there are a
number of other STL algorithms that can be applied to it, which may solve other
problems you have that are not directly addressed by the string member
functions. See Chapter XX for more information on the STL
algorithms.
One of the most delightful discoveries awaiting a C programmer
learning about C++ string handling is how simply strings can be
combined and appended using operator+ and operator+=. These
operators make combining strings syntactically equivalent to adding
numeric data.
//: C01:AddStrings.cpp #include <string> #include <iostream> using namespace std; int main() { string s1("This "); string s2("That "); string s3("The other "); // operator+ concatenates strings s1 = s1 + s2; cout << s1 << endl; // Another way to concatenates strings s1 += s3; cout << s1 << endl; // You can index the string on the right s1 += s3 + s3[4] + "oh lala"; cout << s1 << endl; } ///:~
The output looks like this:
This This That This That The other This That The other ooh lala
operator+ and
operator+= are a very flexible and convenient means of combining
string data. On the right hand side of the statement, you can use almost
any type that evaluates to a group of one or more characters.
The find family of string member functions
allows you to locate a character or group of characters within a given string.
Here are the members of the find family and their general
usage:
string find member function |
What/how it finds |
find( ) |
Searches a string for a specified character or group of
characters and returns the starting position of the first occurrence found or
npos if no match is found. (npos is a const of –1 and
indicates that a search failed.) |
find_first_of( ) |
Searches a target string and returns the position of the first
match of any character in a specified group. If no match is found, it
returns npos. |
find_last_of( ) |
Searches a target string and returns the position of the last
match of any character in a specified group. If no match is found, it
returns npos. |
find_first_not_of( ) |
Searches a target string and returns the position of the first
element that doesn’t match any character in a specified
group. If no such element is found, it returns npos. |
find_last_not_of( ) |
Searches a target string and returns the position of the
element with the largest subscript that doesn’t match of any
character in a specified group. If no such element is found, it returns
npos. |
rfind( ) |
Searches a string from end to beginning for a specified
character or group of characters and returns the starting position of the match
if one is found. If no match is found, it returns npos. |
String searching member functions and their general
uses
The simplest use of find( ) searches for one or
more characters in a string. This overloaded version of find( )
takes a parameter that specifies the character(s) for which to search, and
optionally one that tells it where in the string to begin searching for the
occurrence of a substring. (The default position at which to begin searching is
0.) By setting the call to find inside a loop, you can easily move
through a string, repeating a search in order to find all of the occurrences of
a given character or group of characters within the string.
Notice that we define the string object sieveChars
using a constructor idiom which sets the initial size of the character array and
writes the value ‘P’ to each of its member.
//: C01:Sieve.cpp #include <string> #include <iostream> using namespace std; int main() { // Create a 50 char string and set each // element to 'P' for Prime string sieveChars(50, 'P'); // By definition neither 0 nor 1 is prime. // Change these elements to "N" for Not Prime sieveChars.replace(0, 2, "NN"); // Walk through the array: for(int i = 2; i <= (sieveChars.size() / 2) - 1; i++) // Find all the factors: for(int factor = 2; factor * i < sieveChars.size();factor++) sieveChars[factor * i] = 'N'; cout << "Prime:" << endl; // Return the index of the first 'P' element: int j = sieveChars.find('P'); // While not at the end of the string: while(j != sieveChars.npos) { // If the element is P, the index is a prime cout << j << " "; // Move past the last prime j++; // Find the next prime j = sieveChars.find('P', j); } cout << "\n Not prime:" << endl; // Find the first element value not equal P: j = sieveChars.find_first_not_of('P'); while(j != sieveChars.npos) { cout << j << " "; j++; j = sieveChars.find_first_not_of('P', j); } } ///:~
The output from Sieve.cpp looks like this:
Prime: 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 Not prime: 0 1 4 6 8 9 10 12 14 15 16 18 20 21 22 24 25 26 27 28 30 32 33 34 35 36 38 39 40 42 44 45 46 48 49
find( ) allows you to walk forward through a
string, detecting multiple occurrences of a character or group of
characters, while find_first_not_of( ) allows you to test for the
absence of a character or group.
The find member is also useful for detecting the
occurrence of a sequence of characters in a string:
//: C01:Find.cpp // Find a group of characters in a string #include <string> #include <iostream> using namespace std; int main() { string chooseOne("Eenie, meenie, miney, mo"); int i = chooseOne.find("een"); while(i != string::npos) { cout << i << endl; i++; i = chooseOne.find("een", i); } } ///:~
Find.cpp produces a single line of output :
8
This tells us that the first ‘e’ of the search
group “een” was found in the word “meenie,” and is the
eighth element in the string. Notice that find passed over the
“Een” group of characters in the word “Eenie”. The
find member function performs a case sensitive search.
There are no functions in the string class to change
the case of a string, but these functions can be easily created using the
Standard C library functions toupper( ) and tolower( ),
which change the case of one character at a time. A few small changes will make
Find.cpp perform a case insensitive search:
//: C01:NewFind.cpp #include <string> #include <iostream> using namespace std; // Make an uppercase copy of s: string upperCase(string& s) { char* buf = new char[s.length()]; s.copy(buf, s.length()); for(int i = 0; i < s.length(); i++) buf[i] = toupper(buf[i]); string r(buf, s.length()); delete buf; return r; } // Make a lowercase copy of s: string lowerCase(string& s) { char* buf = new char[s.length()]; s.copy(buf, s.length()); for(int i = 0; i < s.length(); i++) buf[i] = tolower(buf[i]); string r(buf, s.length()); delete buf; return r; } int main() { string chooseOne("Eenie, meenie, miney, mo"); cout << chooseOne << endl; cout << upperCase(chooseOne) << endl; cout << lowerCase(chooseOne) << endl; // Case sensitive search int i = chooseOne.find("een"); while(i != string::npos) { cout << i << endl; i++; i = chooseOne.find("een", i); } // Search lowercase: string lcase = lowerCase(chooseOne); cout << lcase << endl; i = lcase.find("een"); while(i != lcase.npos) { cout << i << endl; i++; i = lcase.find("een", i); } // Search uppercase: string ucase = upperCase(chooseOne); cout << ucase << endl; i = ucase.find("EEN"); while(i != ucase.npos) { cout << i << endl; i++; i = ucase.find("EEN", i); } } ///:~
Both the upperCase( ) and lowerCase( )
functions follow the same form: they allocate storage to hold the data in the
argument string, copy the data and change the case. Then they create a
new string with the new data, release the buffer and return the result
string. The c_str( ) function cannot be used to produce a
pointer to directly manipulate the data in the string because
c_str( ) returns a pointer to const. That is, you’re
not allowed to manipulate string data with a pointer, only with member
functions. If you need to use the more primitive char array manipulation,
you should use the technique shown above.
The output looks like this:
Eenie, meenie, miney, mo EENIE, MEENIE, MINEY, MO eenie, meenie, miney, mo 8 eenie, meenie, miney, mo 0 8 EENIE, MEENIE, MINEY, MO 0 8
The case insensitive searches found both occurrences on the
“een” group.
NewFind.cpp isn’t the best solution to the case
sensitivity problem, so we’ll revisit it when we examine string
comparisons.
Sometimes it’s necessary to search through a
string from end to beginning, if you need to find the data in “last
in / first out “ order. The string member function rfind( )
handles this job.
//: C01:Rparse.cpp // Reverse the order of words in a string #include <string> #include <iostream> #include <vector> using namespace std; int main() { // The ';' characters will be delimiters string s("now.;sense;make;to;going;is;This"); cout << s << endl; // To store the words: vector<string> strings; // The last element of the string: int last = s.size(); // The beginning of the current word: int current = s.rfind(';'); // Walk backward through the string: while(current != string::npos){ // Push each word into the vector. // Current is incremented before copying to // avoid copying the delimiter. strings.push_back( s.substr(++current,last - current)); // Back over the delimiter we just found, // and set last to the end of the next word current -= 2; last = current; // Find the next delimiter current = s.rfind(';', current); } // Pick up the first word - it's not // preceded by a delimiter strings.push_back(s.substr(0, last - current)); // Print them in the new order: for(int j = 0; j < strings.size(); j++) cout << strings[j] << " "; } ///:~
Here’s how the output from Rparse.cpp
looks:
now.;sense;make;to;going;is;This This is going to make sense now.
rfind( ) backs through the string looking for
tokens, reporting the array index of matching characters or string::npos
if it is unsuccessful.
The find_first_of( ) and
find_last_of( ) member functions can be conveniently put to work to
create a little utility that will strip whitespace characters off of both ends
of a string. Notice it doesn’t touch the original string, but instead
returns a new string:
//: C01:trim.h #ifndef TRIM_H #define TRIM_H #include <string> // General tool to strip spaces from both ends: inline std::string trim(const std::string& s) { if(s.length() == 0) return s; int b = s.find_first_not_of(" \t"); int e = s.find_last_not_of(" \t"); if(b == -1) // No non-spaces return ""; return std::string(s, b, e - b + 1); } #endif // TRIM_H ///:~
The first test checks for an empty string; in that case
no tests are made and a copy is returned. Notice that once the end points are
found, the string constructor is used to build a new string from
the old one, giving the starting count and the length. This form also utilizes
the “return value optimization” (see the index for more
details).
Testing such a general-purpose tool needs to be
thorough:
//: C01:TrimTest.cpp #include "trim.h" #include <iostream> using namespace std; string s[] = { " \t abcdefghijklmnop \t ", "abcdefghijklmnop \t ", " \t abcdefghijklmnop", "a", "ab", "abc", "a b c", " \t a b c \t ", " \t a \t b \t c \t ", "", // Must also test the empty string }; void test(string s) { cout << "[" << trim(s) << "]" << endl; } int main() { for(int i = 0; i < sizeof s / sizeof *s; i++) test(s[i]); } ///:~
In the array of string s, you can see that the
character arrays are automatically converted to string objects. This
array provides cases to check the removal of spaces and tabs from both ends, as
well as ensuring that spaces and tabs do not get removed from the middle of a
string.
My word processor/page layout program (Microsoft Word) will
save a document in HTML, but it doesn’t recognize that the code listings
in this book should be tagged with the HTML “preformatted” tag
(<PRE>), and it puts paragraph marks (<P> and </P>) around
every listing line. This means that all the indentation in the code listings is
lost. In addition, Word saves HTML with reduced font sizes for body text, which
makes it hard to read.
To convert the book to HTML
form[5], then, the original output must be
reprocessed, watching for the tags that mark the start and end of code listings,
inserting the <PRE> and </PRE> tags at the appropriate places,
removing all the <P> and </P> tags within the listings, and
adjusting the font sizes. Removal is accomplished with the erase( )
member function, but you must correctly determine the starting and ending points
of the substring you wish to erase. Here’s the program that reprocesses
the generated HTML file:
//: C01:ReprocessHTML.cpp // Take Word's html output and fix up // the code listings and html tags #include "../require.h" #include <iostream> #include <fstream> #include <string> using namespace std; // Produce a new string which is the original // string with the html paragraph break marks // stripped off: string stripPBreaks(string s) { int br; while((br = s.find("<P>")) != string::npos) s.erase(br, strlen("<P>")); while((br = s.find("</P>")) != string::npos) s.erase(br, strlen("</P>")); return s; } // After the beginning of a code listing is // detected, this function cleans up the listing // until the end marker is found. The first line // of the listing is passed in by the caller, // which detects the start marker in the line. void fixupCodeListing(istream& in, ostream& out, string& line, int tag) { out << line.substr(0, tag) << "<PRE>" // Means "preformatted" in html << stripPBreaks(line.substr(tag)) << endl; string s; while(getline(in, s)) { int endtag = s.find("/""/""/"":~"); if(endtag != string::npos) { endtag += strlen("/""/""/"":~"); string before = s.substr(0, endtag); string after = s.substr(endtag); out << stripPBreaks(before) << "</PRE>" << after << endl; return; } out << stripPBreaks(s) << endl; } } string removals[] = { "<FONT SIZE=2>", "<FONT SIZE=1>", "<FONT FACE=\"Times\" SIZE=1>", "<FONT FACE=\"Times\" SIZE=2>", "<FONT FACE=\"Courier\" SIZE=1>", "SIZE=1", // Eliminate all other '1' & '2' size "SIZE=2", }; const int rmsz = sizeof(removals)/sizeof(*removals); int main(int argc, char* argv[]) { requireArgs(argc, 2); ifstream in(argv[1]); assure(in, argv[1]); ofstream out(argv[2]); string line; while(getline(in, line)) { // The "Body" tag only appears once: if(line.find("<BODY") != string::npos) { out << "<BODY BGCOLOR=\"#FFFFFF\" " "TEXT=\"#000000\">" << endl; continue; // Get next line } // Eliminate each of the removals strings: for(int i = 0; i < rmsz; i++) { int find = line.find(removals[i]); if(find != string::npos) line.erase(find, removals[i].size()); } int tag1 = line.find("/""/"":"); int tag2 = line.find("/""*"":"); if(tag1 != string::npos) fixupCodeListing(in, out, line, tag1); else if(tag2 != string::npos) fixupCodeListing(in, out, line, tag2); else out << line << endl; } } ///:~
Notice the lines that detect the start and end listing tags by
indicating them with each character in quotes. These tags are treated in a
special way by the logic in the Extractcode.cpp tool for extracting code
listings. To present the code for the tool in the text of the book, the tag
sequence itself must not occur in the listing. This was accomplished by taking
advantage of a C++ preprocessor feature that causes text strings delimited by
adjacent pairs of double quotes to be merged into a single string during the
preprocessor pass of the build.
int tag1 = line.find("/""/"":");
The effect of the sequence of char arrays is to produce
the starting tag for code listings.
Sometimes it’s useful to take an HTML file and strip its
tags so you have something approximating the text that would be displayed in the
Web browser, only as an ASCII text file. The string class once again
comes in handy. The following has some variation on the theme of the previous
example:
//: C01:HTMLStripper.cpp // Filter to remove html tags and markers #include "../require.h" #include <fstream> #include <iostream> #include <string> using namespace std; string replaceAll(string s, string f, string r) { unsigned int found = s.find(f); while(found != string::npos) { s.replace(found, f.length(), r); found = s.find(f); } return s; } string stripHTMLTags(string s) { while(true) { unsigned int left = s.find('<'); unsigned int right = s.find('>'); if(left==string::npos || right==string::npos) break; s = s.erase(left, right - left + 1); } s = replaceAll(s, "<", "<"); s = replaceAll(s, ">", ">"); s = replaceAll(s, "&", "&"); s = replaceAll(s, " ", " "); // Etc... return s; } int main(int argc, char* argv[]) { requireArgs(argc, 1, "usage: HTMLStripper InputFile"); ifstream in(argv[1]); assure(in, argv[1]); const int sz = 4096; char buf[sz]; while(in.getline(buf, sz)) { string s(buf); cout << stripHTMLTags(s) << endl; } } ///:~
The string class can replace one string with another
but there’s no facility for replacing all the strings of one type with
another, so the replaceAll( ) function does this for you, inside a
while loop that keeps finding the next instance of the find string
f. That function is used inside stripHTMLTags after it uses
erase( ) to remove everything that appears inside angle braces
(‘<‘ and ‘>‘). Note that I probably
haven’t gotten all the necessary replacement values, but you can see what
to do (you might even put all the find-replace pairs in a table...). In
main( ) the arguments are checked, and the file is read and
converted. It is sent to standard output so you must redirect it with
‘>‘ if you want to write it to a
file.
Comparing strings is inherently different than comparing
numbers. Numbers have constant, universally meaningful values. To evaluate the
relationship between the magnitude of two strings, you must make a lexical
comparison. Lexical comparison means that when you test a character to see
if it is “greater than” or “less than” another
character, you are actually comparing the numeric representation of those
characters as specified in the collating sequence of the character set being
used. Most often, this will be the ASCII collating sequence, which assigns the
printable characters for the English language numbers in the range from 32 to
127 decimal. In the ASCII collating sequence, the first “character”
in the list is the space, followed by several common punctuation marks, and then
uppercase and lowercase letters. With respect to the alphabet, this means that
the letters nearer the front have lower ASCII values than those nearer the end.
With these details in mind, it becomes easier to remember that when a lexical
comparison that reports s1 is “greater than” s2, it simply means
that when the two were compared, the first differing character in s1 came later
in the alphabet than the character in that same position in s2.
C++ provides several ways to compare strings, and each has
their advantages. The simplest to use are the non member overloaded operator
functions operator ==, operator != operator >, operator <, operator
>=, and operator <=.
//: C01:CompStr.cpp #include <string> #include <iostream> using namespace std; int main() { // Strings to compare string s1("This "); string s2("That "); for(int i = 0; i< s1.size() && i < s2.size(); i++) // See if the string elements are the same: if(s1[i] == s2[i]) cout << s1[i] << " " << i << endl; // Use the string inequality operators if(s1 != s2) { cout << "Strings aren't the same:" << " "; if(s1 > s2) cout << "s1 is > s2" << endl; else cout << "s2 is > s1" << endl; } } ///:~
Here’s the output from CompStr.cpp:
T 0 h 1 4 Strings aren’t the same: s1 is > s2
The overloaded comparison operators are useful for comparing
both full strings and individual string elements.
Notice in the code fragment below the flexibility of argument
types on both the left and right hand side of the comparison operators. The
overloaded operator set allows the direct comparison of string objects, quoted
literals, and pointers to C style strings.
// The lvalue is a quoted literal and // the rvalue is a string if("That " == s2) cout << "A match" << endl; // The lvalue is a string and the rvalue is a // pointer to a c style null terminated string if(s1 != s2.c_str()) cout << "No match" << endl;
You won’t find the logical not (!) or the logical
comparison operators (&& and ||) among operators for string. (Neither
will you find overloaded versions of the bitwise C operators &, |, ^, or ~.)
The overloaded non member comparison operators for the string class are limited
to the subset which has clear, unambiguous application to single characters or
groups of characters.
The compare( ) member function offers you a great
deal more sophisticated and precise comparison than the non member operator set,
because it returns a lexical comparison value, and provides for comparisons that
consider subsets of the string data. It provides overloaded versions that allow
you to compare two complete strings, part of either string to a complete string,
and subsets of two strings. This example compares complete strings:
//: C01:Compare.cpp // Demonstrates compare(), swap() #include <string> #include <iostream> using namespace std; int main() { string first("This"); string second("That"); // Which is lexically greater? switch(first.compare(second)) { case 0: // The same cout << first << " and " << second << " are lexically equal" << endl; break; case -1: // Less than first.swap(second); // Fall through this case... case 1: // Greater than cout << first << " is lexically greater than " << second << endl; } } ///:~
The output from Compare.cpp looks like this:
This is lexically greater than That
To compare a subset of the characters in one or both strings,
you add arguments that define where to start the comparison and how many
characters to consider. For example, we can use the overloaded version of
compare( ):
s1.compare(s1StartPos, s1NumberChars, s2, s2StartPos,
s2NumberChars);
If we substitute the above version of compare( )
in the previous program so that it only looks at the first two characters of
each string, the program becomes:
//: C01:Compare2.cpp // Overloaded compare() #include <string> #include <iostream> using namespace std; int main() { string first("This"); string second("That"); // Compare first two characters of each string: switch(first.compare(0, 2, second, 0, 2)) { case 0: // The same cout << first << " and " << second << " are lexically equal" << endl; break; case -1: // Less than first.swap(second); // Fall through this case... case 1: // Greater than cout << first << " is lexically greater than " << second << endl; } } ///:~
The output is:
This and That are lexically equal
which is true, for the first two characters of
“This” and “That.”
In the examples so far, we have used C style array indexing
syntax to refer to an individual character in a string. C++ strings provide an
alternative to the s[n] notation: the at( ) member. These two
idioms produce the same result in C++ if all goes well:
//: C01:StringIndexing.cpp #include <string> #include <iostream> using namespace std; int main(){ string s("1234"); cout << s[1] << " "; cout << s.at(1) << endl; } ///:~
The output from this code looks like this:
2 2
However, there is one important difference between [ ]
and at( ). When you try to reference an array element that is out of
bounds, at( ) will do you the kindness of throwing an exception,
while ordinary [ ] subscripting syntax will leave you to your own
devices:
//: C01:BadStringIndexing.cpp #include <string> #include <iostream> using namespace std; int main(){ string s("1234"); // Runtime problem: goes beyond array bounds: cout << s[5] << endl; // Saves you by throwing an exception: cout << s.at(5) << endl; } ///:~
Using at( ) in place of [ ] will give you a
chance to gracefully recover from references to array elements that don’t
exist. at( ) throws an object of class out_of_range. By
catching this object in an exception handler, you can take appropriate remedial
actions such as recalculating the offending subscript or growing the array. (You
can read more about Exception Handling in Chapter
XX)
In the example program NewFind.cpp, we used a lot of
messy and rather tedious C char array handling code to change the case of
the characters in a string and then search for the occurrence of matches to a
substring. Sometimes the “quick and dirty” method is justifiable,
but in general, you won’t want to sacrifice the advantages of having your
string data safely and securely encapsulated in the C++ object where it lives.
Here is a better, safer way to handle case insensitive
comparison of two C++ string objects. Because no data is copied out of the
objects and into C style strings, you don’t have to use pointers and you
don’t have to risk overwriting the bounds of an ordinary character array.
In this example, we use the string iterator. Iterators are themselves
objects which move through a collection or container of other objects, selecting
them one at a time, but never providing direct access to the implementation of
the container. Iterators are not pointers, but they are useful for many
of the same jobs.
//: C01:CmpIter.cpp // Find a group of characters in a string #include <string> #include <iostream> using namespace std; // Case insensitive compare function: int stringCmpi(const string& s1, const string& s2) { // Select the first element of each string: string::const_iterator p1 = s1.begin(), p2 = s2.begin(); // Don’t run past the end: while(p1 != s1.end() && p2 != s2.end()) { // Compare upper-cased chars: if(toupper(*p1) != toupper(*p2)) // Report which was lexically greater: return (toupper(*p1)<toupper(*p2))? -1 : 1; p1++; p2++; } // If they match up to the detected eos, say // which was longer. Return 0 if the same. return(s2.size() - s1.size()); } int main() { string s1("Mozart"); string s2("Modigliani"); cout << stringCmpi(s1, s2) << endl; } ///:~
Notice that the iterators p1 and p2 use the same
syntax as C pointers – the ‘*’ operator makes the
value of element at the location given by the iterators available to the
toupper( ) function. toupper( ) doesn’t actually
change the content of the element in the string. In fact, it can’t. This
definition of p1 tells us that we can only use the elements p1
points to as constants.
string::const_iterator p1 = s1.begin();
The way toupper( ) and the iterators are used in
this example is called a case preserving case insensitive comparison.
This means that the string didn’t have to be copied or rewritten to
accommodate case insensitive comparison. Both of the strings retain their
original data, unmodified.
Just as the standard C pointer gives us the increment (++) and
decrement (--) operators to make pointer arithmetic a bit more convenient, C++
string iterators come in two basic varieties. You’ve seen
end( ) and begin( ), which are the tools for moving
forward through a string one element at a time. The reverse iterators
rend( ) and rbegin( ) allow you to step backwards
through a string. Here’s how they work:
//: C01:RevStr.cpp // Print a string in reverse #include <string> #include <iostream> using namespace std; int main() { string s("987654321"); // Use this iterator to walk backwards: string::reverse_iterator rev; // "Incrementing" the reverse iterator moves // it to successively lower string elements: for(rev = s.rbegin(); rev != s.rend(); rev++) cout << *rev << " "; } ///:~
The output from RevStr.cpp looks like this:
1 2 3 4 5 6 7 8 9
Reverse iterators act like pointers to elements of the
string’s character array, except that when you apply the increment
operator to them, they move backward rather than forward.
rbegin( ) and rend( ) supply string locations that are
consistent with this behavior, to wit, rbegin( ) locates the
position just beyond the end of the string, and rend( ) locates the
beginning. Aside from this, the main thing to remember about reverse iterators
is that they aren’t type equivalent to ordinary iterators. For
example, if a member function parameter list includes an iterator as an
argument, you can’t substitute a reverse iterator to get the function to
perform it’s job walking backward through the string. Here’s an
illustration:
// The compiler won’t accept this string sBackwards(s.rbegin(), s.rend());
The string constructor won’t accept reverse iterators in
place of forward iterators in its parameter list. This is also true of string
members such as copy( ), insert( ), and
assign( ).
We seem to have worked our way around the margins of case
insensitive string comparisons using C++ string objects, so maybe it’s
time to ask the obvious question: “Why isn’t case-insensitive
comparison part of the standard string class ?” The answer provides
interesting background on the true nature of C++ string objects.
Consider what it means for a character to have
“case.” Written Hebrew, Farsi, and Kanji don’t use the concept
of upper and lower case, so for those languages this idea has no meaning at all.
This the first impediment to built-in C++ support for case-insensitive character
search and comparison: the idea of case sensitivity is not universal, and
therefore not portable.
It would seem that if there were a way of designating that
some languages were “all uppercase” or “all lowercase”
we could design a generalized solution. However, some languages which employ the
concept of “case” also change the meaning of particular
characters with diacritical marks: the cedilla in Spanish, the circumflex in
French, and the umlaut in German. For this reason, any case-sensitive collating
scheme that attempts to be comprehensive will be nightmarishly complex to use.
Although we usually treat the C++ string as a class,
this is really not the case. string is a typedef of a more general
constituent, the basic_string< > template. Observe how
string is declared in the standard C++ header file:
typedef basic_string<char> string;
To really understand the nature of strings, it’s helpful
to delve a bit deeper and look at the template on which it is based.
Here’s the declaration of the basic_string< >
template:
template<class charT, class traits = char_traits<charT>, class allocator = allocator<charT> > class basic_string;
Earlier in this book, templates were examined in a great deal
of detail. The main thing to notice about the two declarations above are that
the string type is created when the basic_string template is
instantiated with char. Inside the basic_string< >
template declaration, the line
class traits = char_traits<charT>,
tells us that the behavior of the class made from the
basic_string< > template is specified by a class based
on the template char_traits< >. Thus, the
basic_string< > template provides for cases where you
need string oriented classes that manipulate types other than char (wide
characters or unicode, for example). To do this, the
char_traits< > template controls the content and
collating behaviors of a variety of character sets using the character
comparison functions eq( ) (equal), ne( ) (not equal),
and lt( ) (less than) upon which the
basic_string< > string comparison functions
rely.
This is why the string class doesn’t include case
insensitive member functions: That’s not in its job description. To change
the way the string class treats character comparison, you must supply a
different char_traits< > template, because that defines
the behavior of the individual character comparison member functions.
This information can be used to make a new type of string
class that ignores case. First, we’ll define a new case insensitive
char_traits< > template that inherits the existing one.
Next, we’ll override only the members we need to change in order to make
character-by-character comparison case insensitive. (In addition to the three
lexical character comparison members mentioned above, we’ll also have to
supply new implementation of find( ) and compare( ).)
Finally, we’ll typedef a new class based on basic_string,
but using the case insensitive ichar_traits template for its second
argument.
//: C01:ichar_traits.h // Creating your own character traits #ifndef ICHAR_TRAITS_H #define ICHAR_TRAITS_H #include <string> #include <cctype> struct ichar_traits : std::char_traits<char> { // We'll only change character by // character comparison functions static bool eq(char c1st, char c2nd) { return std::toupper(c1st) == std::toupper(c2nd); } static bool ne(char c1st, char c2nd) { return std::toupper(c1st) != std::toupper(c2nd); } static bool lt(char c1st, char c2nd) { return std::toupper(c1st) < std::toupper(c2nd); } static int compare(const char* str1, const char* str2, size_t n) { for(int i = 0; i < n; i++) { if(std::tolower(*str1)>std::tolower(*str2)) return 1; if(std::tolower(*str1)<std::tolower(*str2)) return -1; if(*str1 == 0 || *str2 == 0) return 0; str1++; str2++; // Compare the other chars } return 0; } static const char* find(const char* s1, int n, char c) { while(n-- > 0 && std::toupper(*s1) != std::toupper(c)) s1++; return s1; } }; #endif // ICHAR_TRAITS_H ///:~
If we typedef an istring class like
this:
typedef basic_string<char, ichar_traits, allocator<char> > istring;
Then this istring will act like an ordinary
string in every way, except that it will make all comparisons without
respect to case. Here’s an example:
//: C01:ICompare.cpp #include "ichar_traits.h" #include <string> #include <iostream> using namespace std; typedef basic_string<char, ichar_traits, allocator<char> > istring; int main() { // The same letters except for case: istring first = "tHis"; istring second = "ThIS"; cout << first.compare(second) << endl; } ///:~
The output from the program is “0”, indicating
that the strings compare as equal. This is just a simple example – in
order to make istring fully equivalent to string, we’d have
to create the other functions necessary to support the new istring
type.
My friend Daniel (who designed the cover and page layout for
this book) does a lot of work with Web pages. One tool he uses creates a
“site map” consisting of a Java applet to display the map and an
HTML tag that invoked the applet and provided it with the necessary data to
create the map. Daniel wanted to use this data to create an ordinary HTML page
(sans applet) that would contain regular links as the site map. The resulting
program turns out to be a nice practical application of the string class,
so it is presented here.
The input is an HTML file that contains the usual stuff along
with an applet tag with a parameter that begins like this:
<param name="source_file" value="
The rest of the line contains encoded information about the
site map, all combined into a single line (it’s rather long, but
fortunately string objects don’t care). Each entry may or may not
begin with a number of ‘#’ signs; each of these indicates one
level of depth. If no ‘#’ sign is present the entry will be
considered to be at level one. After the ‘#’ is the text to
be displayed on the page, followed by a ‘%’ and the URL to
use as the link. Each entry is terminated by a ‘*’. Thus, a
single entry in the line might look like this:
###|Useful Art%./Build/useful_art.html*
The ‘|’ at the beginning is an artifact
that needs to be removed.
My solution was to create an Item class whose
constructor would take input text and create an object that contains the text to
be displayed, the URL and the level. The objects essentially parse themselves,
and at that point you can read any value you want. In main( ), the
input file is opened and read until the line contains the parameter that
we’re interested in. Everything but the site map codes are stripped away
from this string, and then it is parsed into Item
objects:
//: C01:SiteMapConvert.cpp // Using strings to create a custom conversion // program that generates HTML output #include "../require.h" #include <iostream> #include <fstream> #include <string> #include <cstdlib> using namespace std; class Item { string id, url; int depth; string removeBar(string s) { if(s[0] == '|') return s.substr(1); else return s; } public: Item(string in, int& index) : depth(0) { while(in[index] == '#' && index < in.size()){ depth++; index++; } // 0 means no '#' marks were found: if(depth == 0) depth = 1; while(in[index] != '%' && index < in.size()) id += in[index++]; id = removeBar(id); index++; // Move past '%' while(in[index] != '*' && index < in.size()) url += in[index++]; url = removeBar(url); index++; // To move past '*' } string identifier() { return id; } string path() { return url; } int level() { return depth; } }; int main(int argc, char* argv[]) { requireArgs(argc, 1, "usage: SiteMapConvert inputfilename"); ifstream in(argv[1]); assure(in, argv[1]); ofstream out("plainmap.html"); string line; while(getline(in, line)) { if(line.find("<param name=\"source_file\"") != string::npos) { // Extract data from start of sequence // until the terminating quote mark: line = line.substr(line.find("value=\"") + string("value=\"").size()); line = line.substr(0, line.find_last_of("\"")); int index = 0; while(index < line.size()) { Item item(line, index); string startLevel, endLevel; if(item.level() == 1) { startLevel = "<h1>"; endLevel = "</h1>"; } else for(int i = 0; i < item.level(); i++) for(int j = 0; j < 5; j++) out << " "; string htmlLine = "<a href=\"" + item.path() + "\">" + item.identifier() + "</a><br>"; out << startLevel << htmlLine << endLevel << endl; } break; // Out of while loop } } } ///:~
Item contains a private member function
removeBar( ) that is used internally to strip off the leading bars,
if they appear.
The constructor for Item initializes depth to
0 to indicate that no ‘#’ signs were found yet; if none are
found then it is assumed the Item should be displayed at level one. Each
character in the string is examined using operator[ ] to find
the depth, id and url values. The other member functions
simply return these values.
After opening the files, main( ) uses
string::find( ) to locate the line containing the site map data. At
this point, substr( ) is used in order to strip off the information
before and after the site map data. The subsequent while loop performs
the parsing, but notice that the value index is passed by
reference into the Item constructor, and that constructor increments
index as it parses each new Item, thus moving forward in the
sequence.
If an Item is at level one, then an HTML h1 tag
is used, otherwise the elements are indented using HTML non-breaking spaces.
Note in the initialization of htmlLine how easy it is to construct a
string – you can just combine quoted character arrays and other
string objects using operator+.
When the output is written to the destination file,
startLevel and endLevel will only produce results if they have
been given any value other than their default initialization
values.
C++ string objects provide developers with a number of great
advantages over their C counterparts. For the most part, the string class
makes referring to strings through the use of character pointers unnecessary.
This eliminates an entire class of software defects that arise from the use of
uninitialized and incorrectly valued pointers. C++ strings dynamically and
transparently grow their internal data storage space to accommodate increases in
the size of the string data. This means that when the data in a string grows
beyond the limits of the memory initially allocated to it, the string object
will make the memory management calls that take space from and return space to
the heap. Consistent allocation schemes prevent memory leaks and have the
potential to be much more efficient than “roll your own” memory
management.
The string class member functions provide a fairly
comprehensive set of tools for creating, modifying, and searching in strings.
string comparisons are always case sensitive, but you can work around
this by copying string data to C style null terminated strings and using case
insensitive string comparison functions, temporarily converting the data held in
sting objects to a single case, or by creating a case insensitive string class
which overrides the character traits used to create the basic_string
object.
[4] Much of the
material in this chapter was originally created by Nancy
Nicolaisen
[5] I subsequently
found better tools to accomplish this task, but the program is still
interesting.