Previous: 4.1.1 The ASCII Character Set
Up: 4.1 A New Data Type: char
Next: 4.1.3 Character I/O Using getchar() and putchar()
Previous Page: 4.1.1 The ASCII Character Set

4.1.2 Operations on Characters

As we just saw, in C, characters have numeric values and, therefore, may be used in numeric expressions. It is the ASCII code value of a character that is used in these expressions. For example (referring to Table 4.2), the value of 'a' is 97, and that of 'A' is 65. So, the expression 'a' - 'A' is evaluated as , which is 32. As we shall see, this ability to do arithmetic with character data simplifies character processing. When a character variable or constant appears in an expression, it is replaced by its ASCII value of type integer. When a character cell is assigned an integer value, the value is interpreted to be an ASCII value. In other words, a character and its ASCII value are used interchangeably as required by the context. While a cast operator can be used, we do not need it to go from character type to integer type, and vice versa. Here are some expressions using character variables and constants.

ch = 97;                 /* ch <--- ASCII value 97, i.e., 'a' */
     ch = '\141';             /* ch <--- 'a'; octal 141 is decimal 97 */
     ch = '\x61';             /* ch <--- 'a'; hexadecimal 61 is decimal 97 */
     ch = 'a';                /* ch <--- 'a' */

ch = ch - 'a' + 'A'; /* ch <--- 'A' */

ch = 'd'; ch = ch - 'a' + 'A'; /* ch <--- 'D' */ ch = ch - 'A' + 'a'; /* ch <--- 'd' */

The first group of four statements merely assigns lower case 'a' to ch in four different ways: the first assigns a decimal ASCII value, the second assigns a character in octal form, the third assigns a character in hexadecimal form, the fourth assigns a character in a printable symbolic form. All of these statements have exactly the same effect.

The next statement, after the first group, assigns the value of an expression to ch. The right hand side of the assignment is:

ch - 'a' + 'A'
Since the value of ch is 'a' from the previous four statements, the above expression evaluates to the value of 'a' - 'a' + 'A', i.e. the value of 'A'. In other words, the right hand side expression converts lower case 'a' to its upper case version, 'A', which is then assigned to ch. Since the values of lower case letters are contiguous and increasing (as are those of upper case letters) 'a' is less than 'b', 'b' less than 'c', and so forth. Also, the offset value of each letter from the base of the alphabet is the same for lower case letters as it is for upper case letters. For example, 'd' - 'a' is the same as 'D' - 'A'. So, if ch is any lower case letter, then the expression
ch - 'a' + 'A'
results in the upper case version of ch. This is because the value of ch - 'a' is the offset of ch from the lower case base 'a'; adding that value to the upper case base 'A' results in the upper case version of ch. So for example, if ch is 'f' then the value of the above expression is 'F'. Similarly, if ch is an upper case letter, then the expression
ch - 'A' + 'a'
results in the lower case version of ch which may then be assigned to a variable.

Using this fact, the last group of three statements in the above set of statements first assigns a lower case letter 'd' to ch. Then the lower case value of ch is converted to its upper case version, and then back to lower case.

As we mentioned, all lower case and upper case letters have contiguous and increasing values. The same is true for digit characters. Such a contiguous ordering makes it easy to test if a given character, ch, is a lower case letter, an upper case letter, or a digit. For example, any lower case letter has a value that is greater than or equal that of 'a' AND less than or equal to that of 'z'. From this, we can write a C expression that is True if and only if ch is a lower case letter:

(ch >= 'a' && ch <= 'z')

Here is a code fragment that checks whether a character is a lower case letter, an upper case letter, a digit, etc.

if (ch >= 'a' && ch <= 'z')
          printf("%c is a lower case letter\n", ch);
     else if (ch >= 'A' && ch <= 'Z')
          printf("%c is an upper case letter\n", ch);
     else if (ch >= '0' && ch <= '9')
          printf("%c is a digit symbol\n", ch);
     else
          printf("%c is neither a letter nor a digit\n");
Observe the multiway decision and branch: if ... else if ... else if ... else. Only one of the branches is executed. The first if expression checks if the value of ch is between the values of 'a' and 'z', a lower case letter. Only if ch is not a lower case letter, does control proceed to the first else if part, which tests if ch is an upper case letter. Only if ch is not an upper case letter, does control proceed to the next else if part, which tests if ch is a digit. Finally, if ch is not a digit, the last else part is executed. Depending on the value of ch, only one of the paths is executed with its corresponding printf() statement.

Let us see how the expression

(ch >= 'a' && ch <= 'z')
is evaluated. First, the comparison ch >= 'a' is performed; then, ch <= 'z' is evaluated; finally, the results of the two sub-expressions are logically combined by the AND operator. Evaluation takes place in this order because the precedence of the binary relational operators ( >=, <=, ==, etc.) is higher than that of the binary logical operators ( &&, ). We could have used parentheses for clarity, but the precedence rules ensure the expression is evaluated as desired.

One very common error is to write the above expression analogous to mathematical expressions:

('a' <= ch <= 'z')
This would not be found to be an error by the compiler, but the effect will not be as expected. In the above expression, since the precedence of the operators is the same, they will be evaluated from left to right according to their associativity. The result of 'a' <= ch will be either True or False, i.e. 1 or 0, which will then be compared with 'z'. The result will be True since 1 or 0 is always less than 'z' (ASCII value 122). So the value of the above expression will always be True regardless of the value of ch --- not what we would expect.

Let's write a program using all this information. Our next task is to read characters until end of file and to print each one with its ASCII value and what we will call the attributes of the character. The attributes are a character's category, such as a lower case or an upper case letter, a digit, a punctuation, a control character, or a special symbol.

Task

ATTR: For each character input, print out its category and ASCII value in decimal, octal, and hexadecimal forms.

The algorithm requires a multiway decision for each character read. A character can only be in one category, so each character read will lead to the execution of one of the paths in a multiway decision. Here is the algorithm.

read the first character
     repeat as long as end of file is not reached
          if the character is a lower case letter
               print the various character representations, and
               print that it is a lower case letter
          else if it is an upper case letter
               print the various character representations, and
               print that it is an upper case letter
          else if it is a digit
               print the various character representations, and
               print that it is a digit
               etc..
          read the next character
Notice we have abstracted the printing of the various representations of the character (as a character and its ASCII value in decimal, octal and hex) into a single step in the algorithm: print the various character representations, and we perform the same step in every branch of the algorithm. This is a classic situation calling for the use of a function: abstract the details of an operation and use that abstraction in multiple places. The code implementing the above algorithm is shown in Figure 4.2.

We have declared a function print_reps() which is passed a single character argument and expect it to print the various representations of the character. We have used the function in the driver without knowing how print_reps() will perform its task.

We must now write the function print_reps(). The character's value is its ASCII value. When the character value is printed as a character with conversion specification %c, the symbol is printed; when printed as a decimal integer with conversion specification %d, the ASCII value is printed in decimal form. Conversion specification %o prints an integer value in octal form, and %x prints an integer value in hexadecimal form. We simply need a printf() call with these four conversion specifiers to print the character four times. The code for print_reps() is shown in Figure 4.3.

The function simply prints its parameter as a character, a decimal integer, an octal integer, and a hexadecimal integer.

Sample Session:

The last line printed refers to the newline character. Remember, every character including the newline is placed in the keyboard buffer for reading and, while scanf() skips over leading white space when reading a numeric data item, it does not do so when reading a character.

Can we improve this program? The driver ( main()) shows all the details of character testing, beyond the logic of what is being performed here, so it may not be very readable. Perhaps we should define a set of macros to hide the details of the character testing expressions. For example, we might write a macro:

#define IS_LOWER(ch)     ((ch) >= 'a' && (ch) <= 'z')
Then the first if test in main() would be coded as:
if ( IS_LOWER(ch) )  {
          ...
which directly expresses the logic of the program. The remaining expressions can be recoded using macros similarly and this is left as an exercise at the end of the chapter.

One other thought may occur to us to further improve the program. Can we make the function print_reps() a little more abstract and have it print the various representations as well as the category? To do this we would have to give additional information to our new function, which we will call print_category(). We need to tell print_category() the character to print as well as its category. To pass the category, we assign a unique code to each category and pass the appropriate code value to print_category(). To avoid using ``magic numbers'' we define the following macros:

#define  LOWER     0
#define  UPPER     1
#define  DIGIT     2
#define  PUNCT     3
#define  SPACE     4
#define  CONTROL   5
#define  SPECIAL   6
Placing these defines (together with the comparison macros) in a header file, category.h, we can now recode the program as shown in Figure 4.4.

The code for print_category() is also shown. Looking at this code, it may seem inefficient in that we are testing the category twice; once in main() using the character, and again in print_category() using the encoded parameter. Later in this chapter we will see another way to code the test in print_category() which is more efficient and even more readable. The contents of the header file, category.h is left as an exercise. The program shown in Figure 4.4 will behave exactly the same as as the code in Figure 4.2 producing the same sample session shown earlier.



Previous: 4.1.1 The ASCII Character Set
Up: 4.1 A New Data Type: char
Next: 4.1.3 Character I/O Using getchar() and putchar()
Previous Page: 4.1.1 The ASCII Character Set

tep@wiliki.eng.hawaii.edu
Wed Aug 17 08:29:11 HST 1994