4.2.3 Counting Words

Previous: 4.2.2 Converting Digit Characters to Numbers
Up: 4.2 Sample Character Processing Functions
Next: 4.2.4 Extracting Words
Previous Page: 4.2.2 Converting Digit Characters to Numbers

4.2.3 Counting Words

The next task we will consider is counting words in an input text file (a file of characters). A word is a sequence of characters separated by delimiters, namely, white space or punctuation. The first word may or may not be preceded by a delimiter and we will assume the last word is terminated by a delimiter.

Task

CNT: Count the number of characters, words, and lines in the input stream until end of file.

Counting characters and lines is simple: a counter, chrs, can be incremented every time a character is read, and a counter, lns, can be incremented every time a newline character is read. Counting words requires us to know when a word starts and when it ends as we read the sequence of characters. For example, consider the sequence:

Lucky     luck
^    ^    ^   ^

We have shown the start and the end of a word by the symbol ^ There are several cases to consider:

As long as no word has started yet AND the next character read is a delimiter, no new word has started.
If no word has started AND the next character read is NOT a delimiter, then a new word has just started.
If a word has started AND the next character is NOT a delimiter, then the word is continuing.
If a word has started AND the character read is a delimiter, then a word has ended.

We can talk about the state of our text changing from ``a word has not started'' to ``a word has started'' and vice versa. We can use a variable, inword, as a flag to keep track of whether a word has started or not. It will be set to True if a word has started; otherwise, it will be set to False. If inword is False AND the character read is NOT a delimiter, then a word has started, and inword becomes True. If inword is True AND the new character read is a delimiter, then the word has ended and inword becomes False. With this information about the state, we can count a word either when it starts or when it ends. We choose the former, so each time the flag changes from False to True, we will increment the counter, wds. The algorithm is:

initialize all counters to zero, set inword to False
     while the character read, ch, is not EOF
          increment character count chrs
          if ch is a newline
               increment line count lns
          if NOT inword AND ch is NOT delimiter
               increment word count wds
               set inword to True
          else if inword and ch is delimiter
               set inword to False
     print results

We first count characters and newlines. After that, only changes in the state, inword, need to be considered; otherwise we ignore the character and read in the next one. Each time the flag changes from False to True, we count a word. We will use a function delimitp() that checks if a character is a delimiter, i.e. if it is a white space or a punctuation. (The name delimitp stands for ``delimit predicate'' because it tests is its argument is a delimiter and returns True or False). White space and punctuation, in turn, will be tested by other functions. The code for the driver is shown in Figure 4.13.

After printing the program title, the counts are initialized:

lns = wds = chrs = 0;

Assignment operators associate from right to left so the rightmost operator is evaluated first; chrs is assigned 0, and the value of the assignment operation is 0. This value, 0, is then assigned to wds, and the value of that operation is 0. Finally, that value is assigned to lns, and the value of the whole expression is 0. Thus, the statement initializes all three variables to 0 as a concise way of writing three separate assignment statements.

The program driver follows the algorithm very closely. The function delimitp() is used to test if a character is a delimiter and is yet to be written. Otherwise, the program is identical to the algorithm. It counts every character, every newline, and every word each time the flag inword changes from False to True.

Source File Organization

We can add the source code for delimitp() to the source file chrutil.c we have been building with character utility functions. In the last section we wrote a dummy driver in that file to test our utilities. Since we would like to use these utilities in many different programs, we should not have to keep copying a driver into this file. We will soon see how the code in chrutil.c will be made a part of the above program without combining the two files into one (and without using the #include directive to include a code file). In our program file, cnt.c, we also include two header files besides stdio.h. These are: tfdef.h which defines symbolic constants TRUE and FALSE; and chrutil.h which declares prototypes for the functions defined in chrutil.c and any related macros. Since we use these constants and functions in main(), we should include the header files at the head of our source file. Figure 4.14 shows the file tfdef.h and the additions to chrutil.h.

The function delimitp() tests if a character is white space or punctuation. It uses two functions for its tests: whitep() which tests if a character is white space, and punctp() which tests if a character is punctuation. (We could have also implemented these as macros, but chose functions in this case). All these functions are added to the source file, chrutil.c and are shown in Figure 4.15

This source file also includes tfdef.h, and chrutil.h because the functions in the file use the symbolic constants TRUE and FALSE defined in tfdef.h and the prototypes for functions whitep() and punctp() declared in chrutil.h are also needed in this file.

The source code for the functions is simple enough; delimitp() returns TRUE if the its parameter, c, is either white space or punctuation; whitep() returns TRUE if c is either a space, newline, or tab; and punctp() returns TRUE if c is one of the punctuation marks shown. All functions return FALSE if the primary test is not satisfied.

Our entire program is now contained in the two source files cnt.c and chrutil.c which must be compiled separately and linked together to create an executable code file. Commands to do so are implementation dependent; but on Unix systems, the shell command line is:

cc -o cnt cnt.c chrutil.c

The command will compile cnt.c to the object file, cnt.o, then compile chrutil.c to the object file, chrutil.o, and finally link the two object files as well as any standard library functions into an executable file, cnt as directed by the -o cnt option. (If -o option is omitted, the executable file will be called a.out). For other systems, the commands are generally similar; for example, compilers for many personal computers also provide an integrated environment which allows one to edit, compile, and run programs. In such an environment, the programmer may be asked to prepare a project file listing all source files. Once a project file is prepared and the project option activated, a simple command compiles the source files, links them into an executable file, and executes the program. Consult your implementation manuals for details. This technique of splitting the source code for an entire program into multiple files is called serarate compilation and is a good practice as programs grow larger.

Once the above two files, cnt.c and chrutil.c are compiled and linked, the resulting program may then be executed producing a sample session as shown below:

***Line, Word, Character Count Program*** Type characters, EOF to quit Now is the time for all good men To come to the aid of their country. '136D Lines = 2, Words = 16, Characters = 70

Henceforth, we will assume separate compilation of source code whenever it is spread over more then one file. Since main() is the program driver, we will refer to the source file that contains main() as the program file. Other source files needed for a complete program will be listed in the comment at the head of the program file. In the comment, we will also list header files needed for the program. Refer to cnt.c in Figure 4.13 for an example of a listing which enumerates all the files needed to build or create an executable program. (The file stdio.h is not listed since it is assumed to be present in all source files). Header files typically include groups of related symbolic constant definitions and/or prototype declarations. Source files typically contain definitions of functions used by one or more program files. We will organize our code so that a source file contains the code for a related set of functions, and a header file with the same name contains prototype declarations for these functions, e.g. chrutil.c and chrutil.h. As we add source code for new functions to the source files, corresponding prototypes will be assumed to be added in the corresponding header files. Separate compilation has several advantages. Program development can take place in separate modules, and each module can be separately compiled, tested, and debugged. Once debugged, a compiled module need not be recompiled but merely linked with other separately compiled modules. If changes are made in one of the source modules, only that source module needs recompiling and linking with other already compiled modules. Furthermore, compiled modules of useful functions can be used and reused as building blocks to create new and diverse programs. In summary, separate compilation saves compilation time during program development, allows development of compiled modules of useful functions that may be used in many diverse programs, and makes debugging easier by allowing incremental program development.

tep@wiliki.eng.hawaii.edu
Wed Aug 17 08:29:11 HST 1994