Previous: 4.2.2 Converting Digit Characters to Numbers
Up: 4.2 Sample Character Processing Functions
Next: 4.2.4 Extracting Words
Previous Page: 4.2.2 Converting Digit Characters to Numbers
The next task we will consider is counting words in an input text file (a file of characters). A word is a sequence of characters separated by delimiters, namely, white space or punctuation. The first word may or may not be preceded by a delimiter and we will assume the last word is terminated by a delimiter.
CNT: Count the number of characters, words, and lines in the input stream until end of file.
Counting characters and lines is simple: a counter, chrs, can be incremented every time a character is read, and a counter, lns, can be incremented every time a newline character is read. Counting words requires us to know when a word starts and when it ends as we read the sequence of characters. For example, consider the sequence:
Lucky luck ^ ^ ^ ^We have shown the start and the end of a word by the symbol ^ There are several cases to consider:
initialize all counters to zero, set inword to False while the character read, ch, is not EOF increment character count chrs if ch is a newline increment line count lns if NOT inword AND ch is NOT delimiter increment word count wds set inword to True else if inword and ch is delimiter set inword to False print resultsWe first count characters and newlines. After that, only changes in the state, inword, need to be considered; otherwise we ignore the character and read in the next one. Each time the flag changes from False to True, we count a word. We will use a function delimitp() that checks if a character is a delimiter, i.e. if it is a white space or a punctuation. (The name delimitp stands for ``delimit predicate'' because it tests is its argument is a delimiter and returns True or False). White space and punctuation, in turn, will be tested by other functions. The code for the driver is shown in Figure 4.13.
After printing the program title, the counts are initialized:
lns = wds = chrs = 0;Assignment operators associate from right to left so the rightmost operator is evaluated first; chrs is assigned 0, and the value of the assignment operation is 0. This value, 0, is then assigned to wds, and the value of that operation is 0. Finally, that value is assigned to lns, and the value of the whole expression is 0. Thus, the statement initializes all three variables to 0 as a concise way of writing three separate assignment statements.
The program driver follows the algorithm very closely. The function delimitp() is used to test if a character is a delimiter and is yet to be written. Otherwise, the program is identical to the algorithm. It counts every character, every newline, and every word each time the flag inword changes from False to True.
We can add
the source code for delimitp() to the source file chrutil.c
we have been building with character utility functions.
In the last section we wrote a dummy driver in that file
to test our utilities.
Since we would like to use these utilities in many different programs,
we should not have to keep copying a driver into this file.
We will soon see how the code in chrutil.c will
be made a part of the above program without combining the two files into one
(and without using the #include directive to include a code file).
In our program file, cnt.c, we also include two
header files besides stdio.h. These are:
tfdef.h which defines symbolic
constants TRUE and FALSE;
and chrutil.h which declares prototypes for
the functions defined in chrutil.c and any related macros.
Since we use these constants and functions in
main(), we should include the header files at the head of our source file.
Figure 4.14 shows the file tfdef.h and the additions to
chrutil.h.
The function delimitp() tests if a character
is white space or punctuation. It uses two
functions for its tests: whitep() which tests if a
character is white space,
and punctp() which tests if a character is punctuation.
(We could have also implemented these as macros,
but chose functions in this case).
All these functions
are added to the source file, chrutil.c and
are shown in Figure 4.15
This source file also includes tfdef.h, and chrutil.h because the functions in the file use the symbolic constants TRUE and FALSE defined in tfdef.h and the prototypes for functions whitep() and punctp() declared in chrutil.h are also needed in this file.
The source code for the functions is simple enough; delimitp() returns TRUE if the its parameter, c, is either white space or punctuation; whitep() returns TRUE if c is either a space, newline, or tab; and punctp() returns TRUE if c is one of the punctuation marks shown. All functions return FALSE if the primary test is not satisfied.
Our entire program is now contained in the two source files cnt.c and chrutil.c which must be compiled separately and linked together to create an executable code file. Commands to do so are implementation dependent; but on Unix systems, the shell command line is:
cc -o cnt cnt.c chrutil.cThe command will compile cnt.c to the object file, cnt.o, then compile chrutil.c to the object file, chrutil.o, and finally link the two object files as well as any standard library functions into an executable file, cnt as directed by the -o cnt option. (If -o option is omitted, the executable file will be called a.out). For other systems, the commands are generally similar; for example, compilers for many personal computers also provide an integrated environment which allows one to edit, compile, and run programs. In such an environment, the programmer may be asked to prepare a project file listing all source files. Once a project file is prepared and the project option activated, a simple command compiles the source files, links them into an executable file, and executes the program. Consult your implementation manuals for details. This technique of splitting the source code for an entire program into multiple files is called serarate compilation and is a good practice as programs grow larger.
Once the above two files, cnt.c and chrutil.c are compiled and linked, the resulting program may then be executed producing a sample session as shown below:
Henceforth, we will assume separate compilation of source code whenever it is spread over more then one file. Since main() is the program driver, we will refer to the source file that contains main() as the program file. Other source files needed for a complete program will be listed in the comment at the head of the program file. In the comment, we will also list header files needed for the program. Refer to cnt.c in Figure 4.13 for an example of a listing which enumerates all the files needed to build or create an executable program. (The file stdio.h is not listed since it is assumed to be present in all source files).
Header files typically include groups of related symbolic constant definitions and/or prototype declarations. Source files typically contain definitions of functions used by one or more program files. We will organize our code so that a source file contains the code for a related set of functions, and a header file with the same name contains prototype declarations for these functions, e.g. chrutil.c and chrutil.h. As we add source code for new functions to the source files, corresponding prototypes will be assumed to be added in the corresponding header files.
Separate compilation has several advantages. Program development can take place in separate modules, and each module can be separately compiled, tested, and debugged. Once debugged, a compiled module need not be recompiled but merely linked with other separately compiled modules. If changes are made in one of the source modules, only that source module needs recompiling and linking with other already compiled modules. Furthermore, compiled modules of useful functions can be used and reused as building blocks to create new and diverse programs. In summary, separate compilation saves compilation time during program development, allows development of compiled modules of useful functions that may be used in many diverse programs, and makes debugging easier by allowing incremental program development.