BR Publishing KWIC Programmer's Manual


Contents

Source Code Documentation


Introduction

KWIC is a program which is designed to print out the Key Word In Context. That is, it reads in a file or files, parses through them, and it prints out the user designated "keywords" in many different ways according to the user's specifications.

Programmer's Manual

How to Compile KWIC and Necessary File List

The KWIC program can be complied by typing "make -f Makefile.lib" at the X-term window. Note, this will create both a library libprop.so as well as the executable kwic necessary to run KWIC. If you want to create just the library, then you need to type "make -f Makefile.lib libprop.so". This will create a library from the propedit.cpp and properties.cpp files. If you already have the library created, then you can type "make kwic" and it will create an executable using the dynamic library you already have defined (-lprop). Note, you must be in a UNIX environment in order for the KWIC program to be made. Once compiled, the kwic program can be run just by typing kwic [commandline options] filenames at the command prompt line. Note, the only required item is the filename(s) or directory to parse. The complete explaination of the command line context is given page. The files necessary for Make to work on kwic are the Makefile itself and Makefile.lib to create the library, kwic.cpp, kwicparser.cpp / kwicparser.h, properties.cpp / properties.h, word.h, wordholder.cpp / wordholder.h, stdprinter.cpp / stdprinter.h, htmlprinter.cpp / htmlprinter.h, htmltags.cpp/ htmltags.h, filter.h, minfilter.cpp / minfilter.h, comfilter.cpp / comfilter.h, regfilter.cpp / regfilter.h, myinsertiterator.h, comparitor.h

Basic Program Structure Overview

Basically, the kwic.cpp file is where the main file is. In this main file, it creates a Properties object which gets all of the command line parameters, .properties file, environmental variables, and default values figured out, and it creates a filter to know which words are keywords. Then, it creates a KwicParser which reads in the file, sorts the important words, and later prints. The KwicParser has a wordHolder object in it which hides the vector of Word objects and vector of iterators from the user. The wordHolder has a print call which creates a new STDPrinter pointer which creates the correct printer based on the bool values stored in the porperties object passed to it. Then, the STDPrinter pointer prints out the vector of words using iterators passed to it from the vector of iterators in wordHolder. The sorting is done based on the the info from the properties object passed to. Finally, the HTML printing is done in the HTMLPrinter class, which is the base class for several different HTMLPrinters, which print differently depending on the user specified options.

kwic.cpp


kwic.cpp: This is where the main() function is. First, the default valeus for the Porperties object are set at the beginning of the program. Then, a new Properties object is created to take the command line values, .properties file, and environmental variables and store them for use later. Then, a new Filter is created from a minFilter, comFilter, and regFilter. Next, a KwicParser is created which takes the Properties object and the newly created Filter. This class ends up doing all of the necessary input stuff that stores the vector of Words. Finally, the print function is called from KwicParser. This function calls all the other functions necessary for printing.
Properties Class and propEdit Class


properties.cpp / properties.h: This class was created to parse the command line, read from a .properties file, and read from environmental variables. It can do all of these things without knowing anything specific about the program that calls it. Hence why it is goos for us to create this as a library. There are four levels of precedence for multiply specified values. The command line overrides everything, then the environmental variables are next most important values, and then the kwic.properties file, and lastly, if nothing else has been specified, the defaults values will be used. This file takes the propEdit class in order to parse the .properties file. Once everything has been correctly parsed, it has the getStrings, getBools, and getInts functions that each take a string key (like "before" or "occurrences") and returns the correct value associated with that key from the precidence heiarchy listed above. Because of the universal versitality of this class, and because nothing has to be hardcoded in, we compiled it into a library for use by anyone at a later time.


propedit.cpp / propedit.h: This class works with the properties class to parse a .properties file. It is referred to inside the Properties Class, where it is created with the "program name.properties" parameter as the filename. A .properties file has KEY=VALUE pairs that are read in and placed in a map (the KEY term is lowercased). Then, the Properties class just asks for a certain element, like "before" or "min" and it returns the correct value or "" if the value wasn't specified in the .properties file.
KwicParser Class, myinseriterator.h, and comparitor.h


kwicparser.cpp / kwicparser.h; myinsertiterator.h: This class is created pretty much to hide the wordHolder class from the user, and to do the input storage stuff with STL. This class works very closely with myinseriterator.h to complete the reading in of a file. In the constructor, KwicParser takes a filter and a Properties object as parameters. It cycles through each filename, and calls a copy command on the inputstream and the wordHolder "myHolder" for each filename, which stores all the necessary data into myHolder. What myinsertiterator.h does is it creates an insertion iterator that keeps track of the current word and line numbers as it cycles through the input stream. It calls the evaluate(string, int, string) function of myHolder on every call of the assignment operator (=) to store each individual set of (word, linenum, filename). Then, KwicParser calls the sort function of myHolder to alphabetize the important words. Finally, KwicParser has a print rountine which just tells myHolder to print itself.


comparitor.h: This class has all of the comparitor information for use in the STL sort class with the vector of iterators. There are three comparitors that each overload the () operator inheriting from a BaseComp class. These are the AlphaComp, NumComp, and LengthComp which sort Words alphabetically, based on number of occurrences, and by length of the word respectively. Also in comparitor.h is the Comparitor Class, which take a Properties object. It uses this Properties object to overload it's () operator based on what the values of the Properties object are. After choosing the correct BaseComp *, this function then returns the value of that comparitor's () operator, or ! of the operator (). This prevents us from having to write another comparitor just for negating the value of a pervious one. Note, being the nice programmers that we are, we templated all of the comparitor functions so that the word for any type of object instead of just vector::iterators.
Word Class and wordHolder Class


word.h: This is the class that stores the filename, word, linenumber, and number of occurrences of a word. We would have made this a struct, except that there needed to be a setCounts function that set its number of occurrences, because the total number of occurrences of a word isn't known until the entire file has been read through. So, there is a public call to access all four of the stored variables, and one call to set the number of occurrences. This one class is where all of the important information for printing and sorting and everything comes from, because the is a Word class for every single word in the file.


wordholder.cpp / wordholder.h: This is probably the single most important class in my entire kwic program. It contains two vectors in which it stores the file and keywords. One vector contains Words and contains every word in all the files. This vector is created during the call of the evaluate function from myinsertiterator.h. The other vector contains iterators that are at the position of the important words in the other vector. During the call to mySort(), it goes through and decides which words to create an iterator to based on whether they satisfy the conditions of the filter. During that phase, if the user has specified to sort by occurrences, it updates a map with the number of times each keywrod has occurred. This prevents the wasted space that would happen if it always stored these values, even if the user wasn't sorting Words according the number of occurrences. Then, it sorts the iterators based on the Comparitor class declared in comparitor.h, which specifically overloads the () operator for comparing two iterators to sort them based on the values of the Words they're pointing to. Finally, wordHolder has a print call which takes a Properties object as a parameter. During this print call, it decides on how to create a new STDPrinter * as either a STDPrinter, or a special HTML secific printer. It prints the header of the Printer, it cycles through the vector of iterators, and calls the the print routine of the Printer on each iterator which points to the location of each keyword in the vector of Words, and then it printer the footer of the Printer.
Inheritance Hierarchy With Filter Classes


I created a filter class who's main purpose is to tell whether or not a word is a keyword. Whats ket about the use of the Filter class and children classes is that you can make a filter with another filter as a parameter. Since for the KWIC program, there are three different ways to filter a word (min, concordance, reg_exp), it is repetitive to check for three different filters to see if its ok. Instead, you can chain filters together in any order you would like, and the effect is that the filter taking all of these filters would return true only if all of the filters also returned true. So, the code in main looks something like this:

Filter * myFilter = new regFilter(p.getStrings("include"),
new comFilter(p.getStrings("exclude"),
new MinFilter(p.getInts("min"),
new Filter())));

filter.cpp / filter.h: This class is the base class for each of the children filter classes. Specifically, the Filter class has a isOk(string) function that returns true if the the string is a keyword. The criteria for telling whether or not a word is a keyword is passed in through the contructor. The isOk function always returns true in this base class, but is overloaded in each of the children classes.


minfilter.cpp / minfilter.h: This class inherits off of the Filter class. It contains functions which do what is described in the filter class section above. More specifically, a minFilter is created with an int in the contructor. This int is used in the isOk function, where a word is a keyword if its length is greater than or equal to the that number.


comfilter.cpp / comfilter.h: This class inherits off of the Filter class. It contains functions which do what is described in the filter class section above. More specifically, a comFilter is created with an string of a filename in the contructor. This file contains words that the keyword cannot be equal too. So, in the isOk function, a word is tested to see if it was in the exclude file, and if it was, then its not a keyword, otherwise it is a keyword.


regfilter.cpp / regfilter.h: This class inherits off of the Filter class. It contains functions which do what is described in the filter class section above. More specifically, a regFilter is created with an string of a regular expression. Then, inside the isOk function, a word is tested to see if it tests positive to the reg_exp expression. If it does, then it is considered a keyword, otherwise it is not.
Inheritance Hierarchy With Printer Classes


For the KWIC program, I created a STDPrinter class that could be changed based on how you wanted to print out the concordance. This allows for expandablitiy for printing out in HTML as opposed to on the standard cout. More specifically, there are two levels of inheritance. One, the HTMLPrinter class inherits from the STDPrinter class. However, the HTMLPrinter class is also the base class for the a AlphaHTMLPrinter, OcurHTMLPrinter, and a LengthHTMLPrinter.


stdprinter.cpp / stdprinter.h: This class is the base class for the HTMLPrinter. It contains a print function that coordinates the printing of the precontext, the word itself, and the postcontext, but there are other functions specific to printing out certain parts of the concordance in this class. The print function is used to pass in the iterator, and to coordinate the printing of all the parts of the concordance. There is a prePrint function which takes an iterator that prints out all of the words in the precontext up to the word at the iterator. Then, there is a printMe function which prints out the word at the iterator in all caps. Next is the postPint function which prints all of the words in the postcontext startingat the first word after the one pointed to by the iterator. Finally, there is a printEndofLine function which prints out the linenumbers and filename that the words occured in. Note that there are two private variables that are created during the print procedure. There is int firstLine, which is created during the prePrint call which is the line number of the first word printed. Then, there is a int lastLine created during the postPrint call which is the line number of the last word printed. These two values are used in the printEndofLine call in order to print the correct linenumbers. Note, there is also a printHeader and a printFooter class which is necessary for the inheriting HTMLPrinter, but not necessary for the STDPrinter. Therefore, these functions are left blank since they don't do anything.


htmlprinter.cpp / htmlprinter.h: This class inherits from the STDPrinter class. It uses the HTMLtags class so that HTML tags dont have to be hardcoded into teh HTMLPrinter. HTMLtags has syntax where :
myatgs.h3("hello") creates :  

hello

This allows for simpler coding, and not having to hardcode the values inside the HTMLPrinter class. Also inside the HTMLPrinter .cpp and .h files are an AlphaHTMLPrinter, OcurHTMLPrinter, and a LengthHTMLPrinter. Each of these printers overloads the printName and printHeader functions so that when printing in HTML with different sorting methods, the table of contents at the beginning is correctly created, along with the links to the words later. AlphaHTMLPrinter prints an alphabetic list, OcurHTMLPrinter prints a list from 1 to 20 of the times a word occurs, and LengthHTMLPrinter prints a list from 1 to 20 of the length of the word.
Extra Desired Features

Currently, this KWIC program does everything it was designed to do, and with ample use of STL calls and some added in comand line features. What we would like to add to it is perhaps use the code shown in class for parsing directories. Our code was much shorter, but not as elegant. Also, we would like to try maybe a different method for storing words. The current method of storing words in a vector and iterators in another vector seems ok at the moment, but for very large files it is rather inefficient and therefore slow. Using the PString factory and FileStringStore classes would be a simple soultion, if only we had known about the code more than 8 hours before the project was due. Finally, having the user specify an output file would be a simple and useful command line addition.

Discussion of Encountered Problems

There was alot of small errors and glitches that all ended up contributing to making the entire project take more time. Specifically, we had alot of trouble at first because we didn't realize that pointers to objects and iterators weren't the same thing. So, instead of creating a vector of Word pointers, we created a vector of vector iterators. This solved the problem of storing the location of the keywords in a different vector. We tried then to do a comparison in the wordHolder evaluate function which would create iterators pointing to the desired location in the vector of Words. But, this solution didn't work because of an interesting reason. We resorted to printing out memory addresses to solve my problems, and we came across something interesting. Whenever we created an iterator as the file was reading in the words to put into the vector of Words, the iterator no longer pointed in the right place once the vector of words was completed. So it seems, after many hours of struggling, that C++ redefines the memory addresses of every element in the vector after the vector is complete. Because of this, we had to do the comparison and storing of the iterators after the entire vector of Words was stored. This solution is slower because it must basically read through the file twice, but since C++ was being mean, we couldn't see any other solution to the problem.

Also, for printing out the number of occurrences, we had to settle with re-iterating through the vector of words and storing the occurrences into each Word object. We found no other way around this, and thus were forced to resort to another waste of valuable computer processing time to going through the list again. Another problem we spent much time on was comparitor inheritance. We must have spent 5 hours on that one day, without it ever working, but we went back to it today and it finally worked with some small changes. Last, but not least, there were many many problems with makign a library. Finding the right command, getting the right Makefile, setting teh right options, etc... the list goes on and on. Finally, it all came together earlier today, and it we now have our two makefiles.


This document can be found at http://www.duke.edu/~jwr6/cps108/prog-man.html

Copyright BR Publishing 2002 Josh Robinson, Aaron Barasz