Phillips et al.: "Semiotic Modeling for Text String De- Duplication"


The main body of the program begins with line 78. Two String objects, referred to a stack and stack2, are declared as variables, and a Boolean variable called dup is declared and initialized to unity (the default value for equality between stack and stack2). Any comparison that indicates a difference between stack and stack2 will set dup to zero, and result in an overall conclusion that the two strings are different.

Integers called topofstack and topofstack2 are initialized to zero, which signifies that, initially there are no data members in either stack.

Lines 89 through 97 solicit input data, and add the data to stack. A single line of text, consisting of up to 81 characters, can be input to the stack in a single line of input. A message directs the user to input data, and as long as the user does not input the tilde character (~), more words are added to the stack. Input is considered to be individual consecutive characters, with the exception of the space character and the tilde. The space character (actually, one or more consecutive space characters) tells the program to end the input of a word to the stack, and begin the input of the next word; the tilde character tells the program to stop adding to the stack. Line 96 calls the set function to add data to the stack, and increments the topofstack variable so that the next word added to the stack will go in the following memory location. Lines 99 through107 conduct a similar input process on stack2. This program calls for character data to be input manually, but applying the code in the Appendix to the task at hand will result in the input being from a scanner via OCR software. This point, although minor, merits being mentioned.

Line 109 compares the two stacks for equality of word length. If there are an equal number of words in the two stacks (i.e., if topofstack == topofstack2) [N.B., again, a double equals sign is used to denote a Boolean comparison], then each word in stack is compared to its counterpart in stack2 (see line 113). If there are different numbers of words in the two stacks (i.e., topofstack is different than topofstack2) then the word-by-word comparison is skipped, and the Boolean variable is set to zero (line 133). Line 132 denotes that a difference in word count has been detected.

The word-by-word comparison is conducted in line 113. (The ■for■ statement in line 111 moves the comparison from one word in the stack to the next.) The Boolean equality operator, as defined in lines 53 through 55, is employed in this section. If any word in stack is identical to its counterpart in stack2, then that result is printed (line 115) along with the word (line 116). If any word in stack is different than its counterpart in stack2, then the two words are printed (lines 122 through 124), and the Boolean variable dup is set to zero (line 126).

Both stacks then are printed out in full (lines 138 through 144). Finally, the variable dup is evaluated. If no condition was encountered that would indicate a difference between the two inputs, then the value of dup is expected to have been unchanged during the running of the program, and would have retained the value of unity. That overall result is printed (line 147). Otherwise, the overall result is that a difference was encountered, and the variable dup has been set to zero. The output corresponding to that situation is in line 149.

Page - 1     Page + 1

AS/SA Nº 6/7, Article 1 : Page 10 / 13

© 1999, AS/SA

E-mail to the editors
Pour écrire à la rédaction