Phillips et al.: "Semiotic Modeling for Text String De- Duplication"

346






Input 1) has exactly one space between each consecutive pair of words. Input 2) has three spaces between each consecutive pair of words. Figure 1 displays the output from this comparison using the code in the Appendix. Note that a word-by-word comparison of the two inputs was conducted, and each word was evaluated as identical to its counterpart in the other input, regardless of the spacing differences. Therefore, even though there is a significant difference in the spacing between the two inputs, the comparison code deems the two input strings overall (i.e., in the final line of the output) to be identical. This evidences the success of the program in ignoring superfluous spaces between words.

In order to demonstrate the word count comparison feature of the code, the following two inputs were provided:

1) THE SNOWMAN WALKS ALONE STOP

vs.

2) THE SNOWMAN WALKS ALONE NOW STOP

The only difference between the two strings is the addition of the word žNOW" at the end of the sentence. Figure 2 displays the output from this comparison. A word count is conducted, and the program correctly concludes that the two text strings contain different numbers of words. On this basis, the overall conclusion (i.e., the final line of the output) is that the two text strings are different. Note that a word-by-word comparison is not conducted, since the word count is a more "global" (i.e., less detailed) measure by which to compare the two strings. The mechanism of the program is to compare word counts first, and if they are equal, to conduct a word-by-word comparison. If the word counts are not equal, the word-by-word comparison is omitted and a conclusion of inequality is made solely on the basis of the word count. This will be examined in more detail in the Discussion Section (vide infra).

Another consideration is that the word count is less computationally intensive than a word-by-word comparison, so omitting the word-by-word comparison for text strings of unequal length introduces a measure of computational efficiency. Thus, this test evidences the success of the program in identifying differences in lengths of input strings.






Page - 1     Page + 1


AS/SA Nº 6/7, Article 1 : Page 6 / 11

© 1999, AS/SA

E-mail to the editors
Pour écrire à la rédaction

1999.05.31