Phillips et al.: "Semiotic Modelling for Text String De- Duplication"

345






Experimental

C++ version 4.52 (courtesy Borland) was chosen as the programming language. The advent of object-oriented programming languages such as C++ (Sedgewick, 1992), Visual Basic (Aitken, 1996), and Java (Walsh, 1996) has revolutionized scientific programming, and the advantages of object-oriented methodologies have become widely recognized and accepted (Rumbaugh, et al., 1991).

Within that context, an object-oriented methodology was implemented (Sodhi and Sodhi, 1996). One part of an object-oriented approach consists of using classes to protect data, and controlling access to and manipulation of the data through the use of class member functions. The concept of a class is rooted in set theory (Muller, 1976), and lends itself well to be adapted to problems such as the one at hand.

The code that was used to implement the word stack is contained in the Appendix.1  The word stack class code was adapted from string classes developed previously (McMonnies and McSporran, 1995; Capper, 1994; Prata, 1995). It has been tailored to compare text strings on a word-by-word basis, while ignoring spaces between words (vide infra, Discussion Section). If the number of words in each of the two inputs is the same, and each word in each input is the same as the corresponding word in its counterpart input, then the two inputs are considered to be identical text strings, and one is merely the duplicate of the other. If either condition is not met, i.e.:

1) The two inputs have different numbers of words

OR

2) Any word in an input is different (even in spelling)
from the corresponding word in its counterpart input

then the two inputs are considered to be distinct.


Results


As a basic trial of the program, consider the following two inputs:

1) THE SNOWMAN WALKS ALONE STOP

vs.

2) THE    SNOWMAN    WALKS    ALONE    STOP




___________

1. The executable can be downloaded by clicking on the following filename: Dedupe.exe. The uncompiled code is also available as Dedupe.cpp. [RETURN].





Page - 1     Page + 1


AS/SA Nº 6/7, Article 1 : Page 5 / 13

© 1999, AS/SA

E-mail to the editors
Pour écrire à la rédaction

1999.05.31