Phillips et al.: "Semiotic Modeling for Text String De- Duplication"


In order to evaluate the ability of the program to detect differences in individual words, the following two inputs were provided:




This comparison highlights a misspelling commonly found in cable traffic. A typist intends to type the word "THE" and instead types the pseudo-word "HTE", thereby introducing a typographical error to the hard copy. This error subsequently is scanned into a text file, and comparisons with a text input string without this typographical error, but otherwise identical, should be detected as a single-word difference. An error of this nature represents one of the smallest differences that can exist between two text strings, since the word counts are the same, the number of characters in each word also displays no variation, and the spacing is identical.

The output from this comparison is displayed in Figure 3. The word counts are the same, so a word-by-word comparison is conducted. The program correctly concludes that there is a difference between the two inputs" first words (each referred to as Word # 0, due to the indexing methodology in the word stack; "SNOWMAN" is the second word, and is referred to as Word # 1, etc.).Each subsequent word is correctly evaluated as identical to its counterpart. Finally, the overall conclusion (i.e., the last line in the output) is that, based on the difference in one or more words, the two strings are different.


The functionality of each part of the code in Appendix A illustrates some fundamental concepts that apply to an object-oriented methodology. The major difference between a procedural methodology and an object-oriented methodology is that an object-oriented methodology allows a program to designate objects that are members of a pre-defined class, and then uses class member functions to manipulate the objects (Stroustrup, 1991). In contrast, a procedural methodology calls procedures to manipulate numbers in a variety of functions, but does not automatically associate a set of variables with each other (or with an object, as in the case of object-oriented programming).

An object-oriented approach to programming focuses on data and behavior that relates to the data, as opposed to procedural methodologies, which focus on means to manipulate data. In object-oriented programming, data and functions processing the data are considered classes whose instances are objects. Objects are variables belonging to a class (generally a user-defined class). Related to this is the concept of an Abstract Data Type (ADT), which can be considered as a user-defined extension to the base data structures provided by a high level language package. An ADT comprises a set of values and a group of functions for which the data represent the domain of said functions.

OOP embodies a number of properties that are designed to promote facile implementation of ADTs. One of the most useful of these is the concept of inheritance, in which a new data structure can be derived from an existing one. As a case in point, text structure is in many respects similar to this mechanism. For example, words and phrases are types of text strings. If the information encoding text strings applies to all instances of that class, then creating the class "word" from the class "string" (i.e., creation of a subclass) represents an avoidance of duplication of effort.

Page - 1     Page + 1

AS/SA Nº 6/7, Article 1 : Page 7 / 11

© 1999, AS/SA

E-mail to the editors
Pour écrire à la rédaction