Phillips et al.: "Semiotic Modeling for Text String De- Duplication"


It is noteworthy that, according to the above definition, a function is a set of ordered pairs, with the first member being an element of set X, and the second member being an element of set Y. X and Y can be any sets, so if we state that X is a set of documents, and Y is a set of text files (or equivalently, a set of binary numbers, admittedly quite long binary numbers, encoding text files), then we may state that scanning a document into a text file is a function if and only if every time a document is scanned, the same text file is produced (Gilbert, 1970).

Clearly, we want scanning to be a function, since constancy of the scanning process is desired. Unfortunately, this goal is complicated by a need for representations of documents in a database to evolve as additional documents are compiled into said database (Bouchaffra and Meunier, 1995; Rijsbergen, 1979; Robertson and Harding, 1984). If a keyword system (or equivalent indexing system) is used to represent the content of text strings in a database, then obviously a small database will require fewer keywords to match document content than a large database will. Neural networks (Belew, 1989; William, 1990) and a Markovian Random Field Machine (Bouchaffra and Meunier, 1995) have been proposed as solutions to this portion of the problem. Both methodologies have displayed promise. Regardless of the methodology, some system comparing UNits of INformation (otherwise referred to as UNIFs; Meunier, et al., 1993; Meunier, et al., 1987) between new additions to a database and existing database members needs to be implemented.

How do these concepts relate to semiotic modeling? The scientific method tells us that if a process (or function, in the case at hand) is carried out consistently on an identical input, the results also should be identical. Unfortunately, when scanning text strings for input into a text file, this is not always the case. Some examples are:

  • Punctuation: The small size of many punctuation marks can result in an inconsistent scanning of punctuation.

  • Misread letters: Depending on the font, an upper case "A" (for example) can sometimes be scanned as "Fl" (upper case F, lower case ell).

  • Spaces: OCR software performs with a great degree of variation when inserting spaces between words. For example, there is less space between M and N than there is between A and V (MN vs. AV, by inspection). The variations in the spaces required between different pairs of letters can result in an extra space inserted in the middle of a word, or (more commonly) two or more spaces inserted between adjacent words, when only a single space is intended.

Page - 1     Page + 1

AS/SA Nº 6/7, Article 1 : Page 2 / 13

© 1999, AS/SA

E-mail to the editors
Pour écrire à la rédaction