Phillips et al.: "Semiotic Modeling for Text String De- Duplication"

344






Therefore, one of the earliest crucial steps in declassifying the documents is determining which hard copies are duplicates of each other. The subject of information retrieval has been examined in some detail with respect to applications similar to these (Bouchaffra and Meunier, 1995). It has been common for multiple hard copies of cable traffic to be stored in divergent locations, often for decades. Cable traffic poses some unique challenges with respect to the problem of de- duplication. First, all of the characters (at least in most cable traffic) are upper case letters. There is no punctuation, and no lower case characters comprise the text. However, in contrast to modern word processors, there is no adjustment for spacing between words, so scanning superfluous spaces is a problem requiring special attention. As an example of the appearance of cable traffic, what normally would be written:

The quick, brown fox jumped over the lazy dog.

becomes, in cable traffic

THE QUICK COMMA BROWN FOX JUMPED OVER THE LAZY DOG STOP

The text generally is typed in standard, block letters, with no punctuation, or with punctuation spelled out in text (i.e., "COMMA" denoting a comma and "STOP" denoting the end of a sentence in lieu of a period, vide supra). Therefore, the major remaining roadblock (of the three presented previously, vide supra) is the problem of spacing.



Approach


Although there are a number of means of approaching the problem of variable spacing, each has some inherent disadvantages associated with it. For example, one simply could process the text string as a long string of characters, with a function or procedure that deletes spaces whenever they are encountered. However, then the text string would exist as a variable consisting of a single long character string, and long documents would require large amounts of memory to store the variable. Moreover, manipulating a variable of this nature would be cumbersome.

In contrast, a word stack methodology involves storing each word as a character string variable. Moreover, spaces between words are not stored, because the presence of a space in the input simply signifies that a word has ended, and that the next variable begins with the next non-space character. It does not matter how many spaces exist, the spaces do not get stored in the word stack. This approach is ideally suited to the problem at hand.






Page - 1     Page + 1


AS/SA Nº 3, Article 1 : Page 4 / 13

© 1999, AS/SA

E-mail to the editors
Pour écrire à la rédaction

1999.05.31