Therefore, one of the earliest crucial steps in declassifying the documents is
determining which hard copies are duplicates of each other. The subject of
information retrieval has been examined in some detail with respect to applications
similar to these (Bouchaffra and Meunier, 1995). It has been common for multiple
hard copies of cable traffic to be stored in divergent locations, often for decades.
Cable traffic poses some unique challenges with respect to the problem of de-
duplication. First, all of the characters (at least in most cable traffic) are upper case
letters. There is no punctuation, and no lower case characters comprise the text.
However, in contrast to modern word processors, there is no adjustment for spacing
between words, so scanning superfluous spaces is a problem requiring special
attention. As an example of the appearance of cable traffic, what normally would be
The quick, brown fox jumped over the lazy dog.
becomes, in cable traffic
THE QUICK COMMA BROWN FOX JUMPED OVER THE LAZY DOG STOP
The text generally is typed in standard, block letters, with no punctuation,
or with punctuation spelled out in text (i.e., "COMMA" denoting a comma and
"STOP" denoting the end of a sentence in lieu of a period, vide supra).
Therefore, the major remaining roadblock (of the three presented previously,
vide supra) is the problem of spacing.
Although there are a number of means of approaching the problem of
variable spacing, each has some inherent disadvantages associated with it. For
example, one simply could process the text string as a long string of characters, with
a function or procedure that deletes spaces whenever they are encountered. However,
then the text string would exist as a variable consisting of a single long character
string, and long documents would require large amounts of memory to store the
variable. Moreover, manipulating a variable of this nature would be cumbersome.
In contrast, a word stack methodology involves storing each word as a
character string variable. Moreover, spaces between words are not stored, because
the presence of a space in the input simply signifies that a word has ended, and that
the next variable begins with the next non-space character. It does not matter how
many spaces exist, the spaces do not get stored in the word stack. This approach is
ideally suited to the problem at hand.