Phillips et al.: "Semiotic Modeling for Text String De- Duplication"


Therefore, when a hard copy of text is scanned into a text file, it is not only conceivable, but even likely, that the results of the scan will not be the same when conducted twice in a row. (Even such factors as variation in the position (including angle) of the hard copy being scanned can have a significant effect in this regard.) This leads to an interesting question: Is the "text" actually the hard copy, or is it the file? In other words, is the file a representation of the "real" object (specifically, the hard copy), or is the hard copy a representation of the "real" object (specifically, the file)? Ultimately, it must be recognized that the text's author's thoughts and words are the true object (albeit one can argue that even the author's words are only a representation of the thoughts, i.e., the "true" object), and both the hard copy and the file are attempts to communicate those thoughts to other individuals (or to machines). Herein lie the roots of artificial intelligence and machine-based understanding of human thoughts (Partridge and Wilks, 1990). Fortunately, computers have been constructed to function in a manner similar to the human mind, and just as the human mind is innately a symbolic system that constructs symbols, manipulates them, and relates them to real objects external to the mind, so do computers receive inputs of symbols, manipulate them, and provide output of symbols (Meunier, 1998). For the purposes of the discussion at hand, this question does not need to be resolved immediately, but it remains of interest, and has implications for future work.

What can be concluded about the lack of constancy in results from a document scanning procedure? It is patently obvious that the hard copy does not undergo meaningful change from one scan to the next (although its position on the scanner can vary). However, one would be hard pressed to state that the scanning procedure is not a true function, in the mathematical sense of the word. (In an ideal scanning process, scans of identical input should yield identical output.) Natural scatter of output data is acceptable, at least to a certain degree; moreover, natural scatter is not inconsistent with the concept of a mathematical function. The problem is complexified by the fact that one goal of a de-duplication procedure is to scan different hard copies and determine which ones contain text identical to ones previously scanned.

Within this context, the following problem has been presented to the scientific community: How can hard copies of classified material (much of which is cable traffic), which can be years or decades old, be compared with text previously scanned into a database? This problem stems from the Freedom of Information Act and recent executive orders to declassify any material not currently considered to be vital to national security. (Although this work was conducted with an eye toward the American government, and text in English, the concepts are universally applicable.)

Declassifying the millions of documents that apply to the situation at hand is expected to be a labor-intensive process. A significant step in reducing the labor requirements is to determine which documents (or text strings within documents) are copies of each other. The project will entail scanning all of the documents, discarding copies, and processing the remaining documents for a decision on declassification.

Page - 1     Page + 1

AS/SA Nº 6/7, Article 1 : Page 3 / 13

© 1999, AS/SA

E-mail to the editors
Pour écrire à la rédaction