It is noteworthy that, according to the above definition, a function is a set of ordered
pairs, with the first member being an element of set X, and the second member
being an element of set Y. X and Y can be any sets, so if we state that X is a set of
documents, and Y is a set of text files (or equivalently, a set of binary numbers,
admittedly quite long binary numbers, encoding text files), then we may state that
scanning a document into a text file is a function if and only if every time a
document is scanned, the same text file is produced (Gilbert, 1970).
Clearly, we want scanning to be a function, since constancy of the scanning
process is desired. Unfortunately, this goal is complicated by a need for
representations of documents in a database to evolve as additional documents are
compiled into said database (Bouchaffra and Meunier, 1995; Rijsbergen, 1979;
Robertson and Harding, 1984). If a keyword system (or equivalent indexing system)
is used to represent the content of text strings in a database, then obviously a small
database will require fewer keywords to match document content than a large
database will. Neural networks (Belew, 1989; William, 1990) and a Markovian
Random Field Machine (Bouchaffra and Meunier, 1995) have been proposed as
solutions to this portion of the problem. Both methodologies have displayed promise.
Regardless of the methodology, some system comparing UNits of INformation
(otherwise referred to as UNIFs; Meunier, et al., 1993; Meunier,
et al., 1987) between new additions to a database and existing database
members needs to be implemented.
How do these concepts relate to semiotic modeling? The scientific method
tells us that if a process (or function, in the case at hand) is carried out consistently
on an identical input, the results also should be identical. Unfortunately, when
scanning text strings for input into a text file, this is not always the case. Some
- Punctuation: The small size of many punctuation marks can result in an inconsistent scanning of punctuation.
- Misread letters: Depending on the font, an upper case "A" (for example) can sometimes be scanned as "Fl" (upper case F, lower case ell).
- Spaces: OCR software performs with a great degree of variation when
inserting spaces between words. For example, there is less space
between M and N than there is between A and V (MN vs. AV, by
inspection). The variations in the spaces required between different
pairs of letters can result in an extra space inserted in the middle of
a word, or (more commonly) two or more spaces inserted between
adjacent words, when only a single space is intended.