July 2, 2004
K-Praxis has been interested in text analysis for quite some time, and with feedback from our sister concern, Textual Analytics Solutions, we believe it is now time to define and demarcate more clearly the domain of text analysis. In this series of articles we attempt some preliminary definitions and identify differences.
Natural Language?
The term 'natural language' could gain currency only in times when languages other than 'natural' were available. For centuries, human beings used only language. In the twentieth century, when computer programs came to be written, the sets of instructions came to be called 'programming languages'. Strictly speaking, this is misleading nomenclature. All one has to do is to look at some assembly level code. This is what everything is reduced to, before finally transforming into numbers. One knows immediately that this is most definitely a set of instructions. These instructions are given in abbreviations of English words for certain actions (JMP, MOV, ADD, SUB, INC, MULT etc.). These abbreviations are used in the first place because the full English words (JUMP, MOVE, ADD, SUBTRACT, INCREMENT, MULTIPLY etc.) correspond to certain physical or arithmetical actions.
At the level of assembly, then, one could not really speak of language. Higher-level sets of instructions, whether FORTRAN, C++, Scheme, Lisp (and there are countless others) are still, mere sets of instructions that include a minimal number of words from the English language. Apart from conventional usage, there is very little justification to call these sets of instructions 'programming languages'. We must also examine the kinds of English words that have entered into programming languages. Consider the well-known set of instructions that begin with "if" and are followed by "then" and quite often by "else". It is fundamentally important to wonder whether these are 'linguistic' operations or whether they are, properly speaking, 'logical' operations. It is a moot question whether these 'logical' operations are 'linguistic' or not. A word like 'love' could hardly enter this realm. There has been much academic discussion on the relationship between logic and language in the long history of language use - consider, for example why all discussions, including discussion of logic and language, have to take place in language!
Computation and Computers
When we used to punch cards that were 'read' by brushes, we did not really call these punched cards anything but punched cards. Programming amounted to correctly punching cards, feeding them to the more or less mechanical, more or less analogical machine. When we gossiped while punching cards, the language we used was not called 'natural' language then. The beast that we fed these punching cards to, was contemptuously called a computer, almost to mean a stupid, dumb thing, disproportionate in its size and its outputs.
It seems that almost in no time, punching cards transformed into 'programming languages'. It is very important to remember that the word 'language' here is used metaphorically. Like many other metaphors, this too has generated a lot of unnecessary debate, and in this particular case, artificial divisions, animosities, specializations, and, we must admit, some wonderful advances.
Language and Natural Language Processing
NLP thrives on the distinction between artificial (programming) languages and 'natural' languages. We have already pointed out that this distinction is made possible by that metaphorical use, 'programming languages'.
Attempts at relating the increased computational power to analysis of language use have focused on written texts, which are more easily converted to electronic text formats, usually plain text. It is here that we might have to make a preliminary distinction between plain text documents and language use. Spoken language tends to be accompanied by other communication protocols (gestures, para-tactical features etc.). Electronic texts are not really capable of representing these. As a result most computation of language tends to focus either on syntactic (read grammatical) or semantic (read meaning) features of language.
A grammar-oriented processing necessarily has to tag the various grammatical features of words (number, tense, gender etc.), which in itself is a non-trivial process. A semantics oriented processing, on the other hand, must focus on meanings (and relationships), and tends towards what in the parlance is called ontology. While it is true that the construction of any useful ontology too is a non-trivial process, it is of limited use, since the ontology has to be defined well in advance, before a text can be processed. This further means that updating the ontology is a huge task-and one is always in need of updating.
The one area where statistical methods have proved most useful is that of stylometry, and there might be something to learn from these methods. The applicability issue, however, is important: apart from generating interesting algorithms that enable one to identify authorship (with varying degrees of success), one is hard pressed to find uses for advances in stylometry.
In the next article, we shall discuss these issues further, along with some other issues.
Please feel free to send your comments, questions to us at info@k-praxis.com
