July 5, 2004
In the previous article, we discussed the difference between language (misleadingly called 'natural' language), and rules and protocols of writing sets of instructions (misleadingly called programming 'languages'). In this article we discuss the difference between text and language and a few other issues.
Text and Language
By language we shall mean spoken language, and in this particular context, by 'text' we shall mean language use available in digital formats (of various kinds). We have already indicated that language is always accompanied by other communication protocols (gesture, facial expressions, tone, etc.), whereas in a text the only protocol is that of words themselves, usually experienced through visual means. A text contains very few other markers.
While speaking, our syntax is often uncertain, whereas texts are written with a generally certain, and almost rigid syntax. In texts it is syntax that makes for precision of communication.
The major difference between language and text is that texts, once fixed, remain that way, whereas in communicative situations involving language, one can always modify one's communication ("No, no, that's NOT what I meant!"). It follows that once a text is fixed in a medium, it continues to communicate even if the producer of the text is absent from the communicative situation. Linguistic communication situations are interactive, textual communication situations are non-interactive (the author of the text is generally absent from the reading situations).
It follows that the methods and concepts used for analyzing text would be a little different from methods and concepts used for analyzing language. This is the fundamental difference between 'natural' language processing, and textual analytics.
Textual Analytics and NLP
Quite a lot of NLP applications and research suffer from neglecting this difference. Consider how the majority of NLP applications first begin with a 'stop-word' list. These are words that are said to be without too much meaning (most prepositions, articles etc.). Having stopped these words from 'cluttering' the process area, such applications either use clustering methods, or tagging methods to get semantically valuable words (usually nouns and verbs only).
In a linguistic communicative situation, words like 'here' and 'there' indicate certain spatial relationships (which can be taken for granted). A textual recording of the same linguistic utterances loses these entirely, almost, unless the spatial coordinates are recorded separately (they cannot now be taken for granted). Processing a textual recording through a method which first throws away words like 'here' and 'there' loses spatial coordinates-thus making the analysis context-blind. It is not surprising that contextual understanding of words is the weakest area in NLP!!
Textual analytics on the other hand, if developed properly, might retain what NLP throws away immediately, and hope to identify the spatio-temporal coordinates through these retained words, making it, surprisingly perhaps, context-aware. These processes are not easy, we must admit.
On the other hand, textual analytics will certainly benefit from the decades of research that have enhanced NLP, especially the various grammar-tagging methods, and possibly latent semantic indexing. NLP is, after all, an important contributor to the relationship between computation and language use.
Textual Analytics
What one needs to look for, then, are ways of analyzing text that do not, emphatically, discard any piece of textual information, and thus use all the information available in a given text, or across various texts.
If is important that textual analytics does not repeat the errors of commission or omission that NLP tends to make, and yet deliver more analytical capabilities and scope than NLP.
This, we believe is the challenge ahead for textual analytics.
Please feel free to send your comments, questions to us at info@k-praxis.com
