Tuesday, October 29, 2013

Natural Language Processing Specified Terms



Case-Folding : Case-Folding is done by reducing all letters to lower case. Case-Folding can equate words that might better be kept apart.

Tokens : A character sequence and a defined document unit, tokenization is the task of chopping up it up into pieces, called tokens.
OR
               A token is an instance of a sequence of characters in some perticular document that are grouped togather as a useful semantic unit for processing.

Type : A type is the class of all non-repeated tokens or elements or words of the document.

 Normalize : We made words normalize by canonicalizing tokens so that matches accur inspite of shallow differences in the character sequence of the tokens. For instance, if we search for USA we might hope to also match document containing U.S.A. 


No comments:

Post a Comment