Tech moves fast! Stay ahead of the curve with Techopedia!
Join nearly 200,000 subscribers who receive actionable tech insights from Techopedia.
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded. The tokens become the input for another process like parsing and text mining.
Tokenization is used in computer science, where it plays a large part in the process of lexical analysis.
Tokenization relies mostly on simple heuristics in order to separate tokens by following a few steps:
Tokens themselves can also be separators. For example, in most programming languages, identifiers can be placed together with arithmetic operators without white spaces. Although it seems that this would appear as a single word or token, the grammar of the language actually considers the mathematical operator (a token) as a separator, so even when multiple tokens are bunched up together, they can still be separated via the mathematical operator.
Join nearly 200,000 subscribers who receive actionable tech insights from Techopedia.