Building a Programming Language: What is a Tokenizer?

Let’s say you want to make a programming language. The first thing your implementation needs to do is tokenize the input. What does that mean?

Your implementation will be a program that reads an input string and interprets its meaning, analogous to the way we read sentences and interpret their meanings. Consider the following sentence:

The quick brown fox jumps over the lazy dog.

When a human reads this sentence, they look at it and see words. But when a computer reads the same sentence, it just sees a bunch of characters:

'T', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x', ' ', 'j', 'u', 'm', 'p', 's', ' ', 'o', 'v', 'e', 'r', ' ', 't', 'h', 'e', ' ', 'l', 'a', 'z', 'y', ' ', 'd', 'o', 'g', '.'

The process of tokenizing is the process of interpreting a series of characters as a sentence of words.

'The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.'

That’s much easier to understand now, but we’re not done yet. A list of words followed by a period does not necessarily make a valid sentence. To know if the sentence is valid and what it might mean, we need to know what part of speech each of the words are. Something like:

{ type: Article, source: 'The' },
{ type: Adjective, source: 'quick' },
{ type: Adjective, source: 'brown' },
{ type: Noun, source: 'fox' },
{ type: Verb, source: 'jumps' },
{ type: Preposition, source: 'over' },
{ type: Article, source: 'the' },
{ type: Adjective, source: 'lazy' },
{ type: Noun, source: 'dog' },
{ type: Period, source: '.' }

Now we have something that a computer program could begin to interpret. Each of those items listed above is called a token. It has a type in addition to its source which is the original string of characters.

Just because we have a list of tokens doesn’t necessarily mean that we have a valid sentence, but we now have enough information to be able to determine if this sentence follows the rules of the grammar of the language.

Of course, we’re making a programming language, so the “sentences” look more like this:

local (text = op.getsuboutline ());

And the tokens look like this:

{ type: Local, source: "local" },
{ type: LeftParen, source: "(" },
{ type: Identifier, source: "texttosearch" },
{ type: Equal, source: "=" },
{ type: Identifier, source: "op" },
{ type: Dot, source: "." },
{ type: Identifier, source: "getsuboutline" },
{ type: LeftParen, source: "(" },
{ type: RightParen, source: ")" },
{ type: RightParen, source: ")" },
{ type: Semicolon, source: ";" }

But the concept is the same. The job of a tokenizer is to scan through the input string and convert it into a list of tokens.

Once we have a tokenizer, we can start building a compiler which takes the list of tokens, interprets it according to the rules of the language’s grammar, and generates instructions for the computer to execute. But that’s for another day.

Ted C. Howard @ted