Ted's Tidbits - Building a Programming Language: What is a Tokenizer?

Let’s say you want to make a programming language. The first thing your implementation needs to do is tokenize the input. What does that mean?

Your implementation will be a program that reads an input string and interprets its meaning, analogous to the way we read sentences and interpret their meanings. Consider the following sentence:

The quick brown fox jumps over the lazy dog.

When a human reads this sentence, they look at it and see words. But when a computer reads the same sentence, it just sees a bunch of characters:

'T', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x', ' ', 'j', 'u', 'm', 'p', 's', ' ', 'o', 'v', 'e', 'r', ' ', 't', 'h', 'e', ' ', 'l', 'a', 'z', 'y', ' ', 'd', 'o', 'g', '.'

The process of tokenizing is the process of interpreting a series of characters as a sentence of words.

'The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.'

That’s much easier to understand now, but we’re not done yet. A list of words followed by a period does not necessarily make a valid sentence. To know if the sentence is valid and what it might mean, we need to know what part of speech each of the words are. Something like:

{ type: Article, source: 'The' },
{ type: Adjective, source: 'quick' },
{ type: Adjective, source: 'brown' },
{ type: Noun, source: 'fox' },
{ type: Verb, source: 'jumps' },
{ type: Preposition, source: 'over' },
{ type: Article, source: 'the' },
{ type: Adjective, source: 'lazy' },
{ type: Noun, source: 'dog' },
{ type: Period, source: '.' }

Now we have something that a computer program could begin to interpret. Each of those items listed above is called a token. It has a type in addition to its source which is the original string of characters.

Just because we have a list of tokens doesn’t necessarily mean that we have a valid sentence, but we now have enough information to be able to determine if this sentence follows the rules of the grammar of the language.

Of course, we’re making a programming language, so the “sentences” look more like this:

local (text = op.getsuboutline ());

And the tokens look like this:

{ type: Local, source: "local" },
{ type: LeftParen, source: "(" },
{ type: Identifier, source: "texttosearch" },
{ type: Equal, source: "=" },
{ type: Identifier, source: "op" },
{ type: Dot, source: "." },
{ type: Identifier, source: "getsuboutline" },
{ type: LeftParen, source: "(" },
{ type: RightParen, source: ")" },
{ type: RightParen, source: ")" },
{ type: Semicolon, source: ";" }

But the concept is the same. The job of a tokenizer is to scan through the input string and convert it into a list of tokens.

Once we have a tokenizer, we can start building a compiler which takes the list of tokens, interprets it according to the rules of the language’s grammar, and generates instructions for the computer to execute. But that’s for another day.