Ted's Tidbits - development

Building a Programming Language: What is a Tokenizer?

Let’s say you want to make a programming language. The first thing your implementation needs to do is tokenize the input. What does that mean?

Your implementation will be a program that reads an input string and interprets its meaning, analogous to the way we read sentences and interpret their meanings. Consider the following sentence:

The quick brown fox jumps over the lazy dog.

When a human reads this sentence, they look at it and see words. But when a computer reads the same sentence, it just sees a bunch of characters:

'T', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x', ' ', 'j', 'u', 'm', 'p', 's', ' ', 'o', 'v', 'e', 'r', ' ', 't', 'h', 'e', ' ', 'l', 'a', 'z', 'y', ' ', 'd', 'o', 'g', '.'

The process of tokenizing is the process of interpreting a series of characters as a sentence of words.

'The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.'

That’s much easier to understand now, but we’re not done yet. A list of words followed by a period does not necessarily make a valid sentence. To know if the sentence is valid and what it might mean, we need to know what part of speech each of the words are. Something like:

{ type: Article, source: 'The' },
{ type: Adjective, source: 'quick' },
{ type: Adjective, source: 'brown' },
{ type: Noun, source: 'fox' },
{ type: Verb, source: 'jumps' },
{ type: Preposition, source: 'over' },
{ type: Article, source: 'the' },
{ type: Adjective, source: 'lazy' },
{ type: Noun, source: 'dog' },
{ type: Period, source: '.' }

Now we have something that a computer program could begin to interpret. Each of those items listed above is called a token. It has a type in addition to its source which is the original string of characters.

Just because we have a list of tokens doesn’t necessarily mean that we have a valid sentence, but we now have enough information to be able to determine if this sentence follows the rules of the grammar of the language.

Of course, we’re making a programming language, so the “sentences” look more like this:

local (text = op.getsuboutline ());

And the tokens look like this:

{ type: Local, source: "local" },
{ type: LeftParen, source: "(" },
{ type: Identifier, source: "texttosearch" },
{ type: Equal, source: "=" },
{ type: Identifier, source: "op" },
{ type: Dot, source: "." },
{ type: Identifier, source: "getsuboutline" },
{ type: LeftParen, source: "(" },
{ type: RightParen, source: ")" },
{ type: RightParen, source: ")" },
{ type: Semicolon, source: ";" }

But the concept is the same. The job of a tokenizer is to scan through the input string and convert it into a list of tokens.

Once we have a tokenizer, we can start building a compiler which takes the list of tokens, interprets it according to the rules of the language’s grammar, and generates instructions for the computer to execute. But that’s for another day.

Sending Legitimate Bulk Email

This is for all those people who are trying to run a web business that need to send bulk email messages and don’t want them to go directly into their recipients' spam folders.

Yesterday, I (and several others) dedicated several hours to the task of determining why every email we sent went directly into the spam folders of those we were trying to reach. When you search Google for information about spam filters, you find plenty of information about blocking unwanted email, but hardly anything about making sure your legitimate bulk email is not discarded with the trash. We were able to solve our issues, and so I thought I’d share our findings with the community.

Send only plain text. Attachments and HTML content raise flags with content filters.
Set the message header: "Precedence: bulk"
You must set a subject, body, from address, and reply-to address (not having reply-to was my problem

In addition, if you are hosting your own mail server you should:

Publish an SPF record in your DNS configuration
Configure your MTA to and DNS to use DKIM. (Acronyms FTW!)

I hope this info is helpful to someone. I wish I had it.

Sources: