Lexing 2.2

Definition

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens

Tokens

Commonly stored as (name, value) tuples. Example:

my_variable = 17

Becomes:

(identifier, "my_variable")
(operator, "=")
(integer, 17)

Calculator Example

Supported operations

1 + 2
4 - 2
2 * 3
8 / 4

Goal

Scan a string such as “1 + 2 * 3” and create a list of (type, value) tuples:

(number, 1)
(operator, +)
(number, 2)
(operator, *)
(number, 3)

Chomsky Hierarchy

Grammar Automaton (Computer)

Unrestricted Turing Machines
Context Free Pushdown Automata
Regular Finite State Automata

Regular Languages

Are insufficient to parse most programming languages
Are useful for parsing tokens
Can be processed by a simple finite automaton

Deterministic Finite Automaton (DFA)

Finite set of states \(Q\)
Finite set of input symbols called the alphabet \(\Sigma\)
Transition function \(\delta : Q \times \Sigma \rightarrow Q\)
Initial or start state \(q_0 \in Q\)
Set of accept states \(F \subseteq Q\)

Drawing DFAs

States are nodes on the graph
Start state indicated by arrow
Accept states indicated by double border
Transitions indicated as labeled arrows

DFA to accept string containing an even number of zeroes

Regular Expressions

A string used as a search pattern for a member of a regular language.

RE Basics

Characters are generally matched literally.

import re

s = "the quick brown fox jumped over the lazy dog"
re.findall("dog", s)

['dog']

RE Boolean Or

Pipes (|) can be used for boolean or

import re

s = "the quick brown fox jumped over the lazy dog"
re.findall("dog|fox", s)

['fox', 'dog']

RE Quantifiers

? - zero or one occurrences of preceding element
* - zero or more occurrences of preceding element
+ - one or more occurrences of preceding element
{n} - exactly n occurrences of preceding element

import re

s = "the quick brown fox jumped over the lazy dog"
re.findall("ov?", s)

['o', 'o', 'ov', 'o']

RE Grouping

Parens can be used for grouping

import re

s = "the quick brown fox jumped over the lazy dog"
re.findall("(d|f)o(g|x)", s)

[('fox', 'f', 'x'), ('dog', 'd', 'g')]

RE Bracket Expressions

Brackets [] may be used to match a single character against a set of characters

import re

s = "the quick brown fox jumped over the lazy dog"
re.findall("[df]o[gx]", s)

['fox', 'dog']

RE Character Classes

Several special character classes are provided:

\w - alphanumeric characters
\d - digits
\s - whitespace characters
. - anything

import re

s = "the quick brown fox jumped over the lazy dog"
re.findall("\s...\s", s)

[' fox ', ' the ']

DFA RE Equivalence

A DFA can be created to match any regular expression
A regular expression can be created to match a language accepted by any DFA

Match Even Number of Zeroes

import re

inputs = [
    "", "0", "00", "11", "1", "010",
    "110110110", "00100", "1000000",
]

for text in inputs:
    print(f"{text:<9}",
        re.fullmatch(r"1*(01*01*)*", text) != None)

Simple Lexer in Python

import sys
import re

def lex_one(value):
    if re.match('\d+', value):
        return ('number', value)
    elif re.match('[\+\-\*\/]', value):
        return ('operator', value)
    else:
        raise ValueError(f'Invalid token: {value}')

def lex(text):
    return [lex_one(v) for v in text.split()]