Lexing 2.2

Definition

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens

Tokens

Commonly stored as (name, value) tuples. Example:

my_variable = 17

Becomes:

(identifier, "my_variable")
(operator, "=")
(integer, 17)

Calculator Example

Supported operations

  • 1 + 2
  • 4 - 2
  • 2 * 3
  • 8 / 4

Goal

  • Scan a string such as “1 + 2 * 3” and create a list of (type, value) tuples:
(number, 1)
(operator, +)
(number, 2)
(operator, *)
(number, 3)

Chomsky Hierarchy

Grammar Automaton (Computer)

  1. Unrestricted Turing Machines
  2. Context Free Pushdown Automata
  3. Regular Finite State Automata

Regular Languages

  • Are insufficient to parse most programming languages
  • Are useful for parsing tokens
  • Can be processed by a simple finite automaton

Deterministic Finite Automaton (DFA)

  • Finite set of states \(Q\)
  • Finite set of input symbols called the alphabet \(\Sigma\)
  • Transition function \(\delta : Q \times \Sigma \rightarrow Q\)
  • Initial or start state \(q_0 \in Q\)
  • Set of accept states \(F \subseteq Q\)

Drawing DFAs

  • States are nodes on the graph
  • Start state indicated by arrow
  • Accept states indicated by double border
  • Transitions indicated as labeled arrows
DFA to accept string containing an even number of zeroes

Regular Expressions

A string used as a search pattern for a member of a regular language.

RE Basics

  • Characters are generally matched literally.
import re

s = "the quick brown fox jumped over the lazy dog"
re.findall("dog", s)
['dog']

RE Boolean Or

  • Pipes (|) can be used for boolean or
import re

s = "the quick brown fox jumped over the lazy dog"
re.findall("dog|fox", s)
['fox', 'dog']

RE Quantifiers

  • ? - zero or one occurrences of preceding element
  • * - zero or more occurrences of preceding element
  • + - one or more occurrences of preceding element
  • {n} - exactly n occurrences of preceding element
import re

s = "the quick brown fox jumped over the lazy dog"
re.findall("ov?", s)
['o', 'o', 'ov', 'o']

RE Grouping

  • Parens can be used for grouping
import re

s = "the quick brown fox jumped over the lazy dog"
re.findall("(d|f)o(g|x)", s)
[('fox', 'f', 'x'), ('dog', 'd', 'g')]

RE Bracket Expressions

  • Brackets [] may be used to match a single character against a set of characters
import re

s = "the quick brown fox jumped over the lazy dog"
re.findall("[df]o[gx]", s)
['fox', 'dog']

RE Character Classes

Several special character classes are provided:

  • \w - alphanumeric characters
  • \d - digits
  • \s - whitespace characters
  • . - anything
import re

s = "the quick brown fox jumped over the lazy dog"
re.findall("\s...\s", s)
[' fox ', ' the ']

DFA RE Equivalence

  • A DFA can be created to match any regular expression
  • A regular expression can be created to match a language accepted by any DFA

Match Even Number of Zeroes

import re

inputs = [
    "", "0", "00", "11", "1", "010",
    "110110110", "00100", "1000000",
]

for text in inputs:
    print(f"{text:<9}",
        re.fullmatch(r"1*(01*01*)*", text) != None)

Simple Lexer in Python

import sys
import re

def lex_one(value):
    if re.match('\d+', value):
        return ('number', value)
    elif re.match('[\+\-\*\/]', value):
        return ('operator', value)
    else:
        raise ValueError(f'Invalid token: {value}')

def lex(text):
    return [lex_one(v) for v in text.split()]