从‘Hello World’到编译器：用Python手写一个简单的语法树生成器（附完整代码）-二趣网

从‘Hello World’到编译器：用Python手写一个简单的语法树生成器（附完整代码）

编译原理常被视为计算机科学中最艰深的领域之一，那些晦涩的术语和抽象概念让许多开发者望而却步。但当我第一次看到算术表达式被拆解成树状结构时，突然意识到：语法树不过是代码的另一种可视化形式。本文将用不到200行Python代码，带您实现一个能解析四则运算并生成语法树的微型编译器前端，这种"从做中学"的方式远比死记硬背文法规则更有效。

1. 环境准备与基础概念

在开始编码前，我们需要明确几个关键概念。上下文无关文法（Context-Free Grammar）是描述编程语言语法的标准工具，它由一组产生式规则组成。以简单算术表达式为例，其文法可以表示为：

Expr → Term + Expr | Term - Expr | Term Term → Factor * Term | Factor / Term | Factor Factor → ( Expr ) | Number Number → [0-9]+

这个文法清晰地定义了运算优先级：括号>乘除>加减。递归下降解析器正是基于这种分层结构设计的——每个非终结符（如Expr、Term）对应一个解析函数，形成自然的调用层次。

准备Python 3.8+环境，我们只需要标准库：

import sys import re from dataclasses import dataclass from typing import Union, List

2. 词法分析器实现

词法分析器（Lexer）负责将原始字符串转换为标记流（Token Stream）。我们首先定义标记类型：

@dataclass class Token: type: str # 类型如NUMBER, PLUS, LPAREN等 value: Union[str, int, float] pos: int # 在源字符串中的位置 class Lexer: TOKEN_SPEC = [ ('NUMBER', r'\d+(\.\d*)?'), ('PLUS', r'\+'), ('MINUS', r'-'), ('MUL', r'\*'), ('DIV', r'/'), ('LPAREN', r'\('), ('RPAREN', r'\)'), ('SKIP', r'[ \t]+'), # 忽略空格和制表符 ] def __init__(self, text): self.text = text self.pos = 0 self.tokens = self._tokenize() def _tokenize(self): tokens = [] while self.pos < len(self.text): for token_type, pattern in self.TOKEN_SPEC: regex = re.compile(pattern) match = regex.match(self.text, self.pos) if match: value = match.group() if token_type != 'SKIP': if token_type == 'NUMBER': value = float(value) if '.' in value else int(value) tokens.append(Token(token_type, value, self.pos)) self.pos = match.end() break else: raise SyntaxError(f"Unexpected character: {self.text[self.pos]}") return tokens

测试词法分析器：

lexer = Lexer("3 + 4 * (10 - 5)") print([(t.type, t.value) for t in lexer.tokens]) # 输出: [('NUMBER', 3), ('PLUS', '+'), ('NUMBER', 4), ('MUL', '*'), # ('LPAREN', '('), ('NUMBER', 10), ('MINUS', '-'), ('NUMBER', 5), ('RPAREN', ')')]

3. 递归下降语法分析

语法分析器（Parser）将标记流转换为抽象语法树（AST）。我们首先定义AST节点类型：

class ASTNode: pass @dataclass class BinOp(ASTNode): left: ASTNode op: Token right: ASTNode @dataclass class Num(ASTNode): token: Token @dataclass class UnaryOp(ASTNode): op: Token expr: ASTNode

递归下降解析器的核心是分层解析函数，每个函数对应文法中的一个非终结符：

class Parser: def __init__(self, tokens): self.tokens = tokens self.current_token = None self.pos = -1 self.advance() def advance(self): self.pos += 1 self.current_token = self.tokens[self.pos] if self.pos < len(self.tokens) else None def parse(self): return self.expr() def expr(self): node = self.term() while self.current_token and self.current_token.type in ('PLUS', 'MINUS'): op = self.current_token self.advance() node = BinOp(left=node, op=op, right=self.term()) return node def term(self): node = self.factor() while self.current_token and self.current_token.type in ('MUL', 'DIV'): op = self.current_token self.advance() node = BinOp(left=node, op=op, right=self.factor()) return node def factor(self): token = self.current_token if token.type == 'LPAREN': self.advance() node = self.expr() if self.current_token.type != 'RPAREN': raise SyntaxError("Expected closing parenthesis") self.advance() return node elif token.type == 'NUMBER': self.advance() return Num(token) elif token.type in ('PLUS', 'MINUS'): self.advance() return UnaryOp(op=token, expr=self.factor()) else: raise SyntaxError(f"Unexpected token: {token.type}")

测试解析器：

tokens = Lexer("3 + 4 * (10 - 5)").tokens ast = Parser(tokens).parse() print(ast) # 输出类似: BinOp(left=Num(token=Token(type='NUMBER', value=3, pos=0)), # op=Token(type='PLUS', value='+', pos=2), # right=BinOp(left=Num(token=Token(type='NUMBER', value=4, pos=4)), # op=Token(type='MUL', value='*', pos=6), # right=...))

4. 可视化语法树

为了直观理解解析结果，我们实现一个简单的树形打印器：

def print_ast(node, indent=0): if isinstance(node, Num): print(' ' * indent + f"Number({node.token.value})") elif isinstance(node, UnaryOp): print(' ' * indent + f"UnaryOp({node.op.value})") print_ast(node.expr, indent + 2) elif isinstance(node, BinOp): print(' ' * indent + f"BinOp({node.op.value})") print_ast(node.left, indent + 2) print_ast(node.right, indent + 2) print_ast(ast) """ 输出示例: BinOp(+) Number(3) BinOp(*) Number(4) BinOp(-) Number(10) Number(5) """

更专业的可视化可以使用Graphviz生成图片。添加以下代码：

from graphviz import Digraph def visualize_ast(node, graph=None): if graph is None: graph = Digraph() graph.node(str(id(node)), label=f"Root: {node.__class__.__name__}") if isinstance(node, Num): graph.node(str(id(node)), label=f"Number\n{node.token.value}") elif isinstance(node, UnaryOp): graph.node(str(id(node)), label=f"Unary {node.op.value}") graph.edge(str(id(node)), str(id(node.expr))) visualize_ast(node.expr, graph) elif isinstance(node, BinOp): graph.node(str(id(node)), label=f"Binary {node.op.value}") graph.edge(str(id(node)), str(id(node.left))) graph.edge(str(id(node)), str(id(node.right))) visualize_ast(node.left, graph) visualize_ast(node.right, graph) return graph visualize_ast(ast).render('ast', view=True) # 生成PDF可视化文件

5. 错误处理与扩展建议

当前实现已经能处理基本算术表达式，但还需要增强健壮性：

错误恢复：当遇到语法错误时，可以尝试跳过当前标记继续解析
更丰富的语法：支持变量、函数调用等特性
语义分析：检查类型一致性等语义规则

添加错误恢复的改进版解析器示例：

class AdvancedParser(Parser): def _error(self, message): print(f"SyntaxError at position {self.current_token.pos}: {message}") # 尝试跳过当前标记继续解析 self.advance() return None def factor(self): try: token = self.current_token if not token: return None if token.type == 'LPAREN': self.advance() node = self.expr() if not self.current_token or self.current_token.type != 'RPAREN': return self._error("Expected ')'") self.advance() return node elif token.type == 'NUMBER': self.advance() return Num(token) elif token.type in ('PLUS', 'MINUS'): self.advance() expr = self.factor() return UnaryOp(op=token, expr=expr) if expr else None else: return self._error(f"Unexpected token: {token.type}") except IndexError: return self._error("Unexpected end of input")

实际项目中，您可能还需要考虑以下优化：

词法分析性能：使用预编译的正则表达式
语法树遍历：实现Visitor模式便于后续分析
运算符优先级表：动态调整优先级而不用修改文法

# Visitor模式示例 class ASTVisitor: def visit(self, node): method_name = f'visit_{type(node).__name__}' visitor = getattr(self, method_name, self.generic_visit) return visitor(node) def generic_visit(self, node): raise Exception(f'No visit_{type(node).__name__} method') class EvalVisitor(ASTVisitor): def visit_Num(self, node): return node.token.value def visit_BinOp(self, node): left = self.visit(node.left) right = self.visit(node.right) if node.op.type == 'PLUS': return left + right elif node.op.type == 'MINUS': return left - right elif node.op.type == 'MUL': return left * right elif node.op.type == 'DIV': return left / right def visit_UnaryOp(self, node): value = self.visit(node.expr) return -value if node.op.type == 'MINUS' else value eval_visitor = EvalVisitor() result = eval_visitor.visit(ast) print(f"计算结果: {result}") # 输出: 计算结果: 23.0

这个微型编译器前端虽然简单，但已经包含了现代编译器的核心思想。当您下次使用Babel或TypeScript编译器时，会发现它们本质上也是在构建和转换语法树——只是处理的语法更复杂，优化的程度更深而已。

企业官网建设流程全解析