Compilation Phases Explained - Analysis and Synthesis

A compiler works in phases to translate high-level source code into machine code or an intermediate representation (i.e., the orange box in the image below). Compilation phases are categorized into two parts: analysis and synthesis.

Within these compiler phases, there are sub-phases (1-3: analysis; 4-7: synthesis):

Lexical analysis – tokenize source code.
Syntax analysis – builds an AST based on grammar.
Semantic analysis – ensures correct meaning and types.
Intermediate code generation – produces a machine-independent representation.
Optimization – improves intermediate code.
Code generation – produces machine-specific assembly code.
Code optimization (optional) – improves generated code.

Let’s explore each compilation phase.

🌸👋🏻 Join 10,000+ followers! Let’s take this to your inbox. You’ll receive occasional emails about whatever’s on my mind—offensive security, open source, boats, reversing, software freedom, you get the idea.

Analysis phase

1. Lexical analysis (scanner)

The lexical analysis phase is responsible for reading the source code character by character and grouping them into sequences called tokens.

Tokens represent syntax components (e.g., keywords, identifiers, literals, and operators).

Output – a stream of tokens passed to the next phase.
Example – from the line int x = 10;, the lexer might produce tokens like:
- int (keyword)
- x (identifier)
- = (operator)
- 10 (literal)
- ; (semicolon)

2. Syntax analysis (parser)

Abstract Syntax tree for Euclidean Algorithm from Wikipedia. Used on a post about compilation phases.

Abstract Syntax Tree (source).

The parser takes the stream of tokens produced by the lexical analyzer and checks if they follow the language’s grammatical rules. It typically produces an abstract syntax tree (AST) or parse tree.

Output – a tree-like structure (AST) representing the source code’s grammatical structure.
Example – the tokens int x = 10; might be parsed into an AST representing the variable’s declaration and assignment.

3. Semantic analysis

In this phase, the compiler ensures that the parsed structure (AST) adheres to the language’s semantic rules.

Semantic analysis includes type checking, checking scope rules, and ensuring variable declarations and uses are correct.

Output – an annotated AST that includes type information and other semantic details.
Example – ensuring that I am not trying to assign a string to an integer variable or that the variable x has been declared before use.

Shows syntax and semantics. Used on a post about compilation phases.

Shows syntax and semantics (source).

Annotated AST (source).

Synthesis phase

4. Intermediate code generation

After the semantic analysis phase completes, the compiler generates an intermediate representation (IR) of the source code.

The IR is usually independent of the target machine, making it easier to optimize later on, and retarget the compiler for different architectures.

Output – intermediate code, such as three-address code (TAC), control flow graphs (CFGs), or static single assignment (SSA).
Example – a high-level code like x = a + b might be translated to a lower-level IR like t1 = a + b, followed by x = t1.

Intermediate representation example. Used on a post about compilation phases.

(source)

5. Optimization

In this phase, the compiler tries to improve the intermediate code to make it more efficient, reducing resource usage (e.g., CPU and memory) or execution time.

Optimization can happen at multiple levels, including peephole optimization, loop optimization, constant folding, dead code elimination, etc.

Output – an optimized version of the intermediate code.
Example – if a variable’s value is constant, the compiler may replace all instances of that variable with the constant itself (constant propagation).

Vocab

Peephole optimization – local technique where small sequences of instructions are examined and replaced with more efficient ones, improving performance in a small code region.
Loop optimization – enhances loop performance through unrolling, invariant code motion, or loop fusion, reducing the number of iterations or redundant computations. Here is an explanation of these loop optimization techniques:
- Unrolling – expanding loop’s iterations by duplicating the loop body, reducing the overhead of loop control and potentially enabling more optimizations.
- Invariant code motion – moves computations or statements that produce the same result on every iteration (loop-invariant) outside the loop, avoiding redundant calculations within the loop.
- Loop fusion – merges two or more adjacent loops that iterate over the same range into a single loop, reducing the loop’s overhead and improving cache speed.
Constant folding – a process where constant expressions are evaluated at compile-time rather than runtime, reducing calculations during execution.
Dead code elimination – removing code that is never executed or whose results are never used, reducing unnecessary computations and resource usage.

6. Code generation

Then, the compiler translates the optimized intermediate code into target machine code (e.g., assembly or binary code) for the specific hardware architecture. This phase involves selecting instructions, allocating registers, and mapping variables to memory locations.

Output – target machine code (e.g., assembly code).
Example – a high-level statement like x = 10 might be translated into assembly instructions such as MOV R1, #10 and MOV [x], R1.

7. Code optimization (optional)

After generating the target code, additional machine-level optimizations may be performed (e.g., minimizing register usage, instruction reordering, or eliminating unnecessary instructions).

Output – an even more efficient version of machine code.
Example – removing redundant load and store instructions that don’t affect the program’s outcome.

After the compiler: assembler and linker

The final machine code might still need to be assembled into an executable format. The assembly phase converts assembly code into machine-readable binary, while the linking phase resolves external references (e.g., function calls to external libraries), and produces the final executable.

Output – the final executable file.
Example – linking compiled code with external libraries and system calls to produce the final executable binary.