How does a Compiler work

Lexer:

Takes lexemes as an input an generate tokens out of it.(Lexemes are the basic units of meaning in a language in isolation. ). (tokens are the smallest individual units of a program that the compiler recognizes and processes during lexical analysis, the first phase of compilation. The C compiler breaks down the source code into tokens before proceeding to syntax analysis and further stages. Tokens are the building blocks of a program, and understanding them is fundamental to grasping how C code is interpreted.)

Here is an example of how lexer working:

x = a+b*c;
LexemesTokens
xidentifier
=operator
aidentifier
+operator
bidentifier
*operator
cidentifier
;semicolon

Parser:

The parser takes a stream of tokens from the lexer and checks whether they form a valid C++ program according to the grammar. It do it by creating a tree-like  Intermediate Representation (IR) using the existing grammar.A typical representation is a syntax tree in which each interior node represents an operation and the children of the node represent the arguments of the operation.

Here is an example of a working of parser:

x = a+b*c;

Here is an example of our grammar

statement
    → assignment SEMICOLON;
assignment
    → IDENTIFIER "=" expression;
expression
    → term (("+" | "-") term)*;
term
    → factor (("*" | "/") factor)*;
factor
    → IDENTIFIER;

Here is our tree:

Semantic analysis:

The semantic analyzer uses the syntax tree and the information in the symbol table to check the source program for semantic consistency with the language definition. It also gathers type information and saves it in either the syntax tree or the symbol table, for subsequent use during intermediate-code generation.

What are the responsibilities of semantic analysis?

Symbol Table Construction: The symbol table is a data structure that stores information about identifiers.

Here is what each table contain:

  • Name
  • Kind (variable, function)
  • Type (int, float)
  • Scope (global, local)
  • Storage info (offset, size)

For example:

int x;

Here is how it would look in the table:

x → { Name:x,Kind: x, type: int, scope: global }

Scope Management: Semantic analysis tracks where identifiers are visible.

int x;

void f() {
    int x;   // different x
}
{
    int x;
    int x;   // redeclaration error
}

Name Resolution: Checks if each identifier is declared before use.

x = 5;

Not declare before use. It will give an error.

Look up x in symbol table

Not found → error

Type Checking:Ensures operations are applied to compatible types.

Here is an example of invalid type:

int a;
int b;
a + "text";

Here is an example of valid:

int a;
int b;
a + b;

Type Propagation: each tree syntax tree element gets a type.

This information is used later by the code generator.

Assignment Compatibility: Checks that left-hand and right-hand sides are compatible.

float f;
int x;
f = x;     // allowed (implicit conversion)
x = f;     // may be warning or error

Function Semantics:For function calls, semantic analysis checks:

  • Function exists
  • Correct number of arguments
  • Correct argument types
  • Correct return type usage
int add(int a, int b);

add(1);        // ❌ wrong argument count
add(1, "x");  // ❌ wrong type

Control Flow Semantics:Checks rules such as:

  • break only inside loops
  • return matches function return type
  • All paths return a value (if required)
int f() {
    if (x > 0)
        return 1;
}  // ❌ missing return

Constant Expressions & Lvalues:Semantic analyzer also checks:

What is an lvalue (can appear on left of =)

Whether expressions are constant expressions

❌ Invalid:

(a + b) = 5;

Optimization

The machine-independent code-optimization phase attempts to improve the intermediate code so that better target code will result. Usually better means faster, but other objectives may be desired, such as shorter code, or target code that consumes less power.

Types of optimisations:

Machine-Independent Optimizations:

Constant Folding

Evaluate constants at compile time.

int x = 3 * 4;
int x = 12;

Constant Propagation

Replace variables with known constant values.

int a = 5;
int b = a + 3;
int b = 8;

Dead Code Elimination

Remove code that never affects output.

int x = 5;
x = 6;   // x=5 is dead

Common Subexpression Elimination (CSE)

Avoid recomputing the same expression.

a = b * c;
d = b * c;
t = b * c;
a = t;
d = t;

Copy Propagation

Replace copies with original values.

a = b;
c = a;
c = b;

2. Control Flow Optimizations

🔹 Unreachable Code Elimination

return;
x = 5;   // unreachable

🔹 Loop Optimizations

Loop-Invariant Code Motion

for (i=0; i<n; i++)
    x = a * b;   // invariant

t = a * b;
for (i=0; i<n; i++)
    x = t;

Strength Reduction

x = i * 2;

x = i << 1;

3. Data Flow Analysis (Foundation of Optimization)

The compiler computes:

  • Reaching definitions
  • Live variables
  • Available expressions

Example:

  • If a variable is not live, its assignment can be removed.

4. Machine-Dependent Optimizations

(Target CPU aware)

🔹 Register Allocation

  • Assign frequently used variables to registers
  • Reduces memory access

🔹 Instruction Scheduling

  • Reorder instructions to avoid pipeline stalls

5. Inlining

Replace function calls with function bodies.

int add(int a, int b) { return a+b; }

x = add(1, 2);

x = 1 + 2;

Here is an example code optimization:

x=a+b*c
t1 = b * cx = a + t1

 Codegen

The code generator takes as input an intermediate representation of the source

program and maps it into the target language. If the target language is machine

code, registers or memory locations are selected for each of the variables used bythe program. Then, the intermediate instructions are translated into sequences

of machine instructions that perform the same task. A crucial aspect of code

generation is the judicious assignment of registers to hold variables.

t1 = b * c
t2 = a + t1
X  = t2
mov eax, [b]
imul eax, [c]
add eax, [a]
mov [X], eax

What the Assembler Does

The assembler takes assembly instructions and:

  1. Translates them into machine instructions (binary opcodes)
  2. Assigns addresses to instructions and data
  3. Creates object files
  4. Produces symbol tables
  5. Records relocation information