Codementor Events

Compiler - from Basic to Advanced - Part A

Published Jun 14, 2021
Compiler - from Basic to Advanced - Part A

This tutorial introduces the method to build a compiler from scratch.

Objectives

  1. Define and write the user-defined language
  2. Refresh Java
  3. Improve the debugging skill
  4. Understand the step by step to create your compiler

Part A - language and basic theory
Assume that you write the C/C++ code (text) and use the Java compiler to build it. There are a lot of compile errors.
Similarly, you write Java code and use gcc or g++ compiler to build it. There are errors too.

Each language has the specic syntax.

In this part, we define one simple language then you can extend it to your own language (and hope that it maybe popularly).
Our language starts with the 'begin' and ends with 'end' words.

begin

end

let's add some basic syntax:

expression: var1 + var2
assignment statement: total = var1 + var2
logic expression: var1 < var
loop statement: loop (var1 < var2) total = var1 + var

Example of the correct code:

begin 

var1 = 2;
var2 = 4;
total = var1 + var2; 
loop (var1 < var2) 
     loop ( var3 > var4)
   var2 = var2 - var1
end

Generally, the language is:

<program>  ->  begin <statement_list> end
<statement_list> -> <statement> {;<statement_list>}
<statement>  ->  <assignment_statement> | <loop_statement>
<assignment_statement> -> <variable> = <expression>
<variable> -> identifier
<expression> -> <variable> { (+|-) <variable>}           
<loop_statement> -> loop (<logic_expression>)  <statement>
<logic_expression> -> <variable> (< | >) <variable>

Where:
identifier is an identifier is a string that begins with a letter followed by 0 or more letters and/or digits

It is easy to match our example with the language:

begin <--> begin
end <--> end
<assignment_statement> <--> var1 = 2
where
<expression> <--> <variable>
<variable> <--> identifier
identifier <--> 2

The above code is used as the input of our compiler.
The compiler will read the input character by character and regconize:

Token: such as "begin", "end", semicolor, loop, greater than comparison, less than comparision.
Note: begin, end, loop are keywords

Token has 2 parts:

  • The token type such as IDENTIFIER, DIGIT, EOF, ERROR, LEFT_PAREN...
  • Lexeme such as '<', "begin", "var1"...

Note: the above language also accepts the recursive such as

begin 

var1 = 2;
var2 = 4;
total = var1 + var2; 
loop (var1 < var2) 
     loop ( var3 > var4)
   var2 = var2 - var1
end

where the loop is inside the loop...

The next parts are:
Part B - lexer
Part C - parser
Part D - code generation and automatic tools

Discover and read more posts from Quang Dang Van
get started