Compiler - from Basic to Advanced - Part A
This tutorial introduces the method to build a compiler from scratch.
Objectives
- Define and write the user-defined language
- Refresh Java
- Improve the debugging skill
- Understand the step by step to create your compiler
Part A - language and basic theory
Assume that you write the C/C++ code (text) and use the Java compiler to build it. There are a lot of compile errors.
Similarly, you write Java code and use gcc or g++ compiler to build it. There are errors too.
Each language has the specic syntax.
In this part, we define one simple language then you can extend it to your own language (and hope that it maybe popularly).
Our language starts with the 'begin' and ends with 'end' words.
begin
end
let's add some basic syntax:
expression: var1 + var2
assignment statement: total = var1 + var2
logic expression: var1 < var
loop statement: loop (var1 < var2) total = var1 + var
Example of the correct code:
begin
var1 = 2;
var2 = 4;
total = var1 + var2;
loop (var1 < var2)
loop ( var3 > var4)
var2 = var2 - var1
end
Generally, the language is:
<program> -> begin <statement_list> end
<statement_list> -> <statement> {;<statement_list>}
<statement> -> <assignment_statement> | <loop_statement>
<assignment_statement> -> <variable> = <expression>
<variable> -> identifier
<expression> -> <variable> { (+|-) <variable>}
<loop_statement> -> loop (<logic_expression>) <statement>
<logic_expression> -> <variable> (< | >) <variable>
Where:
identifier is an identifier is a string that begins with a letter followed by 0 or more letters and/or digits
It is easy to match our example with the language:
begin <--> begin
end <--> end
<assignment_statement> <--> var1 = 2
where
<expression> <--> <variable>
<variable> <--> identifier
identifier <--> 2
The above code is used as the input of our compiler.
The compiler will read the input character by character and regconize:
Token: such as "begin", "end", semicolor, loop, greater than comparison, less than comparision.
Note: begin, end, loop are keywords
Token has 2 parts:
- The token type such as IDENTIFIER, DIGIT, EOF, ERROR, LEFT_PAREN...
- Lexeme such as '<', "begin", "var1"...
Note: the above language also accepts the recursive such as
begin
var1 = 2;
var2 = 4;
total = var1 + var2;
loop (var1 < var2)
loop ( var3 > var4)
var2 = var2 - var1
end
where the loop is inside the loop...
The next parts are:
Part B - lexer
Part C - parser
Part D - code generation and automatic tools