Codementor Events

Compiler - from Basic to Advanced - Part B

Published Jun 14, 2021
Compiler - from Basic to Advanced - Part B

This part uses the defined language in Part A to write the Lexer.

We need to do the following steps to extract the token by token in the Part A:

  1. define character types:
  public static final int LETTER = 0;  //letter 
  public static final int DIGIT = 1;   //digit
  public static final int UNKNOWN = 2; //unknown, use lookup to find class

  public static final int LEFT_PAREN = 3; //left parenthesis
  public static final int RIGHT_PAREN = 4; //right parenthesis
  public static final int ADD_OP = 5; //add operator
  public static final int SUB_OP = 6; //sub operator
  public static final int ASSIG_OP = 7; //assignment operator

  public static final int EOF = 8; //end of file
  public static final int LESS_OP = 9; //less than comparison operator
  public static final int GREATER_OP = 10; //greater than comparison operator
  public static final int COMMA = 11; //comma character
  public static final int ERROR = -1; //unknown character

  public static final int IDENT = 100; //identity
  1. open input file, read character by character and return the token (token type with the lexeme)

We need 3 methods that are called by order to retrieve a token (This token will be used in the parser of Part B)

  a. read character and recognize the character type
  b. look up to the character (if token is unknown)
  c. lex method (return the completed token)

a) read character
We can use Scanner in Java to read one character such as scanner.read(). This method also returns the character class such as LETTER, DIGIT, UNKNOWN

b) lookup method
This method uses the swith case:
if character is '(', this character is appended to the lexeme and token type is LEFT_PAREN
if character is '+', this character is appended to the lexeme and token type is ADD_OP
...
c) lex method
This calls (a) method and (b) method to get the token type and the lexeme. This is called by the parser.

Summary:
So far, we have understood the followings:

  • language and syntax
  • lexeme
  • token
  • lexer

The next parts are:
Part C - parser
Part D - code generation and automatic tools

Discover and read more posts from Quang Dang Van
get started