CSE4251/CSE5251: Assignment 3

Write a lexer in JavaCC for MiniJava. By default the lexer should report the numbers of errors (including 0) found in the file.

Due: Friday, 30 January 2009

Do "Lexical Analysis", page 34. You must make a executable jar file named scan.jar. This program performs lexical analysis of MiniJava programs. Your scanner must not terminatate abnormally or raise an exception regardless of the input.

Your scanner must detect input that cannot be part of any legal token. Your scanner must issue error messages containing the file name, the line number, and the character position.

./TEST/lexical_errors/Error02.java:001.011: ERROR -- illegal character &
./TEST/lexical_errors/Error02.java:001.014: ERROR -- illegal character ^
./TEST/lexical_errors/Error02.java:002.002: ERROR -- illegal character \
filename=./TEST/lexical_errors/Error02.java, errors=3
There are no other kinds of lexical errors. There are no strings, so there are no unterminated strings. An unterminated comment /* ..., is just an illegal slash character. All characters are legal in a comment.

We will use JavaCC.

MiniJava Language

The definition of the language Mini-Java can be found at the back of the text. A summary of the grammar can be found on the WWW.

In Java comments do not nest. You do not have to implemented nested comments in MiniJava (despite what Appel says on page 484). You may use nested comment if you wish, as that is an interesting exercise.

White Space

The definition of white space is found in 3.6 White Space We will take white space to mean any one or more of the following characters.
  Java  octal dec  hex  name
HT  \t   011   9  0x09  horizontal tab
LF  \n   012  10  0x0A  line feed
FF  \f   014  12  0x0C  form feed
CR  \r   015  13  0x0D  carriage return
SP       040  32  0x20  space

Of particular interest is the definition of a newline, because, without it, we cannot agree on line numbers for errors. A line is determined by the underlying operating system. This is so ugly we might be tempted to write a pre-scanner that takes care of this. Instead we will define a newline to be the line-feed character (the way Unix does). Thus, the number of lines in a program is the number of line-feed characters (if the program ends in a line feed). Since carriage-return characters are ignored, DOS/Windows files will work in the expected fashion.

Be careful with the regular expression '.' --- it may not match \n or \r.

Compiling and Testing the Scanner

Since your tools may be in a different place than mine, you are requested to have a JAVACC variable in your makefile. I will test your scanner like this:

make compiler JAVA=... JAVAC=... JAVACC=...     ;; produces scan.jar
java -jar scan.jar test01.mj
java -jar scan.jar test02.mj
java -jar scan.jar test03.mj
Notice the use of the target "compiler" in the makefile.

If test01.mj contains no errors, then "errors=0" must appear in the one and only line of output to standard output. If test01.mj contains errors, then "errors=x" must appear in the one and only line of output to standard output, where x is a positive number.

Makefile

JAVAC := java
JAVAC := javac
JAVACC:= ../bin/javacc
JFLAGS:=

default	:	compiler

compiler :      scan.jar

scan.jar	:	...
	-@/bin/rm mc.mf
	echo "Main-Class: main.Scan" > mc.mf
	jar cvmf mc.mf $@ error/*.class parser/*.class main/*.class
	-@/bin/rm mc.mf

error/ErrorMsg.class	:  error/ErrorMsg.java
	$(JAVAC) $<

clean	:
	-/bin/rm *~ */*~
	-/bin/rm */*.class
Makefiles must have tabs, and must end in \n not \n\r!

JavaCC

  1. Where can I find JavaCC? Go to JavaCC Home Page and download the binaries for version 4.0.
  2. What does a scanner specification look like in JavaCC?
    PARSER_BEGIN(ExampleScanner)
    package parser;
    public class ExampleScanner {}
    PARSER_END(ExampleScanner)
    
    TOKEN : {
         < IF:      "if" >
       | < ELSE:    "else" >
       | < GT:      ">" >
       | < INVALID: ~[" ", "\t", "\n", "\b", "\f", "\r"] >
    
    }
    
    SKIP : {
         <"//" (~["\n", "\r"])* ("\n" | "\r" | "\r\n")>
       | " "
    }
    
  3. How do I run JavaCC? Use the sh or batch script in the bin directory.
  4. What happens when I run JavaCC? Seven Java files are created from the .jj file. Three of these files actually depend on the your intput and must be recreated everytime the .jj input is changed.
  5. Do you have a sample makefile? Note the scanner files are in a package/directory called "parser".
    JACC_SCANNER_FILE_NAME  := scanner
    JACC_SCANNER_CLASS_NAME := MiniJavaScanner
    JACC_GENERATED_SRC :=\
       $(JACC_SCANNER_CLASS_NAME).java\
       $(JACC_SCANNER_CLASS_NAME)Constants.java\
       $(JACC_SCANNER_CLASS_NAME)TokenManager.java\
       Token.java\
       TokenMgrError.java\
       SimpleCharStream.java\
       ParseException.java
    
    jacc_src := $(addprefix parser/, $(JACC_GENERATED_SRC))
    
    parser/Scan.class	:  $(jacc_src)
    
    # Only these three files actually depend on the particular scanner,
    # so only they need to be generated when $(JACC_SCANNER_FILE_NAME).jj is changed.
    parser/$(JACC_SCANNER_CLASS_NAME).java :  parser/$(JACC_SCANNER_FILE_NAME).jj
    parser/$(JACC_SCANNER_CLASS_NAME)Constants.java :  parser/$(JACC_SCANNER_FILE_NAME).jj
    parser/$(JACC_SCANNER_CLASS_NAME)TokenManager.java :  parser/$(JACC_SCANNER_FILE_NAME).jj
    
    $(jacc_src)		:  parser/$(JACC_SCANNER_FILE_NAME).jj
    	$(JAVACC) -OUTPUT_DIRECTORY=parser $^
    
  6. Do I need any productions (parsing rules) to get the scanner to work? No, you may omit all productions. Productions are not needed for the scanner, but are needed, naturally, for the parser. The function getNextToken will be generated.
  7. How do you call the scanner from the main program?
      final MiniJavaScanner lexer = new MiniJavaScanner (new FileInputStream (filename));
      if (trace) lexer.enable_tracing(); else lexer.disable_tracing(); // for parsing only
      final Token t = lexer.getNextToken ();
    
  8. Do I need anything from JavaCC when I run my scanner/parser as opposed to when I create it? No, with the seven Java files, you no longer need any class files from JavaCC.

Example Output

For one of Appel's sample programs at Cambridge WWW site.

Normal

Just print the file and the number of errors to the standard output.
> java -jar scan.jar ../../minijava/appel/Factorial.java      
filename=../../minijava/appel/Factorial.java, errors=0
Your program must print exactly one line to the standard output. (Error messages get printed to the standard error stream.)

Verbose

java -Dverbose -jar scan.jar Factorial.java
../../minijava/appel/Factorial.java:001.001: CLASS "class"
../../minijava/appel/Factorial.java:001.007: IDENTIFIER "Factorial"
../../minijava/appel/Factorial.java:001.016: LeftBrace "{"
../../minijava/appel/Factorial.java:002.005: PUBLIC "public"
../../minijava/appel/Factorial.java:002.012: STATIC "static"
       * * *
../../minijava/appel/Factorial.java:014.027: MULT "*"
../../minijava/appel/Factorial.java:014.029: LeftParen "("
../../minijava/appel/Factorial.java:014.030: THIS "this"
../../minijava/appel/Factorial.java:014.034: DOT "."
../../minijava/appel/Factorial.java:014.035: IDENTIFIER "ComputeFac"
../../minijava/appel/Factorial.java:014.045: LeftParen "("
../../minijava/appel/Factorial.java:014.046: IDENTIFIER "num"
../../minijava/appel/Factorial.java:014.049: MINUS "-"
../../minijava/appel/Factorial.java:014.050: INTEGER_LITERAL "1"
../../minijava/appel/Factorial.java:014.051: RightParen ")"
../../minijava/appel/Factorial.java:014.052: RightParen ")"
../../minijava/appel/Factorial.java:014.054: SEMICOLON ";"
../../minijava/appel/Factorial.java:015.009: IDENTIFIER "return"
../../minijava/appel/Factorial.java:015.016: IDENTIFIER "num_aux"
../../minijava/appel/Factorial.java:015.024: SEMICOLON ";"
../../minijava/appel/Factorial.java:016.005: RightBrace "}"
../../minijava/appel/Factorial.java:018.001: RightBrace "}"
../../minijava/appel/Factorial.java:018.002: EOF
filename=../../minijava/appel/Factorial.java, errors=0

Token Names

Javacc uses the tokens names provided in the .jj to generate the scanner. But to Java the tokens are just numbers. This is inconvenient in debugging. It would be nice to recover the user's token names as in the verbose output above. (Of course, different programmers may choose different names.) This is easy with reflection since javacc uses the token names as Java variables names. Study the following class:
import java.util.TreeMap;
import java.lang.reflect.Field;

public class FieldNames extends TreeMap  {

   public FieldNames (String class_name) {
      try {
	 final Class c = Class.forName (class_name);
	 final Field [] f = c.getDeclaredFields();
	 for (int i=0; i<f.length; i++) {
	    if (f[i].getType()==int.class) put (f[i].getInt(null), f[i].getName());
	 }
      } catch (Exception e) {
	 e.printStackTrace (System.err);
      }
   }
}

Turning it in

Please turn in one Java jar file asgn3.jar using the submission server.

For example, I expect you to submit something approximately like the following list of files:

  1. Makefile -- must not be in a subdirectory
  2. parser/scanner.jj -- should put all code in subdirectory/packages
  3. main/Scan.java -- hypothetical main program
  4. global/Files.java -- hypothetical support files
  5. scan.mf -- manifest file for executable jar
  6. compile -- script 'java -jar scan.jar $1'
Turn in all source files that you need to make your project in the single jar file. No Java *.class files are permitted. Notice that the subdirectory structure must be preserved in order for Java packages to work correctly.

The commmands I will run are like this:

jar -xf asgn3.jar    # unpack submission
make compiler JAVAC=javac JAVACC=javacc  # build compiler from source
compile directory/testcase1.java         # run compiler on a test case

The submit project tag for this assignment is asgn3. For your convenience, here is a submission form for this assignment.

File 1
Control code:
Course=cse5251
Project=asgn3


Ryan Stansifer <ryan@cs.fit.edu>
Last modified: Tue Mar 31 16:20:52 EDT 2009