CSE1502 Homework - Spring 2007

Current assignment due Wed. Apr. 25, 2007, midnight.

Homework 1

Due Feb. 5, 2007 at midnight (20 out of 100 homework points). Write a calculator program. The program should prompt the user to enter an expression of the form number operator number such as 5.2 x 3, then print the answer (15.6) and ask if the user wants to enter another problem. If the user answers y (for yes), then prompt for another problem. If the user answers n (for no), then exit. If the user enters any other letter, then repeat the question. Operators may be + - x or /. Numbers may have decimal points and/or a leading - sign. You may assume the user enters a space between the numbers and operator. For example (bold is user input):

  Enter a problem: 3 + 10.5
  13.5
  Another problem (y/n)? y
  Enter a problem: -2 / 3
  -0.66666
  Another problem (y/n)? z
  Another problem (y/n)? a
  Another problem (y/n)? y
  Enter a problem: -2 - 3.1
  -5.1
  Another problem (y/n)? n

Submit one plain text source code (.cpp) file as an attached file by email to Matt Mahoney at mmahoney@cs.fit.edu or matmahoney@yahoo.com Do not send executables, project files, etc.

I will test your program with g++. If you use Visual Studio, be sure when you create a new project to select "Win32 console program" and uncheck the box "use precompiled headers". VS will create a new project with some initial code that you should delete. You should not have any code that looks like this:

  #include "stdafx.h";  // no, this is not a standard header file
  int _tmain(int argc, TCHAR* argv)   // no, int main() with no parameters

Homework 2

Due Feb. 26, 2007, midnight. This assignment is worth 25 points (5 for the specification and 20 for the code). Your program will read a text file and write a second file where the output file contains a list of words that appear at least once in the input file. The output list should be sorted by increasing length. Among words of equal length, the order should be the same as the order of the first occurrence in the input file. Each output word should appear only once, regardless of the number of occurrences in the input file. A word is defined as a sequence of 1 or more letters (a-z, A-Z) without any other characters (spaces, punctuation, digits, etc). Thus, "can't" is 2 words, CAN and T. Two words are considered the same if they differ only in case. For example, "the", "THE", and "The" are considered 3 instances of the same word, THE. The output list should be all upper case.

The program should prompt the user for the names of the input and output files and deliver appropriate error messages (and exit) if the input file does not exist or the output file cannot be created or overwritten (e.g. it is read-only). There should be no limits on the contents of the input file (e.g. file size, number of words, length of the shortest or longest word, character set, etc). The program must not modify the input file. For example, suppose that the file in.txt contains the following text:

  It's a test.
  This is another   ...   Test.
  THIS is -- yet "ANOTHER" test!!!

and the program is run as follows (user input in bold):

  Input file? in.txt
  Output file? out.txt

Then the program will create (or overwrite) out.txt with the following contents:

  S
  A
  IT
  IS
  YET
  TEST
  THIS
  ANOTHER

You may solve this problem how you like, but I recommend the following algorithm. Convert the input into upper case and parse into words. Then store the uppercase words in a vector<string>. For each word, check if it is already in the vector, and if not, append it. When finished, print all the words of length 1, 2, 3,... up to the length of the longest word (which you will have to find).

Solution

Homework 3

Due Mon. Apr. 2, midnight (20 points). In GuessNumber.cpp the computer thinks of a number from 1 through n inclusive (n = 1000) and you try to guess what it is, using as few guesses as possible. In your program, the roles will be reversed. You will think of a number and the computer will try to guess it. After each guess, you will tell the computer either h (too high), l (too low) or c correct.

The program should use the best possible strategy to minimize the number of guesses, e.g. a binary search. It should never need more than 1 + log₂ n guesses. For example, if n = 1000, then it should always be correct by the tenth guess. If n is any number 4 through 7 then it should take at most 3 guesses.

The program will start by printing a brief description of the rules of the game and asking for a number n. Then it will guess at most 1 + log₂ n numbers, after which you enter the letter h, l, or c and press ENTER. On the last guess, if there is only one possible correct answer, the program will simply print "your number is x" rather than ask for a response. After the program has guessed your number, it will print the number of guesses it took for this game (including the correct guess at the end), the number of games played, the average number of guesses per game, and ask if you want to play again. If you enter y (and ENTER) then the game repeats (with the same value of n). If you enter anything else, then the program exits. For example:

In this game, you think of a number from 1 through n and I will try
to guess what it is.  After each guess, enter h if my guess is too high,
l if too low, or c if correct.

Please enter a number n: 100
50? h
25? l
37? l
43? h
40? h
38? l
Your number is 39.
It took me 7 guesses.
I averaged 7 guesses per game for 1 game(s).
Play again (y/n)? y
50? l
75? l
88? l
94? c
Your number is 94.
It took me 4 guesses.
I averaged 5.5 guesses per game for 2 game(s).
Play again (y/n)? n

To help you with this program, I have already written part of it, the two files guessgame.h and s07hw3.cpp. You will supply one other file (named how you like, e.g. yourcode.cpp) that I will compile and link to this code as follows:

  g++ s07hw3.cpp yourcode.cpp

Your code must contain the member functions for class GuessGame, defined in guessgame.h as follows:

The constructor should initialize all data members. It should print a brief description of the rules and prompt for and read n.
playGame() should play a game, update gamesPlayed and totalGuesses and print the number of guesses for this game, number of games, and average number of guesses per game.
playAgain() should prompt the user to play again and return true if the user enters y or false otherwise.

Each member function should have a specification above its code (20% of grade). What does it do? What are the inputs and outputs? The specification may refer to parameters, return values, and data members in addition to user input and output. It should not describe how the code works internally. It should not refer to any local variables.

Comments describing how the code works should appear in the code as needed (not in the specificaiton). Such comments are only needed to describe major sections of the code in larger functions, and to describe the purpose of local variables (or use self descriptive names).

Submit only one file. I already have the other two.

Solution.

Homework 4

Due Wed. Apr. 25, 2007, midnight (35 points). This program will be a dictionary based text file compressor. Most English text files never use the characters with ASCII codes 128 through 255. Therefore, you can replace the 128 most common words with one of these characters. The output will be shorter because these common words take only one byte of space.

The decoder will need to know how to translate these characters back to the original words. To do that, you have to store the dictionary in the compressed file. You can do that as a list of up to 128 words, one per line, representing the codes for 128, 129, ..., 255 followed by a blank line to mark the end.

Of course no space is saved by coding a one-character word like "a" or "I". Also, no space is saved by assigning a code to a word unless the word appears at least twice. If it appears only once, then any savings is lost by the extra dictionary entry. The best strategy is to rank the words by score = (length-1)*(count-1) and assign codes to the 128 highest scoring words that have a score of at least 1. It is possible the dictionary could be smaller than 128 words.

Your assignment is to write a file compressor (compress.cpp) that takes 3 arguments on the command line. The first argument is either c (to compress) or d (to decompress). The second argument is the input file. The third is the output file. For example:

  compress c file.txt file.zzz

reads file.txt and creates file.zzz, which should be smaller than file.txt for most text files (like this web page). The command:

  compress d file.zzz file.out

should create file.out, which will be identical to file.txt. In Windows you can check this with the FC command:

  fc file.txt file.out
  FC: no differences encountered

For example, if file.txt is:

  This is a test.
  This is another test.

file.zzz will contain:

  This
  test
  is

  (128) (130) a (129).
  (128) (130) another (129).

where (128), (129), etc. are actually single bytes, which will look like strange characters in a text editor.

A word is any sequence of characters containing A-Z or a-z. Words are case sensitive. "This" and "this" are two different words.

For this assignment, you may assume that the input is a text file and the ASCII codes 128-255 never occur. You may also assume that lines of text end with a carriage-return and linefeed (if in Windows) including the last line. Your program does not have to work otherwise. The following compression approach is suggested:

Read the input file twice. In the first pass, store a list of word counts in a map<string, int>.
Rank the words by score and store in a vector<string>. Make 128 passes through the map to find the highest scoring word, push it on the vector and set the count to 0. Stop when the vector is size 128 or no more words have a positive score. (You could also use another map (mapping words to their codes) instead of a vector).
Print the vector and a blank line to the output.
Rewind (or close and reopen) the input file.
Output any nonletters directly.
For each word, search the vector. If found, output the corresponding code. If not found, output the word.
To decompress, read the dictionary (up to the blank line) into a vector<string>, and replace any characers from the rest of the file in the range 128-255 with the corresponding word.

You can test your program on Alice in Wonderland. After downloading, make sure the size is 152,088 bytes. This is a Windows text file with all lines ending with a carriage-return and linefeed. After you compress and decompress it, make sure the file size is the same and use the Windows FC command to compare it with the original.

Extra credit part 1

For 15 points extra credit, make sure your program works on non-text files that already contain characters 128-255 (like .exe files or Unicode text files), and where the carriage-return and linefeed characters have no special meaning. Normally, C++ in Windows outputs "\n" as "\r\n" and reads "\r\n" as "\n" to maintain compatibility with UNIX/Linux text file format ("\r" is a carriage return - ASCII 13, and "\n" is a linefeed - ASCII 10). Binary mode prevents this extra processing:

  ifstream in(filename, ios::in | ios::binary);
  ofstream out(filename, ios::out | ios::binary);

Also, you will need to encode characters 128-255 so that they decode properly. You can "escape" them by using one of these codes (say 128) as an escape character. Thus, you would code (130) as (128)(130) to signal to the decompressor not to decode (130) from the dictionary. You would code (128) as (128)(128). Note that binary files with lots of these characters would get larger, not smaller.

You might compare your compressor with others on some benchmarks:
Maximum Compression benchmark
Large Text Benchmark and Hutter Prize
Calgary Corpus Challenge
The last two benchmarks offer prize money, but also require binary file compression. This compression scheme does not compress very well by itself, but has the interesting property that you can further compress the compressed file with one of these other compressors. In other words:

  compress c file.txt file.zzz
  zip file.zzz

will often produce a smaller file than

  zip file.txt

Most other compressors do not have this property. Once you compress a file with one compressor, you cannot compress it further with that or any other compressor.

Here are some benchmarks using this program as a preprocessor to other compressors on the 100 MB file enwik8 from the Large Text Benchmark/Hutter prize. I called the program s07hw4.exe (instead of compress) and compressed enwik8 to e8:

  s07hw4 c enwik8 e8

Then I compressed e8 with various compressors in Windows XP on a 2.2 GHz Athlon-64 with 2 GB memory (which some of the better compressors require, with run times up to 3 hours for paq8l). Sizes are shown in bytes. Note that compression improves for every compressor type except those that already do dictioary preprocessing (xml-wrt and paq8hp10any).

Compressor        Options           enwik8         e8       Type
----------        -------        -----------  -----------   ----
None                             100,000,000   88,162,480   Dict
fpaq                              63,391,013   60,359,268   Order 0
compress 4.3d                     45,763,941   44,467,462   LZW
thor 0.9a         ex              41,670,916   40,948,112   LZP
Info-ZIP 2.3.1    -9              36,445,373   35,543,466   LZ77
gzip 1.3.5        -9              36,445,248   35,543,275   LZ77
bzip2 1.0.2       -9              29,008,736   28,516,066   BWT
quad v1.11        -x              29,110,579   28,044,273   ROLZ
cabarc 1.00.0601  -m lzx:21       28,465,607   27,897,201   LZX
boa 0.58b         -m15            24,322,643   24,308,855   PPM
GRZipII 0.2.4     -b8m            23,846,878   23,626,484   LZP+BWT
sbc 0.970r2       -ad -m3 -b63    22,470,539   22,313,333   BWT
hook v0.8e        c 1700 3 1 6    22,039,935   21,751,761   LZP+DMC
ppmd J1           -m256 -o10 -r1  21,388,296   21,182,821   PPM
ccm 1.20a                         21,350,295   21,139,983   CM
bbb               m1000           20,847,290   20,790,469   BWT
ash 04a           /m700 /o10      19,963,105   19,941,851   CM
ppmonstr J        -m1700 -o16     19,055,092   18,831,805   PPM
xml-wrt 3.0 -l11 -b255 -m255 -f24 18,592,499   19,817,480   Dict+CM
paq8l            -8               17,916,450   17,694,859   CM
paq8hp10any      -8               16,335,197   16,990,633   Dict+CM

s07hw4.exe implements the extra credit (which is necessary to properly process the Unicode and newline characters in enwik8). It uses byte 128 as an escape code, and 129-255 as dictionary codes. The dictionary is ordered by decreasing score. I will publish the source code (as GPL open source) after the assignment is due.

And here it is. s07hw4.cpp source code (GPL).

Extra credit part 2

Can you improve these results? Can you write a preprocessor that, when combined with any of the compressors listed above, compresses enwik8 smaller? Extra credit is a follows:

First place, 25 points
Second place, 15 points
Third place and any other improvement of at least one byte, 10 points

Programs are scored by the difference between your results and mine using your preprocessor and choice of one of the above compressors. To earn extra credit, tell me which of the above compressors you want me to use. You must choose only one compressor. If you want to use a compressor not in the list, then I will test my program against yours, and if yours compresses smaller then you get the extra credit. For example, if your program plus gzip compresses to 35,000,000 bytes, and my program plus gzip compresses to 35,543,275 (as shown in the table) then your score is 543,275 bytes. The highest score gets 25 extra credit points. I will also announce the top 3 winners (unless you prefer to remain anonymous).

Hints:

You need to do extra credit part 1 (binary files and escape codes) or your program will not work on enwik8.
ASCII codes 0-8, 11-12, and 14-31 rarely appear in text files. You could use these to code words in addition to 128-255.
You could code words that are embedded in longer words. For example, if the dictionary codes "the" as (150) but has no code for "there", you could code "there" as "(150)re". My program does not do this.
You could code only lowercase words and reserve a code to mean that the next word should be capitalized. My program codes "the" and "The" as two different words.
You might only code words that are preceded by a space character, then remove the space.
Or you might also code words that are not preceded by a space, and reserve one symbol to tell the decompressor not to insert a space.
You could have a much larger dictionary by using 2 byte codes. This is what xml-wrt and paq8hp10any do.