Current assignment due Wed. Apr. 25, 2007, midnight.
Due Feb. 5, 2007 at midnight (20 out of 100 homework points). Write a calculator program. The program should prompt the user to enter an expression of the form number operator number such as 5.2 x 3, then print the answer (15.6) and ask if the user wants to enter another problem. If the user answers y (for yes), then prompt for another problem. If the user answers n (for no), then exit. If the user enters any other letter, then repeat the question. Operators may be + - x or /. Numbers may have decimal points and/or a leading - sign. You may assume the user enters a space between the numbers and operator. For example (bold is user input):
Enter a problem: 3 + 10.5 13.5 Another problem (y/n)? y Enter a problem: -2 / 3 -0.66666 Another problem (y/n)? z Another problem (y/n)? a Another problem (y/n)? y Enter a problem: -2 - 3.1 -5.1 Another problem (y/n)? n
Submit one plain text source code (.cpp) file as an attached file by email to Matt Mahoney at mmahoney@cs.fit.edu or matmahoney@yahoo.com Do not send executables, project files, etc.
I will test your program with g++. If you use Visual Studio, be sure when you create a new project to select "Win32 console program" and uncheck the box "use precompiled headers". VS will create a new project with some initial code that you should delete. You should not have any code that looks like this:
#include "stdafx.h"; // no, this is not a standard header file int _tmain(int argc, TCHAR* argv) // no, int main() with no parameters
Due Feb. 26, 2007, midnight. This assignment is worth 25 points (5 for the specification and 20 for the code). Your program will read a text file and write a second file where the output file contains a list of words that appear at least once in the input file. The output list should be sorted by increasing length. Among words of equal length, the order should be the same as the order of the first occurrence in the input file. Each output word should appear only once, regardless of the number of occurrences in the input file. A word is defined as a sequence of 1 or more letters (a-z, A-Z) without any other characters (spaces, punctuation, digits, etc). Thus, "can't" is 2 words, CAN and T. Two words are considered the same if they differ only in case. For example, "the", "THE", and "The" are considered 3 instances of the same word, THE. The output list should be all upper case.
The program should prompt the user for the names of the input and output files and deliver appropriate error messages (and exit) if the input file does not exist or the output file cannot be created or overwritten (e.g. it is read-only). There should be no limits on the contents of the input file (e.g. file size, number of words, length of the shortest or longest word, character set, etc). The program must not modify the input file. For example, suppose that the file in.txt contains the following text:
It's a test. This is another ... Test. THIS is -- yet "ANOTHER" test!!!and the program is run as follows (user input in bold):
Input file? in.txt Output file? out.txtThen the program will create (or overwrite) out.txt with the following contents:
S A IT IS YET TEST THIS ANOTHER
You may solve this problem how you like, but I recommend the following algorithm. Convert the input into upper case and parse into words. Then store the uppercase words in a vector<string>. For each word, check if it is already in the vector, and if not, append it. When finished, print all the words of length 1, 2, 3,... up to the length of the longest word (which you will have to find).
Due Mon. Apr. 2, midnight (20 points). In GuessNumber.cpp the computer thinks of a number from 1 through n inclusive (n = 1000) and you try to guess what it is, using as few guesses as possible. In your program, the roles will be reversed. You will think of a number and the computer will try to guess it. After each guess, you will tell the computer either h (too high), l (too low) or c correct.
The program should use the best possible strategy to minimize the number of guesses, e.g. a binary search. It should never need more than 1 + log2 n guesses. For example, if n = 1000, then it should always be correct by the tenth guess. If n is any number 4 through 7 then it should take at most 3 guesses.
The program will start by printing a brief description of the rules of the game and asking for a number n. Then it will guess at most 1 + log2 n numbers, after which you enter the letter h, l, or c and press ENTER. On the last guess, if there is only one possible correct answer, the program will simply print "your number is x" rather than ask for a response. After the program has guessed your number, it will print the number of guesses it took for this game (including the correct guess at the end), the number of games played, the average number of guesses per game, and ask if you want to play again. If you enter y (and ENTER) then the game repeats (with the same value of n). If you enter anything else, then the program exits. For example:
In this game, you think of a number from 1 through n and I will try to guess what it is. After each guess, enter h if my guess is too high, l if too low, or c if correct. Please enter a number n: 100 50? h 25? l 37? l 43? h 40? h 38? l Your number is 39. It took me 7 guesses. I averaged 7 guesses per game for 1 game(s). Play again (y/n)? y 50? l 75? l 88? l 94? c Your number is 94. It took me 4 guesses. I averaged 5.5 guesses per game for 2 game(s). Play again (y/n)? n
To help you with this program, I have already written part of it, the two files guessgame.h and s07hw3.cpp. You will supply one other file (named how you like, e.g. yourcode.cpp) that I will compile and link to this code as follows:
g++ s07hw3.cpp yourcode.cppYour code must contain the member functions for class GuessGame, defined in guessgame.h as follows:
Each member function should have a specification above its code (20% of grade). What does it do? What are the inputs and outputs? The specification may refer to parameters, return values, and data members in addition to user input and output. It should not describe how the code works internally. It should not refer to any local variables.
Comments describing how the code works should appear in the code as needed (not in the specificaiton). Such comments are only needed to describe major sections of the code in larger functions, and to describe the purpose of local variables (or use self descriptive names).
Submit only one file. I already have the other two.
Solution.
Due Wed. Apr. 25, 2007, midnight (35 points).
This program will be a dictionary
based text file compressor. Most English text files never use the characters
with ASCII codes 128 through 255. Therefore, you can replace the 128
most common words with one of these characters. The output will
be shorter because these common words take only one byte of space.
The decoder will need to know how to translate these characters back
to the original words. To do that, you have to store the dictionary in
the compressed file. You can do that as a list of up to 128 words, one per line,
representing the codes for 128, 129, ..., 255 followed by a blank line
to mark the end.
Of course no space is saved by coding a one-character word like "a"
or "I". Also, no space is saved by assigning a code to a word unless
the word appears at least twice. If it appears only once, then any savings is
lost by the extra dictionary entry. The best strategy is to rank
the words by score = (length-1)*(count-1) and assign codes to the
128 highest scoring words that have a score of at least 1.
It is possible the dictionary could be smaller than 128 words.
Your assignment is to write a file compressor (compress.cpp)
that takes 3 arguments
on the command line. The first argument is either c (to compress)
or d (to decompress). The second argument is the input file. The
third is the output file. For example:
A word is any sequence of characters containing A-Z or a-z.
Words are case sensitive. "This" and "this" are two different words.
For this assignment, you may assume that the input is a text file
and the ASCII codes 128-255 never occur. You may also assume that
lines of text end with a carriage-return and linefeed (if in Windows)
including the last line. Your program does not have to work otherwise.
The following compression approach is suggested:
You can test your program on Alice in Wonderland.
After downloading, make sure the size is 152,088 bytes. This is a
Windows text file with all lines ending with a carriage-return and linefeed.
After you compress and decompress it, make sure the file size is the same
and use the Windows FC command to compare it with the original.
You might compare your compressor with others on some benchmarks: Here are some benchmarks using this program
as a preprocessor to other compressors on the 100 MB file enwik8 from the
Large Text Benchmark/Hutter prize.
I called the program s07hw4.exe (instead of compress)
and compressed enwik8 to e8:
s07hw4.exe implements the extra credit (which is necessary to
properly process the Unicode and newline characters in enwik8).
It uses byte 128 as an escape code, and 129-255 as dictionary codes. The
dictionary is ordered by decreasing score. I will publish the
source code (as GPL open source) after the assignment is due.
And here it is. s07hw4.cpp source codeHomework 4
compress c file.txt file.zzz
reads file.txt and creates file.zzz, which should be smaller than
file.txt for most text files (like this web page). The command:
compress d file.zzz file.out
should create file.out, which will be identical to file.txt. In
Windows you can check this with the FC command:
fc file.txt file.out
FC: no differences encountered
For example, if file.txt is:
This is a test.
This is another test.
file.zzz will contain:
This
test
is
(128) (130) a (129).
(128) (130) another (129).
where (128), (129), etc. are actually single bytes, which will look
like strange characters in a text editor.
Extra credit part 1
For 15 points extra credit, make sure your program works on non-text files that
already contain characters 128-255 (like .exe files or Unicode text files),
and where the
carriage-return and linefeed characters have no special meaning. Normally,
C++ in Windows outputs "\n" as "\r\n" and reads "\r\n" as "\n" to maintain
compatibility with UNIX/Linux text file format ("\r" is a carriage return -
ASCII 13, and "\n" is a linefeed - ASCII 10). Binary
mode prevents this extra processing:
ifstream in(filename, ios::in | ios::binary);
ofstream out(filename, ios::out | ios::binary);
Also, you will need to encode characters 128-255 so that they decode properly.
You can "escape" them by using one of these codes (say 128) as an escape
character. Thus, you would code (130) as (128)(130) to signal to the
decompressor not to decode (130) from the dictionary. You would code
(128) as (128)(128). Note that binary files with lots of these characters
would get larger, not smaller.
Maximum Compression benchmark
Large Text Benchmark and Hutter Prize
Calgary Corpus Challenge
The last two benchmarks offer prize money, but also require binary file compression.
This compression scheme does not compress
very well by itself, but has the interesting property that you can further
compress the compressed file with one of these other compressors.
In other words:
compress c file.txt file.zzz
zip file.zzz
will often produce a smaller file than
zip file.txt
Most other compressors do not have this property. Once you compress a
file with one compressor, you cannot compress it further with that or
any other compressor.
s07hw4 c enwik8 e8
Then I compressed e8 with various compressors in Windows XP on a 2.2 GHz
Athlon-64 with 2 GB memory (which some of the better compressors require,
with run times up to 3 hours for paq8l).
Sizes are shown in bytes. Note that compression improves for every
compressor type except those that already do dictioary preprocessing
(xml-wrt and paq8hp10any).
Compressor Options enwik8 e8 Type
---------- ------- ----------- ----------- ----
None 100,000,000 88,162,480 Dict
fpaq 63,391,013 60,359,268 Order 0
compress 4.3d 45,763,941 44,467,462 LZW
thor 0.9a ex 41,670,916 40,948,112 LZP
Info-ZIP 2.3.1 -9 36,445,373 35,543,466 LZ77
gzip 1.3.5 -9 36,445,248 35,543,275 LZ77
bzip2 1.0.2 -9 29,008,736 28,516,066 BWT
quad v1.11 -x 29,110,579 28,044,273 ROLZ
cabarc 1.00.0601 -m lzx:21 28,465,607 27,897,201 LZX
boa 0.58b -m15 24,322,643 24,308,855 PPM
GRZipII 0.2.4 -b8m 23,846,878 23,626,484 LZP+BWT
sbc 0.970r2 -ad -m3 -b63 22,470,539 22,313,333 BWT
hook v0.8e c 1700 3 1 6 22,039,935 21,751,761 LZP+DMC
ppmd J1 -m256 -o10 -r1 21,388,296 21,182,821 PPM
ccm 1.20a 21,350,295 21,139,983 CM
bbb m1000 20,847,290 20,790,469 BWT
ash 04a /m700 /o10 19,963,105 19,941,851 CM
ppmonstr J -m1700 -o16 19,055,092 18,831,805 PPM
xml-wrt 3.0 -l11 -b255 -m255 -f24 18,592,499 19,817,480 Dict+CM
paq8l -8 17,916,450 17,694,859 CM
paq8hp10any -8 16,335,197 16,990,633 Dict+CM
Can you improve these results? Can you write a preprocessor that, when combined with any of the compressors listed above, compresses enwik8 smaller? Extra credit is a follows:
Programs are scored by the difference between your results and mine using your preprocessor and choice of one of the above compressors. To earn extra credit, tell me which of the above compressors you want me to use. You must choose only one compressor. If you want to use a compressor not in the list, then I will test my program against yours, and if yours compresses smaller then you get the extra credit. For example, if your program plus gzip compresses to 35,000,000 bytes, and my program plus gzip compresses to 35,543,275 (as shown in the table) then your score is 543,275 bytes. The highest score gets 25 extra credit points. I will also announce the top 3 winners (unless you prefer to remain anonymous).
Hints: