CSE 4051: Project #10 -- SSDI Screen Scraper [again]

Due: Friday, 28 March 2008.

Task

Read a list of social security numbers (SSN) from the standard input. Social security numbers have the form DDD-DD-DDDD (including the two hyphens).
java Scrape < list.txt
Fetch the information about each SSN in the list from RootsWeb using CGI:
http://ssdi.rootsweb.com/cgi-bin/ssdi.cgi
(Both post and get methods work.) Use the name ssn for an SSN as in
http://ssdi.rootsweb.com/cgi-bin/ssdi.cgi?ssn=303-20-3777
to have the WWW server produce an HTML file. Try:
curl --silient 'http://ssdi.rootsweb.com/cgi-bin/ssdi.cgi?ssn=303-20-3777'
or
wget --quiet -O- 'http://ssdi.rootsweb.com/cgi-bin/ssdi.cgi?ssn=303-20-3777'
The resulting HTML (like most WWW pages) is not completely correct and not formatted to be read. A formatted version can be found at the following link: sample SSDI output. Study the source of the WWW page to learn the structure of the HTML produced the server. (The advertisements have been commented out by me.)

“Scrape the screen” using the Java package javax.swing.text.html. (Many third-party XML/HTML parsers as clearer, more efficient, and easier to use.)

You will find in the HTML tables nine columns:

1. Name
2. Birth
3. Death
4. Last Residence
5. Last Benefit
6. SSN
7. Issued
8. Tools
9. Order Record?
You program should print a semicolon delimited line of output for each social security number. Produce nine slightly different columns:
1. SSN  (from column 6)
2. Last Name (last string in column 1)
3. Name (the entire orginal column 1)
4. Birth (DD MMM YYY from column 2)
5. Death (DD MMM YYY or MMM YYYY from column 3)
6. Note (either ' ' or 'P' or 'V' from column 3)
7. Last Residence (from column 4)
8. Last Benefit (from column 5)
9. Issued (from column 7)

For a bigger challenge have the program take two kinds of input lines (not just social security numbers). If the line has the form of a social security number \d\d\d-\d\d-\d\d\d\d, then act as before, otherwise treat the input line as a last name and issue the appropriate query. A query of a single social security number may result in zero of one hits in the database. A name on the other hand may result in multiple hits in the database. Gather all the rows in the resulting HTML table. Moreover, some names may have more than twenty entries. If this is the case you must issue the CGI request multiple times with "start=21", "start=41", "start=61", etc.

Reference

Turning it in

Turn in the Java source code for the program using the submission server. The project tag for this assignment is proj10. The name of the file you submit must be Scrape.java.

File to be submitted:

Control code:
Course=cse4051
Project=proj10


Ryan Stansifer <ryan@cs.fit.edu>
Last modified: Tue Mar 11 14:27:28 EDT 2008