Below are links to papers and source code for four experimental network anomaly intrusion detection systems, PHAD, ALAD, LERAD, and NETAD, and software to evaluate them on the 1999 Lincoln Laboratory DARPA Intrusion Detection System Evaluation data set. The software was developed by Matt Mahoney as dissertation research under Philip K. Chan at Florida Tech in 2000-2003. Dissertation Slides.
The EVAL program is generally useful for testing intrusion detection algorithms on the 1999 data. The other programs are made available for reference.
You may use, copy, modify, and redistribute this code under terms of the GNU general public license.
All of our program were evaluated on the 1999 Lincoln Laboratory DARPA Intrusion Detection System Evaluation data set. To repeat our tests, you need to download the inside sniffer traffic (inside.tcpdump.gz) from week 3 (7 files, Mon.-Fri. and extra Mon.-Tues.), week 4 (4 files, Mon., Wed.-Fri.), and week 5 (5 files, Mon.-Fri). These are big files, about 200 MB compressed each. We uncompressed them with gzip and renamed them as follows:
Week 3 (training) in31 in32 in33 in34 in35 in36 in37 Week 4 (test) in41 in43 in44 in45 Week 5 (test) in51 in52 in53 in54 in55
All programs produce a list of alarms in the format used by the original 1999 evaluation, which has the form
iiiiiiii mm/dd/yyyy hh:mm:ss aaa.aaa.aaa.aaa s.ssssss #commentswhere iiiiiiii is an 8 digit identifier (ignored, always 0), mm/dd/yyyy is the date of the alarm, hh:mm:ss is the time in EST (week 4) or EDT (week 5), aaa.aaa.aaa.aaa is the target (destination) IP address with leading zeros in each decimal byte, and s.ssssss is the alarm score, ranging from 0.000000 to 9.999999. For example
0 04/06/1999 08:59:16 172.016.112.194 0.631169 # a comment
WARNING. This software is provided as reference code to supplement our papers. These are NOT intrusion detection systems. The programs have ONLY been tested with off-line data from the 1999 LL IDS evaluation (inside sniffer) data. They have ONLY been tested on one PC and/or Sun workstation using one version g++ and Perl. You will probably have to modify them to work on the same data, and almost certainly if you use them on different data. If you do, you are on your own. However, we do know of a few bugs you will have to deal with.
PHAD (Packet Header Anomaly Detector) detects anomalies in Ethernet, IP, TCP, UDP, and ICMP packet headers. It is described in PHAD: Packet Header Anomaly Detection for Indentifying Hostile Network Traffic, by Matthew V. Mahoney and Philip K. Chan, 2001, Florida Tech. technical report CS-2001-4 (PDF, 17 pages).
The source code is phad.cpp To compile and run:
g++ phad.cpp -O -o phad phad 1123200 in3* in4* in5* >phad.simThe arguments are the training period in seconds (1123200 is 13 days) and the tcpdump files in chronological order.
ALAD (Application Layer Anomaly Detector) detects anomalies in inbound TCP stream connections to well known server ports. It is described in Learning Nonstationary Models of Normal Network Traffic for Detecting Novel Attacks by Matthew V. Mahoney and Philip K. Chan, Proc. Eighth Intl. Conf. Knowledge Discovery and Data Mining, p376-385, 2002. (C) 2002, ACM (PDF, 10 pages). The code has two components.
te.cpp, a program to extract TCP streams from tcpdump files alad.pl, a Perl script that reads te output and genrate alarms (rename the .txt extension to .pl after downloading)To compile and run:
g++ te.cpp -O -o te te in3* > train te in45* > test perl alad.pl train test > alad.sim
LERAD (Learning Rules for Anomaly Detection) detects TCP stream anomalies like ALAD but uses a learning algorithm to pick good rules from the training set, rather than a fixed set of rules. It is described in Learning Models of Network Traffic for Detecting Novel Attacks by Matthew V. Mahoney and Philip K. Chan, Florida Institute of Technology Technical Report CS-2002-08 (PDF, 48 pages double spaced). It uses the files train and test from ALAD and two programs:
a2l.pl converts train and test from ALAD to a LERAD database format. lerad.cpp outputs alarms.To compile and run:
g++ lerad.cpp -O -o lerad perl a2l.pl train > train.txt perl a2l.pl test > test.txt lerad train.txt test.txt 0 > lerad.simThe third argument is a random number seed (default 0). LERAD uses a randomized algorithm, so each seed gives a slightly different result. LERAD also lists the rules it generates to the file rules.txt.
In Learning Rules for Anomaly Detection of Hostile Network Traffic, Proc. ICDM 2003 (© 2003, IEEE) (Powerpoint slides) and the longer Technical Report TR-CS-2003-16, LERAD-TCP is just LERAD above, and LERAD-PKT is leradp.cpp. To use:
bcc32 leradp.cpp leradp 0 in3tf in45tf > leradp.simwhere 0 is the random number seed (can be any integer), in3tf and in45tf are filtered tcpdump weeks 3-5 (see NETAD below). You can have any number of tcpdump files, filtered or not, but only the first file is used for training.
Time zones come out right when compiled with Borland and your computer is set to Eastern time. DJGPP converts to UT. Haven't tried it in UNIX/g++. Try messing with the print_time() function if it doesn't work.
NETAD (Network Traffic Anomaly Detector) reads packets like PHAD with improvements. It is described in Network Traffic Anomaly Detection Based on Packet Bytes by Matthew V. Mahoney, to appear in Proc. ACM-SAC, Melbourne FL, 2003, (C) 2003, ACM. (PDF, 5 pages). It has two programs.
tf.cpp, a traffic filtering program. netad.cpp, reads filtered traffic and outputs alarms.To compile and run:
g++ tf.cpp -O -o tf g++ netad.cpp -O -o netad tf in3* mv tf.out in3tf tf in4* in5* mv tf.out in45tf netad in3tf in45tf > netad.simNote: NETAD takes any number of files. The training period is hard coded. NETAD also produces a rules.txt file that you can delete.
To compile and run:
g++ sad.cpp -O -o sad sad 38 in3tf in45tf > sad38.sim (38 = TTL, 24 detections, 4 false alarms) sad 45 in[3-5]* > sad45.sim (45 = last byte of source address, 71 det, 16 FA)SAD examines one byte of incoming TCP SYN packets. The first argument is the byte offset to test, including the 16 byte tcpdump header and 14 byte Ethernet header. The other arguments are tcpdump files. The test periods (weeks 2, 4, 5) are hard coded.
The merged results described in the paper were created with tm.cpp. Unfortunately we can't publish the real data because of privacy concerns. However if you collect your own, you could use this program to inject it into the DARPA data. To compile and run:
g++ tm.cpp -O -o tm tm in3tf tcpdump_files... mv tm.out m31 tm in45tf tcpdump_files... (not the same ones) mv tm.out m45Then use m3 and m45 in place of in3tf and in45tf for PHAD, ALAD, LERAD, NETAD, SAD, etc. tm takes 2 or more file name arguments and puts the result in tm.out. The total duration of files 2, 3, 4... should be at least as much as file 1. The time stamps of files 2, 3, 4... are adjusted to match those of file 1 in tm.out, even if there are gaps in both sets.
Labels for week 2 do not include durations, so 1 second was assumed. Categories (probe, DOS, R2L, U2R, Data) are as in weeks 4-5 but no other characteristics are assumed (whether visible in inside or outside traffic, BSM, etc). The data is derived from http://www.ll.mit.edu/IST/ideval/docs/1999/detections_1999.html
eval.cpp source code truth2eval.pl | Scripts to generate tables in eval.cpp hosts2eval.pl | You don't need these unless you plan labels2eval.pl | to generate the tables in eval.cpp from new dataTo compile and run
g++ eval.cpp -O -o eval eval phad.sim (read from a file) phad 1123200 in[3-5]* | eval - (or read from standard input)EVAL takes 2 options, a reporing level (0-4, default 2) and the threshold in number of false alarms (default 100), e.g.
eval phad.sim 4 1000 (most details, detections at 1000 false alarms)
Level 0 only lists warning about alarms containing errors which are ignored, for instance, if the IP address is 0.0.0.0 (below), score is 0, date is out of range, or data is missing or badly formatted.
Ignored: host: 0 03/31/1999 11:35:13 000.000.000.000 0.664225 # Ether Src Hi=xC66973 68%
Level 1 also prints a table of detections at 100 false alarms (or a different number if you specify one). All rows except the last are for the 201 attacks in weeks 4-5. The last row is for the 43 attacks in week 2. Each cell lists the number of detections out of the total number of various combinations of in-spec attacks. For instance, out of the 34 probes with evidence in the inside sniffer traffic (IT), 16 were detected at the lowest threshold allowing no more than 100 false alarms.
Detections/Total at 100 false alarms (weeks 4-5 only except last row) All Probe DOS R2L U2R Data New Stealthy ------- ------- ------- ------- ------- ------- ------- -------- W45 41/201 18/37 21/65 2/56 0/37 0/16 9/62 12/36 (Weeks 4-5) IT 39/177 16/34 21/60 2/54 0/27 0/7 8/52 10/30 (Inside sniffer evidence) OT 26/151 14/32 10/44 2/46 0/26 0/11 4/38 8/23 (Outside sniffer evidence) BSM 4/38 1/1 3/12 0/10 0/11 0/6 0/8 0/6 (Solaris BSM evidence) NT 3/33 0/3 3/7 0/10 0/12 0/4 3/26 0/0 (NT audit log evidence) FS 40/189 18/37 20/62 2/56 0/31 0/11 9/54 12/34 (File system dump evidence) pascal 12/55 4/8 7/20 1/12 0/11 0/6 1/11 3/9 (Solaris target) hume 4/48 0/7 4/15 0/12 0/13 0/5 3/31 0/2 (NT target) zeno 7/22 4/7 3/9 0/3 0/3 0/1 1/2 3/6 (SunOS target) marx 9/44 4/6 4/17 1/18 0/2 0/2 1/11 2/10 (Linux target) W2 0/43 0/9 0/13 0/6 0/12 0/3 0/0 0/0 (Week 2) 41 detections, 6260 alarms, 47 true, 101 false, 6112 not evaluated.Only the highest scoring alarms are evaluated. Evaluation was stopped at 101 false alarms in order to count detections between 100 and 101. There are 47 alarms that detect attacks, but 6 of these detect an attack already detected by another alarm. It is also possible for an alarm to detect more than one attack if the attacks overlap.
Level 2 (the default) also lists all attack types and the comments for the first alarm that detected it, for instance,
Attack FA (false alarms before detected) ------ ---- dosnuke 1 # TCP URG Ptr=49 100% dosnuke 4 # TCP URG Ptr=49 100% dosnuke 16 # TCP URG Ptr=49 100% dosnuke 22 # TCP Flg UAPRSF=x39 100% insidesniffer 21 # TCP Checksum=xAE5B 37% insidesniffer 92 # Ether Src Hi=x00104B 60% mscan 7 # Ether Dest Hi=x29CDBA 57% pod 4 # IP Frag Ptr=x2000 100% pod 18 # IP Frag Ptr=x2000 100% ...There were 4 dosnuke attacks detected at 100 false alarms. The number under FA is how many false alarms had a higher score than the highest scoring alarm to detect it. (There may be other alarms detecting each instance, but only the highest is shown). For instance, if the threshold were set to allow 16 to 21 false alarms then 3 instances of dosnuke would have been detected. The text after # are comments from the alarm file.
Level 3 also lists each detected attack in descending order of the highest scoring alarm that detected it. The FA column shows the number of false alarms with higher scores. For instance, 2 to 5 attacks would be detected (in the order shown) at thresholds allowing 4 false alarms. EVAL would report 5 detections in this case.
n FA Attacks detected after FA false alarms --- ---- -------------------------------------- 1 1 44.082615 teardrop W45 marx DOS LOG IT FS In 2 1 44.110000 dosnuke W45 hume DOS New LOG IT FS In 3 4 51.083800 pod W45 pascal DOS IT OT FS Man 4 4 51.200037 udpstorm W45 DOS IT OT 5 4 51.114500 dosnuke W45 hume DOS New LOG IT FS Man In ... 39 92 51.171917 syslogd W45 pascal DOS LOG BSM IT FS 40 93 51.084334 portsweep W45 Probe Stl OT FS In 41 99 54.145832 satan W45 marx Probe IT OT FSThe notation after each attack is as follows.
W2 - Week 2, data available before the 1999 evaluation (43 labeled attacks). W45 - Weeks 4 and 5, used in the evaluation (201 attacks). pascal (Solaris), hume (NT), zeno (SunOS), marx (Linux) - The 4 main targets. There may be more than one (ipsweep) or none (snmpget, targets the router). Probe, DOS, R2L, U2R, Data - Attack category, may be more than one (e.g. R2L-Data or U2R-Data).The following apply to weeks 4 and 5 only.
New - Attack type not found in week 2. Stl - Stealthy. IT - Evidence of attack in inside sniffer traffic (177 of 201). OT - Evidence in outside sniffer traffic (151). BSM - Evidence in Solaris BSM system call traces (38). LOG - Evidence in system/audit logs. NT - Evidence in NT audit logs (LOG + hume). FS - Evidence in file system dumps. DIR - Evidence in directory listings. Con - Attack launched from console. Man - Attack launched manually (not scripted). In - Insider attack (launched from within eyrie.af.mil).
Level 4 also lists all alarms above the threshold (false alarm limit) in their entirety, either as a false alarm or under the attack it detected. The number preceding each alarm is the number of false alarms with higher scores. For instance, the instance of teardrop on week 4, day 4 at 08:26:15 (read from the ID number 44.082615) would have been detected at a threshold allowing 1 false alarm. A second alarm detected it after 63 false alarms. ps was not detected.
Top 100 false alarms (number, date, time, target, score, #comments) ------------------------------------------------------------------ 0 04/06/1999 08:59:16 172.016.112.194 0.748198 ## TCP Checksum=x6F4A 67% 1 04/00/1999 08:00:28 192.168.001.030 0.664309 ## IP TOS=x20 100% 2 04/00/1999 11:35:18 172.016.114.050 0.653956 ## Ether Dest Hi=xE78D76 ... 100 04/06/1999 05:35:35 192.168.001.010 0.273233 ## UDP Len=136 100% Detections listed by attack (preceding FAs, date, time, target, score) --------------------------------------------------------------------- 41.084031 ps W45 pascal U2R Stl DIR LOG BSM IT OT FS 41.084818 sendmail W45 marx R2L IT OT FS 25 03/29/1999 08:48:10 172.016.114.050 0.434500 ## IP Src=202.049.244.010 ... 44.082615 teardrop W45 marx DOS LOG IT FS In 1 04/01/1999 08:26:16 172.016.114.050 0.697083 ## IP Frag Ptr=x2000 100% 63 04/01/1999 08:26:16 172.016.114.050 0.332558 ## IP Length=24 100% ...Only the first 10 alarms detecting an attack are shown, with the total number given if more than 10. An alarm may appear twice under two overlapping attacks if it detects both. There are actually 101 false alarms listed.
eval version 12/15/03 adds a row labeled "poor" for the 72 attacks which were poorly detected in the 1999 evaluation. An attack is poorly detected if none of the original participants could detect more than half of the instances at 10 false alarms per day.
Published results for PHAD, ALAD, LERAD, and NETAD are based on two older programs, EVAL3 and EVAL4, and an alarm sorting and filtering program, AFIL.PL. You don't need these unless you want to verify our results, or if you want to evaluate merged results from multiple systems (a feature of EVAL3).
EVAL3 and EVAL4 differ from EVAL in that they allow attacks to be detected by the destination address given in the master detection truth table even if the address is the source rather than the target. EVAL only allows targets of 172.16.x.x or 192.168.x.x. EVAL3 and EVAL4 also count badly formatted alarms as false alarms. These differences mainly affect PHAD. EVAL3 also removes duplicate alarms (within 60 seconds) so it sometimes gives higher scores. Neither distinguish in-spec and out-of-spec detections, so out-of-spec have to be removed manually. There are sone attack naming inconsistencies between the two (dict vs. guesstelnet, insidesniffer vs. illegalsniffer, etc). If an alarm occurs during two overlapping attacks, only one is counted (EVAL counts both).
The input format is more restricted. The alarms must be sorted by descending score and have fixed width fields as shown above with a # in column 55. The alarms output by EVAL at level 4 are compatible with EVAL3 and EVAL4.
afil.pl filters and sorts alarms by score. eval3.cpp counts detections at various false alarm rates. eval4.cpp lists each alarm as a detection or false alarm. labels.txt, table1.txt, all.atk Required data files for EVAL3 and EVAL4.To compile:
g++ eval3.cpp -O -o eval3 g++ eval4.cpp -O -o eval4
phad 1123200 in[3-5]* |perl afil.pl >phad.simThis is almost equivalent to sorting by score:
phad 1123200 in[3-5]* |sort +0.45 -r >phad.sim
EVAL3 can take more than one .sim (alarm) file, in which case the alarms are merged by sorting by alarm score and taking equal numbers of top ranking alarms from each and removing duplicates. For instance,
eval3 phad.sim alad.sim 00000001 = phad.sim 6260 00000002 = alad.sim 876 245 attack instances IDS's 5 10 20 50 100 200 500 1000 2000 5000 TP / FA 00000001 5 9 21 37 54 56 56 56 62 86 89 / 5592 00000002 10 13 19 42 60 66 72 72 72 72 72 / 476 00000003 6 14 21 44 73 96 108 110 111 125 128 / 6082 1 41.084818 sendmail Clr Old R2L LINUX 1 41.111531 portsweep Stlth Old PROBE CISCO 1 41.133333 dict Clr Old R2L SOLARIS 1 41.135830 ftpwrite Clr Old R2L SOLARIS ... 1 CISCO 1 DATA-llR2L- 1 DATA-llR2L- sshtrojan 1 DATA-llR2L--LINUX 28 DOS 3 DOS apache2 3 DOS crashiis ... 1 xlockClr 1 xtermClr 1 yagaClr 73 attacks detectedThe first lines assign hex numbers 1, 2, 4, 8, 10, 20... to each .sim file. The next set of lines show the number of attacks detected at 5 to 5000 false alarms for each possible merger. For instance, PHAD detects 54 attacks at 100 false alarms, ALAD detects 60, and PHAD+ALAD (1+2) detects 73. All possible mergers are tested.
The next set lists each detected attack for the merger of all systems at 100 false alarms by ID, name, clear or stealthy, old or new, category (probe, DOS, R2L, U2R), and operating system of the target. The next set counts the number in various groups, for instance, 28 DOS attacks, 3 apache2 attacks, etc. The last set counts by attack name and clear or stealthy.
eval4 s=phad f=16 eval s=phad a=all.atk l=labels.txt t=0 f=16 04/06 08:59:16 172.016.112.194 0.748198 FP TCP Checksum=x6F4A 67% 04/01 08:26:16 172.016.114.050 0.697083 TP teardrop IP Frag Ptr=x2000 100% 04/01 11:00:01 172.016.112.100 0.689305 TP dosnuke TCP URG Ptr=49 100% 03/31 08:00:28 192.168.001.030 0.664309 FP IP TOS=x20 100% 03/31 11:35:13 000.000.000.000 0.664225 FP Ether Src Hi=xC66973 68% 03/31 11:35:18 172.016.114.050 0.653956 FP Ether Dest Hi=xE78D76 57% ... 04/01 11:00:01 172.016.112.100 0.545068 -- dosnuke TCP Flg UAPRSF=x39 100% 04/06 08:59:17 172.016.112.194 0.543602 FP TCP URG Ptr=42243 92% phad 10/17Each alarm is shown as a detection (TP), false alarm (FP), or a duplicate detection of a previously detected attack, which is not counted (--). Each alarm shows the date, time, target, score, name, and comments. The last line shows the number of TP/FP (may be higher than the f option by 1. This allows detections between f and f+1 to be counted. In reality, some of these might or might not be counted depending on the threshold).
Duplicate alarms are not removed, so the number of detections may be lower than given by EVAL3. Also, some attack names may differ (e.g. dict vs. guesstelnet, illegalsniffer vs. insidesniffer) because the attack names are taken from labels.txt instead of table1.txt.
labels.txt and table1.txt are derived from the truth files at the LL DARPA evaluation website, except for one attack labeled apache2_err, which is an actual apache2 attack (look at the traffic), but was not included in the original evaluation (or by EVAL). This makes the total 202 attacks, not 201. all.atk is a list of attacks to be counted by EVAL4.
If all goes well, you should detect the following number of attacks, assuming all are in-spec. EVAL results are also shown for 177 inside sniffer attacks. Systems are trained on 7 days of inside sniffer week 3 and tested on 9 days of inside sniffer weeks 4 and 5. Alarms are filtered using AFIL.PL.
eval3/202 eval4/202 eval/201 eval/177 (In spec: IT-All) --------- --------- -------- ---------------------- PHAD 54 50 43 41 ALAD 60 59 63 63 LERAD 112 110 115 112 (randomized, may vary) NETAD 132 132 132 129 SAD 44 77 77 81 79 (third byte of source address)
Results updated 4/17/03. Prior results did not include filtering with AFIL.PL except for NETAD.
Matt Mahoney, firstname.lastname@example.org