Network Anomaly Intrusion Detection Research at Florida Tech.

Below are links to papers and source code for four experimental network anomaly intrusion detection systems, PHAD, ALAD, LERAD, and NETAD, and software to evaluate them on the 1999 Lincoln Laboratory DARPA Intrusion Detection System Evaluation data set. The software was developed by Matt Mahoney as dissertation research under Philip K. Chan at Florida Tech in 2000-2003. Dissertation Slides.

The EVAL program is generally useful for testing intrusion detection algorithms on the 1999 data. The other programs are made available for reference.

You may use, copy, modify, and redistribute this code under terms of the GNU general public license.

Intrusion Detection Algorithms

All of our program were evaluated on the 1999 Lincoln Laboratory DARPA Intrusion Detection System Evaluation data set. To repeat our tests, you need to download the inside sniffer traffic (inside.tcpdump.gz) from week 3 (7 files, Mon.-Fri. and extra Mon.-Tues.), week 4 (4 files, Mon., Wed.-Fri.), and week 5 (5 files, Mon.-Fri). These are big files, about 200 MB compressed each. We uncompressed them with gzip and renamed them as follows:

  Week 3 (training) in31 in32 in33 in34 in35 in36 in37
  Week 4 (test)     in41      in43 in44 in45
  Week 5 (test)     in51 in52 in53 in54 in55

All programs produce a list of alarms in the format used by the original 1999 evaluation, which has the form

iiiiiiii mm/dd/yyyy hh:mm:ss aaa.aaa.aaa.aaa s.ssssss #comments
where iiiiiiii is an 8 digit identifier (ignored, always 0), mm/dd/yyyy is the date of the alarm, hh:mm:ss is the time in EST (week 4) or EDT (week 5), aaa.aaa.aaa.aaa is the target (destination) IP address with leading zeros in each decimal byte, and s.ssssss is the alarm score, ranging from 0.000000 to 9.999999. For example
       0 04/06/1999 08:59:16 172.016.112.194 0.631169 # a comment

WARNING. This software is provided as reference code to supplement our papers. These are NOT intrusion detection systems. The programs have ONLY been tested with off-line data from the 1999 LL IDS evaluation (inside sniffer) data. They have ONLY been tested on one PC and/or Sun workstation using one version g++ and Perl. You will probably have to modify them to work on the same data, and almost certainly if you use them on different data. If you do, you are on your own. However, we do know of a few bugs you will have to deal with.

PHAD

PHAD (Packet Header Anomaly Detector) detects anomalies in Ethernet, IP, TCP, UDP, and ICMP packet headers. It is described in PHAD: Packet Header Anomaly Detection for Indentifying Hostile Network Traffic, by Matthew V. Mahoney and Philip K. Chan, 2001, Florida Tech. technical report CS-2001-4 (PDF, 17 pages).

The source code is phad.cpp To compile and run:

  g++ phad.cpp -O -o phad
  phad 1123200 in3* in4* in5* >phad.sim
The arguments are the training period in seconds (1123200 is 13 days) and the tcpdump files in chronological order.

ALAD

ALAD (Application Layer Anomaly Detector) detects anomalies in inbound TCP stream connections to well known server ports. It is described in Learning Nonstationary Models of Normal Network Traffic for Detecting Novel Attacks by Matthew V. Mahoney and Philip K. Chan, Proc. Eighth Intl. Conf. Knowledge Discovery and Data Mining, p376-385, 2002. (C) 2002, ACM (PDF, 10 pages). The code has two components.

  te.cpp, a program to extract TCP streams from tcpdump files
  alad.pl, a Perl script that reads te output and genrate alarms
    (rename the .txt extension to .pl after downloading)
To compile and run:
  g++ te.cpp -O -o te
  te in3* > train
  te in45* > test
  perl alad.pl train test > alad.sim

LERAD

LERAD (Learning Rules for Anomaly Detection) detects TCP stream anomalies like ALAD but uses a learning algorithm to pick good rules from the training set, rather than a fixed set of rules. It is described in Learning Models of Network Traffic for Detecting Novel Attacks by Matthew V. Mahoney and Philip K. Chan, Florida Institute of Technology Technical Report CS-2002-08 (PDF, 48 pages double spaced). It uses the files train and test from ALAD and two programs:

  a2l.pl converts train and test from ALAD to a LERAD database format.
  lerad.cpp outputs alarms.
To compile and run:
  g++ lerad.cpp -O -o lerad
  perl a2l.pl train > train.txt
  perl a2l.pl test > test.txt
  lerad train.txt test.txt 0 > lerad.sim
The third argument is a random number seed (default 0). LERAD uses a randomized algorithm, so each seed gives a slightly different result. LERAD also lists the rules it generates to the file rules.txt.

In Learning Rules for Anomaly Detection of Hostile Network Traffic, Proc. ICDM 2003 (© 2003, IEEE) (Powerpoint slides) and the longer Technical Report TR-CS-2003-16, LERAD-TCP is just LERAD above, and LERAD-PKT is leradp.cpp. To use:

  bcc32 leradp.cpp
  leradp 0 in3tf in45tf > leradp.sim
where 0 is the random number seed (can be any integer), in3tf and in45tf are filtered tcpdump weeks 3-5 (see NETAD below). You can have any number of tcpdump files, filtered or not, but only the first file is used for training.

Time zones come out right when compiled with Borland and your computer is set to Eastern time. DJGPP converts to UT. Haven't tried it in UNIX/g++. Try messing with the print_time() function if it doesn't work.

NETAD

NETAD (Network Traffic Anomaly Detector) reads packets like PHAD with improvements. It is described in Network Traffic Anomaly Detection Based on Packet Bytes by Matthew V. Mahoney, to appear in Proc. ACM-SAC, Melbourne FL, 2003, (C) 2003, ACM. (PDF, 5 pages). It has two programs.

  tf.cpp, a traffic filtering program.
  netad.cpp, reads filtered traffic and outputs alarms.
To compile and run:
  g++ tf.cpp -O -o tf
  g++ netad.cpp -O -o netad
  tf in3*
  mv tf.out in3tf
  tf in4* in5*
  mv tf.out in45tf
  netad in3tf in45tf > netad.sim
Note: NETAD takes any number of files. The training period is hard coded. NETAD also produces a rules.txt file that you can delete.

SAD

SAD (Simulation Artifact Detector) is described in An Analysis of the 1999 DARPA/Lincoln Laboratories Evaluation Data for Network Anomaly Detection by Matthew V. Mahoney and Philip K. Chan, TR-CS-2003-02. (See also Proc. RAID, 2003, pp. 220-237). Slides (PowerPoint)

To compile and run:

  g++ sad.cpp -O -o sad
  sad 38 in3tf in45tf > sad38.sim   (38 = TTL, 24 detections, 4 false alarms)
  sad 45 in[3-5]* > sad45.sim       (45 = last byte of source address, 71 det, 16 FA)
SAD examines one byte of incoming TCP SYN packets. The first argument is the byte offset to test, including the 16 byte tcpdump header and 14 byte Ethernet header. The other arguments are tcpdump files. The test periods (weeks 2, 4, 5) are hard coded.

The merged results described in the paper were created with tm.cpp. Unfortunately we can't publish the real data because of privacy concerns. However if you collect your own, you could use this program to inject it into the DARPA data. To compile and run:

  g++ tm.cpp -O -o tm
  tm in3tf tcpdump_files...
  mv tm.out m31
  tm in45tf tcpdump_files... (not the same ones)
  mv tm.out m45
Then use m3 and m45 in place of in3tf and in45tf for PHAD, ALAD, LERAD, NETAD, SAD, etc. tm takes 2 or more file name arguments and puts the result in tm.out. The total duration of files 2, 3, 4... should be at least as much as file 1. The time stamps of files 2, 3, 4... are adjusted to match those of file 1 in tm.out, even if there are gaps in both sets.

Evaluation Programs

EVAL is the program recommended for evaluating intrusion detection systems of any kind on the 1999 DARPA/LL data. Published results for PHAD, ALAD, LERAD, and NETAD are based on two older programs, EVAL3 and EVAL4, which are included for reference. All of these programs are based on criteria used in the 1999 evaluation, but there are minor differences between them. Future work will use EVAL.

EVAL

EVAL reads a .sim file (e.g. phad.sim, output by PHAD) and reports the number of in-spec attacks detected at the lowest threshold allowing 100 false alarms (or another level you specify). An attack is considered detected if any alarm identifies the target address (or one target address if many) and the time within 60 seconds of any portion of the attack on that target. Any alarm that detects no attacks (in-spec or not) is a false alarm. A target address is any address on the the networks 172.16.x.x or 192.168.x.x from the truth table at http://www.ll.mit.edu/IST/ideval/docs/1999/master-listfile-condensed.txt (as of Jan. 2, 2003) with the exception of three (one portsweep and two httptunnel) that have no inside addresses. In that case, the target address is taken from the ID list at http://www.ll.mit.edu/IST/ideval/docs/1999/master_identifications.list even though this file shows the local address as the source instead of the target. (httptunnel is a backdoor that initiates outgoing traffic as a client. I don't know about the portsweep).

Labels for week 2 do not include durations, so 1 second was assumed. Categories (probe, DOS, R2L, U2R, Data) are as in weeks 4-5 but no other characteristics are assumed (whether visible in inside or outside traffic, BSM, etc). The data is derived from http://www.ll.mit.edu/IST/ideval/docs/1999/detections_1999.html

  eval.cpp source code
    truth2eval.pl     | Scripts to generate tables in eval.cpp
    hosts2eval.pl     | You don't need these unless you plan
    labels2eval.pl    | to generate the tables in eval.cpp from new data
To compile and run
  g++ eval.cpp -O -o eval
  eval phad.sim                      (read from a file)
  phad 1123200 in[3-5]* | eval -     (or read from standard input)
EVAL takes 2 options, a reporing level (0-4, default 2) and the threshold in number of false alarms (default 100), e.g.
  eval phad.sim 4 1000               (most details, detections at 1000 false alarms)

Level 0 only lists warning about alarms containing errors which are ignored, for instance, if the IP address is 0.0.0.0 (below), score is 0, date is out of range, or data is missing or badly formatted.

  Ignored: host:        0 03/31/1999 11:35:13 000.000.000.000 0.664225 # Ether Src Hi=xC66973 68%

Level 1 also prints a table of detections at 100 false alarms (or a different number if you specify one). All rows except the last are for the 201 attacks in weeks 4-5. The last row is for the 43 attacks in week 2. Each cell lists the number of detections out of the total number of various combinations of in-spec attacks. For instance, out of the 34 probes with evidence in the inside sniffer traffic (IT), 16 were detected at the lowest threshold allowing no more than 100 false alarms.

  Detections/Total at 100 false alarms (weeks 4-5 only except last row)

           All    Probe    DOS     R2L     U2R    Data     New   Stealthy
         ------- ------- ------- ------- ------- ------- ------- --------
  W45     41/201  18/37   21/65    2/56    0/37    0/16    9/62   12/36   (Weeks 4-5)
  IT      39/177  16/34   21/60    2/54    0/27    0/7     8/52   10/30   (Inside sniffer evidence)
  OT      26/151  14/32   10/44    2/46    0/26    0/11    4/38    8/23   (Outside sniffer evidence)
  BSM      4/38    1/1     3/12    0/10    0/11    0/6     0/8     0/6    (Solaris BSM evidence)
  NT       3/33    0/3     3/7     0/10    0/12    0/4     3/26    0/0    (NT audit log evidence)
  FS      40/189  18/37   20/62    2/56    0/31    0/11    9/54   12/34   (File system dump evidence)
  pascal  12/55    4/8     7/20    1/12    0/11    0/6     1/11    3/9    (Solaris target)
  hume     4/48    0/7     4/15    0/12    0/13    0/5     3/31    0/2    (NT target)
  zeno     7/22    4/7     3/9     0/3     0/3     0/1     1/2     3/6    (SunOS target)
  marx     9/44    4/6     4/17    1/18    0/2     0/2     1/11    2/10   (Linux target)
  W2       0/43    0/9     0/13    0/6     0/12    0/3     0/0     0/0    (Week 2)

  41 detections, 6260 alarms, 47 true, 101 false, 6112 not evaluated.
Only the highest scoring alarms are evaluated. Evaluation was stopped at 101 false alarms in order to count detections between 100 and 101. There are 47 alarms that detect attacks, but 6 of these detect an attack already detected by another alarm. It is also possible for an alarm to detect more than one attack if the attacks overlap.

Level 2 (the default) also lists all attack types and the comments for the first alarm that detected it, for instance,

  Attack            FA (false alarms before detected)
  ------           ----
  dosnuke             1 # TCP URG Ptr=49 100%
  dosnuke             4 # TCP URG Ptr=49 100%
  dosnuke            16 # TCP URG Ptr=49 100%
  dosnuke            22 # TCP Flg UAPRSF=x39 100%
  insidesniffer      21 # TCP Checksum=xAE5B 37%
  insidesniffer      92 # Ether Src Hi=x00104B 60%
  mscan               7 # Ether Dest Hi=x29CDBA 57%
  pod                 4 # IP Frag Ptr=x2000 100%
  pod                18 # IP Frag Ptr=x2000 100%
  ...
There were 4 dosnuke attacks detected at 100 false alarms. The number under FA is how many false alarms had a higher score than the highest scoring alarm to detect it. (There may be other alarms detecting each instance, but only the highest is shown). For instance, if the threshold were set to allow 16 to 21 false alarms then 3 instances of dosnuke would have been detected. The text after # are comments from the alarm file.

Level 3 also lists each detected attack in descending order of the highest scoring alarm that detected it. The FA column shows the number of false alarms with higher scores. For instance, 2 to 5 attacks would be detected (in the order shown) at thresholds allowing 4 false alarms. EVAL would report 5 detections in this case.

  n    FA  Attacks detected after FA false alarms
 ---  ---- --------------------------------------
   1     1 44.082615 teardrop        W45 marx DOS LOG IT FS  In
   2     1 44.110000 dosnuke         W45 hume DOS New LOG IT FS  In
   3     4 51.083800 pod             W45 pascal DOS IT OT FS  Man
   4     4 51.200037 udpstorm        W45 DOS IT OT
   5     4 51.114500 dosnuke         W45 hume DOS New LOG IT FS  Man In
...
  39    92 51.171917 syslogd         W45 pascal DOS LOG BSM IT FS
  40    93 51.084334 portsweep       W45 Probe Stl OT FS  In
  41    99 54.145832 satan           W45 marx Probe IT OT FS
The notation after each attack is as follows.
  W2 -  Week 2, data available before the 1999 evaluation (43 labeled attacks).
  W45 - Weeks 4 and 5, used in the evaluation (201 attacks).
  pascal (Solaris), hume (NT), zeno (SunOS), marx (Linux) - The 4 main
    targets.  There may be more than one (ipsweep) or none (snmpget,
    targets the router).
  Probe, DOS, R2L, U2R, Data - Attack category, may be more than one
    (e.g. R2L-Data or U2R-Data).
The following apply to weeks 4 and 5 only.
  New - Attack type not found in week 2.
  Stl - Stealthy.
  IT  - Evidence of attack in inside sniffer traffic (177 of 201).
  OT  - Evidence in outside sniffer traffic (151).
  BSM - Evidence in Solaris BSM system call traces (38).
  LOG - Evidence in system/audit logs.
  NT  - Evidence in NT audit logs (LOG + hume).
  FS  - Evidence in file system dumps.
  DIR - Evidence in directory listings.
  Con - Attack launched from console.
  Man - Attack launched manually (not scripted).
  In  - Insider attack (launched from within eyrie.af.mil).

Level 4 also lists all alarms above the threshold (false alarm limit) in their entirety, either as a false alarm or under the attack it detected. The number preceding each alarm is the number of false alarms with higher scores. For instance, the instance of teardrop on week 4, day 4 at 08:26:15 (read from the ID number 44.082615) would have been detected at a threshold allowing 1 false alarm. A second alarm detected it after 63 false alarms. ps was not detected.

  Top 100 false alarms (number, date, time, target, score, #comments)
  ------------------------------------------------------------------
         0 04/06/1999 08:59:16 172.016.112.194 0.748198 ## TCP Checksum=x6F4A 67%
         1 04/00/1999 08:00:28 192.168.001.030 0.664309 ## IP TOS=x20 100%
         2 04/00/1999 11:35:18 172.016.114.050 0.653956 ## Ether Dest Hi=xE78D76
  ...
       100 04/06/1999 05:35:35 192.168.001.010 0.273233 ## UDP Len=136 100%

  Detections listed by attack (preceding FAs, date, time, target, score)
  ---------------------------------------------------------------------
  41.084031 ps              W45 pascal U2R Stl DIR LOG BSM IT OT FS
  41.084818 sendmail        W45 marx R2L IT OT FS
        25 03/29/1999 08:48:10 172.016.114.050 0.434500 ## IP Src=202.049.244.010
  ...
  44.082615 teardrop        W45 marx DOS LOG IT FS  In
         1 04/01/1999 08:26:16 172.016.114.050 0.697083 ## IP Frag Ptr=x2000 100%
        63 04/01/1999 08:26:16 172.016.114.050 0.332558 ## IP Length=24 100%
  ...
Only the first 10 alarms detecting an attack are shown, with the total number given if more than 10. An alarm may appear twice under two overlapping attacks if it detects both. There are actually 101 false alarms listed.

eval version 12/15/03 adds a row labeled "poor" for the 72 attacks which were poorly detected in the 1999 evaluation. An attack is poorly detected if none of the original participants could detect more than half of the instances at 10 false alarms per day.

Old Evaluation Code

Published results for PHAD, ALAD, LERAD, and NETAD are based on two older programs, EVAL3 and EVAL4, and an alarm sorting and filtering program, AFIL.PL. You don't need these unless you want to verify our results, or if you want to evaluate merged results from multiple systems (a feature of EVAL3).

EVAL3 and EVAL4 differ from EVAL in that they allow attacks to be detected by the destination address given in the master detection truth table even if the address is the source rather than the target. EVAL only allows targets of 172.16.x.x or 192.168.x.x. EVAL3 and EVAL4 also count badly formatted alarms as false alarms. These differences mainly affect PHAD. EVAL3 also removes duplicate alarms (within 60 seconds) so it sometimes gives higher scores. Neither distinguish in-spec and out-of-spec detections, so out-of-spec have to be removed manually. There are sone attack naming inconsistencies between the two (dict vs. guesstelnet, insidesniffer vs. illegalsniffer, etc). If an alarm occurs during two overlapping attacks, only one is counted (EVAL counts both).

The input format is more restricted. The alarms must be sorted by descending score and have fixed width fields as shown above with a # in column 55. The alarms output by EVAL at level 4 are compatible with EVAL3 and EVAL4.

  afil.pl filters and sorts alarms by score.
  eval3.cpp counts detections at various false alarm rates.
  eval4.cpp lists each alarm as a detection or false alarm.
  labels.txt, table1.txt, all.atk Required data files for EVAL3 and EVAL4.
To compile:
  g++ eval3.cpp -O -o eval3
  g++ eval4.cpp -O -o eval4

AFIL.PL

AFIL.PL is a Perl program that reads an unsorted .sim file from standard input (or a file named on the command line) and sorts it by descending score. It also removes duplicate alarms within the same one minute period, so is sometimes faster than using sort and gives more detections if there are bursts of alarms. It should be used to filter the output of PHAD, ALAD, LERAD, and NETAD, e.g.
  phad 1123200 in[3-5]* |perl afil.pl >phad.sim
This is almost equivalent to sorting by score:
  phad 1123200 in[3-5]* |sort +0.45 -r >phad.sim

EVAL3

EVAL3 lists the number of detections at false alarm rates from 5 to 5000. Prior to counting detections, it removes the lower scoring of duplicate alarms that target the same IP address within 60 seconds. This step improves the detection rate slightly, especially for systems that use binary scoring or that merge alarms from multiple systems.

EVAL3 can take more than one .sim (alarm) file, in which case the alarms are merged by sorting by alarm score and taking equal numbers of top ranking alarms from each and removing duplicates. For instance,

  eval3 phad.sim alad.sim

  00000001 = phad.sim 6260
  00000002 = alad.sim 876
  245 attack instances

  IDS's        5   10   20   50  100  200  500 1000 2000 5000   TP /    FA
  00000001     5    9   21   37   54   56   56   56   62   86   89 /  5592
  00000002    10   13   19   42   60   66   72   72   72   72   72 /   476
  00000003     6   14   21   44   73   96  108  110  111  125  128 /  6082
  1 41.084818 sendmail Clr Old R2L LINUX
  1 41.111531 portsweep Stlth Old PROBE CISCO
  1 41.133333 dict Clr Old R2L SOLARIS
  1 41.135830 ftpwrite Clr Old R2L SOLARIS
  ...
   1 CISCO
   1 DATA-llR2L-
   1 DATA-llR2L- sshtrojan
   1 DATA-llR2L--LINUX
  28 DOS
   3 DOS apache2
   3 DOS crashiis
  ...
   1 xlockClr
   1 xtermClr
   1 yagaClr
  73 attacks detected
The first lines assign hex numbers 1, 2, 4, 8, 10, 20... to each .sim file. The next set of lines show the number of attacks detected at 5 to 5000 false alarms for each possible merger. For instance, PHAD detects 54 attacks at 100 false alarms, ALAD detects 60, and PHAD+ALAD (1+2) detects 73. All possible mergers are tested.

The next set lists each detected attack for the merger of all systems at 100 false alarms by ID, name, clear or stealthy, old or new, category (probe, DOS, R2L, U2R), and operating system of the target. The next set counts the number in various groups, for instance, 28 DOS attacks, 3 apache2 attacks, etc. The last set counts by attack name and clear or stealthy.

EVAL4

EVAL4 labels each alarm as an attack or false alarm, then prints the total number of attacks and false alarms. It only takes one .sim file (s=phad.sim). It stops after 100 false alarms, unless you specify a different number, e.g. f=16. There are other options but they are not very useful. For example,
eval4 s=phad f=16
  eval s=phad a=all.atk l=labels.txt t=0 f=16
  04/06 08:59:16 172.016.112.194 0.748198 FP               TCP Checksum=x6F4A 67%
  04/01 08:26:16 172.016.114.050 0.697083 TP teardrop      IP Frag Ptr=x2000 100%
  04/01 11:00:01 172.016.112.100 0.689305 TP dosnuke       TCP URG Ptr=49 100%
  03/31 08:00:28 192.168.001.030 0.664309 FP               IP TOS=x20 100%
  03/31 11:35:13 000.000.000.000 0.664225 FP               Ether Src Hi=xC66973 68%
  03/31 11:35:18 172.016.114.050 0.653956 FP               Ether Dest Hi=xE78D76 57%
  ...
  04/01 11:00:01 172.016.112.100 0.545068 -- dosnuke       TCP Flg UAPRSF=x39 100%
  04/06 08:59:17 172.016.112.194 0.543602 FP               TCP URG Ptr=42243 92%
  phad 10/17
Each alarm is shown as a detection (TP), false alarm (FP), or a duplicate detection of a previously detected attack, which is not counted (--). Each alarm shows the date, time, target, score, name, and comments. The last line shows the number of TP/FP (may be higher than the f option by 1. This allows detections between f and f+1 to be counted. In reality, some of these might or might not be counted depending on the threshold).

Duplicate alarms are not removed, so the number of detections may be lower than given by EVAL3. Also, some attack names may differ (e.g. dict vs. guesstelnet, illegalsniffer vs. insidesniffer) because the attack names are taken from labels.txt instead of table1.txt.

labels.txt and table1.txt are derived from the truth files at the LL DARPA evaluation website, except for one attack labeled apache2_err, which is an actual apache2 attack (look at the traffic), but was not included in the original evaluation (or by EVAL). This makes the total 202 attacks, not 201. all.atk is a list of attacks to be counted by EVAL4.

Testing

If all goes well, you should detect the following number of attacks, assuming all are in-spec. EVAL results are also shown for 177 inside sniffer attacks. Systems are trained on 7 days of inside sniffer week 3 and tested on 9 days of inside sniffer weeks 4 and 5. Alarms are filtered using AFIL.PL.

        eval3/202 eval4/202 eval/201 eval/177 (In spec: IT-All)
        --------- --------- -------- ----------------------
  PHAD      54       50       43       41
  ALAD      60       59       63       63
  LERAD    112      110      115      112  (randomized, may vary)
  NETAD    132      132      132      129
  SAD 44    77       77       81       79  (third byte of source address)

Results updated 4/17/03. Prior results did not include filtering with AFIL.PL except for NETAD.

Matt Mahoney, mmahoney@cs.fit.edu