Computer Science Technical Reports
2003
-
CS-2003-01 (8 pages) (pdf)
Title HEAT: Runtime Interception of Win32 Functions Authors Michael M. Andrews Contact Email Address Faculty Sponsor Michael M. Andrews TR number assignment date 14 January 2003 Abstract
When researching in any field, it is always important to build upon existing work. In most research disciplines, the core base is well defined and widely documented. However, the basis of innovative research in computer science, especially at the systems level, can often be proprietary information. This can be a major problem as unless you have source code available you must be granted access to the proprietary information (by usually signing non-disclosure documents) which comes with barriers on access, ownership and the ability to publish results.
As an example, Microsoft's Windows operating system (all versions) can be a good base for research as it is a "real world" environment with rich usage patterns and any potential advances made would benefit a large number of people. However, access to the Windows operating system is closely guarded via a number of API's (Application Programing Interfaces) along with the fact that Microsoft jealously guards its source code. Therefore, the ability to easily tie into and extend the functionality of applications and operating systems where source code is not available is a necessity.
With this in mind, researchers at the Florida Institute of Technology created HEAT - Hostile Environment for Application Testing. The goal of this research was initially to create a tool that would create hard-to-reproduce environments (like low or corrupt memory, insufficient disk space, slow network, etc.) to help uncover defects in software. To achieve this it was clear that HEAT would have to tie into both the operating system and the application under test.
-
CS-2003-02 (20 pages) (pdf)
Title An Analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for Network Anomaly Detection Authors Matthew V. Mahoney and Philip K. Chan Contact Email Address Faculty Sponsor Philip K. Chan TR number assignment date 27 January 2003 Revised 7 April 2003, old version (13 pages) here Abstract
We investigate potential simulation artifacts and their effects on the evaluation of network anomaly detection systems in the 1999 DARPA/MIT Lincoln Laboratory off-line intrusion detection evaluation data set. A statistical comparison of the simulated background and training traffic with real traffic collected from a university departmental server suggests the presence of artifacts that could allow a network anomaly detection system to detect some novel intrusions based on idiosyncrasies of the underlying implementation of the simulation, with an artificially low false alarm rate. The evaluation problem can be mitigated by mixing real traffic into the simulation. We compare five anomaly detection algorithms on simulated and mixed traffic. On mixed traffic they detect fewer attacks, but the explanations for these detections are more plausible.
-
CS-2003-03 (7 pages) (pdf)
Title Testing with Hostile Data Streams Author Alan A. Jorgensen Contact Email Address TR number assignment date 30 January 2003 Abstract
This note describes a method of testing software for response to malicious data streams. Systems that process data streams obtained from an external source such as the Internet are vulnerable to security issues if malicious data is not processed correctly. This note describes a testing method that creates malicious data streams, applies them to a software application and checks the appropriateness of the application response.
The note begins with a description of the problem: inadequate testing of software response to malicious data streams. I present a method of testing the response to malicious data streams and introduce the concepts of lexical, syntactic and semantic data stream deformation. I provide a description of a system that produces and applies such tests. This description divides the testing system into components and provides some detail about each component. This system applied to Adobe® Acrobat® Reader® version 5.0.1 provides a case study. The study applied 141,306 unique test cases and revealed 11 distinct indications of buffer overrun, numerous program lock-ups, and four steganographic possibilities.
Research is on-going in the following areas: generalized buffer overrun exploitation, maliciously testing protocols and testing with encoded or encrypted data streams.
-
CS-2003-04 (8 pages) (pdf)
Abstract
Natural forming multi-agent systems (aka Swarms) have the ability to grow to enormous sizes without requiring any of the agents to oversee the entire system. The success of these systems comes from the fact that agents are simple and the interaction with the environment and neighboring agents is local in nature. In this paper we look at abstractions in the field of swarms and study their applicability in the context of coordination systems. In particular, we focus on the problematic issue of scalability of Linda systems. The purpose of this work is to look at abstractions yielded from observations of swarms and the way they are organized, and demonstrate how these abstractions may be used to implement a scalable tuple distribution mechanism in a Linda system -- to be named SwarmLinda.
-
CS-2003-05 (10 pages) (pdf)
Abstract
In this paper, we investigate machine learning techniques for discovering knowledge that can be used to monitor the operation of devices or systems. Specifically, we study methods for generating models that can detect anomalies in time series data. The normal operation of a device can usually be characterized in different temporal states. To identify these states, we introduce a clustering algorithm called Gecko that can automatically determine a reasonable number of clusters using our proposed "L" method. We then use the RIPPER classification algorithm to describe these states in logical rules. Finally, transitional logic between the states is added to create a finite state automaton. Our empirical results, on data obtained from the NASA shuttle program, indicate that the Gecko algorithm is comparable to a human expert in identifying states and our overall system can track normal behavior and detect anomalies.
-
CS-2003-06 (13 pages) (pdf)
Abstract
Much of the intrusion detection research focuses on signature (misuse) detection, where models are built to recognize known attacks. However, signature detection, by its nature, cannot detect novel attacks. Anomaly detection focuses on modeling the normal behavior and identifying significant deviations, which could be novel attacks. In this paper we explore two machine learning methods that can construct anomaly detection models from past behavior. The first method is a rule learning algorithm that characterizes normal behavior in the absence of labeled attack data. The second method uses a clustering algorithm to identify outliers.
-
CS-2003-07 (5 pages) (pdf)
Title A Representation Scheme for Finite Length Strings Authors Gaurav Tandon, Debasis Mitra Contact Email Address Faculty Sponsor Dr. Debasis Mitra TR number assignment date 1 April 2003 Abstract
This study is an attempt to create a canonical representation scheme for finite length strings to simplify the study of the theory behind different classes of patterns and to ease the understanding of the underlying separability issues. This could then be used to determine what kinds of techniques are suitable for what class of separability (linear, multi-linear, or non-linear). This representation can then be used in intrusion detection, biological sequences, pattern recognition/classification and numerous other applications.
-
CS-2003-08 (106 pages) (pdf)
Title Pair Programming to Facilitate the Training of Newly-Hired Programmers Authors Mark Anthony Poff Contact Email Address Faculty Sponsor David W. Clay TR number assignment date 7 April 2003 Abstract
Pair programming is touted by some software professionals as a style which allows developers to produce superior code in less time, and with fewer defects, than code produced by individuals. However, the acclaim for the pair methodology is not universal; some see it was a waste of effort which produces marginal improvements. Reported experiments which obtain quantitative results have typically been performed in an educational environment, and may not reflect actual workplace conditions. This thesis reports on an experiment using pair programing in an industrial setting. Its goal is to determine if this programming style can be used to increase the technical and environmental knowledge of newly-hired programmers, and verify the claims stated above.
-
CS-2003-09 (+++ pages) (not yet available)
Title Scenario-Based Testing-What, How, and Why? Authors Sowmya Padmanabhan Contact Email Address Faculty Sponsor Dr. Cem Kaner TR number assignment date 9 April 2003 Abstract
+++
-
CS-2003-10 (18 pages) (pdf)
Abstract
This paper discusses the concept of an algorithm designed to locate the optimal solution to a problem in a (presumably) very large solution space. The algorithm attempts to locate the optimal solution to the problem by beginning a search at an arbitrary point in the solution space and then searching in the "local" area around the start point to find better solutions. The algorithm completes either when it locates what it thinks is the optimal solution or when predefined halt conditions have been met. The algorithm is repair-based, that is, it begins with a given solution and attempts to "repair" that solution by changing one or more of the components of the solution to bring the solution closer to the optimal. The algorithm uses the natural principles of gravity that act on a body in motion through space and simulates those principles to take a given solution to the problem at hand and repair it to locate an optimal solution. In this document two versions of the algorithm, called GLSA1 (based on simple gravitational force calculation) and GLSA2 (based on gravitational field calculation), are presented and the manner in which an initial evaluation of the algorithm was conducted.
Then, by way of example a particular problem is given that the algorithm can be used to solve, along with a description of how the algorithm would be used to solve that problem. Finally, some conclusions and opportunities for future direction are presented.
-
CS-2003-11 (87 pages) (pdf)
Title Strategies for Testing Web Applications from the Client Side Authors Lawrence T. Prevatte III Contact Email Address Faculty Sponsor Dr. James Whittaker TR number assignment date 25 April 2003 Abstract
This thesis is presented on the testing of computer software designed to operate over the World Wide Web. A fault model to guide Web software testing is developed, the components of Web software applications and their interfaces to each other are defined, and strategies to test Web software are discussed. The approach taken is to simplify the concepts of testing Web software to aid in a methodical treatment of the techniques required to conduct meaningful tests using a minimum amount of testing resources, conducting the tests from the client side of the Web application during the "ready for users" or acceptance phase. This text will present how the industry arrived at the current state of Web application quality, how the industry views Web applications, how to test them, strategies for testing Web applications from the client side, and interesting examples using the proposed methodology.
-
CS-2003-12 (88 pages) (pdf)
Abstract
Software Life Cycle processes play a key role in the successful development of software products. Software processes continue to evolve as the field of software development progresses toward being a true engineering discipline. This theses has two objectives:
(1) To apply software engineering knowledge gained during the pursuit of this degree toward the design and development of a complete, Software Engineering Institute (SEI) Capability Maturity Model (CMM) compliant process for the design phase of the software lifecycle.
(2) To evaluate a Waterfall Lifecycle Model vs. an Iterative Lifecycle Model and compare SEI CM compliant processes to non-SEI CM compliant process using a "real world" software development project.
-
CS-2003-13 (160 pages) (pdf)
Abstract
The current approach to detecting novel attacks in network traffic is to model the normal frequency of session IP addresses and server port usage and to signal unusual combinations of these attributes as suspicious. We make four major contributions to the field of network anomaly detection. First, rather than just model user behavior, we also model network protocols from the data link through the application layer in order to detect attacks that exploit vulnerabilities in the implementation of these protocols. Second, we introduce a time-based model suitable for the bursty nature of network traffic: the probability of an event depends on the time since it last occurred rather than just its average frequency. Third, we introduce an algorithm for learning conditional rules from attack free training data that are sensitive to anomalies. Fourth, we extend the model to cases where attack-free training data is not available.
On the 1999 DARPA/Lincoln Laboratory intrusion detection evaluation data set, our best system detects 75% of novel attacks by unauthorized users at 10 false alarms per day after training only on attack-free traffic. However this result is misleading because the background traffic is simulated and our algorithms are sensitive to artifacts. We compare the background traffic to real traffic collected from a university departmental server and conclude that we could realistically expect to detect 30% of these attacks in this environment, or 47% if we are willing to accept 50 false alarms per day.
-
CS-2003-14 (165 pages) (pdf)
Abstract
This dissertation considers the implementation of exception handling specifically for functional languages. Some implementations incur overhead for using exception handling even when no exceptions are raised. We show the results of some experiments with the SML of New Jersey and OCAML compilers, two well-known compilers for functional languages. Imperative languages avoid this overhead by using tables, but the approach does not easily transfer to compilers using continuation passing style (CPS). This dissertation proposes an approach that works with CPS compilers like SML of New Jersey.
We first present an experiment where programs in SML are written with and without exception handlers. From these results, we conclude that programs with exception handling produce overhead even when no exceptions are raised. Then, we analyze the source of the exception handling overhead in the SML of New Jersey compiler. We present a solution to the problem. The new approach uses two continuations instead of the one continuation. One continuation encapsulates the rest of the normal computation as usual. A second continuation is used for passing the abnormal computation. The second continuation is not passed as an extra argument but is passed as a displacement from the first continuation.
We have implemented a basic CPS compiler for functional languages with exception handling. With it we were able to implement the new approach to exception handling and compare it side-by-side with the approach taken by the SML of New Jersey compiler. We show that the new approach to exception handing adds no overhead to the normal flow of control.
The importance of our new approach to exception handing for CPS compilers proposed in this dissertation is the improved run-time performance in every case in which an exception handler is used.
-
CS-2003-15 (81 pages) (pdf)
Title An Automated Oracle for Verifying GUI Objects (Master's Thesis) Authors Juichi Takahashi Contact Email Address Faculty Sponsor Dr. James A. Whittaker TR number assignment date 8 June 2003 Abstract
The promise of test automation - the automatic application of software inputs to find bugs - is tempered by the difficulty of automatically verifying behavior. Indeed, the lack of tools for behavior verification is a major factor in keeping automated testing out of the mainstream.
This thesis develops a technique that aids automatic behavior verification for a particularly difficult problem: determining the correction of screen output. A methodology to capture and compare screen output is presented and a case study using Microsoft ® PowerPoint ® is described.
-
CS-2003-16 (11 pages) (pdf)
Abstract
We introduce an algorithm called LERAD that learns rules for finding rare events in nominal time-series data with long range dependencies. We use LERAD to find anomalies in network packets and TCP sessions to detect novel intrusions. LERAD outperforms the original participants in the 1999 DARPA/Lincoln Laboratory intrusion detection evaluation, and detected most attacks that eluded a firewall in a university departmental server environment.
-
CS-2003-17 (8 pages) (pdf)
Abstract
A continuum of general to specific interests of a user called a user interest hierarchy (UIH) represents a user's interests at different abstraction levels. A UIH can be learned from a set of web pages visited by a user. In this paper, we focus on improving learning the UIH by adding phrases. We propose the VPF algorithm that can find variable length phrases without any user-defined parameter. To identify meaningful phrases, we examine various correlation functions with respect to well-known properties and other properties that we propose.
-
CS-2003-18 (8 pages) (pdf)
Abstract
We investigate techniques to automatically determine the number of clusters to return from hierarchical clustering and segmentation algorithms. We propose an efficient algorithm, the L Method, that finds the "knee" in a '# of clusters vs. clustering evaluation metric' graph. Using the knee is well-known but is not a particularly well-understood method to determine the number of clusters. We explore the feasibility of this method, and attempt to determine in which situations it will and will not work.
-
CS-2003-19 (8 pages) (pdf)
Abstract
Detecting known vulnerabilities (Signature Detection) is not sufficient for complete security. This has raised recent interest in Anomaly Detection (AD), in which a model is built from normal behavior and significant deviations from this model are flagged anomalous. However, most AD algorithm assume clean training data, which could be hard to obtain. Our proposed algorithm relaxes. For this, we define the notion a strong outlier, which is suspicious at both local and global levels. Finally we illustrate the effectiveness of our approach on the DARPA '99 dataset and find that our approach is at par in number of detections at 10 FA/day with the best participants in the original evaluation who employed a hybrid of techniques.
-
CS-2003-20 (13 pages) (pdf)
Abstract
Many approaches have been suggested and various systems have been modeled to detect intrusions from anomalous behavior of systems calls as a result of an attack. Though these techniques have been shown to be quite effective, a key element seems to be missing -- the inclusion and utilization of the system call arguments to create a richer, more valuable signature and to use this information to model the intrusion detection system more accurately. We put forth the idea of adopting a rule learning approach that mobilizes rules based upon system calls and models the system for normal traffic using system call arguments and other key attributes. We present variations of our techniques and compare the results with those from some of the well known techniques based upon system call sequences. The results show that system call argument information is crucial and assists to successfully detect U2R, R2L and Data attacks generating lesser false alarms.
-
CS-2003-21 (15 pages) (pdf)
Abstract
Most of the current anomaly detection methods for network traffic rely on the packet header for studying network traffic behavior. We believe that significant information lies in the payload of the packet and hence it is important to model the payload as well. Since many protocols exist and new protocols are frequently introduced, parsing the payload based on the protocol specification is time-consuming. Instead of relying on the specification, we propose four different characteristics of streams of bytes, which can help us develop algorithms for parsing the payload into tokens. We feed the extracted tokens from the payload to anomaly detection algorithm. Our empirical results indicated that our parsing techniques can extract tokens that can improve the detection rate.
-
CS-2003-22 (232 pages) (pdf)
Title A Taxonomy of E-Commerce Risks and Failures (Master's Thesis) Authors Giridharan Vijayaraghavan Contact Email Address Faculty Sponsor Dr. Cem Kaner TR number assignment date 16 July 2003 Abstract
Imagine being asked to test an e-commerce website. If you haven't tested one before, where would you start? What experience would you draw on? Where would you look for more information? Even very experienced testers have blind spots when they try to generate test ideas for an application that they have not tested. This thesis presents a simple outline that will help you generate test ideas and limit your blind spots. The outline is the result of about 18 months of research on classifying e-commerce related failures and risks. The result has 45 top-level categories and 700+ examples of errors (potential issues to test for) under most categories. In many cases, there is also a link to about 300+ examples of e-commerce defects that have been publicized in the press.
Using the list, you could pick a category of interest (such as accessibility or software upgrade), read descriptions of several types of problems that fit within that category, and so identify a few issues that would be appropriate to test for in your application. Based on feedback to the authors of Testing Computer Software (Kaner, et. al.), I believe that many testers will be able to use this list to identify potential problems that they would otherwise have missed.
-
CS-2003-23 (100 pages) (pdf)
Title An Algorithm for Clearing Combinatorial Markets Authors Josiane Domgang Nzouonta Contact Email Address Faculty Sponsor Dr. Marius Silaghi TR number assignment date 15 December 2003 Abstract
It was recently shown possible to solve single item auctions without revealing any secret except for the solution. Namely, with vMB-share [4], the seller and the buyer only learn each other's identity and the selling price for a chosen M+1 pricing scheme. No trusted party was necessary. In this thesis, we show how vMB-share can be extended for the clearing of combinatorial negotiations with several items, buyers and sellers. We first show how the more general problem can be reduced to a virtual form, relatively similar to the single-item auctions. Then, some modifications in the cryptographic techniques of vMB-share are made such that it can offer a solution to problems in virtual form. While the problem is known to be NP-complete, a secure multiparty computation that avoids the secrets leak by measuring computation time necessarily takes an exponential computation cost. Some early experiments show that small realistic negotiations can nevertheless be solved with acceptable effort.
-
CS-2003-24 (143 pages) (pdf)
Title Algorithms for String Searching on a Beowulf Cluster Authors Kishore Ramakrishna Kattamuri Contact Email Address Faculty Sponsor Dr. William D. Shoaff TR number assignment date 18 December 2003 Abstract
String matching is a subject of both theoretical and practical interest in computer science. Theoretically, time and space complexity of the algorithms and on the practical side searching for keywords, names, phrases etc., is common. Searching for patterns in DNA (Deoxyribose Nucleic Acid) is one such application and is the ultimate area for this study.
This study resulted in efficient parallel way to search for patterns, the program being operated on a parallel computer. This research provides basis for a further study and help in selecting the search algorithms for broader use.
The algorithms proposed in this research try to solve the problem in both the single pattern and multiple pattern search areas. The outcome of this study was very encouraging and paved path for the possible future study. The algorithms were written in C using the MPI model on the Beowulf cluster of Florida Tech (Bluemarlin). The timings were compared against each other. The test data set used was kept uniform throughout the experiments.
Though several ideas were formulated and were tried to be implemented in the search, only three algorithms (KTV, KTV2 and KPrime) were designed. Of these, KTV and KTV2 showed good results where as the KPrime took lot more time than expected. Investigation on the algorithm revealed a further tweaking which is left as a prt for future study.