Computer Science Technical Reports
2002
-
CS-2002-01 (223 pages) (pdf)
Title A Framework for Multilingual Information Processing Author Steven Edward Atkin Contact Email Address Faculty Sponsor Ryan Stansifer TR number assignment date February 18, 2002 Abstract
Recent and (continuing) rapid increases in computing power now enable more of humankind's written communication to be represented as digital data. The most recent and obvious changes in multilingual information processing have been the introduction of larger character sets encompassing more writing systems. Yet the very richness of larger collections of characters has made the interpretation and processing of text more difficult.
The many competing motivations (satisfying the needs of linguists, computer scientists, and typographers) for standardizing character sets threaten the purpose of information processing: accurate and facile manipulation of data. Existing character sets are constructed without a consistent strategy or architecture. Complex algorithms and reports are necessary now to understand raw streams of characters representing multilingual text.
We assert that information processing is an architectural problem and not just a character set problem. We analyze several multilingual information processing algorithms (e.g., bidirectional reordering and character normalization) and we conclude that they are more dangerous than beneficial. The countless number of unexpected interactions suggest a lack of a coherent architecture.
We introduce abstractions, novel mechanisms, and take the first steps towards organizing them into a new architecture for multilingual information processing. We propose a multilayered architecture which we call Metacode where character sets appear in lower layers and protocols and algorithms in higher layers. We recast bidirectional reordering and character normalization in the Metacode framework.
-
CS-2002-02 (7 pages) (pdf)
Abstract
Linda is a coordination language capable of solving issues in distributed computing environments that relate to process synchronization, communication and creation. The expressiveness of Linda in distributed systems is such that researchers are proposing novel applications using Linda as a primary means of coordination. The examples range from peer-to-peer to groupware computing, from simple chat applications to control systems. Surprisingly, Linda has not been used in the field of distributed databases, although Linda can be helpful in solving coordination issues in a distributed database system. In this paper, we look at a possibility of using Linda in the context of distributed databases.
-
CS-2002-03 (174 pages) (pdf)
Title Coordination using logical operators in Linda Author James Kendall Snyder Contact Email Address Faculty Sponsor Ronaldo Menezes TR number assignment date March 4, 2002 Abstract
During the past 20 years, research in coordination has had success in demonstrating that distributed systems are made of two distinct parts: computation and coordination. One of the models that has contributed to the success of coordination is LINDA. While it is an extremely powerful coordination model, LINDA has limitations expressing parallel access to tuple spaces. This thesis proposes an extension to LINDA called LOGOP LINDA which combines tuple spaces using logical operators and thus enabling a single primitive to access multiple tuple spaces in parallel. Essentially this concept replaces serial access to multiple tuple spaces using the same primitive, tuple or template. Based on a single implementation consisting of LINDA and LOGOP LINDA, it is shown in the empirical results that LOGOP LINDA is more expressive, scalable, and efficient at both the model and implementation level.
-
CS-2002-04 (61 pages) (pdf)
Title Measuring Pointing Times of a Non-visual Haptic Interface Author Gary Wayne Hrezo Contact Email Address Faculty Sponsor William D. Shoaff TR number assignment date March 4, 2002 Abstract
An experiment was performed to evaluate the effectiveness of using haptics (force feedback of a manual joystick) in a non-visual computing environment. It was believed that a haptic display would enhance, if not eliminate, the need for the visual sense when attempting to view graphical information. Sight impaired people could use haptic interfaces to facilitate the navigation of human computer interfaces which are, by their nature, graphically intensive. Subjects manipulated a force feedback joystick a random distance until over a vibrating target of random width. An inwards / outwards and left / right test was administrated. Movement times were found to be a linear function of the level of task difficulty as defined by Shannon's formation of Fitts law, log2 (D/W + 1). Two linear relationships were found. The first when the joystick traveled the smallest distance and the other with all other distances. Results indicate that haptics could be used to show graphical information.
-
CS-2002-05 (99 pages) Preface (pdf) Paper (pdf)
Title Modeling Web User Interest with Implicit Indicators Author KiSub Jung Contact Email Address Faculty Sponsor Phillip Chan TR number assignment date March 18, 2002 Abstract
Nowadays many research organizations are working on developing intelligent agents on the World Wide Web that can learn a user's interest and find information in the World Wide Web based on the user's profile. (E.g., Pazzani et. al. 1997, Goecks et. al., 2001) Some researches found relationships between a user's behavior and user's interest level while others used a content analyzer to build predictive models to predict user's interest level on a web page when a user visits the web page. We think about building predictive models only using user's behavior with the user's explicit rating. There are some problems to predict a user's interest level to record and analyze the relationship between a user's interest level and the user's behavior on the World Wide Web. To monitor a user's interest indicators such as the number of mouse clicks while a user uses a web browser, we need to have a monitoring software program to see what a user does on the web browser and how much he likes the web page. In addition to that, after we find the user's interest indicators general enough to build predictive models, we are able to build predictive models that can predict a user's interest level according to the user's behavior. We can think about building predictive models using regression analysis and neural networks. The two main motivations follow:
- Can we record a user's interest indicators general enough to build predictive models?
- Can we build predictive models to predict a user's interest level only using the user's behavior after the user leaves the web page?
In this study, we show how we collect each user's interest indicators and build predictive models to predict a user's interest level by recording and analyzing only user's behavior with the user's explicit rating on the World Wide Web.
(Editor's Note: This is a slightly modified version of the accepted Master's Thesis. The accepted version is not available in electronic form.)
-
CS-2002-06 (10 pages) (pdf)
Abstract
Traditional intrusion detection systems (IDS) detect attacks by comparing current behavior to signatures of known attacks. One main drawback is the inability of detecting new attacks which do not have known signatures. In this paper we propose a learning algorithm that constructs models of normal behavior from attack-free network traffic. Behavior that deviates from the learned normal model signals possible novel attacks. Our IDS is unique in two respects. First, it is nonstationary, modeling probabilities based on the time since the last event rather than on average rate. This prevents alarm floods. Second, the IDS learns protocol vocabularies (at the data link through the application layers) in order to detect unknown attacks that attempt to exploit implementation errors in poorly tested features of the target software. On the 1999 DARPA IDS evaluation data set [9], we detect 70 out of 180 attacks (with 100 false alarms), about evenly divided between user behavior anomalies (IP addresses and ports, as modeled by most other systems) and protocol anomalies. Because our methods are unconventional, there is a significant non-overlap of our IDS with the original DARPA participants, which implies that they could be combined to increase coverage.
-
CS-2002-07 (22 pages) (pdf)
Abstract
For an electronic business (e-business), customer satisfaction can be the difference between long-term success and short-term failure. Customer satisfaction is highly impacted by web server availability, as customers expect a web site to be available twenty-four hours a day and seven days a week. Unfortunately, unscheduled web server downtime is often beyond the control of the organization. What is needed is an effective means of identifying and recovering from web server downtime in order to minimize the negative impact on the customer. An automated architecture, called WebSpy, has been developed to notify administration and to take immediate action when web server downtime is detected. This paper describes the WebSpy architecture and differentiates it from other popular web monitoring tools. The results of a case study are presented as a means of demonstrating WebSpy's effectiveness in monitoring web server availability.
-
CS-2002-08 (48 pages) (pdf)
Abstract
Network intrusion detection systems often rely on matching patterns that are gleaned from known attacks. While this method is reliable and rarely produces false alarms, it has the obvious disadvantage that it cannot detect novel attacks. An alternative approach is to learn a model of normal traffic and report deviations, but these anomaly models are typically restricted to modeling IP addresses and ports, and do not include the application payload where many attacks occur. We describe a novel approach to anomaly detection. We extract a set of attributes from each event (IP packet or TCP connection), including strings in the payload, and induce a set of conditional rules which have a very low probability of being violated in a nonstationary model of the normal network traffic in the training data. In the 1999 DARPA intrusion detection evaluation data set, we detect about 60% of 190 attacks at a false alarm rate of 10 per day (100 total). We believe that anomaly detection can work because most attacks exploit software or configuration errors that escaped field testing, so are only exposed under unusual conditions.
-
CS-2002-09 (272 pages) (pdf)
Title Software Design Based on Operational Modes Author Alan Albert Jorgensen Contact Email Address Faculty Sponsor None TR number assignment date 23 July 2002 Abstract
The use of software is ubiquitous despite its reputation for low reliability. This dissertation experimentally verifies this reputation and then proposed changes to the development process to prevent certain classes of failures. I begin by presenting numerous examples of software failures from modern, professionally tested, software products. The root cause of each of these failures can be traced to incorrect partitioning of internally stored data.
I propose a new design technique based on a recently developed testing concept called "operational modes." Operational modes allow correct decomposition (abstraction) of software states defined by storage constraints and describe the cause of a large class of software failures. Operational mode design is influenced by four constraining software features: input, output, computation, and data storage. From this understanding, four classifications of failure are derived from this improved definition of operational modes: 1) improperly constrained input, 2) improperly constrained output, 3) improperly constrained computation, and 4) improperly constrained internal data. Illustrative examples of these failure classes are presented from a number of published programs.
I propose changes to the software design process to eliminate these four identified categories of defects by proper identification and implementation of system constraints, i.e., operational modes that correctly partition program data. This new theory provides developers a methodical mechanism to prevent a large class of software faults and provides software testers a road map to the broad class of software behaviors that must be tested.
I demonstrate the application of this design process modification with a small example that, though proven to be correct in the literature, fails due to lack of proper constraint checking. The resulting example program no longer contains these defects as a direct result of the improvements to the design process. The process is further verified by redesigning an example program from a modern software development text. Not only does the technique correct a defect in that example, but results in a function that is now clearly specified and eliminates the need to rely on "clever" design to achieve the desired results.
-
CS-2002-10 (15 pages) (pdf)
Title Unicode Compress: Does Size Really Matter? Authors Steve Atkin and Ryan Stansifer Contact Email Address Faculty Sponsor Ryan Stansifer TR number assignment date 31 July 2002 Abstract
The Unicode standard provides several algorithms, techniques, and strategies for assigning, transmitting, and compressing Unicode characters. These techniques allow Unicode data to be represented in a concise format in several contexts. In this paper we examine several techniques and strategies for compressing Unicode data using the programs gzip and bzip. Unicode compression algorithms known as SCSU and BOCU are also examined. As far as size is concerned, algorithms designed specifically for Unicode may not be necessary.
-
CS-2002-11 (180 pages) (pdf)
Title A Unique Examination of the Buffer Overflow Condition Author Terry B. Gillette Contact Email Address Faculty Sponsor James Whittaker TR number assignment date 27 August 2002 Abstract
Buffer overflows have been the most common form of security vulnerability for the last ten years. Moreover, buffer overflow vulnerabilities enable the type of exploits that dominate remote network penetration. As our reliance on commercial third party software is critical in the current computing environment one must consider the question of how these vulnerabilities are discovered in released proprietary software.
This thesis presents research focused on the fundamental issues surrounding the buffer overflow vulnerability. The objective is to analyze and understand the technical nature of this type of vulnerability and, on the basis of this, develop an efficient generic method that can improve the detection of this software flaw in released, proprietary software systems. The work is performed from the perspective of a security auditor searching for a single vulnerability in a released program, a different approach compared to the many previous studies that focus on both static source code analysis and run time fault injection. First, for systems that include commercial off-the-shelf software components, we perform a systematic review of buffer overflow exploit data and develop a classification hierarchy.
The goal of this new taxonomy is to provide a tool to assist the auditor in developing the heuristic elements for exploratory testing. Second, we propose that a signature analysis of a disassembled binary executable can lead to the discovery of a buffer overflow vulnerability. In support of this argument we demonstrate a methodology that can be used on closed source proprietary software where only the executable binary image is available. In this case, the key selling point is not the potential rapid automated detection of a buffer overflow vulnerability but the proof of concept that security flaws can be detected by binary scanning techniques.
-
CS-2002-12 (5 pages) (pdf)
Title Network Traffic Anomaly Detection Based on Packet Bytes Author Matthew V. Mahoney Contact Email Address Faculty Sponsor Philip K. Chan TR number assignment date 6 September 2002 Abstract
Hostile network traffic is often "different" from benign traffic in ways that can be distinguished without knowing the nature of the attack. We describe a two-stage anomaly detection system for identifying suspicious traffic. First, we filter traffic to pass only the packets of most interest, e.g., the first few packets of incoming server requests. Second, we model the most common protocols (IP, TCP, telnet, FTP, SMTP, HTTP) at the packet byte level to flag events (byte values) that have not been observed for a long time. This simple system detects 132 of 185 attacks in the 1999 DARPA IDS evaluation data set with 100 false alarms, after training on one week of attack-free traffic.
-
CS-2002-13 (13 pages) (pdf)
Title An Algorithm Applicable to Clearing Combinatorial Exchanges Author Marius-Calin Silaghi Contact Email Address Faculty Sponsor None TR number assignment date 16 September 2002 Abstract
It is important to approach negotiations in a way that ensures privacy. So far, research has focused on securely solving restricted classes of negotiation techniques, mainly the (M+1)-st-price auctions. Here we show how these results can be adapted to more general problems.
This paper extends our previous results on how distributed finite discrete problems can be solved securely. Such problems can model larger classes of negotiation problems, .e.g. Combinatorial Exchanges [Sil02]. In Finite Discrete Maximization, each tuple in the problem space is associated with an integer value in a predefined interval and we search for a maximizing input. Values from different subproblems are combined additively. We show that unconstrained distributed Finite Discrete Maximization problems can be solved securely using a scheme that we propose for translating shared secret values into shared differential bids. Differential bid vectors are already used in [AS02][Bra02].
Constrained distributed Finite Discrete Maximization poses additional challenges, due to the loss of additivity of the maximized cost, when infeasibility is marked as the lowest finite value. We found two ways of solving this problem: a) by using an additional multiplication value; and b) by using larger variable domains. While the first alternative enforces a threshold to the privacy level in our current protocol, the second one increases much the complexity of the computation. The proposed algorithms are only (t/3)-private, where t is the number of participants.
-
CS-2002-14 (8 pages) (pdf)
Abstract
To provide a more robust context for personalization, we desire to extract a continuum of general (long-term) to specific (short-term) interests of a user. Our proposed approach is to learn a user interest hierarchy (UIH) from a set of web pages visited by a user. We devise a divisive hierarchical clustering (DHC) algorithm to group words (topics) into a hierarchy where more general interests are represented by a larger set of words. Each web page can then be assigned to nodes in the hierarchy for further processing in learning and predicting interests. This approach is analogous to building a subject taxonomy for a library catalog system and assigning books to the taxonomy. Our approach does not need user involvement and learns the UIH "implicitly." Furthermore, it allows the original objects, web pages, to be assigned to multiple topics (nodes in the hierarchy). In this paper, we focus on learning the UIH from a set of visited pages. We propose a few similarity functions and dynamic threshold-finding methods, and evaluate the resulting hierarchies according to their meaningfulness and shape.
-
CS-2002-15 (33 pages) (pdf)
Abstract
In this work, we first have proposed a technique to define the "causes" of inconsistency on an online point based reasoning constraint network. Second, we introduce an algorithm that proposes the user a minimal set of relations to remove when inconsistencies are detected. We have developed and implemented a battery of algorithms for the purpose of this type of reasoning. Some useful theorems and properties are defined for proving the 'minimal' aspect of the algorithm. Finally, we found that our investigation was a polynomially solvable sub problem of the vertex cover problem.