My main research interests are in adaptive systems, particularly machine learning (data mining) methods for computer security, web personalization, and biomedical informatics.
Machine Learning (Data Mining): Earlier research studied cost-sensitive learning and skewed distributions for classification tasks with applications in credit card fraud detection. Recent research focuses on investigating learning algorithms for characterization (or ``one-class learning''). The ``classical'' learning algorithms are for classification, which construct models to discriminate different classes labeled in the training instances. However, the objective of characterization is to describe instances in the lone ``class,'' without the knowledge of the ``other classes.'' We have introduced algorithms that learn comprehensible models and applied them to different domains, which are discussed below.
Computer Security: Much of the intrusion detection research focuses on signature (misuse) detection, where models are built to recognize known attacks and cannot detect novel attacks. Anomaly detection focuses on modeling the normal behavior and identifying significant deviations, which could be novel attacks. This is a characterization task. For learning algorithms to be effective in anomaly detection, they need to construct easily comprehensive models, train and test efficiently, and generate few false alarms. We introduced a machine learning algorithm, called LERAD, that satisfies the identified properties. We evaluate LERAD on the DARPA 99 dataset as well as real data collected at our department. Compared to SPADE, an anomaly detection algorithm based on BayesNets and probability distributions, LERAD detects significantly more attacks in both data sets. This project was funded by DARPA.
Monitoring with sensors: Time series data from senors of a device can be use to model the normal behavior of the device. Significant deviations from this model can signal potential new problems. This is another characterization task. Our approach first segments a time series into states, then each state is characterized by comprehensible rules, and finally a finite state machine is constructed. Funded by NASA, the project tries to identify malfunctioning valves in the shuttle program. Initial findings indicate our approach is quite effective. We further investigate learning a model from multiple time series and from multiple sensors. In the advent of sensor networks, monitoring various interested targets (devices, environments) and detecting abnormal behavior would become increasingly important.
Web Personalization: We propose non-invasive (no direct user involvement) techniques for adaptive and personalized information searching and navigation guidance. Rather than requesting the user to provide his/her interests, we focus on modeling user interests based on behavior. Our studies found that among more than ten different keyboard and mouse activities, mouse clicks seem to be more predictive of interests. Beyond using words, we introduced an efficient algorithm for identifying phrases to represent a page. We also developed techniques for generating a hierarchy of topics interested to the user. More recently, with Prof. Jacob (Tufts), we propose to investigate eye gaze as an indicator for interests and identifying features to represent a page.
Biomedical Informatics: Most of the algorithms we devise are not specific to a domain. For instance my thesis work uses DNA and protein data for predicting splice junctions and protein secondary structures. Though LERAD was developed for detecting novel computer attacks, preliminary findings suggest it performs well in detecting medical incidents in patient data simulated by a commercial software used for training medical students (MGH and MIT). Recently, with some biology colleagues, we are analyzing microarray data for understanding gene expression in E. coli cell cycles.