Space Shuttle Engine Valve Anomaly Detection by Data Compression
                          Matt Mahoney
                         Sept. 23, 2003

Abstract 

Abnormal sensor data from a set of space shuttle engine solenoid
current traces is detected by decreased improvement in compression
ratio when the compressor is trained on normal sensor data compared
to compression without training.  Compared to a second normal trace,
the decrease is observed in all 8 abnormal traces using GZIP and PAQ3,
and in 7 of 8 traces using RK.

Introduction

NASA uses anomaly detection of time series data (e.g. solenoid
current in a space shuttle engine valve) to identify component failures.
The current system manually specifies the shape of the waveform
at on/off transitions using rules coded in SCL, a language for real-time
expert systems.  NASA desires to automate the coding of SCL rules by
using machine learning techniques to construct models from normal
(properly operating) signal traces.

A model is an estimate of the (unknown) probability distribution over
a set X of possible values (e.g. traces).  A common measure of the
goodness of a model is the cross entropy, defined as:

  HM = -SUM P(x) log M(x)

where P is the unknown true probability of x, M is the modeled
(estimated) probability, and the summation is over all x in X.  HM is
minimized when M = P, in which case it is equal to the entropy of X.

A lossless data compression programs uses a model M to assign a code
of length C(x) = -log2 M(x) bits to a message x.  Thus, compression is a
measure of how well M matches the unknown distribution of possible
messages, P(x).  Most compressors are adaptive, adjusting M continuously
as x is read.  An adaptive compressor can be used for anomaly
detection by training it on normal data, x, then measuring the
improvement in compression of the test data, y, compared to compressing
the test data y by itself.  Thus, we define the following measure of
the deviation of y from x as an anomaly score:

  score(x,y) = (C(xy) - C(x)) / C(y)

where xy is the concatenation of training data x and test data y, and
C(x) denotes the size of x after compression.  An anomaly score of 1
means that x and y are sufficiently different so that training on x has
no effect on the compression of y.  A good model should generate a
score close to 0 when y = x.


Experimental Procedure

Three compression programs were evaluated on two normal and 8 abnormal
time series traces representing current in a space shuttle engine valve
solenoid as the valve is opened and closed.  Each trace consists
of 1000 samples at a rate of 1 ms per sample.  Current ranges from
-3.1 to 7.06, quantized to 0.04.  Thera are 47 distinct values
observed in at least one trace.  The 10 traces are labeled TEK00000.CSV
through TEK00017.CSV, and are denoted t0 through t17 in this paper.  They
can be viewed graphically at http://cs.fit.edu/~mmahoney/nasa/
The traces t0 and t1 are normal.  The others represent various failures.

Anomaly detection was evaluated by using either t0 or t1 as training,
and the other 9 traces as test data.  A test is successful if the
normal trace not used in training receives a lower anomaly score than
all of the abnormal traces.  If we train on t0, then t1 should receive
the lowest anomaly score, and if we train on t1, then t0 should score
the lowest.

The 10 TEK000??.CSV files were losslessly reduced to 1-byte values to
create files of 1000 bytes each.  The conversion function of sample x is:

  byte = 57 + 25(min(4, max(x, -1))

This does not lose any information because in each trace at most one
sample exceeds the lower bound of -1, and at most one sample exceeds
the upper bound of 4.  Thus, each possible data value maps to a
different byte value in the range 32 to 157.  (The bounds are necessary
because there would otherwise be over 256 values).

Three comprsssion programs were used, GZIP, PAQ3, and RK.
GZIP ver. 1.2.4 (http://www.gzip.org/) is an LZ77 compressor.  It
compresses by replacing repeated instances of the input string with
pointers to prior occurrences.  Pointers are aligned on bit
boundaries and occupy less space than the strings they replace.

PAQ3 (http://www.cs.fit.edu/~mmahoney/compression/) is a predictive
arithmetic coder.  Arithmetic coders allow substrings of the input to
have fractional bit length codes by replacing the entire input string x
with a binary representation of a number y in [0, 1) such that
M(s < x) < y < M(s <= x), where s has distribution modeled by M.
PAQ3 uses an ad-hoc weighted averaging of models: n-grams up to n = 8,
a string matcher for contexts longer than 8, a word trigram model, and
a model for fixed length records.  It typically compresses text files
to about 30% smaller than GZIP.

RK ver. 1.04a (http://rksoft.virtualave.net/) is a PPM (n-gram) arithmetic
coder with secondary escape modeling.  Unlike GZIP and PAQ3, which treat
characters as nominal values, RK includes an option for delta coding
for analog data (-fd1) that models the difference between successive
values.   With this option (and -mx3 for maximum compression) it
achieves the best compression of the three on the valve data, although
compression ratios for most other data types are similar to PAQ3.

For each compressor, score(x,y) was computed with x = t0 or t1 and
y = t0 through t17.  For GZIP, the compressed size, C(x), is the
size of the output file.  For PAQ3 and RK, the size is as reported
by the program, excluding headers (a few bytes).  C(xy) is computed
by concatenating x and y to a 2000 byte file and compressing it.

Results are summarized in Table 1.  For all three compressors, the
scores for normal data (score(t0, t1) and score(t1, t0)) are less than
the average scores for the abnormal traces.  For GZIP and PAQ3,
the normal traces always have the lowest scores.  For RK, one of 8
abnormal traces (t15) scores lower than the normal trace on either
training set, but all others are higher.  For the better compressors
(PAQ3 and RK), training on normal data actually makes compression
of the abnormal data worse on average than no training at all.  The
greatest gap between normal and abnormal scores is about 30% by PAQ3.


                        Table 1.  Results summary

                            GZIP    PAQ3     RK
                           -----   -----   -----
score(t0, t1)              0.716   0.773   0.834
score(t1, t0)              0.712   0.772   0.822
avg. score(t0, abnormal)   0.899   1.078   1.007
avg. score(t1, abnormal)   0.897   1.063   1.013
avg. abnormal - normal     0.184   0.298   0.182
score normal < abnormal    18/18   18/18   16/18


Table 2 shows the complete results of each experiment.  It is divided
into 6 sections for each of the two training traces on each of the
three compressors.  Trace t15 scores the lowest of all abnormal
traces by all three compressors, and is slightly lower than the
normal traces in RK.  Examination of the waveforms show that t15
has a noise pattern similar to the normal traces in which samples
alternate by +.04 and -.04, and this pattern is absent in the
other abnormal traces.  t15 differs from the normal traces in the
absolute value of the signal in the "on" state, but this information
is not modeled by RK when the delta coding option is used.


                        Table 2.  Detailed results

x = training file (1000 bytes, normal behavior, either t0 or t1)
y = test file (1000 bytes, all but t0 and t1 are abnormal)
C(y) = size of compressed file y in bytes
C(xy) = size of concatenated x and y (2000 bytes) after compression
C'(y) = C(xy) - C(x) = size of C(y) when trained on x
score(x, y) = C'(y)/C(y) = effect of x on C(y) (a low score mean y ~ x in M)

y  C(y) C(xy) C'(y) score(x, y) for GZIP
---  ---  ---  ---  -----
t0   396  411   15  0.038  C(x = t0) = 396
t1   377  666  270  0.716 
t2   524  875  479  0.914 
t3   517  863  467  0.903
t5   639  995  599  0.937
t10  439  799  403  0.918
t11  507  865  469  0.925
t15  393  696  300  0.763
t16  553  904  508  0.919
t17  524  876  480  0.916  (average untrained compression = 486.9)

t1   377  394   17  0.045  C(x = t1) = 377
t0   396  655  278  0.702  
t2   524  855  478  0.912  
t3   517  842  465  0.899
t5   639  972  595  0.931
t10  439  778  401  0.913
t11  507  844  467  0.921
t15  393  680  303  0.771
t16  553  881  504  0.911
t17  524  856  479  0.914

y  C(y) C(xy) C'(y) score(x, y) for PAQ3
---  ---  ---  ---  -----
t0   285  293    8  0.028  C(x = t0) = 285
t1   282  503  218  0.773  
t2   320  633  348  1.087 
t3   317  631  346  1.091
t4   365  694  409  1.121
t10  266  576  291  1.094
t11  309  630  345  1.117
t15  270  520  235  0.870
t16  342  671  386  1.129
t17  323  645  360  1.115  (average untrained compression = 307.9)

t1   282  289    7  0.025  C(x = t1) = 282
t0   285  502  220  0.772  
t2   320  629  347  1.084 
t3   317  624  342  1.079
t5   365  679  397  1.088
t10  266  571  289  1.086
t11  309  627  345  1.117
t15  270  509  227  0.841
t16  342  664  382  1.117
t17  323  636  354  1.096

y  C(y) C(xy) C'(y) score(x, y) for RK -mx3 -fd1
---  ---  ---  ---  -----
t0   219  233   14  0.064  C(x = t0) = 219
t1   211  395  176  0.834  
t2   340  578  359  1.056 
t3   335  569  350  1.045
t5   325  555  336  1.034
t10  312  540  321  1.029
t11  330  562  343  1.039
t15  197  379  160  0.812
t16  352  573  354  1.006
t17  358  590  371  1.036  (average untrained compression = 297.9)

t1   211  219    8  0.038  C(x = t1) = 211
t0   219  391  180  0.822  
t2   340  573  362  1.065 
t3   335  562  351  1.048
t5   325  548  337  1.037
t10  312  535  324  1.038
t11  330  553  342  1.036
t15  197  367  156  0.792
t16  352  578  367  1.043
t17  358  586  375  1.047


Conclusions and Future Work

These experiments suggest a modeling approach to anomaly detection
in time series data.  The models used by general purpose data
compression programs appear to be suitable for this approach.  The
models are fast enough to use in a real time system.  All of the
compressors tested run 50 times real time or faster on a 750 MHz
PC on this data.

One shortcoming of this experiment is the lack of sufficient normal
traces for evaluation.  We can only state that t0 and t1 resemble
each other more closely than they do the abnormal traces, but we cannot
make claims about the actual variability of the normal data with only
two traces.  They differ from each other mainly in the timing of the
on and off transitions.

Until more test data is made available, I believe we should proceed
under the assumption (suggested but not proven) that better models
result in better anomaly detection.  I developed PAQ3 as a general
purpose data compression program prior to starting work on the
NASA data.  It is not designed to compress analog data, and it uses
an ad-hoc weighting scheme to mix models that are mainly designed
to compress text.  Future work should include:

-- Investigate adaptive approaches to mixing models.
-- Add models for analog data.
-- Instrument PAQ3 to identify specific points where anomalies occur.
-- Identify additional data sources for testing.