Space Shuttle Engine Valve Anomaly Detection by Data Compression Matt Mahoney Sept. 23, 2003 Abstract Abnormal sensor data from a set of space shuttle engine solenoid current traces is detected by decreased improvement in compression ratio when the compressor is trained on normal sensor data compared to compression without training. Compared to a second normal trace, the decrease is observed in all 8 abnormal traces using GZIP and PAQ3, and in 7 of 8 traces using RK. Introduction NASA uses anomaly detection of time series data (e.g. solenoid current in a space shuttle engine valve) to identify component failures. The current system manually specifies the shape of the waveform at on/off transitions using rules coded in SCL, a language for real-time expert systems. NASA desires to automate the coding of SCL rules by using machine learning techniques to construct models from normal (properly operating) signal traces. A model is an estimate of the (unknown) probability distribution over a set X of possible values (e.g. traces). A common measure of the goodness of a model is the cross entropy, defined as: HM = -SUM P(x) log M(x) where P is the unknown true probability of x, M is the modeled (estimated) probability, and the summation is over all x in X. HM is minimized when M = P, in which case it is equal to the entropy of X. A lossless data compression programs uses a model M to assign a code of length C(x) = -log2 M(x) bits to a message x. Thus, compression is a measure of how well M matches the unknown distribution of possible messages, P(x). Most compressors are adaptive, adjusting M continuously as x is read. An adaptive compressor can be used for anomaly detection by training it on normal data, x, then measuring the improvement in compression of the test data, y, compared to compressing the test data y by itself. Thus, we define the following measure of the deviation of y from x as an anomaly score: score(x,y) = (C(xy) - C(x)) / C(y) where xy is the concatenation of training data x and test data y, and C(x) denotes the size of x after compression. An anomaly score of 1 means that x and y are sufficiently different so that training on x has no effect on the compression of y. A good model should generate a score close to 0 when y = x. Experimental Procedure Three compression programs were evaluated on two normal and 8 abnormal time series traces representing current in a space shuttle engine valve solenoid as the valve is opened and closed. Each trace consists of 1000 samples at a rate of 1 ms per sample. Current ranges from -3.1 to 7.06, quantized to 0.04. Thera are 47 distinct values observed in at least one trace. The 10 traces are labeled TEK00000.CSV through TEK00017.CSV, and are denoted t0 through t17 in this paper. They can be viewed graphically at http://cs.fit.edu/~mmahoney/nasa/ The traces t0 and t1 are normal. The others represent various failures. Anomaly detection was evaluated by using either t0 or t1 as training, and the other 9 traces as test data. A test is successful if the normal trace not used in training receives a lower anomaly score than all of the abnormal traces. If we train on t0, then t1 should receive the lowest anomaly score, and if we train on t1, then t0 should score the lowest. The 10 TEK000??.CSV files were losslessly reduced to 1-byte values to create files of 1000 bytes each. The conversion function of sample x is: byte = 57 + 25(min(4, max(x, -1)) This does not lose any information because in each trace at most one sample exceeds the lower bound of -1, and at most one sample exceeds the upper bound of 4. Thus, each possible data value maps to a different byte value in the range 32 to 157. (The bounds are necessary because there would otherwise be over 256 values). Three comprsssion programs were used, GZIP, PAQ3, and RK. GZIP ver. 1.2.4 (http://www.gzip.org/) is an LZ77 compressor. It compresses by replacing repeated instances of the input string with pointers to prior occurrences. Pointers are aligned on bit boundaries and occupy less space than the strings they replace. PAQ3 (http://www.cs.fit.edu/~mmahoney/compression/) is a predictive arithmetic coder. Arithmetic coders allow substrings of the input to have fractional bit length codes by replacing the entire input string x with a binary representation of a number y in [0, 1) such that M(s < x) < y < M(s <= x), where s has distribution modeled by M. PAQ3 uses an ad-hoc weighted averaging of models: n-grams up to n = 8, a string matcher for contexts longer than 8, a word trigram model, and a model for fixed length records. It typically compresses text files to about 30% smaller than GZIP. RK ver. 1.04a (http://rksoft.virtualave.net/) is a PPM (n-gram) arithmetic coder with secondary escape modeling. Unlike GZIP and PAQ3, which treat characters as nominal values, RK includes an option for delta coding for analog data (-fd1) that models the difference between successive values. With this option (and -mx3 for maximum compression) it achieves the best compression of the three on the valve data, although compression ratios for most other data types are similar to PAQ3. For each compressor, score(x,y) was computed with x = t0 or t1 and y = t0 through t17. For GZIP, the compressed size, C(x), is the size of the output file. For PAQ3 and RK, the size is as reported by the program, excluding headers (a few bytes). C(xy) is computed by concatenating x and y to a 2000 byte file and compressing it. Results are summarized in Table 1. For all three compressors, the scores for normal data (score(t0, t1) and score(t1, t0)) are less than the average scores for the abnormal traces. For GZIP and PAQ3, the normal traces always have the lowest scores. For RK, one of 8 abnormal traces (t15) scores lower than the normal trace on either training set, but all others are higher. For the better compressors (PAQ3 and RK), training on normal data actually makes compression of the abnormal data worse on average than no training at all. The greatest gap between normal and abnormal scores is about 30% by PAQ3. Table 1. Results summary GZIP PAQ3 RK ----- ----- ----- score(t0, t1) 0.716 0.773 0.834 score(t1, t0) 0.712 0.772 0.822 avg. score(t0, abnormal) 0.899 1.078 1.007 avg. score(t1, abnormal) 0.897 1.063 1.013 avg. abnormal - normal 0.184 0.298 0.182 score normal < abnormal 18/18 18/18 16/18 Table 2 shows the complete results of each experiment. It is divided into 6 sections for each of the two training traces on each of the three compressors. Trace t15 scores the lowest of all abnormal traces by all three compressors, and is slightly lower than the normal traces in RK. Examination of the waveforms show that t15 has a noise pattern similar to the normal traces in which samples alternate by +.04 and -.04, and this pattern is absent in the other abnormal traces. t15 differs from the normal traces in the absolute value of the signal in the "on" state, but this information is not modeled by RK when the delta coding option is used. Table 2. Detailed results x = training file (1000 bytes, normal behavior, either t0 or t1) y = test file (1000 bytes, all but t0 and t1 are abnormal) C(y) = size of compressed file y in bytes C(xy) = size of concatenated x and y (2000 bytes) after compression C'(y) = C(xy) - C(x) = size of C(y) when trained on x score(x, y) = C'(y)/C(y) = effect of x on C(y) (a low score mean y ~ x in M) y C(y) C(xy) C'(y) score(x, y) for GZIP --- --- --- --- ----- t0 396 411 15 0.038 C(x = t0) = 396 t1 377 666 270 0.716 t2 524 875 479 0.914 t3 517 863 467 0.903 t5 639 995 599 0.937 t10 439 799 403 0.918 t11 507 865 469 0.925 t15 393 696 300 0.763 t16 553 904 508 0.919 t17 524 876 480 0.916 (average untrained compression = 486.9) t1 377 394 17 0.045 C(x = t1) = 377 t0 396 655 278 0.702 t2 524 855 478 0.912 t3 517 842 465 0.899 t5 639 972 595 0.931 t10 439 778 401 0.913 t11 507 844 467 0.921 t15 393 680 303 0.771 t16 553 881 504 0.911 t17 524 856 479 0.914 y C(y) C(xy) C'(y) score(x, y) for PAQ3 --- --- --- --- ----- t0 285 293 8 0.028 C(x = t0) = 285 t1 282 503 218 0.773 t2 320 633 348 1.087 t3 317 631 346 1.091 t4 365 694 409 1.121 t10 266 576 291 1.094 t11 309 630 345 1.117 t15 270 520 235 0.870 t16 342 671 386 1.129 t17 323 645 360 1.115 (average untrained compression = 307.9) t1 282 289 7 0.025 C(x = t1) = 282 t0 285 502 220 0.772 t2 320 629 347 1.084 t3 317 624 342 1.079 t5 365 679 397 1.088 t10 266 571 289 1.086 t11 309 627 345 1.117 t15 270 509 227 0.841 t16 342 664 382 1.117 t17 323 636 354 1.096 y C(y) C(xy) C'(y) score(x, y) for RK -mx3 -fd1 --- --- --- --- ----- t0 219 233 14 0.064 C(x = t0) = 219 t1 211 395 176 0.834 t2 340 578 359 1.056 t3 335 569 350 1.045 t5 325 555 336 1.034 t10 312 540 321 1.029 t11 330 562 343 1.039 t15 197 379 160 0.812 t16 352 573 354 1.006 t17 358 590 371 1.036 (average untrained compression = 297.9) t1 211 219 8 0.038 C(x = t1) = 211 t0 219 391 180 0.822 t2 340 573 362 1.065 t3 335 562 351 1.048 t5 325 548 337 1.037 t10 312 535 324 1.038 t11 330 553 342 1.036 t15 197 367 156 0.792 t16 352 578 367 1.043 t17 358 586 375 1.047 Conclusions and Future Work These experiments suggest a modeling approach to anomaly detection in time series data. The models used by general purpose data compression programs appear to be suitable for this approach. The models are fast enough to use in a real time system. All of the compressors tested run 50 times real time or faster on a 750 MHz PC on this data. One shortcoming of this experiment is the lack of sufficient normal traces for evaluation. We can only state that t0 and t1 resemble each other more closely than they do the abnormal traces, but we cannot make claims about the actual variability of the normal data with only two traces. They differ from each other mainly in the timing of the on and off transitions. Until more test data is made available, I believe we should proceed under the assumption (suggested but not proven) that better models result in better anomaly detection. I developed PAQ3 as a general purpose data compression program prior to starting work on the NASA data. It is not designed to compress analog data, and it uses an ad-hoc weighting scheme to mix models that are mainly designed to compress text. Future work should include: -- Investigate adaptive approaches to mixing models. -- Add models for analog data. -- Instrument PAQ3 to identify specific points where anomalies occur. -- Identify additional data sources for testing.