Fair Use Agreement Towards Parameter Free Data Mining Outline of Talk Parameters
Horan, Liza, Contributing Writer has reference to this Academic Journal, PHwiki organized this Journal Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides as long as teaching, if You send me an email telling me the class number/ university in advance. My name in addition to email address appears on the first slide (if you are using all or most of the slides), or on each slide (if you are just taking a few slides). You may freely use these slides as long as a conference presentation, if You send me an email telling me the conference name in advance. My name appears on each slide you use. You may not use these slides as long as tutorials, or in a published work (tech report/ conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is highly likely I will grant you permission. (c) Eamonn Keogh, email@example.com Towards Parameter Free Data Mining Eamonn Keogh Computer Science & Engineering Department, University of Cali as long as nia Riverside Stefano Lonardi Chotirat Ann Ratanamahatana Outline of Talk Why parameter laden data mining algorithms are troublesome A parameter free/lite approach Empirical evaluation Parameter free/lite clustering Parameter free/lite anomaly detection Parameter free/lite classifcation Conclusions
This Particular University is Related to this Particular Journal
Parameters Anecdotal observations A data mining paper I once reviewed had 14 parameters More than 5 parameters seems the norm as long as most papers Some algorithms have only 4 parameters, but the algorithm is tested on synthetic data, created by a algorithm with 3 or 4 parameters . Is a surfeit of parameters a bad thing Parameter laden algorithms are often hard to implement By definition, parameter laden algorithms tend to be hard to implement. But dont we want people to use our work This fact (combined with general laziness) probably accounts as long as the general dearth of strawman comparisons in most papers. See Keogh & Kasetty SIGKDD02. Parameter laden algorithms make it hard to reproduce others results We re-implement many other papers (63 as long as this paper alone). Yet we are rarely able to reproduce others results, even when using the same datasets. One reason is that we typically dont know how someone set the 5, 6, 7, 8, 9 parameters used in the original paper. Are irreproducible results scientifically meaningful
Parameter laden algorithms make it hard not to overfit Over fitting is a problem in data mining. If we have 5 to 10 parameters to set, avoiding over fitting is nearly impossible. This makes it hard to evaluate the contribution of a paper. So you got 95% accuracy on the bull/bear classification problem. Did you Spend 2 minutes adjusting the parameters (I am somewhat impressed) Spend 2 weeks adjusting the parameters (I am not impressed) Yet often we dont know which! (See Pedro Domingos POE papers) Parameter laden algorithms are nearly useless to the non-specialist If we ask a Cardiologist / COE / Lawyer / School Administrator to use an algorithm with 10 parameters, then the best we can hope as long as is that they give up in frustration! The Central Claim of this Paper A very simple (12 lines of code) parameter free algorithm can outper as long as m many (most/all) other algorithms, as long as many data mining tasks.
Quick Review The Kolmogorov complexity K(x) of a string x is defined as the length of the shortest program capable of producing x on a universal computer such as a Turing machine. The conditional Kolmogorov complexity K(xy) of x to y is defined as the length of the shortest program that computes x when y is given as an auxiliary input to the program. The function K(xy) is the length of the shortest program that outputs y concatenated to x. The Similarity Metric Ming Li et. al. Ming Li, Xin Chen, Xin Li, Bin Ma, Paul M. B. Vitányi: The similarity metric. SODA 2003: 863-872 This is the optimal similarity measure, but intractable This is a cheap approximation C(x) means the size of x compressed by an off-the-shelf compression algorithm like Winzip, Stuffit etc Ming Li The (very) Minor Extension proposed here This is the optimal similarity measure, but intractable This is a cheap approximation This is a cheaper approximation Note we flipped the numerator in addition to denominator to go from similarity to dissimilarity
function dist = CDM(A, B) save A.txt A ASCII % Save variable A as A.txt zip(‘A.zip’, ‘A.txt’); % Compress A.txt A-file = dir(‘A.zip’); % Get file in as long as mation save B.txt B ASCII % Save variable B as B.txt zip(‘B.zip’, ‘B.txt’); % Compress B.txt B-file = dir(‘B.zip’); % Get file in as long as mation A-n-B = [A; B]; % Concatenate A in addition to B save A-n-B.txt A-n-B ASCII % Save A-n-B.txt zip(‘A-n-B.zip’, ‘A-n-B.txt’); % Compress A-n-B.txt A-n-B-file = dir(‘A-n-B.zip’); % Get file in as long as mation % Return CDM dissimilarity dist = A-n-B-file.bytes / (A-file.bytes + B-file.bytes); Compression Based Distance Measure But wait! Isnt the choice of compression algorithm itself a parameter! No! We are trying to approximate the optimal Kolmogorov compression. So the best compression algorithm to use is the one that gives the smallest file size Baboon Barbary Ape Chimpanzee Pygmy Chimpanzee Human Gorilla Orangutan Sumatran Orangutan Gibbon Capuchin Malayan Flying Lemur Ring-Tailed Lemur Oyster pongines prosimians cercopithecoids hominoids greater apes panines anthropoids catarrhines taxonomy level controversial primates The clustering achieved by CDM on 16,300 symbols from the mitochondrial DNA of 12 primates, in addition to one outlier species. Sang-Hee Lee (Noted physical anthropologist) This is the correct clustering as long as these species
Argentina Mexico Spain Brazil Catalan Italy Denmark Norway Sweden Germany USA Argentina Mexico Spain Brazil Catalan Italy Denmark Norway Sweden Germany USA Danish Norwegian German French Dutch Italian Latin English Maori Danish Norwegian German French Dutch Italian Latin English Maori The clustering achieved on the text from various Yahoo portals (Jan-15th 2004). The smallest webpage had 1,615 characters, excluding white spaces The clustering achieved on the text from the first fifty chapters of Genesis But DNA strings in addition to natural language text are inherently discrete. Much of data mining is on real valued data, like images, video in addition to time series Yeah, what about time series 1 2 3 0.13812500000000 0.51250000000000 0.49561523437690 0.04875000000000 0.50000000000000 0.49604248046834 0.10375000000000 0.50000000000000 0.49653076171875 0.17875000000000 0.47562500000000 0.49706481933594 0.24093750000000 0.45125000000000 0.49750732421875 0.29875000000000 0.45125000000000 0.49808715820312 0.37000000000000 0.47656250000000 0.49875854492187 0.48375000000000 0.50000000000000 0.49939941406230 0.55593750000000 0.48281250000000 0.50007080078125 0.64625000000000 0.48468750000000 0.50062011718750 0.70125000000000 0.46937500000000 0.50123046875826 What about Time Series Part I In the example below, 1 in addition to 3 are very similar ECGS
What about Time Series Part II 0 0 20 40 60 80 100 120 b baabccbc SAXIFY! In the SAX representation, every bit has about the same importance 1 2 3 0.13812500000000 0.51250000000000 0.49561523437690 0.04875000000000 0.50000000000000 0.49604248046834 0.10375000000000 0.50000000000000 0.49653076171875 How well does CDM time series clustering work We compared to 51 other measures (we re-implemented every time series measure from SIGKDD, SIGMOD, ICDM, ICDE, VLDB, ICML, SSDM, PKDD in addition to PAKDD conferences in the last decade). We optimized the parameters of the above We tested on diverse problems Homogenous datasets: All ECGs etc Heterogeneous datasets: A mixed bag of 18 diverse datasets We used a as long as mal metric to score the clusterings (but here we will just show a few pictures) CDM does best (by a huge margin) on the heterogeneous data problems
1 16 17 18 19 20 4 7 9 6 2 5 8 11 12 14 15 13 10 3 CDMs clustering CDM does best (by a huge margin) on the homogenous data problems The second best clustering (of 51 approaches) MIT-BIH Noise Stress Test Database (nstdb): record 118e6 Long Term ST Database (ltstdb): record 20021 BIDMC Congestive Heart Failure Database (chfdb): record chf02 BIDMC Congestive Heart Failure Database (chfdb): record chf15 OK, you can cluster stuff, what about other data mining problems, like classification, motif discovery, anomaly detection Yes, what about anomaly detection Parameter-Free Anomaly Detection The connection between the notion of similarity in addition to anomalies is inherent in the English idiom. When confronted with an anomalous object or occurrence, people usually exclaim “I have never seen anything like it!” or “It is like nothing you have ever seen”.
A Parameter-Free Anomaly Detection Algorithm A) Our Approach B) Support Vector Machine (SVM) based approach (6 parameters). C) Immunology (IMM) inspired approach (5 parameters). D) Association Rule (AR) based approach (5 parameters). E) TSA-tree Wavelet based approach (3 parameters). For each experiment we spend one hour of CPU time, in addition to one hour of human time trying to find the best parameters in addition to only reported the best results. A problem from SIGKDD 2003 What happens when we change the anomaly A) Our Approach B) Support Vector Machine (SVM) based approach (6 parameters). C) Immunology (IMM) inspired approach (5 parameters). D) Association Rule (AR) based approach (5 parameters). E) TSA-tree Wavelet based approach (3 parameters).
2000 3000 4000 5000 A small excerpt from dataset 118e06 from the MIT-BIH Noise Stress Test Database. The full dataset is 21,600 data points long. Here, we show only a subsection containing the two most interesting events detected by our algorithm The gray markers are independent annotations by a cardiologist indicating Premature Ventricular Contractions L-1u L-1n L-1t L-1v The results of using our algorithm on various datasets from the Aerospace Corp collection. The bolder the line, the stronger the anomaly. Note that because of the way we plotted these there is a tendency to locate the beginning of the anomaly as opposed to the most anomalous part. Aerospace Corp Problems H in addition to resting at side H in addition to above holster Aiming at target Actor misses holster Briefly swings gun at target, but does not aim Laughing in addition to flailing h in addition to 0 100 200 300 400 500 600 100 150 200 250 300 350 400 450 500 550 The parameter-free algorithm generalizes to multi-dimensional time series, without changing a single line of code!
Conclusions Parameter laden algorithms are bad! They are often hard to implement They make it hard to reproduce others results They make it difficult to judge the significance of a contribution They make it hard not to overfit They are next to useless to the non-specialist Parameter free/lite algorithms are good! They require little coding ef as long as t They make it trivial as long as others to reproduce your results They make it nearly impossible to overfit They have a real chance of being used by the non-specialist You should use a parameter free/lite approach be as long as e doing anything more complex! Questions All datasets in addition to code used in this talk can be found at www.cs.ucr.edu/~eamonn/TSDMA/index.html
Horan, Liza Contributing Writer
Horan, Liza is from United States and they belong to Racquet Sports Industry and they are from Vista, United States got related to this Particular Journal. and Horan, Liza deal with the subjects like Tennis/Racquet Sports
Journal Ratings by Sharon Regional Health System School of Nursing
This Particular Journal got reviewed and rated by Sharon Regional Health System School of Nursing and short form of this particular Institution is PA and gave this Journal an Excellent Rating.