Deterministic AnnealingIndiana UniversityCS Theory groupJanuary 23 2012Geoffrey

Deterministic AnnealingIndiana UniversityCS Theory groupJanuary 23 2012Geoffrey www.phwiki.com

Deterministic AnnealingIndiana UniversityCS Theory groupJanuary 23 2012Geoffrey

Stabe, Paul, News Director has reference to this Academic Journal, PHwiki organized this Journal Deterministic AnnealingIndiana UniversityCS Theory groupJanuary 23 2012Geoffrey Foxgcf@indiana.edu http://www.infomall.org http://www.futuregrid.org Director, Digital Science Center, Pervasive Technology InstituteAssociate Dean as long as Research in addition to Graduate Studies, School of In as long as matics in addition to ComputingIndiana University BloomingtonAbstractWe discuss general theory behind deterministic annealing in addition to illustrate with applications to mixture models (including GTM in addition to PLSA), clustering in addition to dimension reduction. We cover cases where the analyzed space has a metric in addition to cases where it does not. We discuss the many open issues in addition to possible further work as long as methods that appear to outper as long as m the st in addition to ard approaches but are in practice not used.2ReferencesKen Rose, Deterministic Annealing as long as Clustering, Compression, Classification, Regression, in addition to Related Optimization Problems. Proceedings of the IEEE, 1998. 86: p. 2210-2239.References earlier papers including his Caltech Elec. Eng. PhD 1990T Hofmann, JM Buhmann, “Pairwise data clustering by deterministic annealing”, IEEE Transactions on Pattern Analysis in addition to Machine Intelligence 19, pp1-13 1997.Hansjörg Klock in addition to Joachim M. Buhmann, “Data visualization by multidimensional scaling: a deterministic annealing approach”, Pattern Recognition, Volume 33, Issue 4, April 2000, Pages 651-669.Frühwirth R, Waltenberger W: Redescending M-estimators in addition to Deterministic Annealing, with Applications to Robust Regression in addition to Tail Index Estimation. http://www.stat.tugraz.at/AJS/ausg083+4/08306Fruehwirth.pdf Austrian Journal of Statistics 2008, 37(3&4):301-317.Review http://grids.ucs.indiana.edu/ptliupages/publications/pdac24g-fox.pdfRecent algorithm work by Seung-Hee Bae, Jong Youl Choi (Indiana CS PhD’s)http://grids.ucs.indiana.edu/ptliupages/publications/CetraroWriteupJune11-09.pdf http://grids.ucs.indiana.edu/ptliupages/publications/hpdc2010-submission-57.pdf

Huron University US www.phwiki.com

This Particular University is Related to this Particular Journal

Some GoalsWe are building a library of parallel data mining tools that have best known (to me) robustness in addition to per as long as mance characteristicsBig data needs super algorithmsA lot of statistics tools (e.g. in R) are not the best algorithm in addition to not always well parallelizedDeterministic annealing (DA) is one of better approaches to optimizationTends to remove local optimaAddresses overfittingFaster than simulated annealingReturn to my heritage (physics) with an approach I called Physical Computation (cf. also genetic algs) – methods based on analogies to naturePhysics systems find true lowest energy state if you anneal i.e. you equilibrate at each temperature as you coolSome Ideas IDeterministic annealing is better than many well-used optimization problemsStarted as “Elastic Net” by Durbin as long as Travelling Salesman Problem TSPBasic idea behind deterministic annealing is mean field approximation, which is also used in “Variational Bayes” in addition to many “neural network approaches”Markov chain Monte Carlo (MCMC) methods are roughly single temperature simulated annealing Less sensitive to initial conditions Avoid local optima Not equivalent to trying r in addition to om initial startsSome non-DA Ideas IIDimension reduction gives Low dimension mappings of data to both visualize in addition to apply geometric hashingNo-vector (can’t define metric space) problems are O(N2) For no-vector case, one can develop O(N) or O(NlogN) methods as in “Fast Multipole in addition to OctTree methods”Map high dimensional data to 3D in addition to use classic methods developed originally to speed up O(N2) 3D particle dynamics problems

Uses of Deterministic AnnealingClusteringVectors: Rose (Gurewitz in addition to Fox) Clusters with fixed sizes in addition to no tails (Proteomics team at Broad)No Vectors: Hofmann in addition to Buhmann (Just use pairwise distances)Dimension Reduction as long as visualization in addition to analysis Vectors: GTMNo vectors: MDS (Just use pairwise distances)Can apply to general mixture models (but less study)Gaussian Mixture ModelsProbabilistic Latent Semantic Analysis with Deterministic Annealing DA-PLSA as alternative to Latent Dirichlet Allocation (typical in as long as mational retrieval/global inference topic model)Deterministic Annealing IGibbs Distribution at Temperature T P() = exp( – H()/T) / d exp( – H()/T)Or P() = exp( – H()/T + F/T ) Minimize Free Energy combining Objective Function in addition to Entropy F = < H - T S(P) > = d {P()H + T P() lnP()}Where are (a subset of) parameters to be minimizedSimulated annealing corresponds to doing these integrals by Monte CarloDeterministic annealing corresponds to doing integrals analytically (by mean field approximation) in addition to is naturally much faster than Monte CarloIn each case temperature is lowered slowly – say by a factor 0.95 to 0.99 at each iterationDeterministic Annealing Minimum evolving as temperature decreases Movement at fixed temperature going to local minima if not initialized “correctlySolve Linear Equations as long as each temperatureNonlinear effects mitigated by initializing with solution at previous higher temperatureF({y}, T)Configuration {y}

Deterministic Annealing IIFor some cases such as vector clustering in addition to Mixture Models one can do integrals by h in addition to but usually will be impossibleSo introduce Hamiltonian H0(, ) which by choice of can be made similar to real Hamiltonian HR() in addition to which has tractable integralsP0() = exp( – H0()/T + F0/T ) approximate Gibbs as long as HRFR (P0) = < HR - T S0(P0) >0 = < HR – H0> 0 + F0(P0)Where < >0 denotes d Po()Easy to show that real Free Energy (the Gibb’s inequality) FR (PR) FR (P0) (Kullback-Leibler divergence)Expectation step E is find minimizing FR (P0) in addition to Follow with M step (of EM) setting = <> 0 = d Po() (mean field) in addition to one follows with a traditional minimization of remaining parameters10Note 3 types of variables used to approximate real Hamiltonian subject to annealingThe rest – optimized by traditional methodsImplementation of DA Central ClusteringClustering variables are Mi(k) (these are annealed in general approach) where this is probability point i belongs to cluster k in addition to k=1K Mi(k) = 1In Central or PW Clustering, take H0 = i=1N k=1K Mi(k) i(k)Linear as long as m allows DA integrals to be done analyticallyCentral clustering has i(k) = (X(i)- Y(k))2 in addition to Mi(k) determined by Expectation stepHCentral = i=1N k=1K Mi(k) (X(i)- Y(k))2 Hcentral in addition to H0 are identical = exp( -i(k)/T ) / k=1K exp( -i(k)/T )Centers Y(k) are determined in M step11Implementation of DA-PWCClustering variables are again Mi(k) (these are in general approach) where this is probability point i belongs to cluster kPairwise Clustering Hamiltonian given by nonlinear as long as mHPWC = 0.5 i=1N j=1N (i, j) k=1K Mi(k) Mj(k) / C(k) (i, j) is pairwise distance between points i in addition to jwith C(k) = i=1N Mi(k) as number of points in Cluster kTake same as long as m H0 = i=1N k=1K Mi(k) i(k) as as long as central clusteringi(k) determined to minimize FPWC (P0) = < HPWC - T S0(P0) >0 where integrals can be easily doneAnd now linear (in Mi(k)) H0 in addition to quadratic HPC are differentAgain = exp( -i(k)/T ) / k=1K exp( -i(k)/T )12

General Features of DADeterministic Annealing DA is related to Variational Inference or Variational Bayes methodsIn many problems, decreasing temperature is classic multiscale – finer resolution (T is “just” distance scale)We have factors like (X(i)- Y(k))2 / TIn clustering, one then looks at second derivative matrix of FR (P0) wrt in addition to as temperature is lowered this develops negative eigenvalue corresponding to instabilityOr have multiple clusters at each center in addition to perturbThis is a phase transition in addition to one splits cluster into two in addition to continues EM iterationOne can start with just one cluster1314Rose, K., Gurewitz, E., in addition to Fox, G. C. “Statistical mechanics in addition to phase transitions in clustering,” Physical Review Letters, 65(8):945-948, August 1990.My 6 most cited article (402 cites including 15 in 2011)15Start at T= “” with 1 ClusterDecrease T, Clusters emerge at instabilities

1617A(k) = – 0.5 i=1N j=1N (i, j) / 2Bi(k) = j=1N (i, j) / i(k) = (Bi(k) + A(k)) = exp( -i(k)/T )/k=1K exp(-i(k)/T)C(k) = i=1N Loop to converge variables; decrease T from DA-PWC EM Steps (E is red, M Black) k runs over clusters; i,j points18Parallelize by distributing points across processesSteps 1 global sum (reduction)Step 1, 2, 5 local sum if broadcasti pointsk clusters

Continuous Clustering IThis is a subtlety introduced by Ken Rose but not clearly known in communityLets consider “dynamic appearance” of clusters a little more carefully. We suppose that we take a cluster k in addition to split into 2 with centers Y(k)A in addition to Y(k)B with initial values Y(k)A = Y(k)B at original center Y(k)Then typically if you make this change in addition to perturb the Y(k)A Y(k)B, they will return to starting position as F at stable minimum But instability can develop in addition to one finds19Y(k)A – Y(k)BContinuous Clustering IIAt phase transition when eigenvalue corresponding to Y(k)A – Y(k)B goes negative, F is a minimum if two split clusters move together but a maximum if they separatei.e. two genuine clusters are as long as med at instability pointsWhen you split A(k) , Bi(k), i(k) are unchanged in addition to you would hope that cluster counts C(k) in addition to probabilities would be halvedUn as long as tunately that doesn’t work except as long as 1 cluster splitting into 2 due to factor Zi = k=1K exp(-i(k)/T) with = exp( -i(k)/T )/ Zi Naïve solution is to examine explicitly solution with A(k0) , Bi(k0), i(k0) are unchanged; C(k0), halved as long as 0 <= k0 < K in addition to Zi = k=1K w(k)exp(-i(k)/T) with w(k0) =2, w(kk0) = 1Works surprisingly well but much more elegant is Continuous Clustering20Continuous Clustering IIIYou restate problem to consider from the start an arbitrary number of cluster centers at each center withp(k) the density of clusters at site kAll these clusters at a given site have same parameters Zi = k=1K p(k) exp(-i(k)/T) with = p(k) exp( -i(k)/T )/ Zi in addition to k=1K p(k) = 1You can then consider p(k) as one of the non-annealed parameters (the centers Y(k) in central clustering were of this type) determined in final M step. This gives p(k) = C(k) / NWhich interestingly weights clusters according to their size giving all points “equal weight”Initial investigation says similar in per as long as mance to naïve caseNow splitting is exact. p(k), C(k) in addition to probabilities are halved. A(k) , Bi(k), i(k) are unchanged21

A(k) = – 0.5 i=1N j=1N (i, j) / 2Bj(k) = i=1N (i, j) / i(k) = (Bi(k) + A(k)) = p(k) exp( -i(k)/T )/k=1K p(k) exp(-i(k)/T)C(k) = i=1N p(k) = C(k) / N Loop to converge variables; decrease T from ; split centers by halving p(k) DA-PWC EM Steps (E is red, M Black) k runs over clusters; i,j points22Steps 1 global sum (reduction)Step 1, 2, 5 local sum if broadcastNote on Per as long as manceAlgorithms parallelize well with typical speed up of 500 on 768 cores Parallelization is very straight as long as wardThe calculation of eigenvectors of second derivative matrix on pairwise case is ~80% ef as long as tNeed to use power method to find leading eigenvectors as long as each clusterEigenvector is of length N (number of points) as long as pairwiseIn central clustering, eigenvector of length “dimension of space” To do: Compare calculation of eigenvectors with splitting in addition to perturbing each cluster center in addition to see if stableNote eigenvector method tells you direction of instability23Trimmed ClusteringClustering with position-specific constraints on variance: Applying redescending M-estimators to label-free LC-MS data analysis (Rudolf Frühwirth , D R Mani in addition to Saumyadipta Pyne) BMC Bioin as long as matics 2011, 12:358HTCC = k=0K i=1N Mi(k) f(i,k)f(i,k) = (X(i) – Y(k))2/2(k)2 k > 0f(i,0) = c2 / 2 k = 0The 0’th cluster captures (at zero temperature) all points outside clusters (background)Clusters are trimmed (X(i) – Y(k))2/2(k)2 < c2 / 2 Another case when H0 is same as target HamiltonianProteomics Mass Spectrometry Stabe, Paul KVRD-FM News Director www.phwiki.com

High Per as long as mance Dimension Reduction in addition to VisualizationNeed is pervasiveLarge in addition to high dimensional data are everywhere: biology, physics, Internet, Visualization can help data analysis Visualization of large datasets with high per as long as manceMap high-dimensional data into low dimensions (2D or 3D).Need Parallel programming as long as processing large data setsDeveloping high per as long as mance dimension reduction algorithms: MDS(Multi-dimensional Scaling)GTM(Generative Topographic Mapping)DA-MDS(Deterministic Annealing MDS) DA-GTM(Deterministic Annealing GTM) Interactive visualization tool PlotVizMultidimensional Scaling MDSMap points in high dimension to lower dimensionsMany such dimension reduction algorithms (PCA Principal component analysis easiest); simplest but perhaps best at times is MDSMinimize Stress (X) = iStabe, Paul News Director

Stabe, Paul is from United States and they belong to KVRD-FM and they are from  Cottonwood, United States got related to this Particular Journal. and Stabe, Paul deal with the subjects like Local News; Music; National News; Regional News

Journal Ratings by Huron University

This Particular Journal got reviewed and rated by Huron University and short form of this particular Institution is US and gave this Journal an Excellent Rating.