WO2000016250A1

WO2000016250A1 - Data decomposition/reduction method for visualizing data clusters/sub-clusters

Info

Publication number: WO2000016250A1
Application number: PCT/US1999/021363
Authority: WO
Inventors: Joseph Y. Wang
Original assignee: The Catholic University Of America
Priority date: 1998-09-17
Filing date: 1999-09-17
Publication date: 2000-03-23
Also published as: EP1032918A1; AU5926299A; JP2002525719A; CA2310333A1

Abstract

Higher dimensionality data is subject to a hierarchical visualization to allow the complete data set to be visualized in a top-down hierarchy in terms of clusters and sub-clusters at deeper levels. The data set is subject to standard finite normal mixture models and probabilistic principal component projections, the parameters of which are estimated using the expectation-maximization and principal component analysis under the Akaike Information Criteria (AIC) and the Minimum Description Length (MDL) criteria. The high-dimension raw data is subject to processing using principal component analysis to reveal the dominant distribution of the data at a first level. Thereafter, the so-processed information is further processed to reveal sub-clusters within the primary clusters. The various clusters and sub-clusters at the various hierarchical levels are subject to visual projection to reveal the underlying structure. The inventive schema has utility in all applications in which high-dimensionality multi-variate data is to be reduced to a two- or theree-dimensional projection space to allow visual exploration of the underlying structure of the data set.

Description

DATA DECOMPOSITION/REDUCTION METHOD FOR VISUALIZING DATA CLUSTERS/SUB-CLUSTERS

Background Art The present invention relates geneπcally to the field of data analysis and data presentation ^~ and, more particularly, to the analysis of data sets having higher dimensionality data points m order to optimally present the data m a lower dimensional order context, i.e., m a hierarchy of two- or three-dimensional visual contexts to reveal data structures within the data set.

The visualization of data sets having a large number of data points with multiple variables or attributes associated with each data point represents a complex problem. In general, there is no way, a priori , to easily identify groups or subgroups of data points that have relational attributes such that structures and sub-structures existing within the data set can be visualized. Various techniques have been developed for processing the data sets to reveal internal structures as an aid to understanding the data. In general, a large data set will oftentimes have data points that are multi-variant, that is, a single data point can have a multitude of attributes, including attributes that are completely independent from one another or have some degree of mter- attribute relationship or dependency.

Any elementary visualization process involving the projection of the data set onto a two- dimensional visualization space using straight- forward projection algorithms becomes progressively less adequate as the order of the data points increases. Thus, a single projection of a higher- order data set onto a visualization space may not be able to present all of the structures and substructures within the data set of interest m such a way that the structures or sub-structures can be visually distinguished or discriminated.

One form of presentation schema involves hierarchical visualization by which the data set is viewed at a highest - level , whole data set viewpoint. Thereafter, features within the highest-level projection are identified m accordance with an algorithm (s) or other identification criteria and those next highest level features further processed to reveal their respective internal structure m another projection (s) . This hierarchal process can be repeated for successive levels to present successively finer and detailed views of the data set. Thus, m a hierarchical visualization scheme, an image tree is provided with the successively lower images of the tree revealing more detail .

One such hierarchical data visualization scheme is disclosed by C. M. Bishop and M. E. Tipping m an article entitled "A Hierarchical Latent Variable

Model for Data Visualization," IEEE Trans. Pattern Anal. Machine Intell . , Vol. 20, No. 3, pp. 282-293, March 1998. Bishop and Tipping present a hierarchical visualization algorithm based on a two- dimensional hierarchical mixture of latent variable models, the parameters of which are estimated using the expectation-maximization (EM) algorithm. The construction of the hierarchical tree proceeds top down so that structure decomposition is driven interactively by the user and optimal projection is determined by the maximum likelihood principle. A ^~" hierarchy of multiple two-dimensional visualization spaces is provided with the top-level projection displaying the entire data set and successive lower- level projections displaying clusters within the data set displayed at the top-level. Further lower- level projections display sub-clusters and related internal structures within the data set .

Initially, the data set is subjected by Bishop and Tipping to a form of linear latent variable modelling to find a representation of the multidimensional data set m terms of two latent, or "hidden," variables that is determined indirectly from the data set . The modelling is similar to principal component analysis, but defines a probability density m the data space. In applying the Bishop and Tipping protocol, a single top-level latent variable model is generated with the posterior mean of each data point plotted m the latent space. Any cluster centers identified m this initial plot are used as the basis for initiating the next -lower level analysis leading to a mixture of the latent variable models.

There are two potential limitations associated with the Bishop and Tipping approach. First, although a probability density is defined m the data space through a latent variable model, the prior and order of the mixture model are heuπstically selected and an isotropic Gaussian conditional distribution is undesirably restricted, which may misrepresent the true data structures and put the optimality of the formulation m doubt. ^~

Secondly, the parameters, including the optimal projections, are determined by maximum likelihood; this criterion need not always lead to the most interesting or mterpretable visualization plots. Disclosure of Invention

The present invention provides a data decomposition/reduction method for visualizing large sets of multi -variant data including the processing of the multi -variant data down to two- or three- dimensional space m order to optimally reveal otherwise hidden structures within the data set including the principal data cluster or clusters at a first or top level of processing and additional sub-clusters within the principal data clusters m successive lower level visualizations. The identification of the morphology of clusters and subclusters and mter-cluster separation and relative positioning within a large data set allows investigation of the underlying drive that created the data set morphology and the mtra-data-set features .

The data set, constituted by a multitude of data points each having a plurality of attributes, is initially processed as a whole using multiple finite normal mixture models and hierarchical visualization spaces to develop the multi-level data visualization and interpretation. The top-level model and its projection explain the entire data set revealing the presence of clusters and cluster relationships, while lower-level models and projections display internal structure within individual clusters, such as the presence of subclusters, which might not be apparent in the higher-level models and projections. With many complementary mixture models and visualization projections, each level is relatively simple while the complete hierarchy maintains overall flexibility while still conveying considerable structural information. The arrangement combines (a) minimax entropy modeling by which the models are determined and various parameters estimated and (b) principal component analysis to optimize structure decomposition and dimensionality reduction.

The present invention advantagiously performs a probabilistic principal component analysis to project the softly partitioned data space down to a desired two-dimensional visualization space to lead to an optimal dimensionality reduction allowing the best extraction and visualization of local clusters. The minimax entropy principle is used to select the model structures and estimate its parameter values, where the soft partitioning of the data set results in a standard finite normal mixture model with minimum conditional bias and variance. By performing the principal component analysis and minimax entropy modeling alternatively, a complete hierarchy of complementary projections and refined models can be generated automatically, corresponding to a statistical description best fitted to the data .

The present invention treats structure decomposition and dimensionality reduction as two separate but complementary operations, where the criterion used to optimize dimensionality reduction is the separation of clusters rather than the maximum likelihood approach of Bishop and Tipping. The resulting projections, in turn, enhance the performance of structure decomposition at the next lower level .

Thereafter, a model selection procedure is applied to determine the number of subclusters inside each cluster at each level using an information theoretic criteria based upon the minimum of alternate calculations of the Akaike Information Critera (AIC) and the minimum description length (MDL) criteria. This determination allows the process of the present invention to automatically determine whether a further split of a subspace should be implemented or whether to terminate the further processing.

A probabilistic adaptive principal component extraction (PAPEX) algorithm is also applied to estimate the desired number of principal axes. When the dimensionality of the raw data is high, this PAPEX approach is computationally very efficient. Lastly, the present invention defines a probability distribution in data space which naturally induces a corresponding distribution in projection space through a Radon transform. This defined probability distribution permits an independent procedure in determining values for the intrinsic model parameters without concurrent estimation of projection mapping matrix. ^~

In many data sets in which the data points are multi-varient, the underlying "drive" that give rise to the data points often form clusters of points because more than one variable may be a function of that same underlying drive .

In accordance with the present invention and as an initial step in processing the raw data set, the data set (designated herein as the t-space) is projected onto a single x-space (i.e., two- dimensional space) , in which a descriptor W is determined from the sample covariance matrix C_t by fitting a single Gaussian model to the data set over t-space .

Thereafter, a value f (x) is determined for clusters K=l , 2 , ..., K^, in which the values of 7T_k and θ_xk are initialized by the user and estimated by maximizing the likelihood over x-space.

After f (x) is determined, values of the Akaike Information Criterion (AIC) and the minimum description length (MDL) for the various clusters K = 1 , 2 , ..., K_max are calculated and a model selected with a K₀ that corresponds to the minimum of the calculated values of the Akaike Information Criteria (AIC) and the minimum description length (MDL) criteria. The a value f(t) is then determined for K₀ m which the values of π_k z_lk, μ_tk , and C_tk are further refined by maximizing the likelihood over t-space. W_k is determined by directly evaluating the covariance matrix C_tk or learning from t_lk for k=l,2,...,K₀.

Thereafter, x_lk or h [W^τ _k (t₁-m_tk) ] for k = 1,2,...,K₀ is plotted by projecting the data set onto multiple x-subspaces at the second level for visual evaluation by the user.

Then G_k(t) is determined by repeating the above process steps to thus construct multiple x-subspaces at the third level; the hierarchy is completed under the information theoretic criteria using the AIC and the MDL and all x-space subspaces plotted for visual evaluation.

The present invention advantageously provides a data decomposition/reduction method for visualizing data clusters/sub-clusters within a large data space that is optimally effective and computationally efficient .

Other objects and further scope of applicability of the present invention will become apparent from the detailed description to follow, taken m conjunction with the accompanying drawings, m which like parts are designated by like reference characters .

Brief Description of the Drawings The present invention is described below, by way of example, with reference to the accompanying drawings, wherein: FIG. 1 is a schematic block diagram of a system for processing a raw multi -varient data set m accordance with the present invention;

FIG. 2 is a flow diagram of the process flow of the present invention;

FIG. 2A is an alternative visualization of the process flow of the present invention;

FIG. 3 is an example of the projection of a data set onto a 2 -dimensional visualization space after determination of the principal axis;

FIG. 4A is a 2 -dimensional visualization space of one of the clusters of FIG. 3 ;

FIG. 4B is a 2 -dimensional visualization space of another of the clusters of FIG. 3; FIG. 5 is an example of the projection of a data set onto a 2 -dimensional visualization space after determination of the principal axis;

FIG. 6A is a 2 -dimensional visualization space of one of the clusters of FIG. 5 ; FIG. 6B is a 2 -dimensional visualization space of a second of the clusters of FIG. 5; and

FIG. βC is a 2-dιmensιonal visualization space of a third of the clusters of FIG. 5.

Best Mode for Carrying Out the Invention A processing system for implementing the dimensionality reduction using probabilistic principal component analysis and structure decomposition using adaptive expectation maximization methods for visualizing data m accordance with the present invention is shown m FIG. 1 and designated generally therein by the reference character 10. As shown, the system 10 includes a working memory 12 that accepts the raw multi-varient data set, indicated at 14, and which bi-directionally interfaces with a processor 16. The processor 16 processes the raw t-space data set ^~~ 14 as explained m more detail below and presents that data to a graphical user interface (GUI) 18 which presents a two- or three- dimensional visual presentation to the user as also explained below. If desired, a plotter or printer 20 (or other hard copy output device) can be provided to generate a printed record of the display output of the graphical user interface (GUI) . The processor 16 may take the form of a software or firmware programed CPU, ALU, ASIC, or microprocessor or a combination thereof.

As the initial step m processing the raw data and as presented m FIG. 2, the data set is subject to a global principal component analysis to thereafter effect a top most projection. This step is initiated by determining the value of a variable W for the top-most projection m the hierarchy of projections. For relatively low dimensional data sets, W is directly found by evaluating the covariance matrix C_t . For higher dimensional data sets, only the top two eigenvectors of the covariance matrix of the data points are of interest; depending upon the dimensionality of the raw data, it may be computationally more efficient to apply the adaptive principal components extraction (APEX) algorithm described m Y. Wang, S. H. Lin, H. Li, and S, Y. Kung, "Data mapping by probabilistic modular networks and information theoretic criteria," IEEE Trans. Signal Processing, Vol. 46, No.12, pp. 3378-3397, December 1998 to find W directly from the raw data points t₁ . After the data set is projected and displayed by it principal component axis and n the basis of this single x- space and given a fixed K, the user then selects or identifies those points μ_xk on the plot corresponding to the centers of apparent clusters.

The two-step expectation maximization (EM) algorithm can be applied to allow a standard finite normal mixture model (SFNM) , i.e., where

gfrlθ_xk) = J g(t\θ_tk)δ(x -w^Tt+ w^T^)dt EQ. 2 and where the log-likelihood of projecting the data under the Radon Transform is

EQ. 3

The standard finite normal mixture (SFNM) modeling solution addresses the estimation of the regional parameters (π_k θ_tk) and the detection of the structural parameter K₀ in the relationship

EQ. 4 based on the observations t. It has been shown that when K₀ is given, the maximum likelihood (ML) estimate of the regional parameters can be obtained using the expectation maximization (EM) algorithm. There are two observations with the described ^~ approach: when the dimension of the data space is high, the implementation of the expectation maximization (EM) algorithm is very complex and time consuming. Additionally, the initialization of the expectation maximization (EM) algorithm is heuristically chosen, this heuristic selection often leads to only a local optimal solution. Therefore, it is reasonable to consider the model parameter values being estimated, first, in the projected x- space and then further adjusted or fine tuned in the data t-space. One natural criterion used for determining the optimal parameter values is to minimize the distance between the standard finite normal mixture (SFNM) distribution f (x) and the data histogram f_x. Relative entropy (Kullback-Leibler distance) , as suggested by information theory, is a suitable measure of the reconstruction error, given by:

EQ. 5 When relative entropy is used as a distance measure, distance minimization is equivalent to the maximum likelihood estimation, summarized by

£ = xp(~N[H(f_x) + D(f_x\\f)])

EQ. 6 where H is the entropy calculator described by Y.

Wang, S. H. Lin, H. Li, and S, Y. Kung in "Data mapping by probabilistic modular networks and information theoretic criteria," IEEE Trans. Signal

Processing, Vol. 46, No.12, pp. 3378-3397, December ~

1998.

The EM algorithm is implemented as a two-step process, i.e., the E-step and the M-step as follows:

E-step :

(n) _ jfyfol ) ^zik

/W^TJ ^⁾

EQ . 7 and the M- step

_c(n+l) ∑fc l *&? (** ^~ ftirt X** ^~ fj* )^T

EQ. 8C In each complete processing cycle, the previous set of parameter values is used to determine the

posterior probabilities ^xk using the E-step equation. These posterior probabilities are then used to obtain the new set of values "* , ^^k , and x* using the appropriate M-step equations. The processing is continued until a minima in the value

of the relative entropy

is ldentified. This model selection criteria will determine the optimal number of K₀ values unless it is already at a local minimum. The model selection procedure will then determine the optimal number K₀ of models to fit at the next level down m the hierarchy using the two information theoretic criteria, where K_a = 7K₀ - 1 (i.e., the values of Akaike' s Information Criteria (AIC) and the Minimum Description Length (MDL) for K with selection of a model m which K corresponds to the minimum of the

AIC and the MDL) . The resulting points μ_tk ⁽⁰⁾ m data space, obtained by

EQ. 9 are then used as the initial means of the respective submodels. Since the mixing proportions π are pro ection- invariant , a 2 x 2 unit matrix is assigned to the remaining parameters of the covariance matrix C_tk. The expectation-maximization (EM) algorithm can be again applied to allow a standard finite normal matrix (SFNM) with K₀ submodels to be fitted to the data over t-space.

The corresponding EM algorithm can be derived by replacing all x m the E-step and the M-step equations, above, by t. With a soft partitioning of the data set to generate possible models for the next level projection using the EM algorithm, data points will now effectively belong to more than one cluster at any given level. Thus, the effective input values are t_ik = z_ik(t_j - μ_tk) for an independent visualization subspace k in the hierarchy. C_tk can be directly evaluated to obtain W_k as described above. However, when the determination of W_k is based on a neural network learning scheme, an algorithm, termed the probabilistic adaptive principal component extraction (PAPEX) is applied as follows .

The feedforward weight vector w_mkand the feedback weight vector w_mk are initialize to small random values at time i = 1 and a small positive value is assigned to the learning rate parameter η. For m=l and for i=l,2,..., the value of

EQ. 10 is computed where

i_fc(i + 1) = i_fc(i) + ifc(i)tifc -

EQ . 11

For large values of i, w_lk(i) → w_lk, where w_lk is the eigenvector associated with the largest eigenvalue of the covariance matrix C_k. Thereafter m is set equal to 2 and for i=l,2, ..., the following values are computed:

Stefc(ϊ) = w^(i)t _fc + a_k(i)yik(i)

EQ . 12

W₂*(ϊ + 1) = W2fc(i) +

- yl_kiήwzkiή]

EQ . 13 and a_k(i + 1) = a_k(i) - ϊ7feto(»)yifc(*) + y2fc(*)^afc( )]

EQ . 14

For large values of i, w_2k(i) → w_2k, where w_2k is the eigenvector associated with the second largest eigenvalue of the covariance matrix C_k. Having determined principal axes W_k of the mixture model at the second level, the visualization subspaces are then constructed by plotting each data point t_x at the corresponding x_lk for k = 1,2, ... , K₀. Thus if one particular point takes most of the contribution for a particular component, then that point will effectively be visible only on the corresponding subspace.

The determination of the parameters of the models at the third level can again be viewed as a two-step estimation problem, in which a further split of the models at the second level is determined within each of the subspaces over x- space, and then the parameters of the selected models are fine tuned over t-space. Based on the plot of x_lk, the learning of ς_k(x) can again be performed using the expectation-maximization (EM) algorithm and the model selection procedures described above. The third level EM algorithm has the same form as the EM algorithm at the second level, except that in the E-step, the posterior probability that a data point x₁ belongs to submodel j is given by

_ _ ⁷Tj|fcg(xilfl.ylfc)

EQ . 15 where z_lk are constants estimated from the second level of the hierarchy. The corresponding M-step in the expectation maximization algorithm is then given by

Similarly, the resulting points in data space

A*t(l ) =W_fcμ_x(fc.₎+ _tfc

EQ. 19 are then used to initialize the means of the respective submodels, and the expectation maximization (EM) algorithm can be applied to allow a standard finite normal matrix (SFNM) distribution with K₀ submodels to be fitted to the data over t- space. The formulation can be derived by simply replacing all x in the second level M-step by t. With the resulting z₁(_k_-) in t-space, the PAPEX algorithm can be applied to estimate W (_k ) , in which the effective input values are expressed by

Hk ) = ^zi(kJ){ ~ μt{_k,j)) EQ. 20 ^*"

The next level visualization subspace is generated by plotting each data point t_x at the corresponding

^Xi(k,₃) = ²i(A,_j)W (ti~ Mt(fcJ)) EQ. 21 ^v '" values in the (k,j) subspace.

The construction of the entire tree structure hierarchy is automatically completed with the flow diagram of FIG. 2 ending when no further data split is recommended by the information theoretic criteria ^~~ (AIC and MDL) in all of the parent subspaces.

A first exemplary two-level implementation of the present invention is shown in FIGS. 3, 4A, and 4B in which the entire data set is present in the top level projection and two local clusters within that top level projection each individually presented in FIGS. 4A and 4B. As shown in FIG. 3, the entire data set is subject to principal component analysis as described above to obtain the principal axis or axes (axis A_x being representative) for the top level display. Additionally, the axis (unnumbered) for each of the apparent clusters is displayed. Thereafter, the apparent centers of the two clusters are identified and the data subject to the aforementioned processing to further reveal the local cluster of FIG. 4A and the local cluster of FIG. 4B .

A second exemplary two- level implementation of the present invention is shown in FIGS. 5, 6A, 6B, and 6C in which the entire data set is present in the top level projection and three local clusters within that top level projection are each individually presented in FIGS. 6A, 6B, and 6C. As shown in FIG. 5, the entire data set is subject to principal component analysis as described above to obtain the principal axis (A_x) and the axis (unnumbered) for each of the apparent clusters as displayed. The t-space raw data set arises from a mixture of three Gaussians consisting of 300 data points as presented in FIG. 5. As shown, two cloud- like clusters are well separated while a third cluster appears spaced in between the two well- separated cloud-like clusters. By performing the same operations as described above, the second level visual space is generated with a mixture of two local principal component axis subspaces where the line A_x indicates the global principal axis. When the two information theoretic criteria are applied (AIC and MDL) to examine these two cluster plots, the plot on the "right" of FIG. 5 shows evidence of further split. At the third level data modeling, a hierarchical model is adopted, which illustrates that there are indeed total three clusters within the data set, as shown in FIGS. 6A, 6B, and 6C. An alternate visualization of the process of flow of the present invention is shown in FIG. 2A in which the data is input and structured and the high- level data set that this then subject to algorithmic processing to iteratively effect the data structure decomposition, dimensionality reduction, and multiple model selection using the AIC/MDL criteria and effect a best fit to for the next subsequent projection. Thereafter, extraction by the above- described probabilistic adaptive principal component processing and the radon transform is effect to thereafter generate the data cluster visualizations. Industπal Applicability The present invention has use m all applications requiring the analysis of data, particularly multi -dimensional data, for the purpose of optimally visualizing various underlying structures and distributions present within the universe of data. Applications include the detection of data clusters and sub-clusters and their relative relationships m areas of medical, industrial, geophysical imaging, and digital library processing, for example.

As will be apparent to those skilled m the art, various changes and modifications may be made to the illustrated data decomposition/reduction method for visualizing data clusters/sub-clusters of the present invention without departing from the spirit and scope of the invention as determined m the appended claims and their legal equivalent .

CROSS REFERENCE TO UNITED STATES PROVISIONAL PATENT APPLICATION

This application claims the benefit of the filing date of co-pending U.S. Provisional Patent Application No. 60/100,622 filed on September 17, 199Θ by the same inventor herein and entitled "Hierarchical Minimax Entropy Modeling and Visualization for Data Representation and Analysis/Interpretation, " the disclosure of which is incorporated herein by reference.

Claims

Cl aims :

1. A method of processing a data set of a multitude of data points each having a dimensionality greater than at least three to provide a hierarchy of visualizations of the data set into an at least two-dimensional space including a top-level visualization and at least one second level visualization presenting at least one cluster K of the top-level visualized therein, comprising the steps of : providing, as the top-level visualization, a reduced dimension projection of the entire data set along at least a principal axis into an at least two-dimensional visualization space in which the dimensionality of the projected data set is reduced by principal component analysis of the data set to obtain a principal component projection axis; selecting at least one point on said first- mentioned visualization space corresponding to centers of apparent clusters; developing an optimum number of possible models for a second level projection; determining the optimum number of local clusters K for the second level projection by alternately calculating the Akaike information criteria and the minimum description length and using the minimum of the Akaike information criterion and minimum description length to determine the optimum number of local clusters K; determinmg the principal axes for visualization subspaces for the so-determined local clusters and projecting the data for at least one of the so-determined local clusters m a visualization space different from the first -mentioned ^~~ visualization space.

2. The method of claim 1, wherein the principal component for the top-level projection is determined by directly evaluating the covariance matrix.

3. The method of claim 1, wherein the principal component for the top-level projection is determined by adaptive principal components extraction.

4. The method of claim 1, wherein the plurality of possible models for the next-to-top level projection are developed by successive cycles of the E-step and the M-step of expectation- maximization algorithm until a minimum relative entropy is attained.

5. A method of optimally processing a large data set of high dimensional (>3) data points to provide, by dimensional reduction, cluster analysis, and two-dimensional surface projection of a hierarchy of visual displays for the purpose of discerning data information relationships therein, comprising the steps of : a. providing, through principal component analysis, a top level visualization as a projection of the entire data set onto a two-dimensional visualization space defined by its principal component projection axis; b. selection by algorithm of an initial best estimate of data points on said first-mentioned visualization space corresponding to centers of apparent clusters; c. developing an optimal number of possible models for a second level projection; d. determining the optimal number of local clusters for the second level projections by calculating the minimum of the AIC or the MDL to determine the optimum number of second level clusters ; e. determining the principal component axis of each second level cluster for projection onto respective two- dimensional subspaces for display visualization; and f. repeating steps c, d, and e until no further data point clusters are algo╧Çthmically detectable .

6. A method of optimally processing a large data set of high dimensional (>3) data points to provide, by dimensional reduction, cluster analysis and two-dimensional surface projection of a hierarchy of visual displays for the purpose of discerning data information relationships therein, comprising the steps of: a. providing, through principal component analysis, a top level visualization as a projection of the entire data set onto a two-dimensional visualization space defined by its principal component projection axis; b. heuristically selecting, from multiple competing choices, the initial best estimates of data points on said first-mentioned visualization space corresponding to centers of apparent clusters; c. developing an optimal number of possible models for a second level projection; d. determining the optimal number of local clusters for the second level projections by calculating the minimum of the AIC or the MDL to determine the optimum number of second level clusters; e. determining the principal component axis of each second level cluster for projection onto respective two- dimensional subspaces for display visualization; and f. repeating steps c, d, and e until no further data point clusters are heuristically detectable .

7. A system for processing a data set of a multitude of data points each having a dimensionality greater than at least three to provide a hierarchy of visualizations of the data set into an at least two-dimensional space including a top-level visualization and at least one second level visualization presenting at least one cluster K of the top-level visualized therein, characterized by: a processor having a cooperating memory containing a data set of a multitude of data points each having a dimensionality greater than at least three ; a display for presenting one or more visualizations of the data set as processed by the processor; the processor providing, as the top-level visualization on the display, a reduced dimension projection of the entire data set along at least a principal axis into an at least two-dimensional visualization space m which the dimensionality of the projected data set is reduced by principal component analysis of the data set to obtain a principal component projection axis; the processor selecting at least one point on said first-mentioned visualization space corresponding to centers of apparent clusters; the processor thereafter developing an optimum number of possible models for a second level projection; the processor determining the optimum number of local clusters K for the second level projection by alternately calculating the Akaike information criteria and the minimum description length and using the minimum of the Akaike information criterion and minimum description length to determine the optimum number of local clusters K; the processor determining the principal axes for visualization subspaces for the so-determined local clusters and projecting the data for at least one of the so-determined local clusters m a visualization space on the display different from the first-mentioned visualization space.

8. The system of claim 7, wherein the principal component for the top-level projection is determined by directly evaluating the covariance matrix.

9. The system of claim 8, wherein the principal component for the top-level projection is determined by adaptive principal components extraction.

10. The method of claim 7, wherein the plurality of possible models for the next-to-top level projection are developed by successive cycles of the E-step and the M-step of expectation- maximization algorithm until a minimum relative entropy is attained.

11. A system for optimally processing a large data set of high dimensional (>3) data points to provide, by dimensional reduction, cluster analysis, and two-dimensional surface projection of a hierarchy of visual displays for the purpose of discerning data information relationships therein, characterized by: a processor having a cooperating memory containing a data set of a multitude of data points each having a dimensionality greater than at least three and a display for presenting one or more visualizations of the data set as processed by the processor; the processor, through principal component analysis, providing a top level visualization as a projection of the entire data set onto a two- dimensional visualization space m the display and defined by its principal component projection axis; the processor selecting by algorithm an initial best estimate of data points on said first-mentioned visualization space corresponding to centers of apparent clusters; the processor developing an optimal number of possible models for a second level projection; the processor determining the optimal number of local clusters for the second level projections by calculating the minimum of the AIC or the MDL to determine the optimum number of second level clusters; the processor determining the principal component axis of each second level cluster for projection onto respective two-dimensional subspaces for visualization on the display.

12. A system for optimally processing a large data set of high dimensional (>3) data points to provide, by dimensional reduction, cluster analysis and two-dimensional surface projection of a hierarchy of visual displays for the purpose of discerning data information relationships therein, characterized by: a processor having a cooperating memory containing a data set of a multitude of data points each having a dimensionality greater than at least three and a display for presenting one or more visualizations of the data set as processed by the processor; the processor effecting a principal component analysis of the data to provide a top level visualization as a projection of the entire data set onto a two-dimensional visualization space of the display and defined by its principal component projection axis; ^~~ the processor, in response to a heuristic selection entered by a user, providing an initial best estimate of data points on said first-mentioned visualization space corresponding to centers of apparent clusters; the processor thereafter developing an optimal number of possible models for a second level projection; the processor determining the optimal number of local clusters for the second level projections by calculating the minimum of the AIC or the MDL to determine the optimum number of second level clusters; and the processor determining the principal component axis of each second level cluster for projection onto respective two-dimensional subspaces for display visualization by the display.

13. A computer automated process for generating a hierarchy of minimax entropy models and optimum visualization projections for high dimensional space data to improve data representation and interpretation, characterized by: structurally decomposing a high dimensional space data utilizing minimax entropy principles to develop a statistical framework for model identification to an optional number and kernel shape of local clusters from said data; dimensionally reducing said high dimensional space data by combining minimax entropy principles and principal component analysis to optimize data structure decomposition; iteratively and separately performing principal component analysis and minimax entropy model identification to generate a hierarchy of complementary projections and models to develop an intrinsic model to best-fit the high dimensional space data; and ^~~ creating a substantially reduced dimensional visualization space to facilitate better data representation and interpretation of said data.