Geo Analysis - PCA

Permian Thickness Data Analysis - PCA - Part II


Dr. WW is very upset. It seems that he has learned that you have been giving us information which is leading to his removal from the company. Last night we found him trying to insert his floppy into the toaster, muttering something about new variables for old and pca.

Using your Fox data set, reconstruct the denrogram and the plot of sample localities. Read section 27 in the Data Desk manual. There are several objectives of an analysis of Principal Components. Recall that the eigenvalues and associated eigenvectors of a square and symmetric matrix of similarities are extracted. PCA preserves the sum of the elements of the matrix that lie on the main diagonal. If eigenvalues are extracted from the variance-covariance matrix the sum of the variances is preserved. The first eigenvalue is the variance of the first "new" variable. This so-called new variable is formed by a linear combination of the original variables so that this variable accounts for the largest percentage of the total variance. The second new variable accounts for the second highest percentage of the variance and so-on.

If you use the matrix of correlation coefficients, PCA preserves the sum of the elements on the main diagonal, but in this matrix the main diagonal contains 1.0s. For a 5 variable matrix the trace of the correlation coefficient matrix is 5.0. Thus, the sum of the extracted eigenvalues is preserved but redistributed across the new variables.

The decision as to which measure of similarity to use must be based on how you want to define similarity. If you want the magnitude of the raw data to be preserved (that is, giving highest weight to the variables with the largest variance) then the variance-covariance matrix is appropriate. If, on the other hand you want to standardize and treat the variables "equally" the matrix of correlation coefficients is appropriate. [What are the units of the elements in these two matrices?]

You can control the choice of similarity measure - Calc>Options>PCA. Start with the variance-covariance matrix since the dendrogram you have was produced from a distance measure.

Principal Components are often said to allow for a reduction in the dimensions of a data set. This is somewhat confusing as transformation from the original variables to the new variables (the principal component scores) generally requires that all of the original variables (OV) required by the transformation equations (the eigenvectors - EV). That is:

Score1 = EV1*OV1 + EV2*OV2 + ......

Where EV1 and EV2 are the first and second elements of the first eigenvector.

Imagine a matrix in which each variable had equal variance. If there were 4 variables, then a scatterdiagram of any two would display 50% of the total variance. Think of a cluster of points in 4-dimensional space (hard isn't it). The scatter diagram is a projection for 4-D to 2-D and in this case 50% of the variation does not lie in the plane defined by the axes of the scatterdiagram. Suppose that the first two eigenvalues are 50% and 25% so that a scatter diagram of Score1 versus Score 2 would capture 75% of the total variation. This scatter diagram can be said to have reduced the dimensionality in the sense that 25% more of the total variance is now displayed in two-dimensions.

Compute the total variance for the variables Sandstone, Shale, Carbonate, and Evaporite and compute the percentage of the total variance accounted for by each of the four. This is a good way to explore data. A scatter diagram constructed from the variables with the two largest variances will contain the highest percentage of the total variance. It also gives you something to judge how successful PCA is in reducing dimensionality for display purposes. If two of the original variables account for 80% of the total variance, and the first two new Principal Components account for 83% then there has been a gain of 3%. How much gain in capturing variability has been accomplished?

Imagine a case in which there were the following variables:

A, B, C, A+B+C, and A/B.

Note that although the matrix of similarities is 5 by 5, not all 5 variables are potentially independent - one is the some of the first three and another is the ratio of two of the first three. If you extracted the eigenvalues for this 5 by 5 matrix you would find that there were only three with values greater than 0.0. In this case PCA has recognized the redundency in the original data set. Think about this.

It is important to be able to relate the new variables to the old and the magnitudes of the elements of the eigenvectors allow you to begin this process. If some of these elements are very large then you can argue that those original variables are strongly influencing the new variables. If the elements are all positive then the new variable can be viewed as a combination of the original variables. If one is positive and the other is negative, then the new variable can be thought of as the ratio of original variables.

The principal components scores are stored in a folder called U in the results folder. Plot U1 versus U2. Now, with the dendrogram, plot of sample localities, and U1 versus U2 you can begin to assess the effectiveness of the transformation. Think about this.

You should try the analysis using the matrix of correlation coefficients. In general, this measure usually requires the inclusion of more eigenvalues to capture a given percentage of the total variance.

Extract the principal components from a matrix containing the four lithologies plus Total Thickness. What happens?

Put the cluster analysis and pca analysis together and write a coherent single report on the depositional history of this area. The focus should be on the depositional history with the data analysis results used as support for the model that you develop. Feel free to use any information from the literature that you can find -- but include references.

Return to the Geo Analysis Home Page