Geo Analysis - Cluster Analysis

Permian Thickness Data Analysis - Cluster - Part I



Open the Fox data set and reconstruct your plot of sample location coordinates. It will help if you orient your plot to match the map on the handout.

It is possible that spatially distributed data reflect a more or less continuous variation over the mapped area. Or, it is possible that there are naturally occurring groups of samples with the mapped area.

We understand that Dr. WW is attempting to recognize groups within the set of Permian Sedimentary Rock Thicknesses so that he can further his consulting practice. We also understand that you are a pre-expert in the techniques of cluster analysis as one model for recognizing groups of potentialy related samples.

The goal is to see if there are recognizable groups with the set of 31 samples which "make geological sense. The final report should include the work that you do in parts I and II or this exercise. Suggestions will be given to help you get started but you should do plenty of experimenting using what you have learned about DataDesk in the previous exercises.

I recommend starting with the four variables Sand, Shale, Carbonate and Evaporite. Reconstruct your sample location map so that it is in the same orientation as the map given on the handout. You could begin by preparing historgram of each of the four variables and checking the spatial distribution of different amounts of each lithology. Do spatially related patterns begin to appear? Are there groups of similar thicknesses of the same lithology. You might also create a variable called Total - which as the name suggests if the total thickness at each locality and follow the same strategy. When dealing with variables for which you know the spatial distribution, this type of analysis should always take place prior to using a more formal model.

One question (based on discussions in class) is to have a sense as to how many groups of samples to look for. A good place to begin is by computing the matrix of correlation coefficients for the four variables. If two of the variables four are highly correlated then they may be responding to the same underlying factor. In that case there might be three naturally occuring groups. Calc > Options > Cluster. I would begin by choosing the Complete option. You can always try the other one as part of your exploration.

In general, if you are a splitter you could recognize 31 groups - each sample is a group unto itself. Or, if you are a lumper you could recognize 1 group - consisting of all 31 samples. Neither end of the spectrum is particularly satisfying. Before attempting to decide how many groups, explore the dendrogram. Select the "Finger" tool. When you point at a linkage node, you will highlight the samples that have been clustered on the map of all samples. You may begin to recognize potential groups. For your first attempt be consistent. If you find a group that you want to work with look at the level of clustering. Other groups should cluster at at least that level - not at a lower level. When you do this you may find that there is a group(s) with one or two samples. Keep these as a group but be prepared to adjust your decision later on. You may find a larger group which looks like it is really two groups. Start with the larger group. Later on you may find a good reason to subdivide it...perhaps recognizing Facies A with subgroups 1 and 2.

Cluster Analysis is not a statistical tool in the sense that there is a null hypothesis that will tell you whether your groups are correct. In fact, different analysts will probably come up with different groupings. The guiding principle should be "does the distribution of groups make sense" based on what you know about geological principles.

Look at section 13. You will probably want to have some way to record which samples are in which groups. One way is to use the Finger tool. Point at one of the clusters - all the samples in the cluster should be highlighted. Assign a color to that cluster (using color should always be considered as a way to tag group membership). Choose Modify > Selection > Record. Record a letter to that cluster. Repeat for all clusters that you want to begin with. This generates a new icon for each cluster. When you open one of the icons you will find a 1 by all of the samples that belong to that cluster.

Another way to designate group membership is to create a new variable - Data > New > Blank Variable called Group. If you use the ? tool you can identify the sample number (Code). Prepare a list showing the coded (you can use 1,2,3 .... N or a,b,c,.....Z) samples for each group. It is easiest to list the numbers from 1 to 31 and record the group each belongs to. Open the new variable Group and the existing variable Code. Enter the proper values for each group. Click on Sand (Y) and Group (X) and Calc>Summaries>by group. You will have the sandstone thickness summary statistics for all of the groups. Plot a bar graph or pie chart (yes, the dreaded pie chart). Select Modify>Colors>Add>by groups. Now there is a correspondance between the color of the bar chart (or sector of the pie chart) and the color assigned to each group.

I suggest that you compute the summary statistics for means of the four lithologies for each cluster. Begin to see if you can determine the characteristics of each cluster are. Click on one lithology and shift click on the group icons. Compute the summary statistics. For each group you will have two values. One is for the members of that group and the other for all other samples. This can be tedious but take notes.

Prepare a map showing the spatial distribution of your groups. Now is an opportunity to see if groups with small numbers of samples can be tentatively assigned to one of the larger groups. Use judgment in doing this. Also, this is a good time to see if sub groups exist. If one of your larger clusters consists of two or more subgroups based on spatial position, then you may want to modify your map.

Feel free to explore other ways of trying to learn the maximum about these 31 samples.

In the next exercise you will compute the principal components from this data set.