Geo Analysis - Chi Squared

Transformations


We have been told that the good Dr. W has sold his service to academia and has undertaken an analysis of the productivity (mental, not biological) of the faculty members in the College of Unnatural Sciences and Counting at at Kuoraf University in Nepal. The numbers of publications for each of the 50 faculty members are given below:

60 126 150 35 35 3 30 15 26 63 36 24 78 39 48 63 6 60 20 108 37 10 8 34 7 12 56 70 110 202 18 58 31 47 107 52 302 181 65 80 22 120 21 25 75 38 330 200 72 6

create a blank variable in Data Desk and enter the numbers publications. Compute the summary statistics.

Dr. WW argues that the average faculty member has 68 publications. We know in skewed distributions the average may not be a good measure of central tendency. Prepare a histogram of Publications and a normal probability plot. What is the correlation between nscores and publications? Recall that the closer the correlation is to 1.0, the higher the symmetry of the distribution. Without performing a test, it appears that the distribution is too highly skewed to be described as a normal distribution.

. Dr. W has made certain statements about the faculty based on his analysis of the distribution of publications as normal. When we asked about this assumption WW replied that "...of course these data are normal...are you implying that my countrymen are abnormal, wierd, or something...". Needless to say we began to worry.

Create a derived variable which contains the logs (to the base 10). Prepare a histogram and a normal probability plot.

Create a derived variable which contains the ln's (to the base e). Repeat the above analysis. What should the retlationship be between the means of two log (base10 and base e) be? Confirm.

Someone (probably a sedimentologist) said that it is "good" to transform the values to log to the base 2 (the phi scale). Try this.

Create a derived variable which contains the square roots ofthe publications. Repeat above.

Data Desk does not allow construction of the chi squared test that was discussed in class. However, this can be relatively easily calculated by hand using out put from the application. For Publications (raw data) create a new transformed variable containining the Z-scores. Manip > Tranform > New Derived Variable > Misc >ZScores. Sort the Z-scores from smallest to largest - Manip > Sort. Open the sorted vector and count the number of observations in each of the following four intervals:

  1. - infinity to -.67

  2. -.67 to 0.0

  3. 0. to .67

  4. .67 to + infinity

These are the observed values. Each of the bins should contain 25% of the observation IF the data follow a normal distribution. That is, the null hypothesis is the observations follow a normal distribution. Subtract Observed from Expected for each bin and square; divide the square by 12.5. Sum these 4 values -- this is the computed statistic U squared.

IF U squared is greated that the critical value for chi squared (page 118) for the appropriate degrees of freedom (1 for this model) then the null hypothesis is rejected with significance level of alpha -- use an alpha of .10.

Repeat for the log transformed values and for the square root transformed values. Should it make a difference which log transformed set you use -- base 10, base e, or base 2? Why?

The KolmogorovTestForNormality is a formal test using what are termed non parametric models. Close your publications work and save. From Data Desk publications file and then import the template KolmogorovTestForNormality. Drag the raw publication data icon onto the appropriate space in the template. Click on Estimate Parameters. If you click on the ? you get a brief discussion of the method. The Null Hypothesis is that the distribution is normal. The alternative is that it is not. Rejecting the Null does not tell you what the distribution is and accepting the Null leaves you with a beta error. Click on Estimate Parameters and open the plot window. The jaged line is the observed data and the solid line is the "modeled" data assuming a normal distribution. Note that the coefficient of variation is greater than 1.0 and normally distributed data usually has a coefficient of variation that is less than 0.30. The test-statistic is about 0.16 for an alpha of .10. If the observed statistic is greater than 0.16 the null is rejected.

Repeat with the log and square root transformed data.

Which of the forms - raw data, 3 types of logs, square roots would you select if the goal is to work with a "nearly" symmetrical distribution.

Which distribution "bests describes" the publications vector? Why?

It seems to us that something important is missing from this data set. In order to make whatever sense possible from these data, what other information would you like to have? Why?