Monday, March 10, 2008

The pearson product-moment correlation coefficient

The single most common type of correlation is the pearson product momment correlation coefficient, which measures the degree of relationship between two continuous variables. A continuous variable is a variable that can be measured along a line scale.For example, smoking can be measured continuously because the number of cigarettes smoked per week, month, or year(s) can be measured along a line scale from zero to a large number. Gender (male or female) is not considered a continuous variable because if numbers (e.g., 1 or 2) were assigned to the two categories, a person could not be a 1.3 or a 1.7.
Please check the formula for r at Statistic A gentle introduction,Coolidge, Frederick L.

Here are smoking and longevity data.
25 63
35 68
10 72
40 62
85 65
75 46
60 51
45 60
50 55

  列 1 列 2
列 1 1
列 2 -0.61106 1

Above is the r = -0.611. It means that there is a strong negative correlation between smoking and longevity. This indicate that the higher the number of cigarettes smoked in the past 5 years, the lover the number of years lived. And the lower the number of cigarettes, the higher the number of years lived.Remember, this relationship between these two variables does not mean that heavy smoking causes one to live a shorter life. It may, however, give clues as to further research ideas for experiments.
Testing for the significance of a correlation coefficient.
A correlation coefficient may be tested to determine whether the coefficient significantly differs from zero. The value r is obtained on a sample. The value rho (p) is the population's correlation coefficient. It is hoped that r closely approximates rho. The null and alternative hypotheses are as follows:
H0=p=0
Ha=p/=0
The value of r and the number of pairs of score are converted through a formula into a distribution called the t distribution. The t formula can only be used to test whether r is equal to zero. It cannot be used to test to see whether r might be equal to some number other than zero. It is also important to note that the t distribution may used to test other types of inferential statistics.
The t distribution is most commonly used to test whether two means are significantly differernt, but it may also be used to test the significance of the correlation coefficient.
Interestingly, the t distribution becomes the z distribution when the data are infinite, but they are also strikingly visually similar when there are only several hundred numbers in the set of data.
The t test formula, in order to test the null hypothesis for a correlation coefficient, is the following:
see page 165 or other books.

In the example on smoking, the research question was whether heavy smoking was related to longevity. To test whether this obtained r significantly differs from zero, the t formula is used.
t=-2.042

Step 1.
Chose a One-Tailed or Two-Tailed test of significance.
The alternative hypothesis establishes whether we will use a one-tailed or two-tailed significance test. If the alternative hypothesis is nondirectional, as is the case in most studies, then two-tailed test of significance is required.

Step 2.
Choose the level of significance.
The conventional level of significance is p=0.05. Only in rare circumstance would one ever depart from p=0.05 as a starting point.

Step 3.
Determine the degree of freedom (df)
The df is an advanced statistical concept related to sampling. We will keep things simple: The formula for this t test statistic is df=N-2, where N is the number of pairs of scores. In this example, there were nine pairs of score, so df= 9-2 or df =7.

Step 4.
Determine whether the t from the formula (called the derived t) exceeds the table critical values from the t distribution.
For two-tailed test of significance at p=0.05 with df=7, the critical values of t are t=+2.365 and t=-2.365. If the derived t is greater than t=+2.365 and t=-2.365, then the null hypothesis will be rejected. In this example, the derived t=-2.042 is not less than t=-2.365; therefore, the null hypothesis is not rejected, and it will be concluded that r=-0.61 indicates a nonsignificant relationship. Curiously, although the stength of the relationship was strong (r=-0.61), the test of significance indicated that the obtained relationship was likely due to chance or there was greater than 5 chance out of 100 that the relationship was due to chance.
IN research paper, the results might be reported as follows:
There was a strong negative correlation found between smoking and longevity, although the correlation was not statistically significant, r(7)=-0.61,p>0.05. Note that the degrees of freedom appear in the parentheses to the right of r.
To deemonstrate the effect of the number of pairs of scores upon the significance of the correlation coefficeint, had the number of pairs been 12, the correlation r=-0.61 woul have been statistically significant. Thus, it is ironic that we found a strong negative raltionship between smoking and longevity, althought the relationship is not statistically significant (i.e., there is a greater than 5% probability that the finding could have been due to chance). In the world of conservative statisticians, this finding is not good enough to be called statistically significant.

No comments: