Monday, March 10, 2008

Point biserial correlation

The point biserial correlation gives an estimate of the degree of relationhsip between a dichotomous variable and a continuous variable. One typical use of r pb is correlating a single test item, which is dichotomous (e.g., yes-no), with the overall test score, which is continuous.
For example, a neuropsychologist wanted to determine whether fine motor performance, as measured by speed of finger tapping, was related to gender in an older sample of males and females.

In a research paper, the rpb might be reported as follows:
There was a strong and significant negative correlation found between gender and fine motor movement r(8)=-0.68,p<0.05; in other words, older males are associated with slower fine motor movements, and older females appear to have faster fine motor movements.

Spearman correlation

Spearman correlation coefficient determines the degree of relationship for two sets of ranked data. Spearman correlation is also called the rank-order correlation coefficient. although spearman correlation is far less common than pearson, occasionally variables are ordered according to rank, or variables may be subsequently ranked on the basis of a continuous variable. The formula for spearman is the following:
see page 181.
Spearman might be most typically used in situations where there are a number of variables and they are all ranked by two independent judges. For example, a psychologist wishes to determine how alike husbands and wives are their vegetable preferences. The member of the couple were independently asked to rate their preference for seven vegetables from most preferred (#1 rank) to least preferred (#7).
Significance test for spearman r
Spearman cannot be tested for significance in the same manner as Pearson r.

The pearson product-moment correlation coefficient

The single most common type of correlation is the pearson product momment correlation coefficient, which measures the degree of relationship between two continuous variables. A continuous variable is a variable that can be measured along a line scale.For example, smoking can be measured continuously because the number of cigarettes smoked per week, month, or year(s) can be measured along a line scale from zero to a large number. Gender (male or female) is not considered a continuous variable because if numbers (e.g., 1 or 2) were assigned to the two categories, a person could not be a 1.3 or a 1.7.
Please check the formula for r at Statistic A gentle introduction,Coolidge, Frederick L.

Here are smoking and longevity data.
25 63
35 68
10 72
40 62
85 65
75 46
60 51
45 60
50 55

  列 1 列 2
列 1 1
列 2 -0.61106 1

Above is the r = -0.611. It means that there is a strong negative correlation between smoking and longevity. This indicate that the higher the number of cigarettes smoked in the past 5 years, the lover the number of years lived. And the lower the number of cigarettes, the higher the number of years lived.Remember, this relationship between these two variables does not mean that heavy smoking causes one to live a shorter life. It may, however, give clues as to further research ideas for experiments.
Testing for the significance of a correlation coefficient.
A correlation coefficient may be tested to determine whether the coefficient significantly differs from zero. The value r is obtained on a sample. The value rho (p) is the population's correlation coefficient. It is hoped that r closely approximates rho. The null and alternative hypotheses are as follows:
H0=p=0
Ha=p/=0
The value of r and the number of pairs of score are converted through a formula into a distribution called the t distribution. The t formula can only be used to test whether r is equal to zero. It cannot be used to test to see whether r might be equal to some number other than zero. It is also important to note that the t distribution may used to test other types of inferential statistics.
The t distribution is most commonly used to test whether two means are significantly differernt, but it may also be used to test the significance of the correlation coefficient.
Interestingly, the t distribution becomes the z distribution when the data are infinite, but they are also strikingly visually similar when there are only several hundred numbers in the set of data.
The t test formula, in order to test the null hypothesis for a correlation coefficient, is the following:
see page 165 or other books.

In the example on smoking, the research question was whether heavy smoking was related to longevity. To test whether this obtained r significantly differs from zero, the t formula is used.
t=-2.042

Step 1.
Chose a One-Tailed or Two-Tailed test of significance.
The alternative hypothesis establishes whether we will use a one-tailed or two-tailed significance test. If the alternative hypothesis is nondirectional, as is the case in most studies, then two-tailed test of significance is required.

Step 2.
Choose the level of significance.
The conventional level of significance is p=0.05. Only in rare circumstance would one ever depart from p=0.05 as a starting point.

Step 3.
Determine the degree of freedom (df)
The df is an advanced statistical concept related to sampling. We will keep things simple: The formula for this t test statistic is df=N-2, where N is the number of pairs of scores. In this example, there were nine pairs of score, so df= 9-2 or df =7.

Step 4.
Determine whether the t from the formula (called the derived t) exceeds the table critical values from the t distribution.
For two-tailed test of significance at p=0.05 with df=7, the critical values of t are t=+2.365 and t=-2.365. If the derived t is greater than t=+2.365 and t=-2.365, then the null hypothesis will be rejected. In this example, the derived t=-2.042 is not less than t=-2.365; therefore, the null hypothesis is not rejected, and it will be concluded that r=-0.61 indicates a nonsignificant relationship. Curiously, although the stength of the relationship was strong (r=-0.61), the test of significance indicated that the obtained relationship was likely due to chance or there was greater than 5 chance out of 100 that the relationship was due to chance.
IN research paper, the results might be reported as follows:
There was a strong negative correlation found between smoking and longevity, although the correlation was not statistically significant, r(7)=-0.61,p>0.05. Note that the degrees of freedom appear in the parentheses to the right of r.
To deemonstrate the effect of the number of pairs of scores upon the significance of the correlation coefficeint, had the number of pairs been 12, the correlation r=-0.61 woul have been statistically significant. Thus, it is ironic that we found a strong negative raltionship between smoking and longevity, althought the relationship is not statistically significant (i.e., there is a greater than 5% probability that the finding could have been due to chance). In the world of conservative statisticians, this finding is not good enough to be called statistically significant.

Friday, March 7, 2008

the four common types of correlation

1. pearson. a measure of the strength of a relationship between two continuous variables.
2. spearman. a measure of the similarity between two ordinal rankings of a single set of data.
3.point-biserial. a measure of strength of a relationship between one continuous variable and one dichotomous variable (a two-level-only variable such a gender)
4.phi correlation. a measure of the strength of a relationship between two dichotomous variables.

correlation and prediction

the correlation coefficient may also be used as an indicator of prediction. if a strong positive or negative correlation is obtained, then the relationship between the two variables may be linked to a predictive relationship.

Correlation

From Statistics a gentle introduction, Frederick.

Correlation is a statistical method that determines the degree of relationship between two different variable. It is also known as a "bivariate" statistic, with bi meaning two and variate indicating variable. The correlation can only range from -1.00 to +1.00(note that the typical correlation coefficient is reported to two decimal places).
But just because two variables are correlated, it does not mean that one variable caused the other.