An R Introduction to Statistics

Significance Test for Kendall's Tau-b

A variation of the definition of the Kendall correlation coefficient is necessary in order to deal with data samples with tied ranks. It known as the Kendall’s tau-b coefficient and is more effective in determining whether two non-parametric data samples with ties are correlated.

Formally, the Kendall’s tau-b is defined as follows. It replaces the denominator of the original definition with the product of square roots of data pair counts not tied in the target features.

      τB  =   (number-of-concordant pa√irs)-√(number-of-discordant-pairs)-
                                     N1 ×  N2

where N1  =  number of data pairs not tied in a target feature
      N2  =  number of data pairs not tied in another target feature

In the context of our previous example based on the data set survey, N1 would be the number of student pairs with different smoking habits, whereas N2 would be the number of student pairs with different exercise practice levels. We say the two variables Exer and Smoke in the data set are uncorrelated if their correlation coefficient is zero.

As before, the function cor shows that the Kendall correlation coefficient between Exer and Smoke is 0.083547.

> smoke <- as.numeric(factor(survey$Smoke, 
+     levels=c("Never","Occas","Regul","Heavy"))) 
> exer <- as.numeric(factor(survey$Exer, 
+     levels=c("None","Some","Freq"))) 
 
> m <- cbind(exer, smoke) 
> cor(m, method="kendall", use="pairwise") 
          exer    smoke 
exer  1.000000 0.083547 
smoke 0.083547 1.000000

In order to decide whether the variables are uncorrelated, we test the null hypothesis that τB = 0. The alternative hypothesis is that the variables are correlated, and τB is non-zero.

To test the null hypothesis, we apply the function cor.test, and found a p-value of 0.1709. Hence we do not reject the null hypothesis that variables are uncorrelated at 0.05 significance level.

> cor.test(exer, smoke, method="kendall") 
 
        Kendall’s rank correlation tau 
 
data:  exer and smoke 
z = 1.3694, p-value = 0.1709 
alternative hypothesis: true tau is not equal to 0 
sample estimates: 
     tau 
0.083547

The same can be achieved with the latest version of rpudplus. We begin with the function rpucor, which now uses Kendall’s tau-b to compute the correlation coefficient, and comes up with the same correlation coefficient as the function cor.

> library(rpud)                     # load rpudplus 
> rpucor(m, method="kendall", use="pairwise") 
          exer    smoke 
exer  1.000000 0.083547 
smoke 0.083547 1.000000 
attr(,"method") 
[1] "kendall" 
attr(,"use") 
[1] "pairwise.complete.obs"

Alternatively, we can now use the function rpucor.test to extract the Kendall’s tau-b from the estimate component of the result.

> rt <- rpucor.test(m, method="kendall", use="pairwise") 
> rt$estimate 
          exer    smoke 
exer  1.000000 0.083547 
smoke 0.083547 1.000000

And we can find the p-value of the correlation estimate in the p.value component.

> rt$p.value 
         exer   smoke 
exer  0.00000 0.17088 
smoke 0.17088 0.00000

To decide whether to reject the null hypothesis that the variables are uncorrelated, we compare the p-values against 0.05. Since the comparison result at ["exer", "smoke"] is FALSE, we do not reject the null hypothesis that Exer and Smoke are uncorrelated at 0.05 significance level.

> rt$p.value < 0.05 
       exer smoke 
exer   TRUE FALSE 
smoke FALSE  TRUE

To evaluate performance, we create random sample data drawn from 16 ordered symbols, and found the function cor takes more than one and a half minutes to compute the Kendall’s coefficient for 4000 observations in 24 features on an AMD Phenom II X4 CPU. At the time of writing, the function cor.test does not yet support multivariate data for comparison.

> test.data <- function(dim, num, seed=17) { 
+     set.seed(seed) 
+     matrix(sample(1:16, dim*num, replace=TRUE), nrow=num) 
+ } 
> m <- test.data(24, 4000) 

> system.time(cor(m, method="kendall")) 
   user  system elapsed 
 95.610   0.010  95.614

The same task would take merely 0.5 seconds for rpucor on a NVIDIA GTX 460 GPU.

> system.time(rpucor(m, method="kendall")) 
   user  system elapsed 
  0.570   0.000   0.569

And it takes similarly short amount of time for rpucor.test.

> system.time(rpucor.test(m, method="kendall")) 
   user  system elapsed 
  0.520   0.000   0.516

Exercises

  1. Determine which pairs of variables in the data set USJudgeRatings are correlated based on the Kendall’s method at 0.05 significance level.
  2. Determine which pairs of variables in the data set ToothGrowth are correlated based on the Kendall’s method at 0.05 significance level. (Hint: transform its dichotomy variable supp into numeric type before proceed.)