An R Introduction to Statistics

Kendall Rank Coefficient

The correlation coefficient is a measurement of association between two random variables. While its numerical calculation is straightforward, it is not readily applicable to non-parametric statistics.

For example, in the data set survey, the exercise level (Exer) and smoking habit (Smoke) are qualitative attributes. To find their correlation coefficient, we would have to assign artificial numeric values to the qualitative data, which is not very elegant to say the least.

A more robust approach is to compare the rank orders between the variables. For example, suppose student A exercises more than student B in the data set survey. We can check whether it is also the case that student A smokes more than student B. If so, then A and B would have the same relative rank orders, and we say that A and B are concordant pairs with respect to the random variables Exer and Smoke. If student A turns out to smoke less than student B, then we say that A and B are discordant pairs.

Intuitively, it is clear that if the number of concordant pairs is much larger than the number of discordant pairs, then the random variables are positively correlated. If the number of concordant pairs is much less than discordant pairs, then the variables are negatively correlated. Finally, if the number of concordant pairs is about the same as discordant pairs, then the variables are weakly correlated. As a result, the Kendall rank correlation coefficient between the two random variables with n observations is defined as:

τ = (number--of-concordant pairs)--(number-of discordant pairs)
                          n(n- 1)∕2

To find the Kendall coefficient between Exer and Smoke, we will first create a matrix m consisting only of the Exer and Smoke columns. Then we apply the function cor with the "kendall" option. In order to filter out missing survey values, we set the use option as "pairwise.complete.obs". The computation shows that the Kendall coefficient between Exer and Smoke is 0.083547, which is pretty close to zero. Hence the student exercise level and smoking habit are weakly correlated variables.

> smoke <- as.numeric(factor(survey$Smoke, 
+     levels=c("Never","Occas","Regul","Heavy"))) 
> exer <- as.numeric(factor(survey$Exer, 
+     levels=c("None","Some","Freq"))) 
 
> m <- cbind(exer, smoke) 
> cor(m, method="kendall", use="pairwise") 
          exer    smoke 
exer  1.000000 0.083547 
smoke 0.083547 1.000000

Computation of the Kendall coefficient is very time consuming. For a random data set with 24 variables and 4,000 observations, it takes about 2 minutes to calculate the Kendall correlation coefficients on an AMD Phenom II X4 CPU.

> test.data <- function(dim, num, seed=17) { 
+     set.seed(seed) 
+     matrix(rnorm(dim * num), nrow=num) 
+ } 

> m <- test.data(dim=24, num=4000) 
> system.time(cor(m, method="kendall")) 
  user  system elapsed 
91.620   0.010  91.624

The computation time is drastically reduced for an NVIDIA GTX 460 GPU. If we load the rpud package with the rpudplus add-on, and compute the same Kendall correlation coefficients using the function rpucor, the execution time takes only about 0.5 seconds.

> library(rpud)                  # load rpudplus 
> system.time(rpucor(m, method="kendall")) 
   user  system elapsed 
  0.560   0.000   0.559

Here is a chart that compares the performance between AMD Phenom II X4 and NVIDIA GTX 460 for finding the Kendall rank coefficient.

PIC

Exercises

  1. Run the performance test with more variables and observations.
  2. Find the Kendall correlation coefficients of the data set survey with various options for handling missing values, such as “everything” (default), “all.obs”, “complete.obs”, “na.or.complete”, and “pairwise.complete.obs“.