An R Introduction to Statistics

Support Vector Machine with GPU, Part II

In our last tutorial on SVM training with GPU, we mentioned a necessary step to pre-scale the data with rpusvm-scale, and to reverse scaling the prediction outcome. This cumbersome procedure is now simplified with the latest RPUSVM.

For example, we can work directly with the cadata from the LIBSVM site. Just load it into the R workspace with read.svm.data and apply the function rpusvm right away. The overhead of the implicit data scaling turns out to be rather negligible.

> library(rpud)                     # load rpudplus 
> cadata <- read.svm.data("cadata", fac=FALSE) 
> x <- cadata$x; y <- cadata$y 
> system.time(cadata.rpusvm <- rpusvm(x, y, type="eps-regression")) 
........**
   user  system elapsed 
  6.510   0.020   6.539

We can inspect the range of each attribute of cadata in the SVM model cdata.rpusvm. In particular, for a data set with N attributes, the x.bound component of cdata.rpusvm is a 2 × N matrix, with each column containing the lower and upper bounds of the corresponding attribute. Likewise, the y.bound component contains the lower and upper bounds of the response variable.

> cadata.rpusvm$x.bound 
        [,1] [,2]  [,3] [,4]  [,5] [,6]  [,7]    [,8] 
[1,]  0.4999    1     2    1     3    1 32.54 -124.35 
[2,] 15.0001   52 39320 6445 35682 6082 41.95 -114.31 
 
> cadata.rpusvm$y.bound 
       [,1] 
[1,]  14999 
[2,] 500001

Using the residuals component of the SVM model, we can compute the mean square error:

> cadata.res <- cadata.rpusvm$residuals 
> sum(cadata.res*cadata.res)/length(cadata.res) 
[1] 4.091e+09

As for prediction, we can apply the function predict on a test data set without further post-processing:

> test.dat <- read.svm.data("test.dat", fac=FALSE) 
> pred <- predict(cadata.rpusvm, test.dat$x) 
> head(pred) 
     1      2      3      4      5      6 
401234 447244 392549 330003 244601 253104

If we decide to train an SVM in a terminal, we can apply the standalone rpusvm tool by adding extra scale parameters:

$ time rpusvm-train -x "-1:1" -y "-1:1" -s 3 cadata cadata.rpusvm 
rpusvm-train 0.1.2 
http://www.r-tutor.com 
Copyright (C) 2011-2012 Chi Yau. All Rights Reserved. 
This software is free for academic use only. There is absolutely NO warranty. 
 
GeForce GTX 460 GPU 
 
.........**
 
Finished optimization in 9583 iterations 
nu = 0.590296 
obj = -2232.72, rho = -0.300248 
nSV = 12221, nBSV = 12160 
Total nSV = 12221 
 
real    0m5.171s 
user    0m5.020s 
sys     0m0.130s

A glance of the SVM model cadata.rpusvm shows that the scale parameters are gathered in the header section.

$ head -n 20 cadata.rpusvm 
SvmType: smo-epsilon-regression 
KernelType: radial 
Gamma: 0.125 
X-scale: -1 1 
  0: 0.4999 15.0001 
  1: 1 52 
  2: 2 39320 
  3: 1 6445 
  4: 3 35682 
  5: 1 6082 
  6: 32.54 41.95 
  7: -124.35 -114.31 
Y-scale: -1 1 
   14999 500001 
NrClass: 2 
TotalSV: 12221 
Rho: -0.300248

Finally, we can use the model for prediction without pre-scaling the test data or post-scaling the outcome.

$ time rpusvm-predict cadata cadata.rpusvm cadata.out 
rpusvm-predict 0.1.2 
http://www.r-tutor.com 
Copyright (C) 2011-2012 Chi Yau. All Rights Reserved. 
This software is free for academic use only. There is absolutely NO warranty. 
 
GeForce GTX 460 GPU 
 
Mean squared error = 4.09101e+09 
Pearson correlation coefficient = 0.698961 
 
real    0m1.691s 
user    0m1.480s 
sys     0m0.170s