Data preprocessing refers to any transformation of the data done before applying a learning algorithm. This comprises for example finding and resolving inconsistencies, imputation of missing values, identifying, removing or replacing outliers, discretizing numerical data or generating numerical dummy variables for categorical data, any kind of transformation like standardization of predictors or Box-Cox, dimensionality reduction and feature extraction and/or selection.

mlr offers several options for data preprocessing. Some of the following simple methods to change a Task() (or data.frame) were already mentioned on the page about learning tasks:

Moreover, there are tutorial pages devoted to

Fusing learners with preprocessing

mlr’s wrapper functionality permits to combine learners with preprocessing steps. This means that the preprocessing “belongs” to the learner and is done any time the learner is trained or predictions are made.

This is, on the one hand, very practical. You don’t need to change any data or learning Task()s and it’s quite easy to combine different learners with different preprocessing steps.

On the other hand this helps to avoid a common mistake in evaluating the performance of a learner with preprocessing: Preprocessing is often seen as completely independent of the later applied learning algorithms. When estimating the performance of a learner, e.g., by cross-validation all preprocessing is done beforehand on the full data set and only training/predicting the learner is done on the train/test sets. Depending on what exactly is done as preprocessing this can lead to overoptimistic results. For example if imputation by the mean is done on the whole data set before evaluating the learner performance you are using information from the test data during training, which can cause overoptimistic performance results.

To clarify things one should distinguish between data-dependent and data-independent preprocessing steps: Data-dependent steps in some way learn from the data and give different results when applied to different data sets. Data-independent steps always lead to the same results. Clearly, correcting errors in the data or removing data columns like Ids that should not be used for learning, is data-independent. Imputation of missing values by the mean, as mentioned above, is data-dependent. Imputation by a fixed constant, however, is not.

To get a honest estimate of learner performance combined with preprocessing, all data-dependent preprocessing steps must be included in the resampling. This is automatically done when fusing a learner with preprocessing.

To this end mlr provides two wrappers:

As mentioned above the specified preprocessing steps then “belong” to the wrapped Learner (makeLearner()). In contrast to the preprocessing options listed above like normalizeFeatures()

  • the Task() itself remains unchanged,
  • the preprocessing is not done globally, i.e., for the whole data set, but for every pair of training/test data sets in, e.g., resampling,
  • any parameters controlling the preprocessing as, e.g., the percentage of outliers to be removed can be tuned together with the base learner parameters.

We start with some examples for makePreprocWrapperCaret().

Preprocessing with makePreprocWrapperCaret

makePreprocWrapperCaret() is an interface to caret’s caret::preProcess() function that provides many different options like imputation of missing values, data transformations as scaling the features to a certain range or Box-Cox and dimensionality reduction via Independent or Principal Component Analysis. For all possible options see the help page of function caret::preProcess().

Note that the usage of makePreprocWrapperCaret() is slightly different than that of caret::preProcess().

For example the following call to caret::preProcess()

preProcess(x, method = c("knnImpute", "pca"), pcaComp = 10)

with x being a matrix or data.frame would thus translate into

makePreprocWrapperCaret(learner, ppc.knnImpute = TRUE, ppc.pca = TRUE, ppc.pcaComp = 10)

where learner is a mlr Learner (makeLearner()) or the name of a learner class like "classif.lda".

If you enable multiple preprocessing options (like knn imputation and principal component analysis above) these are executed in a certain order detailed on the help page of function caret::preProcess().

In the following we show an example where principal components analysis (PCA) is used for dimensionality reduction. This should never be applied blindly, but can be beneficial with learners that get problems with high dimensionality or those that can profit from rotating the data.

We consider the sonar.task(), which poses a binary classification problem with 208 observations and 60 features.

sonar.task
## Supervised task: Sonar-example
## Type: classif
## Target: Class
## Observations: 208
## Features:
##    numerics     factors     ordered functionals 
##          60           0           0           0 
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 2
##   M   R 
## 111  97 
## Positive class: M

Below we fuse quadratic discriminant analysis (MASS::qda()) from package MASS with a principal components preprocessing step. The threshold is set to 0.9, i.e., the principal components necessary to explain a cumulative percentage of 90% of the total variance are kept. The data are automatically standardized prior to PCA.

lrn = makePreprocWrapperCaret("classif.qda", ppc.pca = TRUE, ppc.thresh = 0.9)
lrn
## Learner classif.qda.preproc from package MASS
## Type: classif
## Name: ; Short name: 
## Class: PreprocWrapperCaret
## Properties: twoclass,multiclass,numerics,factors,prob
## Predict-Type: response
## Hyperparameters: ppc.BoxCox=FALSE,ppc.YeoJohnson=FALSE,ppc.expoTrans=FALSE,ppc.center=TRUE,ppc.scale=TRUE,ppc.range=FALSE,ppc.knnImpute=FALSE,ppc.bagImpute=FALSE,ppc.medianImpute=FALSE,ppc.pca=TRUE,ppc.ica=FALSE,ppc.spatialSign=FALSE,ppc.corr=FALSE,ppc.zv=FALSE,ppc.nzv=FALSE,ppc.thresh=0.9,ppc.na.remove=TRUE,ppc.k=5,ppc.fudge=0.2,ppc.numUnique=3,ppc.cutoff=0.9,ppc.freqCut=19,ppc.uniqueCut=10

The wrapped learner is trained on the sonar.task(). By inspecting the underlying MASS::qda() model, we see that the first 22 principal components have been used for training.

mod = train(lrn, sonar.task)
mod
## Model for learner.id=classif.qda.preproc; learner.class=PreprocWrapperCaret
## Trained on: task.id = Sonar-example; obs = 208; features = 60
## Hyperparameters: ppc.BoxCox=FALSE,ppc.YeoJohnson=FALSE,ppc.expoTrans=FALSE,ppc.center=TRUE,ppc.scale=TRUE,ppc.range=FALSE,ppc.knnImpute=FALSE,ppc.bagImpute=FALSE,ppc.medianImpute=FALSE,ppc.pca=TRUE,ppc.ica=FALSE,ppc.spatialSign=FALSE,ppc.corr=FALSE,ppc.zv=FALSE,ppc.nzv=FALSE,ppc.thresh=0.9,ppc.na.remove=TRUE,ppc.k=5,ppc.fudge=0.2,ppc.numUnique=3,ppc.cutoff=0.9,ppc.freqCut=19,ppc.uniqueCut=10

getLearnerModel(mod)
## Model for learner.id=classif.qda; learner.class=classif.qda
## Trained on: task.id = Sonar-example; obs = 208; features = 22
## Hyperparameters:

getLearnerModel(mod, more.unwrap = TRUE)
## Call:
## qda(f, data = getTaskData(.task, .subset, recode.target = "drop.levels"))
## 
## Prior probabilities of groups:
##         M         R 
## 0.5336538 0.4663462 
## 
## Group means:
##          PC1        PC2        PC3         PC4         PC5         PC6
## M  0.5976122 -0.8058235  0.9773518  0.03794232 -0.04568166 -0.06721702
## R -0.6838655  0.9221279 -1.1184128 -0.04341853  0.05227489  0.07691845
##          PC7         PC8        PC9       PC10        PC11          PC12
## M  0.2278162 -0.01034406 -0.2530606 -0.1793157 -0.04084466 -0.0004789888
## R -0.2606969  0.01183702  0.2895848  0.2051963  0.04673977  0.0005481212
##          PC13       PC14        PC15        PC16        PC17        PC18
## M -0.06138758 -0.1057137  0.02808048  0.05215865 -0.07453265  0.03869042
## R  0.07024765  0.1209713 -0.03213333 -0.05968671  0.08528994 -0.04427460
##          PC19         PC20        PC21         PC22
## M -0.01192247  0.006098658  0.01263492 -0.001224809
## R  0.01364323 -0.006978877 -0.01445851  0.001401586

Below the performances of MASS::qda() with and without PCA preprocessing are compared in a benchmark experiment. Note that we use stratified resampling to prevent errors in MASS::qda() due to a too small number of observations from either class.

rin = makeResampleInstance("CV", iters = 3, stratify = TRUE, task = sonar.task)
res = benchmark(list("classif.qda", lrn), sonar.task, rin, show.info = FALSE)
res
##         task.id          learner.id mmce.test.mean
## 1 Sonar-example         classif.qda      0.2932367
## 2 Sonar-example classif.qda.preproc      0.1779848

PCA preprocessing in this case turns out to be really beneficial for the performance of Quadratic Discriminant Analysis.

Joint tuning of preprocessing options and learner parameters

Let’s see if we can optimize this a bit. The threshold value of 0.9 above was chosen arbitrarily and led to 22 out of 60 principal components. But maybe a lower or higher number of principal components should be used. Moreover, qda (MASS::qda()) has several options that control how the class covariance matrices or class probabilities are estimated.

Those preprocessing and learner parameters can be tuned jointly. Before doing this let’s first get an overview of all the parameters of the wrapped learner using function getParamSet().

getParamSet(lrn)
##                      Type len     Def                      Constr Req Tunable
## ppc.BoxCox        logical   -   FALSE                           -   -    TRUE
## ppc.YeoJohnson    logical   -   FALSE                           -   -    TRUE
## ppc.expoTrans     logical   -   FALSE                           -   -    TRUE
## ppc.center        logical   -    TRUE                           -   -    TRUE
## ppc.scale         logical   -    TRUE                           -   -    TRUE
## ppc.range         logical   -   FALSE                           -   -    TRUE
## ppc.knnImpute     logical   -   FALSE                           -   -    TRUE
## ppc.bagImpute     logical   -   FALSE                           -   -    TRUE
## ppc.medianImpute  logical   -   FALSE                           -   -    TRUE
## ppc.pca           logical   -   FALSE                           -   -    TRUE
## ppc.ica           logical   -   FALSE                           -   -    TRUE
## ppc.spatialSign   logical   -   FALSE                           -   -    TRUE
## ppc.corr          logical   -   FALSE                           -   -    TRUE
## ppc.zv            logical   -   FALSE                           -   -    TRUE
## ppc.nzv           logical   -   FALSE                           -   -    TRUE
## ppc.thresh        numeric   -    0.95                    0 to Inf   -    TRUE
## ppc.pcaComp       integer   -       -                    1 to Inf   -    TRUE
## ppc.na.remove     logical   -    TRUE                           -   -    TRUE
## ppc.k             integer   -       5                    1 to Inf   -    TRUE
## ppc.fudge         numeric   -     0.2                    0 to Inf   -    TRUE
## ppc.numUnique     integer   -       3                    1 to Inf   -    TRUE
## ppc.n.comp        integer   -       -                    1 to Inf   -    TRUE
## ppc.cutoff        numeric   -     0.9                      0 to 1   -    TRUE
## ppc.freqCut       numeric   -      19                    1 to Inf   -    TRUE
## ppc.uniqueCut     numeric   -      10                    0 to Inf   -    TRUE
## method           discrete   -  moment            moment,mle,mve,t   -    TRUE
## nu                numeric   -       5                    2 to Inf   Y    TRUE
## predict.method   discrete   - plug-in plug-in,predictive,debiased   -    TRUE
##                  Trafo
## ppc.BoxCox           -
## ppc.YeoJohnson       -
## ppc.expoTrans        -
## ppc.center           -
## ppc.scale            -
## ppc.range            -
## ppc.knnImpute        -
## ppc.bagImpute        -
## ppc.medianImpute     -
## ppc.pca              -
## ppc.ica              -
## ppc.spatialSign      -
## ppc.corr             -
## ppc.zv               -
## ppc.nzv              -
## ppc.thresh           -
## ppc.pcaComp          -
## ppc.na.remove        -
## ppc.k                -
## ppc.fudge            -
## ppc.numUnique        -
## ppc.n.comp           -
## ppc.cutoff           -
## ppc.freqCut          -
## ppc.uniqueCut        -
## method               -
## nu                   -
## predict.method       -

The parameters prefixed by ppc. belong to preprocessing. method, nu and predict.method are MASS::qda() parameters.

Instead of tuning the PCA threshold (ppc.thresh) we tune the number of principal components (ppc.pcaComp) directly. Moreover, for MASS::qda() we try two different ways to estimate the posterior probabilities (parameter predict.method): the usual plug-in estimates and unbiased estimates.

We perform a grid search and set the resolution to 10. This is for demonstration. You might want to use a finer resolution.

ps = makeParamSet(
  makeIntegerParam("ppc.pcaComp", lower = 1, upper = getTaskNFeats(sonar.task)),
  makeDiscreteParam("predict.method", values = c("plug-in", "debiased"))
)
ctrl = makeTuneControlGrid(resolution = 10)
res = tuneParams(lrn, sonar.task, rin, par.set = ps, control = ctrl, show.info = FALSE)
res
## Tune result:
## Op. pars: ppc.pcaComp=21; predict.method=plug-in
## mmce.test.mean=0.1779848

as.data.frame(res$opt.path)[1:3]
##    ppc.pcaComp predict.method mmce.test.mean
## 1            1        plug-in      0.5052450
## 2            8        plug-in      0.2449275
## 3           14        plug-in      0.2021394
## 4           21        plug-in      0.1779848
## 5           27        plug-in      0.2212560
## 6           34        plug-in      0.2452726
## 7           40        plug-in      0.2500345
## 8           47        plug-in      0.2452726
## 9           53        plug-in      0.2549344
## 10          60        plug-in      0.2932367
## 11           1       debiased      0.5000000
## 12           8       debiased      0.2833678
## 13          14       debiased      0.2453416
## 14          21       debiased      0.2837129
## 15          27       debiased      0.2547274
## 16          34       debiased      0.2886128
## 17          40       debiased      0.2741891
## 18          47       debiased      0.3075914
## 19          53       debiased      0.2642512
## 20          60       debiased      0.2830918

There seems to be a preference for a lower number of principal components (<27) for both "plug-in" and "debiased" with "plug-in" achieving slightly lower error rates.

Writing a custom preprocessing wrapper

If the options offered by makePreprocWrapperCaret() are not enough, you can write your own preprocessing wrapper using function makePreprocWrapper().

As described in the tutorial section about wrapped learners wrappers are implemented using a train and a predict method. In case of preprocessing wrappers these methods specify how to transform the data before training and before prediction and are completely user-defined.

Below we show how to create a preprocessing wrapper that centers and scales the data before training/predicting. Some learning methods as, e.g., k nearest neighbors, support vector machines or neural networks usually require scaled features. Many, but not all, have a built-in scaling option where the training data set is scaled before model fitting and the test data set is scaled accordingly, that is by using the scaling parameters from the training stage, before making predictions. In the following we show how to add a scaling option to a Learner (makeLearner()) by coupling it with function base::scale().

Note that we chose this simple example for demonstration. Centering/scaling the data is also possible with makePreprocWrapperCaret().

Specifying the train function

The train function has to be a function with the following arguments:

  • data is a data.frame with columns for all features and the target variable.
  • target is a string and denotes the name of the target variable in data.
  • args is a list of further arguments and parameters that influence the preprocessing.

It must return a list with elements $data and $control, where $data is the preprocessed data set and $control stores all information required to preprocess the data before prediction.

The train function for the scaling example is given below. It calls base::scale() on the numerical features and returns the scaled training data and the corresponding scaling parameters.

args contains the center and scale arguments of function base::scale() and slot $control stores the scaling parameters to be used in the prediction stage.

Regarding the latter note that the center and scale arguments of base::scale() can be either a logical value or a numeric vector of length equal to the number of the numeric columns in data, respectively. If a logical value was passed to args we store the column means and standard deviations/root mean squares in the $center and $scale slots of the returned $control object.

trainfun = function(data, target, args = list(center, scale)) {
  # Identify numerical features
  cns = colnames(data)
  nums = setdiff(cns[sapply(data, is.numeric)], target)
  # Extract numerical features from the data set and call scale
  x = as.matrix(data[, nums, drop = FALSE])
  x = scale(x, center = args$center, scale = args$scale)
  # Store the scaling parameters in control
  # These are needed to preprocess the data before prediction
  control = args
  if (is.logical(control$center) && control$center)
    control$center = attr(x, "scaled:center")
  if (is.logical(control$scale) && control$scale)
    control$scale = attr(x, "scaled:scale")
  # Recombine the data
  data = data[, setdiff(cns, nums), drop = FALSE]
  data = cbind(data, as.data.frame(x))
  return(list(data = data, control = control))
}

Specifying the predict function

The predict function has the following arguments:

  • data is a data.frame containing only feature values (as for prediction the target values naturally are not known).
  • target is a string indicating the name of the target variable.
  • args are the args that were passed to the train function.
  • control is the object returned by the train function.

It returns the preprocessed data.

In our scaling example the predict function scales the numerical features using the parameters from the training stage stored in control.

predictfun = function(data, target, args, control) {
  # Identify numerical features
  cns = colnames(data)
  nums = cns[sapply(data, is.numeric)]
  # Extract numerical features from the data set and call scale
  x = as.matrix(data[, nums, drop = FALSE])
  x = scale(x, center = control$center, scale = control$scale)
  # Recombine the data
  data = data[, setdiff(cns, nums), drop = FALSE]
  data = cbind(data, as.data.frame(x))
  return(data)
}

Creating the preprocessing wrapper

Below we create a preprocessing wrapper with a regression neural network (nnet::nnet()) (which itself does not have a scaling option) as base learner.

The train and predict functions defined above are passed to makePreprocWrapper() via the train and predict arguments. par.vals is a list of parameter values that is relayed to the args argument of the train function.

lrn = makeLearner("regr.nnet", trace = FALSE, decay = 1e-02)
lrn = makePreprocWrapper(lrn, train = trainfun, predict = predictfun,
  par.vals = list(center = TRUE, scale = TRUE))
lrn
## Learner regr.nnet.preproc from package nnet
## Type: regr
## Name: ; Short name: 
## Class: PreprocWrapper
## Properties: numerics,factors,weights
## Predict-Type: response
## Hyperparameters: size=3,trace=FALSE,decay=0.01

Let’s compare the cross-validated mean squared error (mse) on the Boston Housing data set (mlbench::BostonHousing()) with and without scaling.

rdesc = makeResampleDesc("CV", iters = 3)

r = resample(lrn, bh.task, resampling = rdesc, show.info = FALSE)
r

## Resample Result
## Task: BostonHousing-example
## Learner: regr.nnet.preproc
## Aggr perf: mse.test.mean=26.6997095
## Runtime: 0.083282

lrn = makeLearner("regr.nnet", trace = FALSE, decay = 1e-02)
r = resample(lrn, bh.task, resampling = rdesc, show.info = FALSE)
r

## Resample Result
## Task: BostonHousing-example
## Learner: regr.nnet
## Aggr perf: mse.test.mean=56.8645496
## Runtime: 0.0463462

Joint tuning of preprocessing and learner parameters

Often it’s not clear which preprocessing options work best with a certain learning algorithm. As already shown for the number of principal components in makePreprocWrapperCaret() we can tune them easily together with other hyperparameters of the learner.

In our scaling example we can try if nnet::nnet() works best with both centering and scaling the data or if it’s better to omit one of the two operations or do no preprocessing at all. In order to tune center and scale we have to add appropriate LearnerParam (ParamHelpers::LearnerParam())s to the parameter set (ParamHelpers::ParamSet()) of the wrapped learner.

As mentioned above base::scale() allows for numeric and logical center and scale arguments. As we want to use the latter option we declare center and scale as logical learner parameters.

lrn = makeLearner("regr.nnet", trace = FALSE)
lrn = makePreprocWrapper(lrn, train = trainfun, predict = predictfun,
  par.set = makeParamSet(
    makeLogicalLearnerParam("center"),
    makeLogicalLearnerParam("scale")
  ),
  par.vals = list(center = TRUE, scale = TRUE))

lrn
## Learner regr.nnet.preproc from package nnet
## Type: regr
## Name: ; Short name: 
## Class: PreprocWrapper
## Properties: numerics,factors,weights
## Predict-Type: response
## Hyperparameters: size=3,trace=FALSE,center=TRUE,scale=TRUE

getParamSet(lrn)
##            Type len    Def      Constr Req Tunable Trafo
## center  logical   -      -           -   -    TRUE     -
## scale   logical   -      -           -   -    TRUE     -
## size    integer   -      3    0 to Inf   -    TRUE     -
## maxit   integer   -    100    1 to Inf   -    TRUE     -
## skip    logical   -  FALSE           -   -    TRUE     -
## rang    numeric   -    0.7 -Inf to Inf   -    TRUE     -
## decay   numeric   -      0    0 to Inf   -    TRUE     -
## Hess    logical   -  FALSE           -   -    TRUE     -
## trace   logical   -   TRUE           -   -   FALSE     -
## MaxNWts integer   -   1000    1 to Inf   -   FALSE     -
## abstol  numeric   - 0.0001 -Inf to Inf   -    TRUE     -
## reltol  numeric   -  1e-08 -Inf to Inf   -    TRUE     -

Now we do a simple grid search for the decay parameter of nnet::nnet() and the center and scale parameters.

rdesc = makeResampleDesc("Holdout")
ps = makeParamSet(
  makeDiscreteParam("decay", c(0, 0.05, 0.1)),
  makeLogicalParam("center"),
  makeLogicalParam("scale")
)
ctrl = makeTuneControlGrid()
res = tuneParams(lrn, bh.task, rdesc, par.set = ps, control = ctrl, show.info = FALSE)

res
## Tune result:
## Op. pars: decay=0.1; center=FALSE; scale=TRUE
## mse.test.mean=14.0485584

df = as.data.frame(res$opt.path)
df[, -ncol(df)]
##    decay center scale mse.test.mean dob eol error.message
## 1      0   TRUE  TRUE      18.90371   1  NA          <NA>
## 2   0.05   TRUE  TRUE      22.41828   2  NA          <NA>
## 3    0.1   TRUE  TRUE      20.75765   3  NA          <NA>
## 4      0  FALSE  TRUE      17.62376   4  NA          <NA>
## 5   0.05  FALSE  TRUE      17.12806   5  NA          <NA>
## 6    0.1  FALSE  TRUE      14.04856   6  NA          <NA>
## 7      0   TRUE FALSE      78.98051   7  NA          <NA>
## 8   0.05   TRUE FALSE      44.28004   8  NA          <NA>
## 9    0.1   TRUE FALSE      56.37346   9  NA          <NA>
## 10     0  FALSE FALSE      90.20231  10  NA          <NA>
## 11  0.05  FALSE FALSE      90.20117  11  NA          <NA>
## 12   0.1  FALSE FALSE      30.18762  12  NA          <NA>

Preprocessing wrapper functions

If you have written a preprocessing wrapper that you might want to use from time to time it’s a good idea to encapsulate it in an own function as shown below. If you think your preprocessing method is something others might want to use as well and should be integrated into mlr just contact us.

makePreprocWrapperScale = function(learner, center = TRUE, scale = TRUE) {
  trainfun = function(data, target, args = list(center, scale)) {
    cns = colnames(data)
    nums = setdiff(cns[sapply(data, is.numeric)], target)
    x = as.matrix(data[, nums, drop = FALSE])
    x = scale(x, center = args$center, scale = args$scale)
    control = args
    if (is.logical(control$center) && control$center)
      control$center = attr(x, "scaled:center")
    if (is.logical(control$scale) && control$scale)
      control$scale = attr(x, "scaled:scale")
    data = data[, setdiff(cns, nums), drop = FALSE]
    data = cbind(data, as.data.frame(x))
    return(list(data = data, control = control))
  }
  predictfun = function(data, target, args, control) {
    cns = colnames(data)
    nums = cns[sapply(data, is.numeric)]
    x = as.matrix(data[, nums, drop = FALSE])
    x = scale(x, center = control$center, scale = control$scale)
    data = data[, setdiff(cns, nums), drop = FALSE]
    data = cbind(data, as.data.frame(x))
    return(data)
  }
  makePreprocWrapper(
    learner,
    train = trainfun,
    predict = predictfun,
    par.set = makeParamSet(
      makeLogicalLearnerParam("center"),
      makeLogicalLearnerParam("scale")
    ),
    par.vals = list(center = center, scale = scale)
  )
}

lrn = makePreprocWrapperScale("classif.lda")
train(lrn, iris.task)
## Model for learner.id=classif.lda.preproc; learner.class=PreprocWrapper
## Trained on: task.id = iris-example; obs = 150; features = 4
## Hyperparameters: center=TRUE,scale=TRUE