Data preprocessing refers to any transformation of the data done before applying a learning algorithm. This comprises for example finding and resolving inconsistencies, imputation of missing values, identifying, removing or replacing outliers, discretizing numerical data or generating numerical dummy variables for categorical data, any kind of transformation like standardization of predictors or Box-Cox, dimensionality reduction and feature extraction and/or selection.
mlr
offers several options for data preprocessing. Some of the following simple methods to change a Task()
(or data.frame
) were already mentioned on the page about learning tasks:
capLargeValues()
: Convert large/infinite numeric values.createDummyFeatures()
: Generate dummy variables for factor features.dropFeatures()
: Remove selected features.joinClassLevels()
: Only for classification: Merge existing classes to new, larger classes.mergeSmallFactorLevels()
: Merge infrequent levels of factor features.normalizeFeatures()
: Normalize features by different methods, e.g., standardization or scaling to a certain range.removeConstantFeatures()
: Remove constant features.subsetTask()
: Remove observations and/or features from a Task()
.Moreover, there are tutorial pages devoted to
mlr
’s wrapper functionality permits to combine learners with preprocessing steps. This means that the preprocessing “belongs” to the learner and is done any time the learner is trained or predictions are made.
This is, on the one hand, very practical. You don’t need to change any data or learning Task()
s and it’s quite easy to combine different learners with different preprocessing steps.
On the other hand this helps to avoid a common mistake in evaluating the performance of a learner with preprocessing: Preprocessing is often seen as completely independent of the later applied learning algorithms. When estimating the performance of a learner, e.g., by cross-validation all preprocessing is done beforehand on the full data set and only training/predicting the learner is done on the train/test sets. Depending on what exactly is done as preprocessing this can lead to overoptimistic results. For example if imputation by the mean is done on the whole data set before evaluating the learner performance you are using information from the test data during training, which can cause overoptimistic performance results.
To clarify things one should distinguish between data-dependent and data-independent preprocessing steps: Data-dependent steps in some way learn from the data and give different results when applied to different data sets. Data-independent steps always lead to the same results. Clearly, correcting errors in the data or removing data columns like Ids that should not be used for learning, is data-independent. Imputation of missing values by the mean, as mentioned above, is data-dependent. Imputation by a fixed constant, however, is not.
To get a honest estimate of learner performance combined with preprocessing, all data-dependent preprocessing steps must be included in the resampling. This is automatically done when fusing a learner with preprocessing.
To this end mlr
provides two wrappers:
makePreprocWrapperCaret()
is an interface to all preprocessing options offered by caret
’s caret::preProcess()
function.makePreprocWrapper()
permits to write your own custom preprocessing methods by defining the actions to be taken before training and before prediction.As mentioned above the specified preprocessing steps then “belong” to the wrapped Learner (makeLearner()
). In contrast to the preprocessing options listed above like normalizeFeatures()
Task()
itself remains unchanged,We start with some examples for makePreprocWrapperCaret()
.
makePreprocWrapperCaret()
is an interface to caret
’s caret::preProcess()
function that provides many different options like imputation of missing values, data transformations as scaling the features to a certain range or Box-Cox and dimensionality reduction via Independent or Principal Component Analysis. For all possible options see the help page of function caret::preProcess()
.
Note that the usage of makePreprocWrapperCaret()
is slightly different than that of caret::preProcess()
.
makePreprocWrapperCaret()
takes (almost) the same formal arguments as caret::preProcess()
, but their names are prefixed by ppc.
.makePreprocWrapperCaret()
does not have a method
argument. Instead all preprocessing options that would be passed to caret::preProcess()
’s method
argument are given as individual logical parameters to makePreprocWrapperCaret()
.For example the following call to caret::preProcess()
preProcess(x, method = c("knnImpute", "pca"), pcaComp = 10)
with x
being a matrix
or data.frame
would thus translate into
makePreprocWrapperCaret(learner, ppc.knnImpute = TRUE, ppc.pca = TRUE, ppc.pcaComp = 10)
where learner
is a mlr
Learner (makeLearner()
) or the name of a learner class like "classif.lda"
.
If you enable multiple preprocessing options (like knn imputation and principal component analysis above) these are executed in a certain order detailed on the help page of function caret::preProcess()
.
In the following we show an example where principal components analysis (PCA) is used for dimensionality reduction. This should never be applied blindly, but can be beneficial with learners that get problems with high dimensionality or those that can profit from rotating the data.
We consider the sonar.task()
, which poses a binary classification problem with 208 observations and 60 features.
sonar.task
## Supervised task: Sonar-example
## Type: classif
## Target: Class
## Observations: 208
## Features:
## numerics factors ordered functionals
## 60 0 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 2
## M R
## 111 97
## Positive class: M
Below we fuse quadratic discriminant analysis (MASS::qda()
) from package MASS
with a principal components preprocessing step. The threshold is set to 0.9, i.e., the principal components necessary to explain a cumulative percentage of 90% of the total variance are kept. The data are automatically standardized prior to PCA.
lrn = makePreprocWrapperCaret("classif.qda", ppc.pca = TRUE, ppc.thresh = 0.9)
lrn
## Learner classif.qda.preproc from package MASS
## Type: classif
## Name: ; Short name:
## Class: PreprocWrapperCaret
## Properties: twoclass,multiclass,numerics,factors,prob
## Predict-Type: response
## Hyperparameters: ppc.BoxCox=FALSE,ppc.YeoJohnson=FALSE,ppc.expoTrans=FALSE,ppc.center=TRUE,ppc.scale=TRUE,ppc.range=FALSE,ppc.knnImpute=FALSE,ppc.bagImpute=FALSE,ppc.medianImpute=FALSE,ppc.pca=TRUE,ppc.ica=FALSE,ppc.spatialSign=FALSE,ppc.corr=FALSE,ppc.zv=FALSE,ppc.nzv=FALSE,ppc.thresh=0.9,ppc.na.remove=TRUE,ppc.k=5,ppc.fudge=0.2,ppc.numUnique=3,ppc.cutoff=0.9,ppc.freqCut=19,ppc.uniqueCut=10
The wrapped learner is trained on the sonar.task()
. By inspecting the underlying MASS::qda()
model, we see that the first 22 principal components have been used for training.
mod = train(lrn, sonar.task)
mod
## Model for learner.id=classif.qda.preproc; learner.class=PreprocWrapperCaret
## Trained on: task.id = Sonar-example; obs = 208; features = 60
## Hyperparameters: ppc.BoxCox=FALSE,ppc.YeoJohnson=FALSE,ppc.expoTrans=FALSE,ppc.center=TRUE,ppc.scale=TRUE,ppc.range=FALSE,ppc.knnImpute=FALSE,ppc.bagImpute=FALSE,ppc.medianImpute=FALSE,ppc.pca=TRUE,ppc.ica=FALSE,ppc.spatialSign=FALSE,ppc.corr=FALSE,ppc.zv=FALSE,ppc.nzv=FALSE,ppc.thresh=0.9,ppc.na.remove=TRUE,ppc.k=5,ppc.fudge=0.2,ppc.numUnique=3,ppc.cutoff=0.9,ppc.freqCut=19,ppc.uniqueCut=10
getLearnerModel(mod)
## Model for learner.id=classif.qda; learner.class=classif.qda
## Trained on: task.id = Sonar-example; obs = 208; features = 22
## Hyperparameters:
getLearnerModel(mod, more.unwrap = TRUE)
## Call:
## qda(f, data = getTaskData(.task, .subset, recode.target = "drop.levels"))
##
## Prior probabilities of groups:
## M R
## 0.5336538 0.4663462
##
## Group means:
## PC1 PC2 PC3 PC4 PC5 PC6
## M 0.5976122 -0.8058235 0.9773518 0.03794232 -0.04568166 -0.06721702
## R -0.6838655 0.9221279 -1.1184128 -0.04341853 0.05227489 0.07691845
## PC7 PC8 PC9 PC10 PC11 PC12
## M 0.2278162 -0.01034406 -0.2530606 -0.1793157 -0.04084466 -0.0004789888
## R -0.2606969 0.01183702 0.2895848 0.2051963 0.04673977 0.0005481212
## PC13 PC14 PC15 PC16 PC17 PC18
## M -0.06138758 -0.1057137 0.02808048 0.05215865 -0.07453265 0.03869042
## R 0.07024765 0.1209713 -0.03213333 -0.05968671 0.08528994 -0.04427460
## PC19 PC20 PC21 PC22
## M -0.01192247 0.006098658 0.01263492 -0.001224809
## R 0.01364323 -0.006978877 -0.01445851 0.001401586
Below the performances of MASS::qda()
with and without PCA preprocessing are compared in a benchmark experiment. Note that we use stratified resampling to prevent errors in MASS::qda()
due to a too small number of observations from either class.
rin = makeResampleInstance("CV", iters = 3, stratify = TRUE, task = sonar.task)
res = benchmark(list("classif.qda", lrn), sonar.task, rin, show.info = FALSE)
res
## task.id learner.id mmce.test.mean
## 1 Sonar-example classif.qda 0.2932367
## 2 Sonar-example classif.qda.preproc 0.1779848
PCA preprocessing in this case turns out to be really beneficial for the performance of Quadratic Discriminant Analysis.
Let’s see if we can optimize this a bit. The threshold value of 0.9 above was chosen arbitrarily and led to 22 out of 60 principal components. But maybe a lower or higher number of principal components should be used. Moreover, qda
(MASS::qda()
) has several options that control how the class covariance matrices or class probabilities are estimated.
Those preprocessing and learner parameters can be tuned jointly. Before doing this let’s first get an overview of all the parameters of the wrapped learner using function getParamSet()
.
getParamSet(lrn)
## Type len Def Constr Req Tunable
## ppc.BoxCox logical - FALSE - - TRUE
## ppc.YeoJohnson logical - FALSE - - TRUE
## ppc.expoTrans logical - FALSE - - TRUE
## ppc.center logical - TRUE - - TRUE
## ppc.scale logical - TRUE - - TRUE
## ppc.range logical - FALSE - - TRUE
## ppc.knnImpute logical - FALSE - - TRUE
## ppc.bagImpute logical - FALSE - - TRUE
## ppc.medianImpute logical - FALSE - - TRUE
## ppc.pca logical - FALSE - - TRUE
## ppc.ica logical - FALSE - - TRUE
## ppc.spatialSign logical - FALSE - - TRUE
## ppc.corr logical - FALSE - - TRUE
## ppc.zv logical - FALSE - - TRUE
## ppc.nzv logical - FALSE - - TRUE
## ppc.thresh numeric - 0.95 0 to Inf - TRUE
## ppc.pcaComp integer - - 1 to Inf - TRUE
## ppc.na.remove logical - TRUE - - TRUE
## ppc.k integer - 5 1 to Inf - TRUE
## ppc.fudge numeric - 0.2 0 to Inf - TRUE
## ppc.numUnique integer - 3 1 to Inf - TRUE
## ppc.n.comp integer - - 1 to Inf - TRUE
## ppc.cutoff numeric - 0.9 0 to 1 - TRUE
## ppc.freqCut numeric - 19 1 to Inf - TRUE
## ppc.uniqueCut numeric - 10 0 to Inf - TRUE
## method discrete - moment moment,mle,mve,t - TRUE
## nu numeric - 5 2 to Inf Y TRUE
## predict.method discrete - plug-in plug-in,predictive,debiased - TRUE
## Trafo
## ppc.BoxCox -
## ppc.YeoJohnson -
## ppc.expoTrans -
## ppc.center -
## ppc.scale -
## ppc.range -
## ppc.knnImpute -
## ppc.bagImpute -
## ppc.medianImpute -
## ppc.pca -
## ppc.ica -
## ppc.spatialSign -
## ppc.corr -
## ppc.zv -
## ppc.nzv -
## ppc.thresh -
## ppc.pcaComp -
## ppc.na.remove -
## ppc.k -
## ppc.fudge -
## ppc.numUnique -
## ppc.n.comp -
## ppc.cutoff -
## ppc.freqCut -
## ppc.uniqueCut -
## method -
## nu -
## predict.method -
The parameters prefixed by ppc.
belong to preprocessing. method
, nu
and predict.method
are MASS::qda()
parameters.
Instead of tuning the PCA threshold (ppc.thresh
) we tune the number of principal components (ppc.pcaComp
) directly. Moreover, for MASS::qda()
we try two different ways to estimate the posterior probabilities (parameter predict.method
): the usual plug-in estimates and unbiased estimates.
We perform a grid search and set the resolution to 10. This is for demonstration. You might want to use a finer resolution.
ps = makeParamSet(
makeIntegerParam("ppc.pcaComp", lower = 1, upper = getTaskNFeats(sonar.task)),
makeDiscreteParam("predict.method", values = c("plug-in", "debiased"))
)
ctrl = makeTuneControlGrid(resolution = 10)
res = tuneParams(lrn, sonar.task, rin, par.set = ps, control = ctrl, show.info = FALSE)
res
## Tune result:
## Op. pars: ppc.pcaComp=21; predict.method=plug-in
## mmce.test.mean=0.1779848
as.data.frame(res$opt.path)[1:3]
## ppc.pcaComp predict.method mmce.test.mean
## 1 1 plug-in 0.5052450
## 2 8 plug-in 0.2449275
## 3 14 plug-in 0.2021394
## 4 21 plug-in 0.1779848
## 5 27 plug-in 0.2212560
## 6 34 plug-in 0.2452726
## 7 40 plug-in 0.2500345
## 8 47 plug-in 0.2452726
## 9 53 plug-in 0.2549344
## 10 60 plug-in 0.2932367
## 11 1 debiased 0.5000000
## 12 8 debiased 0.2833678
## 13 14 debiased 0.2453416
## 14 21 debiased 0.2837129
## 15 27 debiased 0.2547274
## 16 34 debiased 0.2886128
## 17 40 debiased 0.2741891
## 18 47 debiased 0.3075914
## 19 53 debiased 0.2642512
## 20 60 debiased 0.2830918
There seems to be a preference for a lower number of principal components (<27) for both "plug-in"
and "debiased"
with "plug-in"
achieving slightly lower error rates.
If the options offered by makePreprocWrapperCaret()
are not enough, you can write your own preprocessing wrapper using function makePreprocWrapper()
.
As described in the tutorial section about wrapped learners wrappers are implemented using a train and a predict method. In case of preprocessing wrappers these methods specify how to transform the data before training and before prediction and are completely user-defined.
Below we show how to create a preprocessing wrapper that centers and scales the data before training/predicting. Some learning methods as, e.g., k nearest neighbors, support vector machines or neural networks usually require scaled features. Many, but not all, have a built-in scaling option where the training data set is scaled before model fitting and the test data set is scaled accordingly, that is by using the scaling parameters from the training stage, before making predictions. In the following we show how to add a scaling option to a Learner (makeLearner()
) by coupling it with function base::scale()
.
Note that we chose this simple example for demonstration. Centering/scaling the data is also possible with makePreprocWrapperCaret()
.
The train function has to be a function with the following arguments:
data
is a data.frame
with columns for all features and the target variable.target
is a string and denotes the name of the target variable in data
.args
is a list
of further arguments and parameters that influence the preprocessing.It must return a list
with elements $data
and $control
, where $data
is the preprocessed data set and $control
stores all information required to preprocess the data before prediction.
The train function for the scaling example is given below. It calls base::scale()
on the numerical features and returns the scaled training data and the corresponding scaling parameters.
args
contains the center
and scale
arguments of function base::scale()
and slot $control
stores the scaling parameters to be used in the prediction stage.
Regarding the latter note that the center
and scale
arguments of base::scale()
can be either a logical value or a numeric vector of length equal to the number of the numeric columns in data
, respectively. If a logical value was passed to args
we store the column means and standard deviations/root mean squares in the $center
and $scale
slots of the returned $control
object.
trainfun = function(data, target, args = list(center, scale)) {
# Identify numerical features
cns = colnames(data)
nums = setdiff(cns[sapply(data, is.numeric)], target)
# Extract numerical features from the data set and call scale
x = as.matrix(data[, nums, drop = FALSE])
x = scale(x, center = args$center, scale = args$scale)
# Store the scaling parameters in control
# These are needed to preprocess the data before prediction
control = args
if (is.logical(control$center) && control$center)
control$center = attr(x, "scaled:center")
if (is.logical(control$scale) && control$scale)
control$scale = attr(x, "scaled:scale")
# Recombine the data
data = data[, setdiff(cns, nums), drop = FALSE]
data = cbind(data, as.data.frame(x))
return(list(data = data, control = control))
}
The predict function has the following arguments:
data
is a data.frame
containing only feature values (as for prediction the target values naturally are not known).target
is a string indicating the name of the target variable.args
are the args
that were passed to the train function.control
is the object returned by the train function.It returns the preprocessed data.
In our scaling example the predict function scales the numerical features using the parameters from the training stage stored in control
.
predictfun = function(data, target, args, control) {
# Identify numerical features
cns = colnames(data)
nums = cns[sapply(data, is.numeric)]
# Extract numerical features from the data set and call scale
x = as.matrix(data[, nums, drop = FALSE])
x = scale(x, center = control$center, scale = control$scale)
# Recombine the data
data = data[, setdiff(cns, nums), drop = FALSE]
data = cbind(data, as.data.frame(x))
return(data)
}
Below we create a preprocessing wrapper with a regression neural network (nnet::nnet()
) (which itself does not have a scaling option) as base learner.
The train and predict functions defined above are passed to makePreprocWrapper()
via the train
and predict
arguments. par.vals
is a list
of parameter values that is relayed to the args
argument of the train function.
lrn = makeLearner("regr.nnet", trace = FALSE, decay = 1e-02)
lrn = makePreprocWrapper(lrn, train = trainfun, predict = predictfun,
par.vals = list(center = TRUE, scale = TRUE))
lrn
## Learner regr.nnet.preproc from package nnet
## Type: regr
## Name: ; Short name:
## Class: PreprocWrapper
## Properties: numerics,factors,weights
## Predict-Type: response
## Hyperparameters: size=3,trace=FALSE,decay=0.01
Let’s compare the cross-validated mean squared error (mse) on the Boston Housing data set (mlbench::BostonHousing()
) with and without scaling.
rdesc = makeResampleDesc("CV", iters = 3)
r = resample(lrn, bh.task, resampling = rdesc, show.info = FALSE)
r
## Resample Result
## Task: BostonHousing-example
## Learner: regr.nnet.preproc
## Aggr perf: mse.test.mean=26.6997095
## Runtime: 0.083282
lrn = makeLearner("regr.nnet", trace = FALSE, decay = 1e-02)
r = resample(lrn, bh.task, resampling = rdesc, show.info = FALSE)
r
## Resample Result
## Task: BostonHousing-example
## Learner: regr.nnet
## Aggr perf: mse.test.mean=56.8645496
## Runtime: 0.0463462
Often it’s not clear which preprocessing options work best with a certain learning algorithm. As already shown for the number of principal components in makePreprocWrapperCaret()
we can tune them easily together with other hyperparameters of the learner.
In our scaling example we can try if nnet::nnet()
works best with both centering and scaling the data or if it’s better to omit one of the two operations or do no preprocessing at all. In order to tune center
and scale
we have to add appropriate LearnerParam
(ParamHelpers::LearnerParam()
)s to the parameter set (ParamHelpers::ParamSet()
) of the wrapped learner.
As mentioned above base::scale()
allows for numeric and logical center
and scale
arguments. As we want to use the latter option we declare center
and scale
as logical learner parameters.
lrn = makeLearner("regr.nnet", trace = FALSE)
lrn = makePreprocWrapper(lrn, train = trainfun, predict = predictfun,
par.set = makeParamSet(
makeLogicalLearnerParam("center"),
makeLogicalLearnerParam("scale")
),
par.vals = list(center = TRUE, scale = TRUE))
lrn
## Learner regr.nnet.preproc from package nnet
## Type: regr
## Name: ; Short name:
## Class: PreprocWrapper
## Properties: numerics,factors,weights
## Predict-Type: response
## Hyperparameters: size=3,trace=FALSE,center=TRUE,scale=TRUE
getParamSet(lrn)
## Type len Def Constr Req Tunable Trafo
## center logical - - - - TRUE -
## scale logical - - - - TRUE -
## size integer - 3 0 to Inf - TRUE -
## maxit integer - 100 1 to Inf - TRUE -
## skip logical - FALSE - - TRUE -
## rang numeric - 0.7 -Inf to Inf - TRUE -
## decay numeric - 0 0 to Inf - TRUE -
## Hess logical - FALSE - - TRUE -
## trace logical - TRUE - - FALSE -
## MaxNWts integer - 1000 1 to Inf - FALSE -
## abstol numeric - 0.0001 -Inf to Inf - TRUE -
## reltol numeric - 1e-08 -Inf to Inf - TRUE -
Now we do a simple grid search for the decay
parameter of nnet::nnet()
and the center
and scale
parameters.
rdesc = makeResampleDesc("Holdout")
ps = makeParamSet(
makeDiscreteParam("decay", c(0, 0.05, 0.1)),
makeLogicalParam("center"),
makeLogicalParam("scale")
)
ctrl = makeTuneControlGrid()
res = tuneParams(lrn, bh.task, rdesc, par.set = ps, control = ctrl, show.info = FALSE)
res
## Tune result:
## Op. pars: decay=0.1; center=FALSE; scale=TRUE
## mse.test.mean=14.0485584
df = as.data.frame(res$opt.path)
df[, -ncol(df)]
## decay center scale mse.test.mean dob eol error.message
## 1 0 TRUE TRUE 18.90371 1 NA <NA>
## 2 0.05 TRUE TRUE 22.41828 2 NA <NA>
## 3 0.1 TRUE TRUE 20.75765 3 NA <NA>
## 4 0 FALSE TRUE 17.62376 4 NA <NA>
## 5 0.05 FALSE TRUE 17.12806 5 NA <NA>
## 6 0.1 FALSE TRUE 14.04856 6 NA <NA>
## 7 0 TRUE FALSE 78.98051 7 NA <NA>
## 8 0.05 TRUE FALSE 44.28004 8 NA <NA>
## 9 0.1 TRUE FALSE 56.37346 9 NA <NA>
## 10 0 FALSE FALSE 90.20231 10 NA <NA>
## 11 0.05 FALSE FALSE 90.20117 11 NA <NA>
## 12 0.1 FALSE FALSE 30.18762 12 NA <NA>
If you have written a preprocessing wrapper that you might want to use from time to time it’s a good idea to encapsulate it in an own function as shown below. If you think your preprocessing method is something others might want to use as well and should be integrated into mlr
just contact us.
makePreprocWrapperScale = function(learner, center = TRUE, scale = TRUE) {
trainfun = function(data, target, args = list(center, scale)) {
cns = colnames(data)
nums = setdiff(cns[sapply(data, is.numeric)], target)
x = as.matrix(data[, nums, drop = FALSE])
x = scale(x, center = args$center, scale = args$scale)
control = args
if (is.logical(control$center) && control$center)
control$center = attr(x, "scaled:center")
if (is.logical(control$scale) && control$scale)
control$scale = attr(x, "scaled:scale")
data = data[, setdiff(cns, nums), drop = FALSE]
data = cbind(data, as.data.frame(x))
return(list(data = data, control = control))
}
predictfun = function(data, target, args, control) {
cns = colnames(data)
nums = cns[sapply(data, is.numeric)]
x = as.matrix(data[, nums, drop = FALSE])
x = scale(x, center = control$center, scale = control$scale)
data = data[, setdiff(cns, nums), drop = FALSE]
data = cbind(data, as.data.frame(x))
return(data)
}
makePreprocWrapper(
learner,
train = trainfun,
predict = predictfun,
par.set = makeParamSet(
makeLogicalLearnerParam("center"),
makeLogicalLearnerParam("scale")
),
par.vals = list(center = center, scale = scale)
)
}
lrn = makePreprocWrapperScale("classif.lda")
train(lrn, iris.task)
## Model for learner.id=classif.lda.preproc; learner.class=PreprocWrapper
## Trained on: task.id = iris-example; obs = 150; features = 4
## Hyperparameters: center=TRUE,scale=TRUE