mlr
provides several imputation methods which are listed on the help page imputations()
. These include standard techniques as imputation by a constant value (like a fixed constant, the mean, median or mode) and random numbers (either from the empirical distribution of the feature under consideration or a certain distribution family). Moreover, missing values in one feature can be replaced based on the other features by predictions from any supervised Learner (makeLearner()
) integrated into mlr
.
If your favourite option is not implemented in mlr
yet, you can easily create your own imputation method.
Also note that some of the learning algorithms included in mlr
can deal with missing values in a sensible way, i.e., other than simply deleting observations with missing values. Those Learner (makeLearner()
)s have the property "missings"
and thus can be identified using listLearners()
.
# Regression learners that can deal with missing values
listLearners("regr", properties = "missings")[c("class", "package")]
## class package
## 1 regr.bartMachine bartMachine
## 2 regr.cforest party
## 3 regr.ctree party
## 4 regr.cubist Cubist
## 5 regr.featureless mlr
## 6 regr.gbm gbm
## 7 regr.h2o.deeplearning h2o
## 8 regr.h2o.gbm h2o
## 9 regr.h2o.glm h2o
## 10 regr.h2o.randomForest h2o
## 11 regr.randomForestSRC randomForestSRC
## 12 regr.rpart rpart
## 13 regr.xgboost xgboost
See also the list of integrated learners in the Appendix.
Imputation can be done by function impute()
. You can specify an imputation method for each feature individually or for classes of features like numerics or factors. Moreover, you can generate dummy variables that indicate which values are missing, also either for classes of features or for individual features. These allow to identify the patterns and reasons for missing data and permit to treat imputed and observed values differently in a subsequent analysis.
Let’s have a look at the airquality (datasets::airquality()
) data set.
data(airquality)
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
There are 37 NA's
in variable Ozone
(ozone pollution) and 7 NA's
in variable Solar.R
(solar radiation). For demonstration purposes we insert artificial NA's
in column Wind
(wind speed) and coerce it into a factor
.
airq = airquality
ind = sample(nrow(airq), 10)
airq$Wind[ind] = NA
airq$Wind = cut(airq$Wind, c(0,8,16,24))
summary(airq)
## Ozone Solar.R Wind Temp Month
## Min. : 1.00 Min. : 7.0 (0,8] :51 Min. :56.00 Min. :5.000
## 1st Qu.: 18.00 1st Qu.:115.8 (8,16] :85 1st Qu.:72.00 1st Qu.:6.000
## Median : 31.50 Median :205.0 (16,24]: 7 Median :79.00 Median :7.000
## Mean : 42.13 Mean :185.9 NA's :10 Mean :77.88 Mean :6.993
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:85.00 3rd Qu.:8.000
## Max. :168.00 Max. :334.0 Max. :97.00 Max. :9.000
## NA's :37 NA's :7
## Day
## Min. : 1.0
## 1st Qu.: 8.0
## Median :16.0
## Mean :15.8
## 3rd Qu.:23.0
## Max. :31.0
##
If you want to impute NA's
in all integer features (these include Ozone
and Solar.R
) by the mean, in all factor features (Wind
) by the mode and additionally generate dummy variables for all integer features, you can do this as follows:
imp = impute(airq, classes = list(integer = imputeMean(), factor = imputeMode()),
dummy.classes = "integer")
impute()
returns a list
where slot $data
contains the imputed data set. Per default, the dummy variables are factors with levels "TRUE"
and "FALSE"
. It is also possible to create numeric zero-one indicator variables.
head(imp$data, 10)
## Ozone Solar.R Wind Temp Month Day Ozone.dummy Solar.R.dummy
## 1 41.00000 190.0000 (0,8] 67 5 1 FALSE FALSE
## 2 36.00000 118.0000 (0,8] 72 5 2 FALSE FALSE
## 3 12.00000 149.0000 (8,16] 74 5 3 FALSE FALSE
## 4 18.00000 313.0000 (8,16] 62 5 4 FALSE FALSE
## 5 42.12931 185.9315 (8,16] 56 5 5 TRUE TRUE
## 6 28.00000 185.9315 (8,16] 66 5 6 FALSE TRUE
## 7 23.00000 299.0000 (8,16] 65 5 7 FALSE FALSE
## 8 19.00000 99.0000 (8,16] 59 5 8 FALSE FALSE
## 9 8.00000 19.0000 (16,24] 61 5 9 FALSE FALSE
## 10 42.12931 194.0000 (8,16] 69 5 10 TRUE FALSE
Slot $desc
is an ImputationDesc
(impute()
) object that stores all relevant information about the imputation. For the current example this includes the means and the mode computed on the non-missing data.
imp$desc
## Imputation description
## Target:
## Features: 6; Imputed: 6
## impute.new.levels: TRUE
## recode.factor.levels: TRUE
## dummy.type: factor
The imputation description shows the name of the target variable (not present), the number of features and the number of imputed features. Note that the latter number refers to the features for which an imputation method was specified (five integers plus one factor) and not to the features actually containing NA's
. dummy.type
indicates that the dummy variables are factors. For details on impute.new.levels
and recode.factor.levels
see the help page of function impute()
.
Let’s have a look at another example involving a target variable. A possible learning task associated with the airquality (datasets::airquality()
) data is to predict the ozone pollution based on the meteorological features. Since we do not want to use columns Day
and Month
we remove them.
airq = subset(airq, select = 1:4)
The first 100 observations are used as training data set.
airq.train = airq[1:100,]
airq.test = airq[-c(1:100),]
In case of a supervised learning problem you need to pass the name of the target variable to impute()
. This prevents imputation and creation of a dummy variable for the target variable itself and makes sure that the target variable is not used to impute the features.
In contrast to the example above we specify imputation methods for individual features instead of classes of features.
Missing values in Solar.R
are imputed by random numbers drawn from the empirical distribution of the non-missing observations.
Function imputeLearner
(imputations()
) allows to use all supervised learning algorithms integrated into mlr
for imputation. The type of the Learner (makeLearner()
) (regr
, classif
) must correspond to the class of the feature to be imputed. The missing values in Wind
are replaced by the predictions of a classification tree (rpart::rpart()
). Per default, all available columns in airq.train
except the target variable (Ozone
) and the variable to be imputed (Wind
) are used as features in the classification tree, here Solar.R
and Temp
. You can also select manually which columns to use. Note that rpart::rpart()
can deal with missing feature values, therefore the NA's
in column Solar.R
do not pose a problem.
imp = impute(airq.train, target = "Ozone", cols = list(Solar.R = imputeHist(),
Wind = imputeLearner("classif.rpart")), dummy.cols = c("Solar.R", "Wind"))
summary(imp$data)
## Ozone Solar.R Wind Temp Solar.R.dummy
## Min. : 1.00 Min. : 7.0 (0,8] :35 Min. :56.00 FALSE:93
## 1st Qu.: 16.00 1st Qu.:100.5 (8,16] :59 1st Qu.:69.00 TRUE : 7
## Median : 34.00 Median :223.0 (16,24]: 6 Median :79.50
## Mean : 41.59 Mean :192.0 Mean :76.87
## 3rd Qu.: 63.00 3rd Qu.:273.2 3rd Qu.:84.00
## Max. :135.00 Max. :334.0 Max. :93.00
## NA's :31
## Wind.dummy
## FALSE:94
## TRUE : 6
##
##
##
##
##
imp$desc
## Imputation description
## Target: Ozone
## Features: 3; Imputed: 2
## impute.new.levels: TRUE
## recode.factor.levels: TRUE
## dummy.type: factor
The ImputationDesc
(impute()
) object can be used by function reimpute()
to impute the test data set the same way as the training data.
airq.test.imp = reimpute(airq.test, imp$desc)
head(airq.test.imp)
## Ozone Solar.R Wind Temp Solar.R.dummy Wind.dummy
## 1 110 207 (0,8] 90 FALSE FALSE
## 2 NA 222 (8,16] 92 FALSE FALSE
## 3 NA 137 (8,16] 86 FALSE FALSE
## 4 44 192 (8,16] 86 FALSE FALSE
## 5 28 273 (8,16] 82 FALSE FALSE
## 6 65 157 (8,16] 80 FALSE FALSE
Especially when evaluating a machine learning method by some resampling technique you might want that impute()
/reimpute()
are called automatically each time before training/prediction. This can be achieved by creating an imputation wrapper.
You can couple a Learner (makeLearner()
) with imputation by function makeImputeWrapper()
which basically has the same formal arguments as impute()
. Like in the example above we impute Solar.R
by random numbers from its empirical distribution, Wind
by the predictions of a classification tree and generate dummy variables for both features.
lrn = makeImputeWrapper("regr.lm", cols = list(Solar.R = imputeHist(),
Wind = imputeLearner("classif.rpart")), dummy.cols = c("Solar.R", "Wind"))
lrn
## Learner regr.lm.imputed from package stats
## Type: regr
## Name: ; Short name:
## Class: ImputeWrapper
## Properties: numerics,factors,se,weights,missings
## Predict-Type: response
## Hyperparameters:
Before training the resulting Learner (makeLearner()
), impute()
is applied to the training set. Before prediction reimpute()
is called on the test set and the ImputationDesc
(impute()
) object from the training stage.
We again aim to predict the ozone pollution from the meteorological variables. In order to create the Task()
we need to delete observations with missing values in the target variable.
airq = subset(airq, subset = !is.na(airq$Ozone))
task = makeRegrTask(data = airq, target = "Ozone")
In the following the 3-fold cross-validated mean squared error is calculated.
rdesc = makeResampleDesc("CV", iters = 3)
r = resample(lrn, task, resampling = rdesc, show.info = FALSE, models = TRUE)
r$aggr
## mse.test.mean
## 483.2621
lapply(r$models, getLearnerModel, more.unwrap = TRUE)
## [[1]]
##
## Call:
## stats::lm(formula = f, data = d)
##
## Coefficients:
## (Intercept) Solar.R Wind(8,16] Wind(16,24]
## -72.62255 0.05914 -27.65865 -27.50254
## Temp Solar.R.dummyTRUE Wind.dummyTRUE
## 1.58889 -9.24041 1.45173
##
##
## [[2]]
##
## Call:
## stats::lm(formula = f, data = d)
##
## Coefficients:
## (Intercept) Solar.R Wind(8,16] Wind(16,24]
## -118.37366 0.07683 -18.92091 -1.82817
## Temp Solar.R.dummyTRUE Wind.dummyTRUE
## 2.01424 -7.18644 -14.57578
##
##
## [[3]]
##
## Call:
## stats::lm(formula = f, data = d)
##
## Coefficients:
## (Intercept) Solar.R Wind(8,16] Wind(16,24]
## -93.94976 0.06823 -20.38373 -17.85132
## Temp Solar.R.dummyTRUE Wind.dummyTRUE
## 1.75427 -17.09540 1.37333
A second possibility to fuse a learner with imputation is provided by makePreprocWrapperCaret()
, which is an interface to caret
s caret::preProcess()
function. caret::preProcess()
only works for numeric features and offers imputation by k-nearest neighbors, bagged trees, and by the median.