Function makeImputeMethod() permits to create your own imputation method. For this purpose you need to specify a learn function that extracts the necessary information and an impute function that does the actual imputation. The learn and impute functions both have at least the following formal arguments:

• data is a base::data.frame() with missing values in some features.
• col indicates the feature to be imputed.
• target indicates the target variable(s) in a supervised learning task.

# Example: Imputation using the mean

Let’s have a look at function imputeMean (imputations()).

imputeMean = function ()
{
makeImputeMethod(learn = function(data, target, col) mean(data[[col]],
na.rm = TRUE), impute = simpleImpute)
}

imputeMean (imputations()) calls the unexported mlr function simpleImpute which is defined as follows.

simpleImpute = function (data, target, col, const)
{
if (is.na(const)) {
stopf("Error imputing column '%s'. Maybe all input data was missing?",
col)
}
x = data[[col]]
if (is.logical(x) && !is.logical(const)) {
x = as.factor(x)
}
if (is.factor(x) && const %nin% levels(x)) {
levels(x) = c(levels(x), as.character(const))
}
replace(x, is.na(x), const)
}

The learn function calculates the mean of the non-missing observations in column col. The mean is passed via argument const to the impute function that replaces all missing values in feature col.

# Writing your own imputation method

Now let’s write a new imputation method: A frequently used simple technique for longitudinal data is last observation carried forward (LOCF). Missing values are replaced by the most recent observed value.

In the R code below the learn function determines the last observed value previous to each NA (values) as well as the corresponding number of consecutive NA's (times). The impute function generates a vector by replicating the entries in values according to times and replaces the NA's in feature col.

imputeLOCF = function() {
makeImputeMethod(
learn = function(data, target, col) {
x = data[[col]]
ind = is.na(x)
dind = diff(ind)
lastValue = which(dind == 1)  # position of the last observed value previous to NA
lastNA = which(dind == -1)    # position of the last of potentially several consecutive NA's
values = x[lastValue]         # last observed value previous to NA
times = lastNA - lastValue    # number of consecutive NA's
return(list(values = values, times = times))
},
impute = function(data, target, col, values, times) {
x = data[[col]]
replace(x, is.na(x), rep(values, times))
}
)
}

Note that this function is just for demonstration and is lacking some checks for real-world usage (for example ‘What should happen if the first value in x is already missing?’). Below it is used to impute the missing values in features Ozone and Solar.R in the airquality (datasets::airquality()) data set.

data(airquality)
imp = impute(airquality, cols = list(Ozone = imputeLOCF(), Solar.R = imputeLOCF()),
dummy.cols = c("Ozone", "Solar.R"))
head(imp\$data, 10)
##    Ozone Solar.R Wind Temp Month Day Ozone.dummy Solar.R.dummy
## 1     41     190  7.4   67     5   1       FALSE         FALSE
## 2     36     118  8.0   72     5   2       FALSE         FALSE
## 3     12     149 12.6   74     5   3       FALSE         FALSE
## 4     18     313 11.5   62     5   4       FALSE         FALSE
## 5     18     313 14.3   56     5   5        TRUE          TRUE
## 6     28     313 14.9   66     5   6       FALSE          TRUE
## 7     23     299  8.6   65     5   7       FALSE         FALSE
## 8     19      99 13.8   59     5   8       FALSE         FALSE
## 9      8      19 20.1   61     5   9       FALSE         FALSE
## 10     8     194  8.6   69     5  10        TRUE         FALSE