Function makeImputeMethod() permits to create your own imputation method. For this purpose you need to specify a learn function that extracts the necessary information and an impute function that does the actual imputation. The learn and impute functions both have at least the following formal arguments:

• data is a base::data.frame() with missing values in some features.
• col indicates the feature to be imputed.
• target indicates the target variable(s) in a supervised learning task.

# Example: Imputation using the mean

Let’s have a look at function imputeMean (imputations()).

imputeMean = function ()
{
makeImputeMethod(learn = function(data, target, col) mean(data[[col]],
na.rm = TRUE), impute = simpleImpute)
}

imputeMean (imputations()) calls the unexported mlr function simpleImpute which is defined as follows.

simpleImpute = function (data, target, col, const)
{
if (is.na(const)) {
stopf("Error imputing column '%s'. Maybe all input data was missing?",
col)
}
x = data[[col]]
if (is.logical(x) && !is.logical(const)) {
x = as.factor(x)
}
if (is.factor(x) && const %nin% levels(x)) {
levels(x) = c(levels(x), as.character(const))
}
replace(x, is.na(x), const)
}

The learn function calculates the mean of the non-missing observations in column col. The mean is passed via argument const to the impute function that replaces all missing values in feature col.

# Writing your own imputation method

Now let’s write a new imputation method: A frequently used simple technique for longitudinal data is last observation carried forward (LOCF). Missing values are replaced by the most recent observed value.

In the R code below the learn function determines the last observed value previous to each NA (values) as well as the corresponding number of consecutive NA's (times). The impute function generates a vector by replicating the entries in values according to times and replaces the NA's in feature col.

imputeLOCF = function() {
makeImputeMethod(
learn = function(data, target, col) {
x = data[[col]]
ind = is.na(x)
dind = diff(ind)
lastValue = which(dind == 1)  # position of the last observed value previous to NA
lastNA = which(dind == -1)    # position of the last of potentially several consecutive NA's
values = x[lastValue]         # last observed value previous to NA
times = lastNA - lastValue    # number of consecutive NA's
return(list(values = values, times = times))
},
impute = function(data, target, col, values, times) {
x = data[[col]]
replace(x, is.na(x), rep(values, times))
}
)
}

Note that this function is just for demonstration and is lacking some checks for real-world usage (for example ‘What should happen if the first value in x is already missing?’). Below it is used to impute the missing values in features Ozone and Solar.R in the airquality (datasets::airquality()) data set.

data(airquality)
imp = impute(airquality, cols = list(Ozone = imputeLOCF(), Solar.R = imputeLOCF()),
dummy.cols = c("Ozone", "Solar.R"))
## 10     8     194  8.6   69     5  10        TRUE         FALSE