vignettes/tutorial/create_imputation.Rmd
create_imputation.Rmd
Function makeImputeMethod()
permits to create your own imputation method. For this purpose you need to specify a learn function that extracts the necessary information and an impute function that does the actual imputation. The learn and impute functions both have at least the following formal arguments:
data
is a base::data.frame()
with missing values in some features.col
indicates the feature to be imputed.target
indicates the target variable(s) in a supervised learning task.Let’s have a look at function imputeMean (imputations()
).
imputeMean = function ()
{
makeImputeMethod(learn = function(data, target, col) mean(data[[col]],
na.rm = TRUE), impute = simpleImpute)
}
imputeMean (imputations()
) calls the unexported mlr
function simpleImpute
which is defined as follows.
simpleImpute = function (data, target, col, const)
{
if (is.na(const)) {
stopf("Error imputing column '%s'. Maybe all input data was missing?",
col)
}
x = data[[col]]
if (is.logical(x) && !is.logical(const)) {
x = as.factor(x)
}
if (is.factor(x) && const %nin% levels(x)) {
levels(x) = c(levels(x), as.character(const))
}
replace(x, is.na(x), const)
}
The learn function calculates the mean of the non-missing observations in column col
. The mean is passed via argument const
to the impute function that replaces all missing values in feature col
.
Now let’s write a new imputation method: A frequently used simple technique for longitudinal data is last observation carried forward (LOCF). Missing values are replaced by the most recent observed value.
In the R code below the learn function determines the last observed value previous to each NA
(values
) as well as the corresponding number of consecutive NA's
(times
). The impute function generates a vector by replicating the entries in values
according to times
and replaces the NA's
in feature col
.
imputeLOCF = function() {
makeImputeMethod(
learn = function(data, target, col) {
x = data[[col]]
ind = is.na(x)
dind = diff(ind)
lastValue = which(dind == 1) # position of the last observed value previous to NA
lastNA = which(dind == -1) # position of the last of potentially several consecutive NA's
values = x[lastValue] # last observed value previous to NA
times = lastNA - lastValue # number of consecutive NA's
return(list(values = values, times = times))
},
impute = function(data, target, col, values, times) {
x = data[[col]]
replace(x, is.na(x), rep(values, times))
}
)
}
Note that this function is just for demonstration and is lacking some checks for real-world usage (for example ‘What should happen if the first value in x
is already missing?’). Below it is used to impute the missing values in features Ozone
and Solar.R
in the airquality (datasets::airquality()
) data set.
data(airquality)
imp = impute(airquality, cols = list(Ozone = imputeLOCF(), Solar.R = imputeLOCF()),
dummy.cols = c("Ozone", "Solar.R"))
head(imp$data, 10)
## Ozone Solar.R Wind Temp Month Day Ozone.dummy Solar.R.dummy
## 1 41 190 7.4 67 5 1 FALSE FALSE
## 2 36 118 8.0 72 5 2 FALSE FALSE
## 3 12 149 12.6 74 5 3 FALSE FALSE
## 4 18 313 11.5 62 5 4 FALSE FALSE
## 5 18 313 14.3 56 5 5 TRUE TRUE
## 6 28 313 14.9 66 5 6 FALSE TRUE
## 7 23 299 8.6 65 5 7 FALSE FALSE
## 8 19 99 13.8 59 5 8 FALSE FALSE
## 9 8 19 20.1 61 5 9 FALSE FALSE
## 10 8 194 8.6 69 5 10 TRUE FALSE