vignettes/tutorial/create_learner.Rmd
In order to integrate a learning algorithm into mlr
some interface code has to be written. Three functions are mandatory for each learner.
makeLearner()
.)mlr
’s train()
function).mlr
’s predict.WrappedModel()
function).Technically, integrating a learning method means introducing a new S3 class
and implementing the corresponding methods for the generic functions RLearner()
, trainLearner()
, and predictLearner()
. Therefore we start with a quick overview of the involved classes and constructor functions.
As you already know makeLearner()
generates an object of class Learner (makeLearner()
).
class(makeLearner(cl = "classif.lda"))
## [1] "classif.lda" "RLearnerClassif" "RLearner" "Learner"
class(makeLearner(cl = "regr.lm"))
## [1] "regr.lm" "RLearnerRegr" "RLearner" "Learner"
class(makeLearner(cl = "surv.coxph"))
## [1] "surv.coxph" "RLearnerSurv" "RLearner" "Learner"
class(makeLearner(cl = "cluster.kmeans"))
## [1] "cluster.kmeans" "RLearnerCluster" "RLearner" "Learner"
class(makeLearner(cl = "multilabel.rFerns"))
## [1] "multilabel.rFerns" "RLearnerMultilabel" "RLearner"
## [4] "Learner"
The first element of each class attribute vector is the name of the learner class passed to the cl
argument of makeLearner()
. Obviously, this adheres to the naming conventions
"classif.<R_method_name>"
for classification,"multilabel.<R_method_name>"
for multilabel classification,"regr.<R_method_name>"
for regression,"surv.<R_method_name>"
for survival analysis, and"cluster.<R_method_name>"
for clustering.Additionally, there exist intermediate classes that reflect the type of learning problem, i.e., all classification learners inherit from RLearnerClassif
(RLearner()
), all regression learners from RLearnerRegr
(RLearner()
) and so on. Their superclasses are RLearner()
and finally Learner
(makeLearner()
). For all these (sub)classes there exist constructor functions makeRLearner
(RLearner()
), makeRLearnerClassif
(RLearner()
), makeRLearneRegr
(RLearner()
) etc. that are called internally by makeLearner()
.
A short side remark: As you might have noticed there does not exist a special learner class for cost-sensitive classification (costsens) with example-specific costs. This type of learning task is currently exclusively handled through wrappers like makeCostSensWeightedPairsWrapper()
.
In the following we show how to integrate learners for the five types of learning tasks mentioned above. Defining a completely new type of learner that has special properties and does not fit into one of the existing schemes is of course possible, but much more advanced and not covered here.
We use a classification example to explain some general principles (so even if you are interested in integrating a learner for another type of learning task you might want to read the following section). Examples for other types of learning tasks are shown later on.
We show how the Linear Discriminant Analysis (MASS::lda()
) from package MASS
has been integrated into the classification learner classif.lda
in mlr
as an example.
The minimal information required to define a learner is the mlr
name of the learner, its package, the parameter set, and the set of properties of your learner. In addition, you may provide a human-readable name, a short name and a note with information relevant to users of the learner.
First, name your learner. According to the naming conventions above the name starts with classif.
and we choose classif.lda
.
Second, we need to define the parameters of the learner. These are any options that can be set when running it to change how it learns, how input is interpreted, how and what output is generated, and so on. mlr
provides a number of functions to define parameters, a complete list can be found in the documentation of LearnerParam
(ParamHelpers::LearnerParam()
) of the ParamHelpers
package.
In our example, we have discrete and numeric parameters, so we use makeDiscreteLearnerParam
(ParamHelpers::LearnerParam()
) and makeNumericLearnerParam
(ParamHelpers::LearnerParam()
) to incorporate the complete description of the parameters. We include all possible values for discrete parameters and lower and upper bounds for numeric parameters. Strictly speaking it is not necessary to provide bounds for all parameters and if this information is not available they can be estimated, but providing accurate and specific information here makes it possible to tune the learner much better (see the section on tuning).
Next, we add information on the properties of the learner (see also the section on learners). Which types of features are supported (numerics, factors)? Are case weights supported? Are class weights supported? Can the method deal with missing values in the features and deal with NA’s in a meaningful way (not na.omit
)? Are one-class, two-class, multi-class problems supported? Can the learner predict posterior probabilities?
If the learner supports class weights the name of the relevant learner parameter can be specified via argument class.weights.param
.
Sometimes parameters allow certain values that don’t “fit” the formal parameter types. Often, these are also the default values. For example: A numeric parameter that accepts values as NA
, NULL
, "auto"
, and so on. For these there is the option special.vals
, which allows to list these extra values as feasible. An example is the missing
argument in xgboost
. It accepts numeric values (that’s why it’s set as a numeric parameter) and NULL
but the default is NA
.
makeNumericLearnerParam(id = "missing", default = NA, tunable = FALSE, when = "both",
special.vals = list(NA, NA_real_, NULL))
If a parameter depends on another one, the requires
attribute is added as an argument. Dependent parameters with a requires
field must use quote()
and not expression()
to define it.
Below is the complete code for the definition of the LDA learner. It has one discrete parameter, method
, and two continuous ones, nu
and tol
. It supports classification problems with two or more classes and can deal with numeric and factor explanatory variables. It can predict posterior probabilities.
makeRLearner.classif.lda = function() {
makeRLearnerClassif(
cl = "classif.lda",
package = "MASS",
par.set = makeParamSet(
makeDiscreteLearnerParam(id = "method", default = "moment",
values = c("moment", "mle", "mve", "t")),
makeNumericLearnerParam(id = "nu", lower = 2,
requires = quote(method == "t")),
makeNumericLearnerParam(id = "tol", default = 1e-4, lower = 0),
makeDiscreteLearnerParam(id = "predict.method",
values = c("plug-in", "predictive", "debiased"),
default = "plug-in", when = "predict"),
makeLogicalLearnerParam(id = "CV", default = FALSE, tunable = FALSE)
),
properties = c("twoclass", "multiclass", "numerics", "factors", "prob"),
name = "Linear Discriminant Analysis",
short.name = "lda",
note = "Learner param 'predict.method' maps to 'method' in predict.lda."
)
}
Once the learner has been defined, we need to tell mlr
how to call it to train a model. The name of the function has to start with trainLearner.
, followed by the mlr
name of the learner as defined above (classif.lda
here). The prototype of the function looks as follows.
This function must fit a model on the data of the task .task
with regard to the subset defined in the integer vector .subset
and the parameters passed in the ...
arguments. Usually, the data should be extracted from the task using getTaskData()
. This will take care of any subsetting as well. It must return the fitted model. mlr
assumes no special data type for the return value – it will be passed to the predict function we are going to define below, so any special code the learner may need can be encapsulated there.
For our example, the definition of the function looks like this. In addition to the data of the task, we also need the formula that describes what to predict. We use the function getTaskFormula()
to extract this from the task.
trainLearner.classif.lda = function (.learner, .task, .subset, .weights = NULL, ...)
{
f = getTaskFormula(.task)
MASS::lda(f, data = getTaskData(.task, .subset), ...)
}
Finally, the prediction function needs to be defined. The name of this function starts with predictLearner.
, followed again by the mlr
name of the learner. The prototype of the function is as follows.
It must predict for the new observations in the data.frame
.newdata
with the wrapped model .model
, which is returned from the training function. The actual model the learner built is stored in the $learner.model
member and can be accessed simply through .model$learner.model
.
For classification, you have to return a factor of predicted classes if .learner$predict.type
is "response"
, or a matrix of predicted probabilities if .learner$predict.type
is "prob"
and this type of prediction is supported by the learner. In the latter case the matrix must have the same number of columns as there are classes in the task and the columns have to be named by the class names.
The definition for LDA looks like this. It is pretty much just a straight pass-through of the arguments to the base::predict()
function and some extraction of prediction data depending on the type of prediction requested.
The main difference for regression is that the type of predictions are different (numeric instead of labels or probabilities) and that not all of the properties are relevant. In particular, whether one-, two-, or multi-class problems and posterior probabilities are supported is not applicable.
Apart from this, everything explained above applies. Below is the definition for the earth::earth()
learner.
makeRLearner.regr.earth = function() {
makeRLearnerRegr(
cl = "regr.earth",
package = "earth",
par.set = makeParamSet(
makeLogicalLearnerParam(id = "keepxy", default = FALSE, tunable = FALSE),
makeNumericLearnerParam(id = "trace", default = 0, upper = 10, tunable = FALSE),
makeIntegerLearnerParam(id = "degree", default = 1L, lower = 1L),
makeNumericLearnerParam(id = "penalty"),
makeIntegerLearnerParam(id = "nk", lower = 0L),
makeNumericLearnerParam(id = "thres", default = 0.001),
makeIntegerLearnerParam(id = "minspan", default = 0L),
makeIntegerLearnerParam(id = "endspan", default = 0L),
makeNumericLearnerParam(id = "newvar.penalty", default = 0),
makeIntegerLearnerParam(id = "fast.k", default = 20L, lower = 0L),
makeNumericLearnerParam(id = "fast.beta", default = 1),
makeDiscreteLearnerParam(id = "pmethod", default = "backward",
values = c("backward", "none", "exhaustive", "forward", "seqrep", "cv")),
makeIntegerLearnerParam(id = "nprune")
),
properties = c("numerics", "factors"),
name = "Multivariate Adaptive Regression Splines",
short.name = "earth",
note = ""
)
}
trainLearner.regr.earth = function (.learner, .task, .subset, .weights = NULL, ...)
{
f = getTaskFormula(.task)
earth::earth(f, data = getTaskData(.task, .subset), ...)
}
predictLearner.regr.earth = function (.learner, .model, .newdata, ...)
{
predict(.model$learner.model, newdata = .newdata)[, 1L]
}
Again most of the data is passed straight through to/from the train/predict functions of the learner.
For survival analysis, you have to return so-called linear predictors in order to compute the default measure for this task type, the cindex (for .learner$predict.type
== "response"
). For .learner$predict.type
== "prob"
, there is no substantially meaningful measure (yet). You may either ignore this case or return something like predicted survival curves (cf. example below).
There are three properties that are specific to survival learners: “rcens”, “lcens” and “icens”, defining the type(s) of censoring a learner can handle – right, left and/or interval censored.
Let’s have a look at how the Cox Proportional Hazard Model (survival::coxph()
) from package survival
has been integrated into the survival learner surv.coxph
in mlr
as an example:
makeRLearner.surv.coxph = function() {
makeRLearnerSurv(
cl = "surv.coxph",
package = "survival",
par.set = makeParamSet(
makeDiscreteLearnerParam(id = "ties", default = "efron",
values = c("efron", "breslow", "exact")),
makeLogicalLearnerParam(id = "singular.ok", default = TRUE),
makeNumericLearnerParam(id = "eps", default = 1e-09, lower = 0),
makeNumericLearnerParam(id = "toler.chol",
default = .Machine$double.eps^0.75, lower = 0),
makeIntegerLearnerParam(id = "iter.max", default = 20L, lower = 1L),
makeNumericLearnerParam(id = "toler.inf",
default = sqrt(.Machine$double.eps^0.75), lower = 0),
makeIntegerLearnerParam(id = "outer.max", default = 10L, lower = 1L),
makeLogicalLearnerParam(id = "model", default = FALSE, tunable = FALSE),
makeLogicalLearnerParam(id = "x", default = FALSE, tunable = FALSE),
makeLogicalLearnerParam(id = "y", default = TRUE, tunable = FALSE)
),
properties = c("missings", "numerics", "factors", "weights", "prob", "rcens"),
name = "Cox Proportional Hazard Model",
short.name = "coxph",
note = ""
)
}
trainLearner.surv.coxph = function (.learner, .task, .subset, .weights = NULL, ...)
{
f = getTaskFormula(.task)
data = getTaskData(.task, subset = .subset)
if (is.null(.weights)) {
survival::coxph(formula = f, data = data, ...)
}
else {
survival::coxph(formula = f, data = data, weights = .weights,
...)
}
}
predictLearner.surv.coxph = function (.learner, .model, .newdata, ...)
{
predict(.model$learner.model, newdata = .newdata, type = "lp",
...)
}
For clustering, you have to return a numeric vector with the IDs of the clusters that the respective datum has been assigned to. The numbering should start at 1.
Below is the definition for the FarthestFirst
(RWeka::FarthestFirst()
) learner from the RWeka
package. Weka starts the IDs of the clusters at 0, so we add 1 to the predicted clusters. RWeka has a different way of setting learner parameters; we use the special Weka_control
function to do this.
makeRLearner.cluster.FarthestFirst = function() {
makeRLearnerCluster(
cl = "cluster.FarthestFirst",
package = "RWeka",
par.set = makeParamSet(
makeIntegerLearnerParam(id = "N", default = 2L, lower = 1L),
makeIntegerLearnerParam(id = "S", default = 1L, lower = 1L),
makeLogicalLearnerParam(id = "output-debug-info", default = FALSE,
tunable = FALSE)
),
properties = c("numerics"),
name = "FarthestFirst Clustering Algorithm",
short.name = "farthestfirst"
)
}
trainLearner.cluster.FarthestFirst = function (.learner, .task, .subset, .weights = NULL, ...)
{
ctrl = RWeka::Weka_control(...)
RWeka::FarthestFirst(getTaskData(.task, .subset), control = ctrl)
}
predictLearner.cluster.FarthestFirst = function (.learner, .model, .newdata, ...)
{
as.integer(predict(.model$learner.model, .newdata, ...)) +
1L
}
As stated in the multilabel section, multilabel classification methods can be divided into problem transformation methods and algorithm adaptation methods.
At this moment the only problem transformation method implemented in mlr
is the binary relevance method (makeMultilabelBinaryRelevanceWrapper()
). Integrating more of these methods requires good knowledge of the architecture of the mlr
package.
The integration of an algorithm adaptation multilabel classification learner is easier and works very similar to the normal multiclass-classification. In contrast to the multiclass case, not all of the learner properties are relevant. In particular, whether one-, two-, or multi-class problems are supported is not applicable. Furthermore the prediction function output must be a matrix with each prediction of a label in one column and the names of the labels as column names. If .learner$predict.type
is "response"
the predictions must be logical. If .learner$predict.type
is "prob"
and this type of prediction is supported by the learner, the matrix must consist of predicted probabilities.
Below is the definition of the rFerns::rFerns()
learner from the rFerns
package, which does not support probability predictions.
makeRLearner.multilabel.rFerns = function() {
makeRLearnerMultilabel(
cl = "multilabel.rFerns",
package = "rFerns",
par.set = makeParamSet(
makeIntegerLearnerParam(id = "depth", default = 5L),
makeIntegerLearnerParam(id = "ferns", default = 1000L)
),
properties = c("numerics", "factors", "ordered"),
name = "Random ferns",
short.name = "rFerns",
note = ""
)
}
trainLearner.multilabel.rFerns = function (.learner, .task, .subset, .weights = NULL, ...)
{
d = getTaskData(.task, .subset, target.extra = TRUE)
rFerns::rFerns(x = d$data, y = as.matrix(d$target), ...)
}
predictLearner.multilabel.rFerns = function (.learner, .model, .newdata, ...)
{
as.matrix(predict(.model$learner.model, .newdata, ...))
}
Some learners, for example decision trees and random forests, can calculate feature importance values, which can be extracted from a fitted model (makeWrappedModel()
) using function getFeatureImportance()
.
If your newly integrated learner supports this you need to
"featimp"
to the learner properties andgetFeatureImportanceLearner()
(which later is called internally by getFeatureImportance()
) in order to make this work.This method takes the Learner()
.learner
, the WrappedModel (makeWrappedModel()
) .model
and potential further arguments and calculates or extracts the feature importance. It must return a named vector of importance values.
Below are two simple examples. In case of "classif.rpart"
the feature importance values can be easily extracted from the fitted model.
getFeatureImportanceLearner.classif.rpart = function (.learner, .model, ...)
{
mod = getLearnerModel(.model, more.unwrap = TRUE)
mod$variable.importance
}
For the randomForestSRC::rfsrc()
from package randomForestSRC
function randomForestSRC::vimp()
is called.
getFeatureImportanceLearner.classif.randomForestSRC = function (.learner, .model, ...)
{
mod = getLearnerModel(.model, more.unwrap = TRUE)
randomForestSRC::vimp(mod, ...)$importance[, "all"]
}
Many ensemble learners generate out-of-bag predictions (OOB predictions) automatically. mlr
provides the function getOOBPreds()
to access these predictions in the mlr
framework.
If your newly integrated learner is able to calculate OOB predictions and you want to be able to access them in mlr
via getOOBPreds()
you need to
"oobpreds"
to the learner properties andgetOOBPredsLearner()
(which later is called internally by getOOBPreds()
).This method takes the Learner (makeLearner()
) .learner
and the WrappedModel (makeWrappedModel()
) .model
and extracts the OOB predictions. It must return the predictions in the same format as the predictLearner()
function.
getOOBPredsLearner.classif.randomForest = function (.learner, .model)
{
if (.learner$predict.type == "response") {
m = getLearnerModel(.model, more.unwrap = TRUE)
unname(m$predicted)
}
else {
getLearnerModel(.model, more.unwrap = TRUE)$votes
}
}
If your interface code to a new learning algorithm exists only locally, i.e., it is not (yet) merged into mlr
or does not live in an extra package with a proper namespace you might want to register the new S3 methods to make sure that these are found by, e.g., listLearners()
. You can do this as follows:
registerS3method("makeRLearner", "<awesome_new_learner_class>",
makeRLearner.<awesome_new_learner_class>)
registerS3method("trainLearner", "<awesome_new_learner_class>",
trainLearner.<awesome_new_learner_class>)
registerS3method("predictLearner", "<awesome_new_learner_class>",
predictLearner.<awesome_new_learner_class>)
If you have written more methods, for example in order to extract feature importance values or out-of-bag predictions these also need to be registered in the same manner, for example:
registerS3method("getFeatureImportanceLearner", "<awesome_new_learner_class>",
getFeatureImportanceLearner.<awesome_new_learner_class>)
For the new learner to work with parallelization, you may have to export the new methods explicitly:
If you haven’t written a learner interface for private use only, but intend to send a pull request to have it included in the mlr
package there are a few things to take care of, most importantly unit testing!
For general information about contributing to the package, unit testing, version control setup and the like please also read the coding guidelines in the mlr Wiki.
The R file containing the interface code should adhere to the naming convention RLearner_<type>_<learner_name>.R
, e.g., RLearner_classif_lda.R
, see for example https://github.com/mlr-org/mlr/blob/master/R/RLearner_classif_lda.R and contain the necessary roxygen @export
tags to register the S3 methods in the NAMESPACE.
The learner interfaces should work out of the box without requiring any parameters to be set, e.g., train("classif.lda", iris.task)
should run. Sometimes, this makes it necessary to change or set some additional defaults as explained above and – very important – informing the user about this in the note
.
The parameter set of the learner should be as complete as possible.
Every learner interface must be unit tested.
The tests make sure that we get the same results when the learner is invoked through the mlr
interface and when using the original functions. If you are not familiar or want to learn more about unit testing and package testthat
have a look at the Testing chapter in Hadley Wickham’s R packages.
In mlr
all unit tests are in the following directory: https://github.com/mlr-org/mlr/tree/master/tests/testthat. For each learner interface there is an individual file whose name follows the scheme test_<type>_<learner_name>.R
, for example https://github.com/mlr-org/mlr/blob/master/tests/testthat/test_classif_lda.R.
Below is a snippet from the tests of the lda interface https://github.com/mlr-org/mlr/blob/master/tests/testthat/test_classif_lda.R.
test_that("classif_lda", {
requirePackagesOrSkip("MASS", default.method = "load")
set.seed(getOption("mlr.debug.seed"))
m = MASS::lda(formula = multiclass.formula, data = multiclass.train)
set.seed(getOption("mlr.debug.seed"))
p = predict(m, newdata = multiclass.test)
testSimple("classif.lda", multiclass.df, multiclass.target, multiclass.train.inds, p$class)
testProb("classif.lda", multiclass.df, multiclass.target, multiclass.train.inds, p$posterior)
})
The tests make use of numerous helper objects and helper functions. All of these are defined in the helper_
files in https://github.com/mlr-org/mlr/blob/master/tests/testthat/.
In the above code the first line just loads package MASS
or skips the test if the package is not available. The objects multiclass.formula
, multiclass.train
, multiclass.test
etc. are defined in https://github.com/mlr-org/mlr/blob/master/tests/testthat/helper_objects.R. We tried to choose fairly self-explanatory names: For example multiclass
indicates a multi-class classification problem, multiclass.train
contains data for training, multiclass.formula
a formula
object etc.
The test fits an lda model on the training set and makes predictions on the test set using the original functions MASS::lda()
and MASS:predict.lda()
. The helper functions testSimple
and testProb
perform training and prediction on the same data using the mlr
interface – testSimple
for predict.type = "response
and testProbs
for predict.type = "prob"
– and check if the predicted class labels and probabilities coincide with the outcomes p$class
and p$posterior
.
In order to get reproducible results seeding is required for many learners. The "mlr.debug.seed"
works as follows: When invoking the tests the option "mlr.debug.seed"
is set (see https://github.com/mlr-org/mlr/blob/master/tests/testthat/helper_zzz.R), and set.seed(getOption("mlr.debug.seed"))
is used to specify the seed. Internally, mlr
’s train and predict.WrappedModel functions check if the "mlr.debug.seed"
option is set and if yes, also specify the seed.
Note that the option "mlr.debug.seed"
is only set for testing, so no seeding happens in normal usage of mlr
.
Let’s look at a second example. Many learners have parameters that are commonly changed or tuned and it is important to make sure that these are passed through correctly. Below is a snippet from https://github.com/mlr-org/mlr/blob/master/tests/testthat/test_regr_randomForest.R.
test_that("regr_randomForest", {
requirePackagesOrSkip("randomForest", default.method = "load")
parset.list = list(
list(),
list(ntree = 5, mtry = 2),
list(ntree = 5, mtry = 4),
list(proximity = TRUE, oob.prox = TRUE),
list(nPerm = 3)
)
old.predicts.list = list()
for (i in 1:length(parset.list)) {
parset = parset.list[[i]]
pars = list(formula = regr.formula, data = regr.train)
pars = c(pars, parset)
set.seed(getOption("mlr.debug.seed"))
m = do.call(randomForest::randomForest, pars)
set.seed(getOption("mlr.debug.seed"))
p = predict(m, newdata = regr.test, type = "response")
old.predicts.list[[i]] = p
}
testSimpleParsets("regr.randomForest", regr.df, regr.target,
regr.train.inds, old.predicts.list, parset.list)
})
All tested parameter configurations are collected in the parset.list
. In order to make sure that the default parameter configuration is tested the first element of the parset.list
is an empty list
(base::list()
). Then we simply loop over all parameter settings and store the resulting predictions in old.predicts.list
. Again the helper function testSimpleParsets
does the same using the mlr
interface and compares the outcomes.
Additional to tests for individual learners we also have general tests that loop through all integrated learners and make for example sure that learners have the correct properties (e.g. a learner with property "factors"
can cope with factor
(base::factor()
) features, a learner with property "weights"
takes observation weights into account properly etc.). For example https://github.com/mlr-org/mlr/blob/master/tests/testthat/test_learners_all_classif.R runs through all classification learners. Similar tests exist for all types of learning methods like regression, cluster and survival analysis as well as multilabel classification.
In order to run all tests for, e.g., classification learners on your machine you can invoke the tests from within R by
or from the command line using Michel’s rt tool
rtest --filter=classif