Integrating Another Learner • mlr

In order to integrate a learning algorithm into mlr some interface code has to be written. Three functions are mandatory for each learner.

First, define a new learner class with a name, description, capabilities, parameters, and a few other things. (An object of this class can then be generated by makeLearner().)
Second, you need to provide a function that calls the learner function and builds the model given data (which makes it possible to invoke training by calling mlr’s train() function).
Finally, a prediction function that returns predicted values given new data is required (which enables invoking prediction by calling mlr’s predict.WrappedModel() function).

Technically, integrating a learning method means introducing a new S3 class and implementing the corresponding methods for the generic functions RLearner(), trainLearner(), and predictLearner(). Therefore we start with a quick overview of the involved classes and constructor functions.

Classes, constructors, and naming schemes

As you already know makeLearner() generates an object of class Learner (makeLearner()).

class(makeLearner(cl = "classif.lda"))
## [1] "classif.lda"     "RLearnerClassif" "RLearner"        "Learner"

class(makeLearner(cl = "regr.lm"))
## [1] "regr.lm"      "RLearnerRegr" "RLearner"     "Learner"

class(makeLearner(cl = "surv.coxph"))
## [1] "surv.coxph"   "RLearnerSurv" "RLearner"     "Learner"

class(makeLearner(cl = "cluster.kmeans"))
## [1] "cluster.kmeans"  "RLearnerCluster" "RLearner"        "Learner"

class(makeLearner(cl = "multilabel.rFerns"))
## [1] "multilabel.rFerns"  "RLearnerMultilabel" "RLearner"          
## [4] "Learner"

The first element of each class attribute vector is the name of the learner class passed to the cl argument of makeLearner(). Obviously, this adheres to the naming conventions

"classif.<R_method_name>" for classification,
"multilabel.<R_method_name>" for multilabel classification,
"regr.<R_method_name>" for regression,
"surv.<R_method_name>" for survival analysis, and
"cluster.<R_method_name>" for clustering.

Additionally, there exist intermediate classes that reflect the type of learning problem, i.e., all classification learners inherit from RLearnerClassif (RLearner()), all regression learners from RLearnerRegr (RLearner()) and so on. Their superclasses are RLearner() and finally Learner (makeLearner()). For all these (sub)classes there exist constructor functions makeRLearner (RLearner()), makeRLearnerClassif (RLearner()), makeRLearneRegr (RLearner()) etc. that are called internally by makeLearner().

A short side remark: As you might have noticed there does not exist a special learner class for cost-sensitive classification (costsens) with example-specific costs. This type of learning task is currently exclusively handled through wrappers like makeCostSensWeightedPairsWrapper().

In the following we show how to integrate learners for the five types of learning tasks mentioned above. Defining a completely new type of learner that has special properties and does not fit into one of the existing schemes is of course possible, but much more advanced and not covered here.

We use a classification example to explain some general principles (so even if you are interested in integrating a learner for another type of learning task you might want to read the following section). Examples for other types of learning tasks are shown later on.

Classification

We show how the Linear Discriminant Analysis (MASS::lda()) from package MASS has been integrated into the classification learner classif.lda in mlr as an example.

Definition of the learner

The minimal information required to define a learner is the mlr name of the learner, its package, the parameter set, and the set of properties of your learner. In addition, you may provide a human-readable name, a short name and a note with information relevant to users of the learner.

First, name your learner. According to the naming conventions above the name starts with classif. and we choose classif.lda.

Second, we need to define the parameters of the learner. These are any options that can be set when running it to change how it learns, how input is interpreted, how and what output is generated, and so on. mlr provides a number of functions to define parameters, a complete list can be found in the documentation of LearnerParam (ParamHelpers::LearnerParam()) of the ParamHelpers package.

In our example, we have discrete and numeric parameters, so we use makeDiscreteLearnerParam (ParamHelpers::LearnerParam()) and makeNumericLearnerParam (ParamHelpers::LearnerParam()) to incorporate the complete description of the parameters. We include all possible values for discrete parameters and lower and upper bounds for numeric parameters. Strictly speaking it is not necessary to provide bounds for all parameters and if this information is not available they can be estimated, but providing accurate and specific information here makes it possible to tune the learner much better (see the section on tuning).

Next, we add information on the properties of the learner (see also the section on learners). Which types of features are supported (numerics, factors)? Are case weights supported? Are class weights supported? Can the method deal with missing values in the features and deal with NA’s in a meaningful way (not na.omit)? Are one-class, two-class, multi-class problems supported? Can the learner predict posterior probabilities?

If the learner supports class weights the name of the relevant learner parameter can be specified via argument class.weights.param.

Sometimes parameters allow certain values that don’t “fit” the formal parameter types. Often, these are also the default values. For example: A numeric parameter that accepts values as NA, NULL, "auto", and so on. For these there is the option special.vals, which allows to list these extra values as feasible. An example is the missing argument in xgboost. It accepts numeric values (that’s why it’s set as a numeric parameter) and NULL but the default is NA.

makeNumericLearnerParam(id = "missing", default = NA, tunable = FALSE, when = "both",
  special.vals = list(NA, NA_real_, NULL))

If a parameter depends on another one, the requires attribute is added as an argument. Dependent parameters with a requires field must use quote() and not expression() to define it.

Below is the complete code for the definition of the LDA learner. It has one discrete parameter, method, and two continuous ones, nu and tol. It supports classification problems with two or more classes and can deal with numeric and factor explanatory variables. It can predict posterior probabilities.

makeRLearner.classif.lda = function() {
  makeRLearnerClassif(
    cl = "classif.lda",
    package = "MASS",
    par.set = makeParamSet(
      makeDiscreteLearnerParam(id = "method", default = "moment",
        values = c("moment", "mle", "mve", "t")),
      makeNumericLearnerParam(id = "nu", lower = 2,
        requires = quote(method == "t")),
      makeNumericLearnerParam(id = "tol", default = 1e-4, lower = 0),
      makeDiscreteLearnerParam(id = "predict.method",
        values = c("plug-in", "predictive", "debiased"),
        default = "plug-in", when = "predict"),
      makeLogicalLearnerParam(id = "CV", default = FALSE, tunable = FALSE)
    ),
    properties = c("twoclass", "multiclass", "numerics", "factors", "prob"),
    name = "Linear Discriminant Analysis",
    short.name = "lda",
    note = "Learner param 'predict.method' maps to 'method' in predict.lda."
  )
}

Creating the training function of the learner

Once the learner has been defined, we need to tell mlr how to call it to train a model. The name of the function has to start with trainLearner., followed by the mlr name of the learner as defined above (classif.lda here). The prototype of the function looks as follows.

function(.learner, .task, .subset, .weights = NULL, ...) { }

This function must fit a model on the data of the task .task with regard to the subset defined in the integer vector .subset and the parameters passed in the ... arguments. Usually, the data should be extracted from the task using getTaskData(). This will take care of any subsetting as well. It must return the fitted model. mlr assumes no special data type for the return value – it will be passed to the predict function we are going to define below, so any special code the learner may need can be encapsulated there.

For our example, the definition of the function looks like this. In addition to the data of the task, we also need the formula that describes what to predict. We use the function getTaskFormula() to extract this from the task.

trainLearner.classif.lda = function (.learner, .task, .subset, .weights = NULL, ...) 
{
    f = getTaskFormula(.task)
    MASS::lda(f, data = getTaskData(.task, .subset), ...)
}

Creating the prediction method

Finally, the prediction function needs to be defined. The name of this function starts with predictLearner., followed again by the mlr name of the learner. The prototype of the function is as follows.

function(.learner, .model, .newdata, ...) { }

It must predict for the new observations in the data.frame .newdata with the wrapped model .model, which is returned from the training function. The actual model the learner built is stored in the $learner.model member and can be accessed simply through .model$learner.model.

For classification, you have to return a factor of predicted classes if .learner$predict.type is "response", or a matrix of predicted probabilities if .learner$predict.type is "prob" and this type of prediction is supported by the learner. In the latter case the matrix must have the same number of columns as there are classes in the task and the columns have to be named by the class names.

The definition for LDA looks like this. It is pretty much just a straight pass-through of the arguments to the base::predict() function and some extraction of prediction data depending on the type of prediction requested.

predictLearner.classif.lda = function (.learner, .model, .newdata, predict.method = "plug-in", 
    ...) 
{
    p = predict(.model$learner.model, newdata = .newdata, method = predict.method, 
        ...)
    if (.learner$predict.type == "response") {
        return(p$class)
    }
    else {
        return(p$posterior)
    }
}

Regression

The main difference for regression is that the type of predictions are different (numeric instead of labels or probabilities) and that not all of the properties are relevant. In particular, whether one-, two-, or multi-class problems and posterior probabilities are supported is not applicable.

Apart from this, everything explained above applies. Below is the definition for the earth::earth() learner.

makeRLearner.regr.earth = function() {
  makeRLearnerRegr(
    cl = "regr.earth",
    package = "earth",
    par.set = makeParamSet(
      makeLogicalLearnerParam(id = "keepxy", default = FALSE, tunable = FALSE),
      makeNumericLearnerParam(id = "trace", default = 0, upper = 10, tunable = FALSE),
      makeIntegerLearnerParam(id = "degree", default = 1L, lower = 1L),
      makeNumericLearnerParam(id = "penalty"),
      makeIntegerLearnerParam(id = "nk", lower = 0L),
      makeNumericLearnerParam(id = "thres", default = 0.001),
      makeIntegerLearnerParam(id = "minspan", default = 0L),
      makeIntegerLearnerParam(id = "endspan", default = 0L),
      makeNumericLearnerParam(id = "newvar.penalty", default = 0),
      makeIntegerLearnerParam(id = "fast.k", default = 20L, lower = 0L),
      makeNumericLearnerParam(id = "fast.beta", default = 1),
      makeDiscreteLearnerParam(id = "pmethod", default = "backward",
        values = c("backward", "none", "exhaustive", "forward", "seqrep", "cv")),
      makeIntegerLearnerParam(id = "nprune")
    ),
    properties = c("numerics", "factors"),
    name = "Multivariate Adaptive Regression Splines",
    short.name = "earth",
    note = ""
  )
}

trainLearner.regr.earth = function (.learner, .task, .subset, .weights = NULL, ...) 
{
    f = getTaskFormula(.task)
    earth::earth(f, data = getTaskData(.task, .subset), ...)
}

predictLearner.regr.earth = function (.learner, .model, .newdata, ...) 
{
    predict(.model$learner.model, newdata = .newdata)[, 1L]
}

Again most of the data is passed straight through to/from the train/predict functions of the learner.

Survival analysis

For survival analysis, you have to return so-called linear predictors in order to compute the default measure for this task type, the cindex (for .learner$predict.type == "response"). For .learner$predict.type == "prob", there is no substantially meaningful measure (yet). You may either ignore this case or return something like predicted survival curves (cf. example below).

There are three properties that are specific to survival learners: “rcens”, “lcens” and “icens”, defining the type(s) of censoring a learner can handle – right, left and/or interval censored.

Let’s have a look at how the Cox Proportional Hazard Model (survival::coxph()) from package survival has been integrated into the survival learner surv.coxph in mlr as an example:

makeRLearner.surv.coxph = function() {
  makeRLearnerSurv(
    cl = "surv.coxph",
    package = "survival",
    par.set = makeParamSet(
      makeDiscreteLearnerParam(id = "ties", default = "efron",
        values = c("efron", "breslow", "exact")),
      makeLogicalLearnerParam(id = "singular.ok", default = TRUE),
      makeNumericLearnerParam(id = "eps", default = 1e-09, lower = 0),
      makeNumericLearnerParam(id = "toler.chol",
        default = .Machine$double.eps^0.75, lower = 0),
      makeIntegerLearnerParam(id = "iter.max", default = 20L, lower = 1L),
      makeNumericLearnerParam(id = "toler.inf",
        default = sqrt(.Machine$double.eps^0.75), lower = 0),
      makeIntegerLearnerParam(id = "outer.max", default = 10L, lower = 1L),
      makeLogicalLearnerParam(id = "model", default = FALSE, tunable = FALSE),
      makeLogicalLearnerParam(id = "x", default = FALSE, tunable = FALSE),
      makeLogicalLearnerParam(id = "y", default = TRUE, tunable = FALSE)
    ),
    properties = c("missings", "numerics", "factors", "weights", "prob", "rcens"),
    name = "Cox Proportional Hazard Model",
    short.name = "coxph",
    note = ""
  )
}

trainLearner.surv.coxph = function (.learner, .task, .subset, .weights = NULL, ...) 
{
    f = getTaskFormula(.task)
    data = getTaskData(.task, subset = .subset)
    if (is.null(.weights)) {
        survival::coxph(formula = f, data = data, ...)
    }
    else {
        survival::coxph(formula = f, data = data, weights = .weights, 
            ...)
    }
}

predictLearner.surv.coxph = function (.learner, .model, .newdata, ...) 
{
    predict(.model$learner.model, newdata = .newdata, type = "lp", 
        ...)
}

Clustering

For clustering, you have to return a numeric vector with the IDs of the clusters that the respective datum has been assigned to. The numbering should start at 1.

Below is the definition for the FarthestFirst (RWeka::FarthestFirst()) learner from the RWeka package. Weka starts the IDs of the clusters at 0, so we add 1 to the predicted clusters. RWeka has a different way of setting learner parameters; we use the special Weka_control function to do this.

makeRLearner.cluster.FarthestFirst = function() {
  makeRLearnerCluster(
    cl = "cluster.FarthestFirst",
    package = "RWeka",
    par.set = makeParamSet(
      makeIntegerLearnerParam(id = "N", default = 2L, lower = 1L),
      makeIntegerLearnerParam(id = "S", default = 1L, lower = 1L),
      makeLogicalLearnerParam(id = "output-debug-info", default = FALSE,
        tunable = FALSE)
    ),
    properties = c("numerics"),
    name = "FarthestFirst Clustering Algorithm",
    short.name = "farthestfirst"
  )
}

trainLearner.cluster.FarthestFirst = function (.learner, .task, .subset, .weights = NULL, ...) 
{
    ctrl = RWeka::Weka_control(...)
    RWeka::FarthestFirst(getTaskData(.task, .subset), control = ctrl)
}

predictLearner.cluster.FarthestFirst = function (.learner, .model, .newdata, ...) 
{
    as.integer(predict(.model$learner.model, .newdata, ...)) + 
        1L
}

Multilabel classification

As stated in the multilabel section, multilabel classification methods can be divided into problem transformation methods and algorithm adaptation methods.

At this moment the only problem transformation method implemented in mlr is the binary relevance method (makeMultilabelBinaryRelevanceWrapper()). Integrating more of these methods requires good knowledge of the architecture of the mlr package.

The integration of an algorithm adaptation multilabel classification learner is easier and works very similar to the normal multiclass-classification. In contrast to the multiclass case, not all of the learner properties are relevant. In particular, whether one-, two-, or multi-class problems are supported is not applicable. Furthermore the prediction function output must be a matrix with each prediction of a label in one column and the names of the labels as column names. If .learner$predict.type is "response" the predictions must be logical. If .learner$predict.type is "prob" and this type of prediction is supported by the learner, the matrix must consist of predicted probabilities.

Below is the definition of the rFerns::rFerns() learner from the rFerns package, which does not support probability predictions.

makeRLearner.multilabel.rFerns = function() {
  makeRLearnerMultilabel(
    cl = "multilabel.rFerns",
    package = "rFerns",
    par.set = makeParamSet(
      makeIntegerLearnerParam(id = "depth", default = 5L),
      makeIntegerLearnerParam(id = "ferns", default = 1000L)
    ),
    properties = c("numerics", "factors", "ordered"),
    name = "Random ferns",
    short.name = "rFerns",
    note = ""
  )
}

trainLearner.multilabel.rFerns = function (.learner, .task, .subset, .weights = NULL, ...) 
{
    d = getTaskData(.task, .subset, target.extra = TRUE)
    rFerns::rFerns(x = d$data, y = as.matrix(d$target), ...)
}

predictLearner.multilabel.rFerns = function (.learner, .model, .newdata, ...) 
{
    as.matrix(predict(.model$learner.model, .newdata, ...))
}

Creating a new method for extracting feature importance values

Some learners, for example decision trees and random forests, can calculate feature importance values, which can be extracted from a fitted model (makeWrappedModel()) using function getFeatureImportance().

If your newly integrated learner supports this you need to

add "featimp" to the learner properties and
implement a new S3 method for function getFeatureImportanceLearner() (which later is called internally by getFeatureImportance()) in order to make this work.

This method takes the Learner() .learner, the WrappedModel (makeWrappedModel()) .model and potential further arguments and calculates or extracts the feature importance. It must return a named vector of importance values.

Below is a simple example. In case of "classif.rpart" the feature importance values can be easily extracted from the fitted model.

getFeatureImportanceLearner.classif.rpart = function (.learner, .model, ...) 
{
    mod = getLearnerModel(.model, more.unwrap = TRUE)
    mod$variable.importance
}

Creating a new method for extracting out-of-bag predictions

Many ensemble learners generate out-of-bag predictions (OOB predictions) automatically. mlr provides the function getOOBPreds() to access these predictions in the mlr framework.

If your newly integrated learner is able to calculate OOB predictions and you want to be able to access them in mlr via getOOBPreds() you need to

add "oobpreds" to the learner properties and
implement a new S3 method for function getOOBPredsLearner() (which later is called internally by getOOBPreds()).

This method takes the Learner (makeLearner()) .learner and the WrappedModel (makeWrappedModel()) .model and extracts the OOB predictions. It must return the predictions in the same format as the predictLearner() function.

getOOBPredsLearner.classif.randomForest = function (.learner, .model) 
{
    if (.learner$predict.type == "response") {
        m = getLearnerModel(.model, more.unwrap = TRUE)
        unname(m$predicted)
    }
    else {
        getLearnerModel(.model, more.unwrap = TRUE)$votes
    }
}

Registering your learner

If your interface code to a new learning algorithm exists only locally, i.e., it is not (yet) merged into mlr or does not live in an extra package with a proper namespace you might want to register the new S3 methods to make sure that these are found by, e.g., listLearners(). You can do this as follows:

registerS3method("makeRLearner", "<awesome_new_learner_class>",
  makeRLearner.<awesome_new_learner_class>)
registerS3method("trainLearner", "<awesome_new_learner_class>",
  trainLearner.<awesome_new_learner_class>)
registerS3method("predictLearner", "<awesome_new_learner_class>",
  predictLearner.<awesome_new_learner_class>)

If you have written more methods, for example in order to extract feature importance values or out-of-bag predictions these also need to be registered in the same manner, for example:

registerS3method("getFeatureImportanceLearner", "<awesome_new_learner_class>",
  getFeatureImportanceLearner.<awesome_new_learner_class>)

For the new learner to work with parallelization, you may have to export the new methods explicitly:

parallelExport("trainLearner.<awesome_new_learner_class>",
  "predictLearner.<awesome_new_learner_class>")

Further information for developers

If you haven’t written a learner interface for private use only, but intend to send a pull request to have it included in the mlr package there are a few things to take care of, most importantly unit testing!

For general information about contributing to the package, unit testing, version control setup and the like please also read the coding guidelines in the mlr Wiki.

The R file containing the interface code should adhere to the naming convention RLearner_<type>_<learner_name>.R, e.g., RLearner_classif_lda.R, see for example https://github.com/mlr-org/mlr/blob/main/R/RLearner_classif_lda.R and contain the necessary roxygen @export tags to register the S3 methods in the NAMESPACE.
The learner interfaces should work out of the box without requiring any parameters to be set, e.g., train("classif.lda", iris.task) should run. Sometimes, this makes it necessary to change or set some additional defaults as explained above and – very important – informing the user about this in the note.
The parameter set of the learner should be as complete as possible.
Every learner interface must be unit tested.

Unit testing

The tests make sure that we get the same results when the learner is invoked through the mlr interface and when using the original functions. If you are not familiar or want to learn more about unit testing and package testthat have a look at the Testing chapter in Hadley Wickham’s R packages.

In mlr all unit tests are in the following directory: https://github.com/mlr-org/mlr/tree/main/tests/testthat. For each learner interface there is an individual file whose name follows the scheme test_<type>_<learner_name>.R, for example https://github.com/mlr-org/mlr/blob/main/tests/testthat/test_classif_lda.R.

Below is a snippet from the tests of the lda interface https://github.com/mlr-org/mlr/blob/main/tests/testthat/test_classif_lda.R.

test_that("classif_lda", {
  requirePackagesOrSkip("MASS", default.method = "load")

  set.seed(getOption("mlr.debug.seed"))
  m = MASS::lda(formula = multiclass.formula, data = multiclass.train)
  set.seed(getOption("mlr.debug.seed"))
  p = predict(m, newdata = multiclass.test)

  testSimple("classif.lda", multiclass.df, multiclass.target, multiclass.train.inds, p$class)
  testProb("classif.lda", multiclass.df, multiclass.target, multiclass.train.inds, p$posterior)
})

The tests make use of numerous helper objects and helper functions. All of these are defined in the helper_ files in https://github.com/mlr-org/mlr/blob/main/tests/testthat/.

In the above code the first line just loads package MASS or skips the test if the package is not available. The objects multiclass.formula, multiclass.train, multiclass.test etc. are defined in https://github.com/mlr-org/mlr/blob/main/tests/testthat/helper_objects.R. We tried to choose fairly self-explanatory names: For example multiclass indicates a multi-class classification problem, multiclass.train contains data for training, multiclass.formula a formula object etc.

The test fits an lda model on the training set and makes predictions on the test set using the original functions MASS::lda() and MASS:predict.lda(). The helper functions testSimple and testProb perform training and prediction on the same data using the mlr interface – testSimple for predict.type = "response and testProbs for predict.type = "prob" – and check if the predicted class labels and probabilities coincide with the outcomes p$class and p$posterior.

In order to get reproducible results seeding is required for many learners. The "mlr.debug.seed" works as follows: When invoking the tests the option "mlr.debug.seed" is set (see https://github.com/mlr-org/mlr/blob/main/tests/testthat/helper_zzz.R), and set.seed(getOption("mlr.debug.seed")) is used to specify the seed. Internally, mlr’s train and predict.WrappedModel functions check if the "mlr.debug.seed" option is set and if yes, also specify the seed.

Note that the option "mlr.debug.seed" is only set for testing, so no seeding happens in normal usage of mlr.

Let’s look at a second example. Many learners have parameters that are commonly changed or tuned and it is important to make sure that these are passed through correctly. Below is a snippet from https://github.com/mlr-org/mlr/blob/main/tests/testthat/test_regr_randomForest.R.

test_that("regr_randomForest", {
  requirePackagesOrSkip("randomForest", default.method = "load")

  parset.list = list(
    list(),
    list(ntree = 5, mtry = 2),
    list(ntree = 5, mtry = 4),
    list(proximity = TRUE, oob.prox = TRUE),
    list(nPerm = 3)
  )

  old.predicts.list = list()

  for (i in 1:length(parset.list)) {
    parset = parset.list[[i]]
    pars = list(formula = regr.formula, data = regr.train)
    pars = c(pars, parset)
    set.seed(getOption("mlr.debug.seed"))
    m = do.call(randomForest::randomForest, pars)
    set.seed(getOption("mlr.debug.seed"))
    p = predict(m, newdata = regr.test, type = "response")
    old.predicts.list[[i]] = p
  }

  testSimpleParsets("regr.randomForest", regr.df, regr.target,
    regr.train.inds, old.predicts.list, parset.list)
})

All tested parameter configurations are collected in the parset.list. In order to make sure that the default parameter configuration is tested the first element of the parset.list is an empty list (base::list()). Then we simply loop over all parameter settings and store the resulting predictions in old.predicts.list. Again the helper function testSimpleParsets does the same using the mlr interface and compares the outcomes.

Additional to tests for individual learners we also have general tests that loop through all integrated learners and make for example sure that learners have the correct properties (e.g. a learner with property "factors" can cope with factor (base::factor()) features, a learner with property "weights" takes observation weights into account properly etc.). For example https://github.com/mlr-org/mlr/blob/main/tests/testthat/test_learners_all_classif.R runs through all classification learners. Similar tests exist for all types of learning methods like regression, cluster and survival analysis as well as multilabel classification.

In order to run all tests for, e.g., classification learners on your machine you can invoke the tests from within R by

devtools::test("mlr", filter = "classif")

or from the command line using Michel’s rt tool

rtest --filter=classif