Learning tasks encapsulate the data set and further relevant information about a machine learning problem, for example the name of the target variable for supervised problems.
Task types and creation
The tasks are organized in a hierarchy, with the generic Task()
at the top. The following tasks can be instantiated and all inherit from the virtual superclass Task()
:
-
RegrTask()
for regression problems, -
ClassifTask()
for binary and multi-class classification problems with class-dependent costs can be handled as well), -
SurvTask()
for survival analysis, -
ClusterTask()
for cluster analysis, -
MultilabelTask()
for multilabel classification problems, -
CostSensTask()
for general cost sensitive classification (with example-specific costs).
To create a task, just call make<TaskType>
, e.g., makeClassifTask()
. All tasks require an identifier (argument id
) and a base::data.frame()
(argument data
). If no ID is provided it is automatically generated using the variable name of the data. The ID will be later used to name results, for example of benchmark experiments, and to annotate plots. Depending on the nature of the learning problem, additional arguments may be required and are discussed in the following sections.
Regression
For supervised learning like regression (as well as classification and survival analysis) we, in addition to data
, have to specify the name of the target
variable.
data(BostonHousing, package = "mlbench")
regr.task = makeRegrTask(id = "bh", data = BostonHousing, target = "medv")
regr.task
## Supervised task: bh
## Type: regr
## Target: medv
## Observations: 506
## Features:
## numerics factors ordered functionals
## 12 1 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
As you can see, the Task()
records the type of the learning problem and basic information about the data set, e.g., the types of the features (base::numeric()
vectors, base::factors()
or ordered factors), the number of observations, or whether missing values are present.
Creating tasks for classification and survival analysis follows the same scheme, the data type of the target variables included in data
is simply different. For each of these learning problems some specifics are described below.
Classification
For classification the target column has to be a factor
.
In the following example we define a classification task for the mlbench::BreastCancer()
data set and exclude the variable Id
from all further model fitting and evaluation.
data(BreastCancer, package = "mlbench")
df = BreastCancer
df$Id = NULL
classif.task = makeClassifTask(id = "BreastCancer", data = df, target = "Class")
classif.task
## Supervised task: BreastCancer
## Type: classif
## Target: Class
## Observations: 699
## Features:
## numerics factors ordered functionals
## 0 4 5 0
## Missings: TRUE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 2
## benign malignant
## 458 241
## Positive class: benign
In binary classification the two classes are usually referred to as positive and negative class with the positive class being the category of greater interest. This is relevant for many performance measures like the true positive rate or ROC analysis. Moreover, mlr
, where possible, permits to set options (like the setThreshold()
or makeWeightedClassesWrapper()
) and returns and plots results (like class posterior probabilities) for the positive class only.
makeClassifTask()
by default selects the first factor level of the target variable as the positive class, in the above example benign
. Class malignant
can be manually selected as follows:
classif.task = makeClassifTask(id = "BreastCancer", data = df, target = "Class", positive = "malignant")
Survival analysis
Survival tasks use two target columns. For left and right censored problems these consist of the survival time and a binary event indicator. For interval censored data the two target columns must be specified in the "interval2"
format (see survival::Surv()
).
data(cancer, package = "survival")
lung$status = (lung$status == 2) # convert to logical
surv.task = makeSurvTask(data = lung, target = c("time", "status"))
surv.task
## Supervised task: lung
## Type: surv
## Target: time,status
## Events: 165
## Observations: 228
## Features:
## numerics factors ordered functionals
## 8 0 0 0
## Missings: TRUE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
The type of censoring can be specified via the argument censoring
, which defaults to "rcens"
for right censored data.
Multilabel classification
In multilabel classification each object can belong to more than one category at the same time.
The data
are expected to contain as many target columns as there are class labels. The target columns should be logical vectors that indicate which class labels are present. The names of the target columns are taken as class labels and need to be passed to the target
argument of makeMultilabelTask()
.
In the following example we get the data of the yeast data set, extract the label names, and pass them to the target
argument in makeMultilabelTask()
.
yeast = getTaskData(yeast.task)
labels = colnames(yeast)[1:14]
yeast.task = makeMultilabelTask(id = "multi", data = yeast, target = labels)
yeast.task
## Supervised task: multi
## Type: multilabel
## Target: label1,label2,label3,label4,label5,label6,label7,label8,label9,label10,label11,label12,label13,label14
## Observations: 2417
## Features:
## numerics factors ordered functionals
## 103 0 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 14
## label1 label2 label3 label4 label5 label6 label7 label8 label9 label10
## 762 1038 983 862 722 597 428 480 178 253
## label11 label12 label13 label14
## 289 1816 1799 34
See also the tutorial page multilabel.
Cluster analysis
As cluster analysis is unsupervised, the only mandatory argument to construct a cluster analysis task is the data
. Below we create a learning task from the data set datasets::mtcars()
.
data(mtcars, package = "datasets")
cluster.task = makeClusterTask(data = mtcars)
cluster.task
## Unsupervised task: mtcars
## Type: cluster
## Observations: 32
## Features:
## numerics factors ordered functionals
## 11 0 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
Cost-sensitive classification
The standard objective in classification is to obtain a high prediction accuracy, i.e., to minimize the number of errors. All types of misclassification errors are thereby deemed equally severe. However, in many applications different kinds of errors cause different costs.
In case of class-dependent costs, that solely depend on the actual and predicted class labels, it is sufficient to create an ordinary ClassifTask()
.
In order to handle example-specific costs it is necessary to generate a CostSensTask()
. In this scenario, each example \((x, y)\) is associated with an individual cost vector of length \(K\) with \(K\) denoting the number of classes. The \(k\)-th component indicates the cost of assigning \(x\) to class \(k\). Naturally, it is assumed that the cost of the intended class label \(y\) is minimal.
As the cost vector contains all relevant information about the intended class \(y\), only the feature values \(x\) and a cost
matrix, which contains the cost vectors for all examples in the data set, are required to create the CostSensTask()
.
In the following example we use the datasets::iris()
data and an artificial cost matrix (which is generated as proposed by Beygelzimer et al., 2005):
df = iris
cost = matrix(runif(150 * 3, 0, 2000), 150) * (1 - diag(3))[df$Species, ]
df$Species = NULL
costsens.task = makeCostSensTask(data = df, cost = cost)
costsens.task
## Supervised task: df
## Type: costsens
## Observations: 150
## Features:
## numerics factors ordered functionals
## 4 0 0 0
## Missings: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 3
## y1, y2, y3
For more details see the page on cost sensitive classification.
Further settings
The Task()
help page also lists several other arguments to describe further details of the learning problem.
For example, we could include a blocking
factor in the task. This would indicate that some observations “belong together” and should not be separated when splitting the data into training and test sets for resampling.
Another option is to assign weights
to observations. These can simply indicate observation frequencies or result from the sampling scheme used to collect the data. Note that you should use this option only if the weights really belong to the task. If you plan to train some learning algorithms with different weights on the same Task()
, mlr
offers several other ways to set observation or class weights (for supervised classification). See for example the tutorial page about training or function makeWeightedClassesWrapper()
.
Accessing a learning task
We provide many operators to access the elements stored in a Task()
. The most important ones are listed in the documentation of Task()
and getTaskData()
.
To access the TaskDesc()
that contains basic information about the task you can use:
getTaskDesc(classif.task)
## $id
## [1] "BreastCancer"
##
## $type
## [1] "classif"
##
## $target
## [1] "Class"
##
## $size
## [1] 699
##
## $n.feat
## numerics factors ordered functionals
## 0 4 5 0
##
## $has.missings
## [1] TRUE
##
## $has.weights
## [1] FALSE
##
## $has.blocking
## [1] FALSE
##
## $has.coordinates
## [1] FALSE
##
## $class.levels
## [1] "benign" "malignant"
##
## $positive
## [1] "malignant"
##
## $negative
## [1] "benign"
##
## $class.distribution
##
## benign malignant
## 458 241
##
## attr(,"class")
## [1] "ClassifTaskDesc" "SupervisedTaskDesc" "TaskDesc"
Note that TaskDesc()
have slightly different elements for different types of Task()
s. Frequently required elements can also be accessed directly.
# Get the ID
getTaskId(classif.task)
## [1] "BreastCancer"
# Get the type of task
getTaskType(classif.task)
## [1] "classif"
# Get the names of the target columns
getTaskTargetNames(classif.task)
## [1] "Class"
# Get the number of observations
getTaskSize(classif.task)
## [1] 699
# Get the number of input variables
getTaskNFeats(classif.task)
## [1] 9
# Get the class levels in classif.task
getTaskClassLevels(classif.task)
## [1] "benign" "malignant"
Moreover, mlr
provides several functions to extract data from a Task()
.
# Accessing the data set in classif.task
str(getTaskData(classif.task))
## 'data.frame': 699 obs. of 10 variables:
## $ Cl.thickness : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.shape : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
## $ Marg.adhesion : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
## $ Epith.c.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.nuclei : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
## $ Bl.cromatin : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
# Get the names of the input variables in cluster.task
getTaskFeatureNames(cluster.task)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
# Get the values of the target variables in surv.task
head(getTaskTargets(surv.task))
## time status
## 1 306 TRUE
## 2 455 TRUE
## 3 1010 FALSE
## 4 210 TRUE
## 5 883 TRUE
## 6 1022 FALSE
# Get the cost matrix in costsens.task
head(getTaskCosts(costsens.task))
## y1 y2 y3
## [1,] 0 1694.9063 1569.15053
## [2,] 0 995.0545 18.85981
## [3,] 0 775.8181 1558.13177
## [4,] 0 492.8980 1458.78130
## [5,] 0 222.1929 1260.26371
## [6,] 0 779.9889 961.82166
Note that getTaskData()
offers many options for converting the data set into a convenient format. This especially comes in handy when you integrate a new learner from another R package into mlr
. In this regard function getTaskFormula()
is also useful.
Modifying a learning task
mlr
provides several functions to alter an existing Task()
, which is often more convenient than creating a new Task()
from scratch. Here are some examples.
# Select observations and/or features
cluster.task = subsetTask(cluster.task, subset = 4:17)
# It may happen, especially after selecting observations, that features are constant.
# These should be removed.
removeConstantFeatures(cluster.task)
## Removing 1 columns: am
## Unsupervised task: mtcars
## Type: cluster
## Observations: 14
## Features:
## numerics factors ordered functionals
## 10 0 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
# Remove selected features
dropFeatures(surv.task, c("meal.cal", "wt.loss"))
## Supervised task: lung
## Type: surv
## Target: time,status
## Events: 165
## Observations: 228
## Features:
## numerics factors ordered functionals
## 6 0 0 0
## Missings: TRUE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
# Standardize numerical features
task = normalizeFeatures(cluster.task, method = "range")
summary(getTaskData(task))
## mpg cyl disp hp
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3161 1st Qu.:0.5000 1st Qu.:0.1242 1st Qu.:0.2801
## Median :0.5107 Median :1.0000 Median :0.4076 Median :0.6311
## Mean :0.4872 Mean :0.7143 Mean :0.4430 Mean :0.5308
## 3rd Qu.:0.6196 3rd Qu.:1.0000 3rd Qu.:0.6618 3rd Qu.:0.7473
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## drat wt qsec vs
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2672 1st Qu.:0.1275 1st Qu.:0.2302 1st Qu.:0.0000
## Median :0.3060 Median :0.1605 Median :0.3045 Median :0.0000
## Mean :0.4544 Mean :0.3268 Mean :0.3752 Mean :0.4286
## 3rd Qu.:0.7026 3rd Qu.:0.3727 3rd Qu.:0.4908 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## am gear carb
## Min. :0.5 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.5 1st Qu.:0.0000 1st Qu.:0.3333
## Median :0.5 Median :0.0000 Median :0.6667
## Mean :0.5 Mean :0.2857 Mean :0.6429
## 3rd Qu.:0.5 3rd Qu.:0.7500 3rd Qu.:1.0000
## Max. :0.5 Max. :1.0000 Max. :1.0000
For more functions and more detailed explanations have a look at the data preprocessing page.
Example tasks and convenience functions
For your convenience mlr
provides pre-defined Task()
s for each type of learning problem. These are also used throughout this tutorial in order to get shorter and more readable code. A list of all Task()
s can be found in the Appendix.
Moreover, mlr
’s function convertMLBenchObjToTask()
can generate Task()
s from the data sets and data generating functions in package mlbench::mlbench()
.