`vignettes/tutorial/over_and_undersampling.Rmd`

`over_and_undersampling.Rmd`

In case of *binary classification* strongly imbalanced classes often lead to unsatisfactory results regarding the prediction of new observations, especially for the small class. In this context *imbalanced classes* simply means that the number of observations of one class (usu. positive or majority class) by far exceeds the number of observations of the other class (usu. negative or minority class). This setting can be observed fairly often in practice and in various disciplines like credit scoring, fraud detection, medical diagnostics or churn management.

Most classification methods work best when the number of observations per class are roughly equal. The problem with *imbalanced classes* is that because of the dominance of the majority class classifiers tend to ignore cases of the minority class as noise and therefore predict the majority class far more often. In order to lay more weight on the cases of the minority class, there are numerous correction methods which tackle the *imbalanced classification problem*. These methods can generally be divided into *cost- and sampling-based approaches*. Below all methods supported by `mlr`

are introduced.

The basic idea of *sampling methods* is to simply adjust the proportion of the classes in order to increase the weight of the minority class observations within the model.

The *sampling-based approaches* can be divided further into three different categories:

**Undersampling methods**: Elimination of randomly chosen cases of the majority class to decrease their effect on the classifier. All cases of the minority class are kept.**Oversampling methods**: Generation of additional cases (copies, artificial observations) of the minority class to increase their effect on the classifier. All cases of the majority class are kept.**Hybrid methods**: Mixture of under- and oversampling strategies.

All these methods directly access the underlying data and “rearrange” it. In this way the sampling is done as part of the *preprocesssing* and can therefore be combined with every appropriate classifier.

`mlr`

currently supports the first two approaches.

As mentioned above *undersampling* always refers to the majority class, while *oversampling* affects the minority class. By the use of *undersampling*, randomly chosen observations of the majority class are eliminated. Through (simple) *oversampling* all observations of the minority class are considered at least once when fitting the model. In addition, exact copies of minority class cases are created by random sampling with repetitions.

First, let’s take a look at the effect for a classification task. Based on a simulated `ClassifTask`

(`Task()`

) with imbalanced classes two new tasks (`task.over`

, `task.under`

) are created via `mlr`

functions `oversample()`

and `undersample()`

, respectively.

data.imbal.train = rbind( data.frame(x = rnorm(100, mean = 1), class = "A"), data.frame(x = rnorm(5000, mean = 2), class = "B") ) task = makeClassifTask(data = data.imbal.train, target = "class") task.over = oversample(task, rate = 8) task.under = undersample(task, rate = 1/8) table(getTaskTargets(task)) ## ## A B ## 100 5000 table(getTaskTargets(task.over)) ## ## A B ## 800 5000 table(getTaskTargets(task.under)) ## ## A B ## 100 625

Please note that the *undersampling rate* has to be between 0 and 1, where 1 means no undersampling and 0.5 implies a reduction of the majority class size to 50 percent. Correspondingly, the *oversampling rate* must be greater or equal to 1, where 1 means no oversampling and 2 would result in doubling the minority class size.

As a result the performance should improve if the model is applied to new data.

lrn = makeLearner("classif.rpart", predict.type = "prob") mod = train(lrn, task) mod.over = train(lrn, task.over) mod.under = train(lrn, task.under) data.imbal.test = rbind( data.frame(x = rnorm(10, mean = 1), class = "A"), data.frame(x = rnorm(500, mean = 2), class = "B") ) performance(predict(mod, newdata = data.imbal.test), measures = list(mmce, ber, auc)) ## mmce ber auc ## 0.01960784 0.50000000 0.50000000 performance(predict(mod.over, newdata = data.imbal.test), measures = list(mmce, ber, auc)) ## mmce ber auc ## 0.04705882 0.41600000 0.69080000 performance(predict(mod.under, newdata = data.imbal.test), measures = list(mmce, ber, auc)) ## mmce ber auc ## 0.0372549 0.5090000 0.7265000

In this case the *performance measure* has to be considered very carefully. As the *misclassification rate* (mmce) evaluates the overall accuracy of the predictions, the *balanced error rate* (ber) and *area under the ROC Curve* (auc) might be more suitable here, as the misclassifications within each class are separately taken into account.

Alternatively, `mlr`

also offers the integration of over- and undersampling via a wrapper approach. This way over- and undersampling can be applied to already existing learners to extend their functionality.

The example given above is repeated once again, but this time with extended learners instead of modified tasks (see `makeOversampleWrapper()`

and `makeUndersampleWrapper()`

). Just like before the *undersampling rate* has to be between 0 and 1, while the *oversampling rate* has a lower boundary of 1.

lrn.over = makeOversampleWrapper(lrn, osw.rate = 8) lrn.under = makeUndersampleWrapper(lrn, usw.rate = 1/8) mod = train(lrn, task) mod.over = train(lrn.over, task) mod.under = train(lrn.under, task) performance(predict(mod, newdata = data.imbal.test), measures = list(mmce, ber, auc)) ## mmce ber auc ## 0.01960784 0.50000000 0.50000000 performance(predict(mod.over, newdata = data.imbal.test), measures = list(mmce, ber, auc)) ## mmce ber auc ## 0.02941176 0.45600000 0.76880000 performance(predict(mod.under, newdata = data.imbal.test), measures = list(mmce, ber, auc)) ## mmce ber auc ## 0.03529412 0.31200000 0.79960000

Two extensions to (simple) oversampling are available in `mlr`

.

As the duplicating of the minority class observations can lead to overfitting, within *SMOTE* the “new cases” are constructed in a different way. For each new observation, one randomly chosen minority class observation as well as one of its *randomly chosen next neighbours* are interpolated, so that finally a new *artificial observation* of the minority class is created. The `smote()`

function in `mlr`

handles numeric as well as factor features, as the gower distance is used for nearest neighbour calculation. The factor level of the new artificial case is sampled from the given levels of the two input observations.

Analogous to oversampling, *SMOTE preprocessing* is possible via modification of the task.

task.smote = smote(task, rate = 8, nn = 5) table(getTaskTargets(task)) ## ## A B ## 100 5000 table(getTaskTargets(task.smote)) ## ## A B ## 800 5000

Alternatively, a new wrapped learner can be created via `makeSMOTEWrapper()`

.

lrn.smote = makeSMOTEWrapper(lrn, sw.rate = 8, sw.nn = 5) mod.smote = train(lrn.smote, task) performance(predict(mod.smote, newdata = data.imbal.test), measures = list(mmce, ber, auc)) ## mmce ber auc ## 0.03137255 0.50600000 0.70930000

By default the number of nearest neighbours considered within the algorithm is set to 5.

Another extension of oversampling consists in the combination of sampling with the bagging approach. For each iteration of the bagging process, minority class observations are oversampled with a given rate in `obw.rate`

. The majority class cases can either all be taken into account for each iteration (`obw.maxcl = "all"`

) or bootstrapped with replacement to increase variability between training data sets during iterations (`obw.maxcl = "boot"`

).

The construction of the **Overbagging Wrapper** works similar to `makeBaggingWrapper()`

. First an existing `mlr`

learner has to be passed to `makeOverBaggingWrapper()`

. The number of iterations or fitted models can be set via `obw.iters`

.

lrn = makeLearner("classif.rpart", predict.type = "response") obw.lrn = makeOverBaggingWrapper(lrn, obw.rate = 8, obw.iters = 3)

For *binary classification* the prediction is based on majority voting to create a discrete label. Corresponding probabilities are predicted by considering the proportions of all the predicted labels. Please note that the benefit of the sampling process is *highly dependent* on the specific learner as shown in the following example.

First, let’s take a look at the tree learner with and without overbagging:

lrn = setPredictType(lrn, "prob") rdesc = makeResampleDesc("CV", iters = 5) r1 = resample(learner = lrn, task = task, resampling = rdesc, show.info = FALSE, measures = list(mmce, ber, auc)) r1$aggr ## mmce.test.mean ber.test.mean auc.test.mean ## 0.01960784 0.50000000 0.50000000 obw.lrn = setPredictType(obw.lrn, "prob") r2 = resample(learner = obw.lrn, task = task, resampling = rdesc, show.info = FALSE, measures = list(mmce, ber, auc)) r2$aggr ## mmce.test.mean ber.test.mean auc.test.mean ## 0.0327451 0.4781435 0.5361176

Now let’s consider a *random forest* as initial learner:

lrn = makeLearner("classif.ranger") obw.lrn = makeOverBaggingWrapper(lrn, obw.rate = 8, obw.iters = 3) lrn = setPredictType(lrn, "prob") r1 = resample(learner = lrn, task = task, resampling = rdesc, show.info = FALSE, measures = list(mmce, ber, auc)) r1$aggr ## mmce.test.mean ber.test.mean auc.test.mean ## 0.03196078 0.49333337 0.58766687 obw.lrn = setPredictType(obw.lrn, "prob") r2 = resample(learner = obw.lrn, task = task, resampling = rdesc, show.info = FALSE, measures = list(mmce, ber, auc)) r2$aggr ## mmce.test.mean ber.test.mean auc.test.mean ## 0.04117647 0.49834049 0.49953403

While *overbagging* slighty improves the performance of the *decision tree*, the AUC decreases in the second example when additional overbagging is applied. As RF itself is already a strong learner (and a bagged one as well), a further bagging step isn’t very helpful here and usually won’t improve the model.

In binary classification, the default probability value at which a prediction is either classified as “1” or “0” is 0.50. This means with an estimate of >= 0.50 the observation is put into class “1” while lower values get assigned class “0”. To reach a better performance in binary classification, it can be helpful to also optimize the the probability threshold at which this split is made. This can be especially helpful if the response is unbalanced. To enable this, argument `tune.threshold`

needs to be set to `TRUE`

in the chosen `makeTuneControl*`

function.

```
## Resampling: cross-validation
## Measures: mmce
## [Resample] iter 1: 0.0541069
## [Resample] iter 2: 0.0606654
## [Resample] iter 3: 0.0475880
##
## Aggregated Result: mmce.test.mean=0.0541201
##
```

lrn = makeLearner("classif.gbm", predict.type = "prob", distribution = "bernoulli") ps = makeParamSet( makeIntegerParam("interaction.depth", lower = 1, upper = 5) ) ctrl = makeTuneControlRandom(maxit = 2, tune.threshold = TRUE) lrn = makeTuneWrapper(lrn, par.set = ps, control = ctrl, resampling = cv2) r = resample(lrn, spam.task, cv3, extract = getTuneResult) ## Resampling: cross-validation ## Measures: mmce ## [Tune] Started tuning learner classif.gbm for parameter set: ## Type len Def Constr Req Tunable Trafo ## interaction.depth integer - - 1 to 5 - TRUE - ## With control class: TuneControlRandom ## Imputation value: 1 ## [Tune-x] 1: interaction.depth=5 ## [Tune-y] 1: mmce.test.mean=0.0599739; time: 0.1 min ## [Tune-x] 2: interaction.depth=3 ## [Tune-y] 2: mmce.test.mean=0.0625815; time: 0.0 min ## [Tune] Result: interaction.depth=5 : mmce.test.mean=0.0599739 ## [Resample] iter 1: 0.0482714 ## [Tune] Started tuning learner classif.gbm for parameter set: ## Type len Def Constr Req Tunable Trafo ## interaction.depth integer - - 1 to 5 - TRUE - ## With control class: TuneControlRandom ## Imputation value: 1 ## [Tune-x] 1: interaction.depth=1 ## [Tune-y] 1: mmce.test.mean=0.0652075; time: 0.0 min ## [Tune-x] 2: interaction.depth=3 ## [Tune-y] 2: mmce.test.mean=0.0586882; time: 0.0 min ## [Tune] Result: interaction.depth=3 : mmce.test.mean=0.0586882 ## [Resample] iter 2: 0.0560626 ## [Tune] Started tuning learner classif.gbm for parameter set: ## Type len Def Constr Req Tunable Trafo ## interaction.depth integer - - 1 to 5 - TRUE - ## With control class: TuneControlRandom ## Imputation value: 1 ## [Tune-x] 1: interaction.depth=3 ## [Tune-y] 1: mmce.test.mean=0.0577155; time: 0.0 min ## [Tune-x] 2: interaction.depth=2 ## [Tune-y] 2: mmce.test.mean=0.0606496; time: 0.0 min ## [Tune] Result: interaction.depth=3 : mmce.test.mean=0.0577155 ## [Resample] iter 3: 0.0625815 ## ## Aggregated Result: mmce.test.mean=0.0556385 ## print(r$extract) ## [[1]] ## Tune result: ## Op. pars: interaction.depth=5 ## Threshold: 0.48 ## mmce.test.mean=0.0599739 ## ## [[2]] ## Tune result: ## Op. pars: interaction.depth=3 ## Threshold: 0.48 ## mmce.test.mean=0.0586882 ## ## [[3]] ## Tune result: ## Op. pars: interaction.depth=3 ## Threshold: 0.52 ## mmce.test.mean=0.0577155

In the above script the tuning is (of course) nested. What happens is: The tuner evaluates a certain learner configuration via (inner) two-fold CV. On these predictions then the optimal threshold is selected, for this learner config (by calling `tuneThreshold()`

on the `ResamplePrediction`

object, which was generated in the configuration evaluation). For the optimal learner config, at the end of tuning, we also know its selected threshold. The model is then trained on the complete outer training data set with the threshold set from the tuning and the prediction is made on the outer test set.

In contrast to sampling, *cost-based approaches* usually require particular learners, which can deal with different *class-dependent costs* Cost-Sensitive Classification.

Another approach independent of the underlying classifier is to assign the costs as *class weights*, so that each observation receives a weight, depending on the class it belongs to. Similar to the sampling-based approaches, the effect of the minority class observations is thereby increased simply by a higher weight of these instances and vice versa for majority class observations.

In this way every learner which supports weights can be extended through the wrapper approach. If the learner does not have a direct parameter for class weights, but supports observation weights, the weights depending on the class are internally set in the wrapper.

lrn = makeLearner("classif.logreg") wcw.lrn = makeWeightedClassesWrapper(lrn, wcw.weight = 0.01)

For binary classification, the single number passed to the classifier corresponds to the weight of the positive / majority class, while the negative / minority class receives a weight of 1. So actually, no real costs are used within this approach, but the cost ratio is taken into account.

If the underlying learner already has a parameter for class weighting (e.g., `class.weights`

in `"classif.ksvm"`

), the `wcw.weight`

is basically passed to the specific class weighting parameter.

lrn = makeLearner("classif.ksvm") wcw.lrn = makeWeightedClassesWrapper(lrn, wcw.weight = 0.01)