Summarizes a data.frame, somewhat differently than the normal summary function of R. The function is mainly useful as a basic EDA tool on data.frames before they are converted to tasks, but can be used on tasks as well.
Columns can be of type numeric, integer, logical, factor, or character. Characters and logicals will be treated as factors.
Arguments
- obj
(data.frame | Task)
Input data.
Value
(data.frame). With columns:
- name
Name of column.
- type
Data type of column.
- na
Number of NAs in column.
- disp
Measure of dispersion, for numerics and integers sd is used, for categorical columns the qualitative variation.
- mean
Mean value of column, NA for categorical columns.
- median
Median value of column, NA for categorical columns.
- mad
MAD of column, NA for categorical columns.
- min
Minimal value of column, for categorical columns the size of the smallest category.
- max
Maximal value of column, for categorical columns the size of the largest category.
- nlevs
For categorical columns, the number of factor levels, NA else.
See also
Other eda_and_preprocess:
capLargeValues()
,
createDummyFeatures()
,
dropFeatures()
,
mergeSmallFactorLevels()
,
normalizeFeatures()
,
removeConstantFeatures()
,
summarizeLevels()
Examples
summarizeColumns(iris)
#> name type na mean disp median mad min max nlevs
#> 1 Sepal.Length numeric 0 5.843333 0.8280661 5.80 1.03782 4.3 7.9 0
#> 2 Sepal.Width numeric 0 3.057333 0.4358663 3.00 0.44478 2.0 4.4 0
#> 3 Petal.Length numeric 0 3.758000 1.7652982 4.35 1.85325 1.0 6.9 0
#> 4 Petal.Width numeric 0 1.199333 0.7622377 1.30 1.03782 0.1 2.5 0
#> 5 Species factor 0 NA 0.6666667 NA NA 50.0 50.0 3