General outlier detection for univariate datasets
Source:vignettes/generaloutlier.Rmd
generaloutlier.Rmd
Introduction to general data outlier detection
The presence of outliers is a general concern even is data which is not specifically for species distribution modelling. Also, the same approach of adhoc selection of outlier detection method is considered to detect and remove outliers in the data, which increased subjectivity. Therefore, we extend specleanr principle of ensembling multiple outlier detection methods to identify absolute outliers in the datasets which can later be removed.
The same process is followed but the no data extraction and evaluating model performance is required.
Below is a detailed workflow for objectively detecting and removing outlier in iris dataset which is incorporated in the dataset package in R programming Language
irisdata1 <- iris
#introduce outlier data and NAs
rowsOutNA1 <- data.frame(x= c(344, NA,NA, NA),
x2 = c(34, 45, 544, NA),
x3= c(584, 5, 554, NA),
x4 = c(575, 4554,474, NA),
x5 =c('setosa', 'setosa', 'setosa', "setosa"))
colnames(rowsOutNA1) <- colnames(irisdata1)
dfinal <- rbind(irisdata1, rowsOutNA1)
Detecting outlier in changed iris dataset*
We can only use univariate methods to detect only in in variable such
as Sepal.Length or we can exclude the species column and also use
multivariate methods such as isolation forest, Mahalanobis outlier
detection method or One class support vector machines. To identify the
methods allowed in this package, run
**extractMethod()**
NOTE * Because we are considering univariate analysis, the parameter sdm is set to FALSE.
Also, all multivariate outlier detection methods are not necessary for univariate datasets. The function extractMethods() can be used to identify the different methods allowed.
Indicate na.inform to show how NAs are handled in the dataset. If percentage NAs in a column are greater than then missingness parameter, then that particular column will be removed. Otherwise, the rows with NAs will be removed using na.omit to avoid failing the outlier detection methods. In summary, increasing missingness may lead to loss many rows especially if any column has more missing values.
Filter out only setosa data before outlier detection
setosadf <- dfinal[dfinal$Species%in%"setosa",c("Sepal.Width", 'Species')]
setosa_outlier_detection <- multidetect(data = setosadf,
var = 'Sepal.Width',
multiple = FALSE,
methods = c("adjbox", "iqr", "hampel","jknife",
"seqfences", "mixediqr",
"distboxplot", "semiqr",
"zscore", "logboxplot", "medianrule"),
silence_true_errors = FALSE,
missingness = 0.1,
sdm = FALSE,
na.inform = TRUE)
#> 1 (1.85%) NAs removed for parameter Sepal.Width.
#extractMethods()
Obtaining quality controlled dataset using loess method or data labelling
setosa_qc_loess <- extract_clean_data(refdata = setosadf,
outliers = setosa_outlier_detection, loess = TRUE)
#clean dataset
nrow(setosa_qc_loess)
#> [1] 51
#reference data
nrow(setosadf)
#> [1] 54
setosa_qc_labeled <- classify_data(refdata = setosadf, outliers = setosa_outlier_detection)
Visualise labelled quality controlled dataset
ggenvironmentalspace(setosa_qc_labeled,
type = '1D',
ggxangle = 45,
scalecolor = 'viridis',
xhjust = 1,
legend_position = 'blank',
ylab = "Number of records",
xlab = "Outlier labels")
For multiple species
NOTE
- For multiple groups, the parameter multiple is changed to TRUE and the var_col should be provided as demonstrated below.
multspp_outlier_detection <- multidetect(data = dfinal,
var = 'Sepal.Width',
multiple = TRUE,
var_col = "Species",
methods = c("adjbox", "iqr", "hampel","jknife",
"seqfences", "mixediqr",
"distboxplot", "semiqr",
"zscore", "logboxplot", "medianrule"),
silence_true_errors = FALSE,
missingness = 0.1,
sdm = FALSE,
na.inform = TRUE)
#> 1 (1.85%) NAs removed for parameter Sepal.Width.
#> 0 (0%) NAs removed for parameter Sepal.Width.
#> 0 (0%) NAs removed for parameter Sepal.Width.
Obtaining quality controlled dataset using loess method or data labelling
multsp_qc_loess <- extract_clean_data(refdata = dfinal,
outliers = multspp_outlier_detection,
var_col = 'Species',
loess = TRUE)
#clean dataset
nrow(multsp_qc_loess)
#> [1] 151
#reference data
nrow(dfinal)
#> [1] 154
multi_qc_labeled <- classify_data(refdata = dfinal,
outliers = multspp_outlier_detection,
var_col = 'Species')
Visualise labelled quality controlled dataset
ggenvironmentalspace(multi_qc_labeled,
type = '1D',
ggxangle = 45,
scalecolor = 'viridis',
xhjust = 1,
legend_position = 'blank',
ylab = "Number of records",
xlab = "Outlier labels")
The package is undergoing peer review for publication