Skip to contents

The function allows to ensemble multiple outlier detection methods to ably compare the outliers flagged by each method.

Usage

multidetect(
  data,
  var,
  select = NULL,
  output = "outlier",
  exclude = NULL,
  multiple,
  var_col = NULL,
  optpar = list(optdf = NULL, ecoparam = NULL, optspcol = NULL, direction = NULL, maxcol
    = NULL, mincol = NULL, maxval = NULL, minval = NULL, checkfishbase = FALSE, mode =
    NULL, lat = NULL, lon = NULL, pct = 80, warn = FALSE),
  kmpar = list(k = 6, method = "silhouette", mode = "soft"),
  ifpar = list(cutoff = 0.5, size = 0.7),
  mahalpar = list(mode = "soft"),
  jkpar = list(mode = "soft"),
  zpar = list(type = "mild", mode = "soft"),
  gloshpar = list(k = 3, metric = "manhattan", mode = "soft"),
  knnpar = list(metric = "manhattan", mode = "soft"),
  lofpar = list(metric = "manhattan", mode = "soft", minPts = 10),
  methods,
  bootSettings = list(run = FALSE, nb = 5, maxrecords = 30, seed = 1135, th = 0.6),
  pc = list(exec = FALSE, npc = 2, q = T, pcvar = "PC1"),
  verbose = FALSE,
  spname = NULL,
  warn = FALSE,
  missingness = 0.1,
  silence_true_errors = TRUE,
  sdm = TRUE,
  na.inform = FALSE
)

Arguments

data

dataframe or list. Data sets for multiple or single species after of extraction of environment predictors.

var

character. A variable to check for outliers especially the one with directly affects species distribution such as maximum temperature of the coldest month for bioclimatic variables (IUCN Standards and Petitions Committee, 2022)) or stream power index for hydromorphological parameters (Logez et al., 2012). This parameter is necessary for the univariate outlier detection methods such as Z-score.

select

vector The columns that will be used in outlier detection. Make sure only numeric columns are accepted.

output

character. Either clean: for a data set with no outliers, or outlier: to output a dataframe with outliers. Default outlier.

exclude

vector. Exclude variables that should not be considered in the fitting the one class model, for example x and y columns or latitude/longitude or any column that the user doesn't want to consider.

multiple

logical. If the multiple species are considered, then multiple must be set to TRUE and FALSE for single species.

var_col

string. A column with species names if dataset for species is a dataframe not a list. See pred_extract for extracting environmental data.

optpar

list. Parameters for species optimal ranges like temperatures ranges. For details check ecological_ranges.

kmpar

list. Parameters for k-means clustering like method and number of clusters for tuning. For details, check xkmeans.

ifpar

list. Isolation forest parameter settings. Parameters of the isolation model that are required include the cutoff to be used for denoting outliers. It ranges from 0 to 1 but Default 0.5. Also, the size of data partitioning for training should be determined. For more details check (Liu et al. 2008)

mahalpar

list. Parameters for Malahanobis distance which includes varying the mode of output mahal.

jkpar

list. Parameters for reverse jackknifing mainly the mode used. For details jknife.

zpar

list. Parameters for z-score such as mode and x parameter. For details zscore

gloshpar

list. Parameters for global local outlier score from hierarchies such as distance metric used. For details xglosh.

knnpar

list. Parameters for varying the distance matrix such as Euclidean or Manhattan distance. For details xknn

lofpar

list. Parameters for local outlier factor such as the distance matrix and mode of method implementation such as robust and soft mode. For details xlof.

methods

vector. Outlier detection methods considered. Use extractMethods to get outlier detection methods implemented in this package.

bootSettings

list. A list of parameters to implement bootstrapping mostly for records below 30. For details checks boots.

pc

list. A list of parameters to implement principal component analysis for dimesnion reduction. For details checks pca.

verbose

logical. whether to return messages or not. Default FALSE.

spname

string. species name being handled.

warn

logical. Whether to return warning or not. Default TRUE.

missingness

numeric. Allowed missing values in a column to allow a user decide whether to remove the individual columns or rows from the data sets. Default 0.1. Therefore, if a column has more than 10% missing values, then it will be removed from the dataset rather than the rows.

silence_true_errors

logical. Show execution errors and therefore for multiple species the code will break if one of the methods fails to execute.

sdm

logical If the user sets TRUE, strict data checks will be done including removing all non-numeric columns from the datasets before identification of outliers. If set to FALSE non numeric columns will be left in the data but the variable of concern will checked if its numeric. Also, only univariate methods are allowed. Check broad_classify for the broad categories of the methods allowed.

na.inform

logical Inform on the NAs removed in executing general datasets. Default FALSE.

Value

A list of outliers or clean dataset of datacleaner class. The different attributes are associated with the datacleaner class from multidetect function.

  • result: dataframe. list of dataframes with the outliers flagged by each method.

  • mode: logical. Indicating whether it was multiple TRUE or FALSE.

  • varused: character. Indicating the variable used for the univariate outlier detection methods.

  • out: character. Whether outliers where indicated by the user or no outlier data.

  • methodsused: vector. The different methods used the outlier detection process.

  • dfname: character. The dataset name for the species records.

  • exclude: vector. The columns which were excluded during outlier detection, if any.

Details

This function computes different outlier detection methods including univariate, multivariate and species ecological ranges to enables seamless comparison and similarities in the outliers detected by each method. This can be done for multiple species or a single species in a dataframe or lists or dataframes and thereafter the outliers can be extracted using the extract_clean_data function.

References

  1. IUCN Standards and Petitions Committee. (2022). THE IUCN RED LIST OF THREATENED SPECIESTM Guidelines for Using the IUCN Red List Categories and Criteria Prepared by the Standards and Petitions Committee of the IUCN Species Survival Commission. https://www.iucnredlist.org/documents/RedListGuidelines.pdf.

  2. Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008, December). Isolation forest. In 2008 eighth ieee international conference on data mining (pp. 413-422). IEEE.

Examples


if (FALSE) { # \dontrun{

data("efidata")
data("jdsdata")

matchdata <- match_datasets(datasets = list(jds = jdsdata, efi=efidata),
                            lats = 'lat',
                            lons = 'lon',
                            species = c('speciesname','scientificName'),
                            date = c('Date', 'sampling_date'),
                            country = c('JDS4_site_ID'))


datacheck <- check_names(matchdata, colsp = 'species', pct = 90, merge =TRUE)


danube <- system.file('extdata/danube.shp.zip', package='specleanr')

db <- sf::st_read(danube, quiet=TRUE)


worldclim <- terra::rast(system.file('extdata/worldclim.tiff', package='specleanr'))

rdata <- pred_extract(data = datacheck,
                     raster= worldclim ,
                     lat = 'decimalLatitude',
                     lon= 'decimalLongitude',
                     colsp = 'speciescheck',
                     bbox = db,
                     minpts = 10,
                     list=TRUE,
                     merge=F)


out_df <- multidetect(data = rdata, multiple = TRUE,
                     var = 'bio6',
                     output = 'outlier',
                     exclude = c('x','y'),
                     methods = c('zscore', 'adjbox','iqr', 'semiqr','hampel', 'kmeans',
                                'logboxplot', 'lof','iforest', 'mahal', 'seqfences'))


#optimal ranges in the multidetect: made up

optdata <- data.frame(species= c("Salmo trutta", "Abramis brama"),
                      mintemp = c(6, 1.6),maxtemp = c(20, 21),
                       meantemp = c(8.5, 10.4), #ecoparam
                      direction = c('greater', 'greater'))
#species record

salmoabramis <- rdata["Salmo trutta"]

#even if one species, please indicate multiple to TRUE, since its picked from pred_extract function

out_df <- multidetect(data = salmoabramis, multiple = TRUE,
                      var = 'bio1',
                      output = 'outlier',
                      exclude = c('x','y'),
                      methods = c('zscore', 'adjbox','iqr', 'semiqr','hampel', 'kmeans',
                                  'logboxplot', 'lof','iforest', 'mahal', 'seqfences', 'optimal'),
                      optpar = list(optdf=optdata, optspcol = 'species',
                                    mincol = "mintemp", maxcol = "maxtemp"))
#plot the number of outliers

#ggoutliers(out_df, 1)

} # }