Skip to contents

K-fold cross validation for estimating model performance

Usage

cross_validation(
  project,
  mod.name,
  zone.dat,
  groups,
  k = NULL,
  time_var = NULL,
  use.scalers = FALSE,
  scaler.func = NULL
)

Arguments

project

Name of project

mod.name

Name of saved model to use. Argument can be the name of the model or can pull the name of the saved "best" model. Leave mod.name empty to use the saved "best" model. If more than one model is saved, mod.name should be the numeric indicator of which model to use. Use table_view("modelChosen", project) to view a table of saved models.

zone.dat

Variable in main data table that identifies the individual zones or areas.

groups

Determine how to subset dataset into groups for training and testing

k

Integer, value required if groups = 'Observations' to determine the number of groups for splitting data into training and testing datasets. The value of k should be chosen to balance bias and variance and values of k = 5 or 10 have been found to be efficient standard values in the literature. Note that higher k values will increase runtime and the computational cost of cross_validation. Leave-on-out cross validation is a type of k-fold cross validation in which k = n number of observations, which can be useful for small datasets.

time_var

Name of column for time variable. Required if groups = 'Years'.

use.scalers

Input for create_model_input(). Logical, should data be normalized? Defaults to FALSE. Rescaling factors are the mean of the numeric vector unless specified with scaler.func.

scaler.func

Input for create_model_input(). Function to calculate rescaling factors.

Details

K-fold cross validation is a resampling procedure for evaluating the predictive performance of a model. First the data are split into k groups, which can be split randomly across observations (e.g., 5-fold cross validation where each group is randomly assigned across observations) or split based on a particular variable (e.g., split groups based on gear type). Each group takes turn being the 'hold-out' or 'test' data set, while the remaining groups are the training dataset (parameters are estimated for the training dataset). Finally the predictive performance of each iteration is calculated as the percent absolute prediction error. s

Examples

if (FALSE) {

model_design_outsample("scallop", "scallopModName")

}