K-fold cross validation for estimating model performance
Usage
cross_validation(
project,
mod.name,
zone.dat,
groups,
k = NULL,
time_var = NULL,
use.scalers = FALSE,
scaler.func = NULL
)
Arguments
- project
Name of project
- mod.name
Name of saved model to use. Argument can be the name of the model or can pull the name of the saved "best" model. Leave
mod.name
empty to use the saved "best" model. If more than one model is saved,mod.name
should be the numeric indicator of which model to use. Usetable_view("modelChosen", project)
to view a table of saved models.- zone.dat
Variable in main data table that identifies the individual zones or areas.
- groups
Determine how to subset dataset into groups for training and testing
- k
Integer, value required if
groups = 'Observations'
to determine the number of groups for splitting data into training and testing datasets. The value ofk
should be chosen to balance bias and variance and values ofk = 5 or 10
have been found to be efficient standard values in the literature. Note that higher k values will increase runtime and the computational cost ofcross_validation
. Leave-on-out cross validation is a type of k-fold cross validation in whichk = n
number of observations, which can be useful for small datasets.- time_var
Name of column for time variable. Required if
groups = 'Years'
.- use.scalers
Input for
create_model_input()
. Logical, should data be normalized? Defaults toFALSE
. Rescaling factors are the mean of the numeric vector unless specified withscaler.func
.- scaler.func
Input for
create_model_input()
. Function to calculate rescaling factors.
Details
K-fold cross validation is a resampling procedure for evaluating the predictive performance of a model. First the data are split into k groups, which can be split randomly across observations (e.g., 5-fold cross validation where each group is randomly assigned across observations) or split based on a particular variable (e.g., split groups based on gear type). Each group takes turn being the 'hold-out' or 'test' data set, while the remaining groups are the training dataset (parameters are estimated for the training dataset). Finally the predictive performance of each iteration is calculated as the percent absolute prediction error. s
Examples
if (FALSE) {
model_design_outsample("scallop", "scallopModName")
}