Determining optimal parameters of the Self Referent Encoding Task: A large-scale examination of self-referent cognition and depression

This file is one of a series of supplemental explanatory documents for the study “Determining optimal parameters of the Self Referent Encoding Task: A large-scale examination of self-referent cognition and depression”. Data and code are located at doi: 10.18738/T8/XK5PXX, and websites with visual R Markdown explanations are located and navigable on the paper’s github pages website.

Data description

These models take a significant amount of computing resources due to the cross-validation procedures and large size. As such, they are presented here and are explored separately in a different file. These models were run through .sh batch scripts through The University of Texas at Austin’s Texas Advanced Computing Center (TACC). Because of the opportunities offered by distributed computing, these models were run in three separate scripts; they’re presented here in tandem.

If you are viewing this as an HTML file, and wish to see the full code, please download the R Markdown file from the Texas Data Repository.

Code employed

This code demonstrates how data were subset into training and test data, and the nomenclature of the best subsets function. Packages caret and beset were used.

Install and load packages

install.packages("caret", dependencies = TRUE)
install.packages("devtools", dependencies = TRUE)
devtools::install_github("jashu/beset")

library(caret); library(beset)

Load data

load("utstudents.Rdata"); load("adolescents.Rdata"); load("mturkers.Rdata")

Subset data

These data are already subset into train/test datasets, but the following code was used to subset them after all NAs are removed.

utstudents.trainIndex <- caret::createDataPartition(observed.utstudents$dep, 
                                                    times = 1, p = .8, list=F)
utstudents.train <- observed.utstudents[ utstudents.trainIndex, ]
utstudents.test <- observed.utstudents[ -utstudents.trainIndex, ]

mturkers.trainIndex <- caret::createDataPartition(observed.mturkers$dep, 
                                                  times = 1, p = .8, list=F)
mturkers.train <- observed.mturkers[ mturkers.trainIndex, ]
mturkers.test <- observed.mturkers[ -mturkers.trainIndex, ]

adolescents.trainIndex <- caret::createDataPartition(observed.adolescents.full$dep, 
                                                     times = 1, p = .8, list=F)
adolescents.train <- observed.adolescents.full[ adolescents.trainIndex, ]
adolescents.test <- observed.adolescents.full[ -adolescents.trainIndex, ]

Run the models using beset

Refer to the github page for beset or the help file (help(beset)) to learn more about the functions contained therein. In essence, the 80% partition (training data) is passed to data, the 20% particion to be test_data. A negative binomial generalized linear model is employed (beset automatically estimates the theta parameter). n_cores is set at 20 given the computing capabilities of TACC; on a personal computer the number of cores may be determined using the command parallel::detectCores(). Other variables take the default value, using 10 folds, limiting the number of possible predictors chosen in any model to 10, and repeating cross-validation 10 times.

library(beset)

ut_negbin_model <- 
  beset::beset_glm(dep ~ ., data = utstudents.train, 
                   test_data = utstudents.test, family = "negbin",
                   n_cores = 20)
mturkers_negbin_model <- 
  beset::beset_glm(dep ~ ., data = mturkers.train,
                   test_data = mturkers.test, family = "negbin",
                   n_cores = 20)
adolescents_negbin_model <- 
  beset::beset_glm(dep ~ ., data = adolescents.train, 
                   test_data = adolescents.test, family = "negbin",
                   n_cores = 20)

Save models and data

save(ut_negbin_model, utstudents.train, utstudents.test, file="utmodel-all.Rdata")
save(mturkers_negbin_model, mturkers.train, mturkers.test, file="mtmodel-all.Rdata")
save(adolescents_negbin_model, adolescents.train, adolescents.test, file="adomodel-all.Rdata")

Please note that these models on TACC take over 30 minutes to run.