Early Career Lighting Talks - Managing Randomness to Enable Reproducible Machine Learning
Event Type
Workshop
Online Only
Career Development
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
HPC Community Collaboration
W
TimeSunday, 14 November 20213:50pm - 3:55pm CST
LocationOnline
DescriptionThe National Science Foundation defines reproducibility as "the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator''. Reproducibility in machine learning refers to the ability to regenerate a model precisely guaranteeing identical accuracy and transparency. While a model may offer reproducible inference, reproducing the model itself is frequently problematic at best due to the presence of pseudo-random numbers as part of the model generation. Managing the random numbers generated in the production of a machine learning model is a necessary step to ensuring that the model is reproducible.
This project establishes examples of the impact of randomness on model accuracy and offers a preliminary investigation into the ways in which random number generation can be controlled to make ML models more reproducible.
In this study, random number generation was controlled by regulating the random number seed and algorithm to ensure the same generation process is used each time. We designed C++ intercepts to std:rand() and std::srand() and store the seed used for model generation (either deliberately by the caller or generated by the intercept to ensure it is a known value).
This intercept is used in a series of experiments with various machine learning algorithms to explore the relationship between random number generation and model accuracy on public, commonly used data sets. For each experiment, we train and test 100 models on each of two different data sets for each of three algorithms for a total of six sets. The selected algorithms are Neural Network (NN), K-Means clustering, and Naive Bayes classifier models. The six different data sets are from the UCI Machine Learning Repository: Heart Disease (NN), Wine (NN), Iris (K-Means), Breast Tissue (K-Means), Wisconsin Breast Cancer (Naive Bayes), and Somerville Happiness (Naive Bayes).
In these experiments, we primarily focus on controlling or varying each model's 1) random seed, 2) train/test ratio, 3) training data set, and 4) testing data set. We explore three permutations of these variables: 1) varying the seed, controlling the train/test ratio, and controlling the train/test data sets, 2) varying the seed, controlling the train/test ratio, and varying the train/test data sets, and 3) controlling the seed, varying the train/test ratio, and varying the train/test data sets.
Overall, our experiments show a wide variety and degree of unpredictability in final model quality given the use of random number generators. From these experiments, we concluded that the random number seed used in the training of a model greatly impacts the model's quality and overall performance on data with a model quality variance of as much as 60 points seen.
Future experiments will explore additional permutations of the four variables studied in this project. Additional future work will include running identical experiments using other ML algorithms, different types of data sets, and a GPU, and applying these same techniques to parallel ML algorithms.
This project establishes examples of the impact of randomness on model accuracy and offers a preliminary investigation into the ways in which random number generation can be controlled to make ML models more reproducible.
In this study, random number generation was controlled by regulating the random number seed and algorithm to ensure the same generation process is used each time. We designed C++ intercepts to std:rand() and std::srand() and store the seed used for model generation (either deliberately by the caller or generated by the intercept to ensure it is a known value).
This intercept is used in a series of experiments with various machine learning algorithms to explore the relationship between random number generation and model accuracy on public, commonly used data sets. For each experiment, we train and test 100 models on each of two different data sets for each of three algorithms for a total of six sets. The selected algorithms are Neural Network (NN), K-Means clustering, and Naive Bayes classifier models. The six different data sets are from the UCI Machine Learning Repository: Heart Disease (NN), Wine (NN), Iris (K-Means), Breast Tissue (K-Means), Wisconsin Breast Cancer (Naive Bayes), and Somerville Happiness (Naive Bayes).
In these experiments, we primarily focus on controlling or varying each model's 1) random seed, 2) train/test ratio, 3) training data set, and 4) testing data set. We explore three permutations of these variables: 1) varying the seed, controlling the train/test ratio, and controlling the train/test data sets, 2) varying the seed, controlling the train/test ratio, and varying the train/test data sets, and 3) controlling the seed, varying the train/test ratio, and varying the train/test data sets.
Overall, our experiments show a wide variety and degree of unpredictability in final model quality given the use of random number generators. From these experiments, we concluded that the random number seed used in the training of a model greatly impacts the model's quality and overall performance on data with a model quality variance of as much as 60 points seen.
Future experiments will explore additional permutations of the four variables studied in this project. Additional future work will include running identical experiments using other ML algorithms, different types of data sets, and a GPU, and applying these same techniques to parallel ML algorithms.