Machine learning-based techniques to handle missing data in meta-regression
Date
2024
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Meta-analysts within social sciences and education face challenges when encountering missing data, particularly missing covariates in meta-regression that can skew statistical inferences. In this dissertation, I investigate the effectiveness of model-based machine learning approaches, specifically Random Forest (RF) and LightGBM (LG), for handling missing data, juxtaposed against standard multiple imputation methods, such as Predictive Mean Matching (PMM) and ad-hoc methods including Complete Case Analysis (CCA) and Shifting Case Analysis (SCA). Through two comprehensive simulation studies, I assess the performance of these methods by measuring bias and precision in scenarios with varying degrees of missingness (5%, 15%, 30%) and different missing data mechanisms (MCAR, MAR, MNAR). The findings reveal that while multiple imputation methods can provide accurate estimates in meta-regression, their efficacy varies with higher rates of missingness and when missingness is correlated with effect sizes. LightGBM demonstrated the most consistent performance, showing minimal bias and stable error ratios across all missing data scenarios, making it particularly effective for meta-analyses involving binary moderators and complex moderator structures. While generally effective, Random Forest displayed sensitivity to the missingness mechanism and level, suggesting its conditional robustness. More specifically, Random Forest works well with high missing percentages (15% and 30%) under MNAR conditions but shows varying results in other mechanisms suggesting conditional robustness. In contrast, Predictive Mean Matching showed increased bias and decreased precision under MCAR at 30% missing proportion. Similarly Complete case analysis also showed increased biasness under MAR conditions at 15% and 30% missing proportions. ☐ The results underscore the superiority of LightGBM and Random Forest over traditional imputation methods in meta-analytic contexts, highlighting their potential to enhance the accuracy and reliability of systematic reviews plagued by missing data. The application of these advanced machine learning techniques in meta-analysis marks a significant methodological advancement, offering a more robust framework for researchers confronting missing data challenges in systematic reviews.
Description
Keywords
LightGBM, Machine learning, Meta analysis, Meta regression, Missing data, Random Forest