Filling the Gaps in Health Data: Using a Machine Learning Approach to Augment Partially Observed Variables Such as Smoking in Claims Data.
Missing information is common in real-world claims data, particularly on behavioral confounders, for example, smoking. Often one category of the variable, "yes" is partially observed while the other "no" remains completely missing-a pattern we call missing with truncation. A common way to handle these missing values is to naïvely treat missing values as absence of the risk factor, which may lead to substantial misclassification. Standard multiple imputation is impossible as only one level of the variable is observed.
A case study was conducted using data from the NOVELTY study, including 12 224 people with physician diagnosed asthma and/or COPD (NCT02760329). From this cohort, 9733 patients with complete information were included. This dataset was split into two where the first part was used to train an imputation model and the second part was used to evaluate the imputations based on the model (1) when used to impute a truncated and amputated smoking variable against the naïvely classifying missing as "no" (2) when varying the percent smokers retained, q.
The accuracy of approaches (1) and (2) was 0.79 and 0.43, respectively; for q = 90%, the accuracy of approaches (1) and (2) was 0.89 and 0.94, respectively. Transfer learning showed better accuracy than the naïve approach when the percentage of true smokers being recorded as smokers was < 80%.
The added value of transfer learning was greatest when low proportions of true ever-smokers were recorded, with its advantage depending on both the true prevalence of true smokers and the predictive model's performance.
A case study was conducted using data from the NOVELTY study, including 12 224 people with physician diagnosed asthma and/or COPD (NCT02760329). From this cohort, 9733 patients with complete information were included. This dataset was split into two where the first part was used to train an imputation model and the second part was used to evaluate the imputations based on the model (1) when used to impute a truncated and amputated smoking variable against the naïvely classifying missing as "no" (2) when varying the percent smokers retained, q.
The accuracy of approaches (1) and (2) was 0.79 and 0.43, respectively; for q = 90%, the accuracy of approaches (1) and (2) was 0.89 and 0.94, respectively. Transfer learning showed better accuracy than the naïve approach when the percentage of true smokers being recorded as smokers was < 80%.
The added value of transfer learning was greatest when low proportions of true ever-smokers were recorded, with its advantage depending on both the true prevalence of true smokers and the predictive model's performance.
Authors
Franzen Franzen, Chandakas Chandakas, Hillman Hillman, Rhodes Rhodes, Nordon Nordon
View on Pubmed