การเปรียบเทียบประสิทธิภาพของวิธีทดแทนค่าสูญหายแบบพหุในข้อมูลพหุระดับ: การศึกษาด้วยการจำลองข้อมูล Comparison of the Efficiency of Multiple Imputation in Multilevel Data: A Simulation Study

การเปรียบเทียบประสิทธิภาพของวิธีทดแทนค่าสูญหายแบบพหุในข้อมูลพหุระดับ: การศึกษาด้วยการจำลองข้อมูล
Comparison of the Efficiency of Multiple Imputation in Multilevel Data: A Simulation Study

Nuanrat Chimsud, Prapasiri Ratchaprapapornkul, Siwachot Srisuttiyakorn

Abstract

การเปรียบเทียบประสิทธิภาพของวิธีทดแทนค่าสูญหายแบบพหุในข้อมูลพหุระดับ: การศึกษาด้วยการจำลองข้อมูล ในการวิจัยครั้งนี้ มีวัตถุประสงค์เพื่อเปรียบเทียบประสิทธิภาพของวิธีการทดแทนค่าสูญหายแบบพหุจำนวน 6 วิธี ได้แก่ วิธี Multiple Imputation Fully Conditional Specification (FCS) วิธี Random Forest (RF) และวิธี Optimal Impute (Opt.impute) ประกอบด้วยวิธี Opt.knn วิธี Opt.tree วิธี Opt.svm และวิธี Opt.cv โดยใช้การจำลองข้อมูลทางการศึกษาที่มีโครงสร้างแบบพหุระดับด้วยโมเดลสัมประสิทธิ์ความถดถอยแบบสุ่ม (Random Coefficients Model) ภายใต้เงื่อนไขดังนี้ 1) ประเภทการสูญหายแบ่งออกเป็น 6 รูปแบบ ได้แก่ การสูญหายแบบสุ่มอย่างสมบูรณ์ (Missing Completely at Random; MCAR) การสูญหายแบบสุ่ม (Missing at Random; MAR) การสูญหายแบบไม่สุ่ม (Missing not at Random; MNAR) และประเภทของการสูญหายรูปแบบผสมรายคู่ 3 รูปแบบ ได้แก่ MCAR - MAR, MCAR - MNAR และ MAR - MNAR 2) ขนาดของตัวอย่างระดับที่หนึ่งเท่ากับ 1,000 2,000 และ 3,000 หน่วย และขนาดตัวอย่างระดับที่สองเท่ากับ 40 50 และ 60 หน่วย 3) อัตราการสูญหายของค่าสังเกตในตัวอย่างระดับที่หนึ่งเป็น 3 ระดับ ได้แก่ ร้อยละ 30 ร้อยละ 40 และร้อยละ 50 ตามลำดับ เมื่อพิจารณาผลการวิจัยจำแนกตามประเภทการสูญหาย 6 รูปแบบ พบว่า ข้อสูญหายรูปแบบ MCAR วิธีทดแทนค่าสูญหาย Opt.cv มีประสิทธิภาพโดยเฉลี่ยสูงที่สุด ข้อมูลสูญหายรูปแบบ MAR, MCAR - MAR และ MAR - MNAR วิธีทดแทนค่า สูญหาย Opt.svm มีประสิทธิภาพโดยเฉลี่ยสูงที่สุด ทั้งนี้เมื่อข้อสูญหายแบบ MNAR, MCAR-MNAR พบว่าวิธีทดแทนค่าสูญหาย RF มีประสิทธิภาพโดยเฉลี่ยสูงที่สุด จากการวิเคราะห์ผลการวิจัยพบว่า วิธีทดแทนค่าสูญหาย Opt.impute มีแนวโน้มให้ประสิทธิภาพโดยเฉลี่ยสูงที่สุด รองลงมาคือ วิธี RF และวิธี FCS ตามลำดับ

The purpose of this research was to compare the efficiency of multiple Imputation methods of multilevel missing data. Six methods of the Imputation included Multiple Imputation Fully Conditional Specification (FCS), Random Forest (RF), and four methods of Optimal Impute (Opt. impute). A simulation study was based on real-world educational data with a random coefficient model. The performance of these approaches under various conditions was investigated: 1) six types of missing data: Missing completely at random (MCAR), Missing at Random (MAR), Missing not at Random (MNAR), and three mixed types of missing data: MCAR-MAR, MCAR-MNAR and MAR-MNAR 2) the level 1 sample sizes: 1,000, 2,000, and 3,000 units and the level 2 sample sizes: 40, 50, and 60 units, 3) three missing rates of observations in the level 1 sample sizes which were three levels: 30% , 40% , and 50% respectively. The results showed that for the MCAR, Opt. cv method had the highest average efficiency; Opt. svm method was the most effective in MAR, MCAR-MAR and MAR-MNAR; and RF method was the most effective in MNAR and MCAR-MNAR. Therefore, the Opt. impute method tended to provide the highest average efficiency, followed by the RF method and the FCS method, respectively.

Keywords

References

[1] S. V. Buuren, Flexible Imputation of Missing Data. New York: Chapman and Hall/CRC, 2018, pp.3-18

[2] J. Nissen, R. Donatello, and B. V. Dusen, “Missing data and bias in physics education research: A case for using multiple imputation,” Physical Review Physics Education Research, vol. 15, no. 2, 2019.

[3] S. Ngudratoke, “The principles of multilevel path analysis, and multilevel latent variable growth curve model: Muthen-based approach,” Journal of Research Methodology, vol. 15, no. 1, pp. 85–104, 2002 (in Thai).

[4] S. Srisuttiyakorn, “Educational inequality and its factors: Multilevel analysis integrated with median-based class of generalized entropy inequality Index,” Journal of Research Methodology, vol. 32, no. 3, pp. 356–386, 2019 (in Thai).

[5] A. C. Black, O. Harel, and D. B. McCoach, “Missing data techniques for multilevel data: Implications of model misspecification,” Journal of Applied Statistics, vol. 38, no. 9, pp. 1845–1865, 2011.

[6] H. Nugroho and K. Surendro, “Missing data problem in predictive analytics,” in Proceedings ICSCA, 2019, pp.95–100.

[7] G. L. Schlomer, L. Bauman, and N. A. Card, “Best practices for missing data management in counseling psychology,” Journal of Couns Psychol, vol. 57, no. 1, pp. 1–10, 2010.

[8] S. V. Buuren, “Multiple imputation of discrete and continuous data by fully conditional specification,” Journal of Statistical Methods in Medical Research, vol. 16, no. 3, pp. 195–197, 2007.

[9] S. V. Buuren, “Multiple imputation of discrete and continuous data by fully conditional specification,” Journal of Statistical Software, vol. 45, no. 3, 2011.

[10] S. V. Buuren, “Multiple imputation of discrete and continuous data by fully conditional specification,” Journal of Statistical Methods in Medical Research, vol. 16, no. 3, pp. 195–197, 2007.

[11] V. Audigier, I. R. White, S. Jolani, T. Debray, M. Quartagno, J. Carpenter, S. V. Buuren, and M. Resche-Rigon, “Multiple imputation for multilevel data with continuous and variables,” Statistical Science, vol. 33, no. 2, pp. 160–183, 2018.

[12] S. Pornprasertmani, ”Missing data handling (Multilevel Modeling),” Ph.D. dissertation, Faculty of Psychology, Chulalongkorn University, Thailand, 2019 (in Thai).

[13] F. Jia, and W. Wu, “Evaluating methods for handling missing ordinal data in structural equation modeling,” Behav Res Methods, vol. 51, no. 5, pp. 2337–2355, 2019.

[14] M. Kokla, J. Viranen, M. Kolehmainen, J. Paananen, and K Hanhineva, “Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: A comparative study,” BMC Bioinformatics, 2019.

[15] D. Bertsimas, C. Pawlowski, and Y. D. Zhuo, “From predictive methods to missing data imputation: An optimization approach,” Journal of Machine Learning Research 18, pp. 1–39, 2018.

[16] S. Srisuttiyakorn, “Missing data analysis,” Journal of Education, vol. 52, no. 1, pp. 217– 223, 2019 (in Thai).

[17] J. Lorah and A. Womac, “Value of sample size for computation of the Bayesian information criterion (BIC) in multilevel modeling,” Behavior Research Methods, vol. 51, pp. 440–450, 2019.

Full Text: PDF

DOI: 10.14416/j.kmutnb.2024.03.009

ISSN: 2985-2145

Username
Password
Remember me

The Journal of King Mongkut's University of Technology North Bangkokวารสารวิชาการพระจอมเกล้าพระนครเหนือ

Abstract

Keywords

References