[1]王冉,张倩,胡真.数据不均衡下艰难梭菌感染预测模型的构建[J].华侨大学学报(自然科学版),2025,46(6):694-702.[doi:10.11830/ISSN.1000-5013.202504008]
 WANG Ran,ZHANG Qian,HU Zhen.Prediction Model fof Clostridium Difficile Infection Under Imbalanced Data[J].Journal of Huaqiao University(Natural Science),2025,46(6):694-702.[doi:10.11830/ISSN.1000-5013.202504008]
点击复制

数据不均衡下艰难梭菌感染预测模型的构建()
分享到:

《华侨大学学报(自然科学版)》[ISSN:1000-5013/CN:35-1079/N]

卷:
第46卷
期数:
2025年第6期
页码:
694-702
栏目:
出版日期:
2025-11-20

文章信息/Info

Title:
Prediction Model fof Clostridium Difficile Infection Under Imbalanced Data
文章编号:
1000-5013(2025)06-0694-09
作者:
王冉1 张倩2 胡真1
1. 河海大学 数学学院, 江苏 南京 211100;2. 南京市第一医院 药学部, 江苏 南京 210006
Author(s):
WANG Ran1 ZHANG Qian2 HU Zhen1
1. School of Mathematics, Hohai University, Nanjing 211100, China; 2. Department of Pharmacy, Nanjing First Hospital, Nanjing 210006, China
关键词:
艰难梭菌感染 风险预测模型 不均衡数据 SMOTE算法 机器学习
Keywords:
Clostridium difficile infection risk prediction model imbalanced data SMOTE algorithm machine learning
分类号:
R516;TP181
DOI:
10.11830/ISSN.1000-5013.202504008
文献标志码:
A
摘要:
采用MIMIC重症数据库中的重症患者数据为数据集,对艰难梭菌感染的发病风险进行预测。针对该数据集的非均衡性,提出一种基于改进SMOTE算法和机器学习的风险预测方法。首先,改进SMOTE算法以实现均衡数据集,通过引入流行病学中常用的参数比值比计算特征权重,改进SMOTE算法中的最近邻样本挑选规则。同时,为避免新合成样本离散特征结构被破坏,对SMOTE算法中的样本合成方法进行改进。其次,采用多种机器学习算法分别构建重症患者艰难梭菌感染的风险预测模型。实验结果表明:采用改进 SMOTE 算法与CatBoost算法构建的模型对艰难梭菌感染患者的预测效果更佳,该模型在测试集中的曲线下面积(AUC)为 0.75,召回率为 0.69,在验证集中的 AUC 为 0.73,召回率为 0.57。
Abstract:
This study utilizes data from critically ill patients in the MIMIC intensive care database to predict the risk of clostridium difficile infection. To address the data imbalance in the dataset, a risk prediction method based on improved SMOTE algorithm and machine learning is proposed. First, the improved SMOTE algorithm is enhanced to generate a balanced dataset by incorporating feature weights derived from odds ratios-an epidemiologically relevant metric-to refine the selection of nearest neighbors. Additionally, to prevent the structural damage of discrete features in the synthesized samples, the sample synthesis method within SMOTE is also modified. Subsequently, multiple machine learning algorithms are used to construct risk prediction models for Clostridium difficile infection in critically ill patients, respectively. The results show that the model established by the improved SMOTE algorithm and the CatBoost classifiier obtains better predictive performance. Specifically, it achieves an area under the curve(AUC)of 0.75 and a recall rate of 0.69 on the test set, and an AUC of 0.73 with a recall of 0.57 on the validation set.

参考文献/References:

[1] PICCIONI A,ROSA F,MANCA F,et al.Gut Microbiota and Clostridium difficile: What we know and the new frontiers[J].International Journal of Molecular Sciences,2022,23(21):13323.DOI:10.3390/ijms232113323.
[2] 郭亚慧,曹青青,尹凤荣,等.住院腹泻患者艰难梭菌感染的危险因素分析[J].胃肠病学,2021,26(8):454-458.DOI:10.3969/j.issn.1008-7125.2021.08.002.
[3] 周勇,吴媛,曾汇文,等.艰难梭菌的感染特征及其危险因素: 基于中南地区某市住院腹泻患者的标本[J].南方医科大学学报,2024,44(5):998-1003.DOI:10.12122/j.issn.1673-4254.2024.05.23.
[4] FEUERSTADT P,THERIAULT N,TILLOTSON G.The burden of CDI in the United States: A multifactorial challenge[J].BMC Infectious Diseases,2023,23(1):132.DOI:10.1186/s12879-023-08096-0.
[5] KELLY CR,FISCHER M,ALLEGRETTI JR,et al.ACG clinical guidelines: Prevention, diagnosis, and treatment of clostridioides difficile infections[J].American Journal of Gastroenterology,2021,116(6):1124-1147.DOI:10.14309/ajg.0000000000001278.
[6] WEI Hongcheng,SUN Jie,SHAN Wenqi,et al.Environmental chemical exposure dynamics and machine learning-based prediction of diabetes mellitus[J].Science of The Total Environment,2022,806(Pt 2):150674.DOI:10.1016/j.scitotenv.2021.150674.
[7] PERSSON I,MACURA A,BECEDAS D,et al.Early prediction of sepsis in intensive care patients using the machine learning algorithm NAVOY? Sepsis, a prospective randomized clinical validation study[J].Journal of Critical Care,2024,80:154400.DOI:10.1016/j.jcrc.2023.154400.
[8] LI Xiaoqian,XIONG Xingyu,LIANG Zongan,et al.A machine learning diagnostic model for Pneumocystis jirovecii pneumonia in patients with severe pneumonia[J].International Journal of Emergency Medicine,2023,18(6):1741-1749.DOI:10.1007/s11739-023-03353-1.
[9] ?TLE?瘙塁 E,BALCZEWSKI E A,KEIDAN M,et al.Clostridioides difficile infection surveillance in intensive care units and oncology wards using machine learning[J].Infection Control and Hospital Epidemiology,2023,44(11):1776-1781.DOI:10.1017/ice.2023.54.
[10] ALAMRI A,BIN A A,AL H E,et al.Development of a prediction model to identify the risk of clostridioides difficile infection in hospitalized patients receiving at least one dose of antibiotics[J].Pharmacy(Basel),2024,12(1):37.DOI:10.3390/pharmacy12010037.
[11] 王卓,万健,张玉洁,等.基于肠道菌群和代谢组学构建溃疡性结肠炎伴艰难梭菌感染的诊断模型[J].空军军医大学学报,2024,45(3):332-336.DOI:10.13276/j.issn.2097-1656.2024.03.015.
[12] MARRA A R,ALZUNITAN M,ABOSI O,et al.Modest Clostridiodes difficile infection prediction using machine learning models in a tertiary care hospital[J].Diagnostic Microbiology and Infectious Disease,2020,98(2):115104.DOI:10.1016/j.diagmicrobio.2020.115104.
[13] ZAFAR A,ATTIA Z,TESFAYE M,et al.Machine learning-based risk factor analysis and prevalence prediction of intestinal parasitic infections using epidemiological survey data[J].PLOS Neglected Tropical Diseases,2022,16(6):e0010517.DOI:10.1371/journal.pntd.0010517.
[14] TRAN V,SAAD T,TESFAYE M,et al.Helicobacter pylori(H.pylori)risk factor analysis and prevalence prediction: A machine learning-based approach[J].BMC Infectious Diseases,2022,22(1):655.DOI:10.1186/s12879-022-07625-7.
[15] HASSANZADEH R,FARHADIAN M,RAFIEEMEHR H.Hospital mortality prediction in traumatic injuries patients: Comparing different SMOTE-based machine learning algorithms[J].BMC Medical Research Methodology,2023,23(1):101.DOI:10.1186/s12874-023-01920-w.
[16] ZHANG Lixiang,ZHOU Xiaojuan,CAO Jiaoyu.Establishment and validation of a heart failure risk prediction model for elderly patients after coronary rotational atherectomy based on machine learning[J].PeerJ,2024,12:e16867.DOI:10.7717/peerj.16867.
[17] TALEBI MM,JAHANI Y,AREFZADEH Z,et al.Predicting diabetes in adults: Identifying important features in unbalanced data over a 5-year cohort study using machine learning algorithm[J].BMC Medical Research Methodology,2024,24(1):220.DOI:10.1186/s12874-024-02341-z.
[18] HAN Hui,WANG Wenyuan,MAO Binghuan.Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning[C]//Proceedings of the 2005 International Conference on Advances in Intelligent Computing.Berlin,Heidelberg:Springer,2005:878-887.DOI:10.1007/11538059_91.
[19] HE Haibo,YANG Bai,GARCIA E A,et al.ADASYN: Adaptive synthetic sampling approach for imbalanced learning[C]//Proceedings of 2008 IEEE International Joint Conference on Neural Networks.Hong Kong:IEEE Press,2008:1322-1328.DOI:10.1109/IJCNN.2008.4633969.
[20] 周玉,孙红玉,房倩,等.不平衡数据集分类方法研究综述[J].计算机应用研究,2022,39(6):1615-1621.DOI:10.19734/j.issn.1001-3695.2021.10.0590.
[21] JOHNSONA E W,BULGARELLI L,SHEN Lu,et al.MIMIC-IV, a freely accessible electronic health record dataset[J].Scientific Data,2023,10(1):1.DOI:10.1038/s41597-022-01899-x.
[22] SLADE E,NAYLOR MG.A fair comparison of tree-based and parametric methods in multiple imputation by chained equations[J].Statistics in Medicine,2020,39(8):1156-1166.DOI:10.1002/sim.8468.
[23] LESSA FC,MU Y,BAMBERG WM,et al.Burden of Clostridium difficile infection in the United States[J].The New England Journal of Medicine,2015,372(9):825-834.DOI:10.1056/NEJMoa1408913.

备注/Memo

备注/Memo:
收稿日期: 2025-04-16
通信作者: 胡真(1982-),男,副教授,博士,主要从事机器学习和数据分析的研究。E-mail:huzhen@hhu.edu.cn。
基金项目: 国家重点研发计划资助项目(2024YFE0206600)https://hdxb.hqu.edu.cn/
更新日期/Last Update: 2025-11-20