The ScispaCy Clinical Term Extraction and Snomed CT Synonymy Elimination From Clinical Data For Clustering: A Novel Study

E.K. Jasila; N. Saleena; K. A. Abdul Nazeer

doi:10.26713/cma.v15i2.2939

Authors

E.K. Jasila Department of Computer Science and Engineering, National Institute of Technology Calicut, Kozhikode, Kerala, India https://orcid.org/0000-0002-3783-3925
N. Saleena Department of Computer Science and Engineering, National Institute of Technology Calicut, Kozhikode, Kerala, India https://orcid.org/0000-0002-2449-7758
K. A. Abdul Nazeer Department of Computer Science and Engineering, National Institute of Technology Calicut, Kozhikode, Kerala, India https://orcid.org/0000-0002-0527-9026

DOI:

https://doi.org/10.26713/cma.v15i2.2939

Keywords:

Clinical document, SNOMED CT ontology, Red-black tree

Abstract

A clinical document is a written or electronic record that encompasses details regarding a patient’s medical procedure, clinical trial, or test outcomes. Standard information mining approaches have challenges in clustering clinical documents due to their unstructured nature. This work introduced a new approach for grouping clinical documents to address problems related to synonymy, abbreviation extension, and extraction of key features. The clinical document collection for coronary artery disease consists of 1304 records obtained from 296 patients. These records have been chosen for preprocessing with the aim of removing any irregularities. The scispaCy model extracts relevant information after a simple letter-matching algorithm identifies and extends abbreviations. Furthermore, the features are examined using SNOMED CT ontology to eradicate medical terms that have similar meanings. The TF-IDF method is employed to convert the recovered features into vectors. The BERT model’s word embeddings were employed in this study to represent features. Nevertheless, the TF-IDF model surpasses the BERT model in performance. The clustering process utilises an enhanced k-means algorithm that incorporates the Red-Black Tree data structure. The recommended strategy was evaluated with several existing clustering algorithms in this study. It has been found that the proposed method produces clusters with higher scores for Normalised Mutual Information (NMI) and accuracy. Based on the results of this investigation, the model has the ability to detect individuals with similar diseases and provide assistance to healthcare professionals.

Downloads

Download data is not yet available.

References

D. Cai, X. He and J. Han, Locally consistent concept factorization for document clustering, IEEE Transactions on Knowledge and Data Engineering 23(6) (2010), 902 – 913, DOI: 10.1109/TKDE.2010.165.

R. Cohen, M. Elhadad and N. Elhadad, Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies, BMC Bioinformatics 14 (2013), Article number: 10, DOI: 10.1186/1471-2105-14-10.

T. H. Cormen, C. E. Leiserson, R. L. Rivest and C. Stein, Introduction to Algorithms, 4th edition, The MIT Press, Cambridge, 1312 pages (2022).

D. Das, Y. Katyal, J. Verma, S. Dubey, A. D. Singh, K. Agarwal, S. Bhaduri and R. K. Ranjan, Information retrieval and extraction on COVID-19 clinical articles using graph community detection and Bio-BERT embeddings, in: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020 (NLP-COVID19), K. Verspoor, K. B. Cohen, M. Dredze, E. Ferrara, J. May, R. Munro, C. Paris and B. Wallace (editors), Association for Computational Linguistics, (2020), URL: https://aclanthology.org/2020.nlpcovid19-acl.7.

D. Dessi, D. R. Recupero, G. Fenu and S. Consoli, Exploiting cognitive computing and frame semantic features for biomedical document clustering, in: Proceedings of the Workshop on Semantic Web Solutions for Large-scale Biomedical Data Analytics (SeWeBMeDA 2017) co-located with 14th Extended Semantic Web Conference (ESWC 2017) (Portoroz, Slovenia, May 28, 2017), A. Hasnain, A. Sheth, M. Dumontier and D. Rebholz-Schuhmann, Vol. 1948 (2017), pp. 20 – 34, URL: https://ceur-ws.org/Vol-1948/paper3.pdf.

K. Doing-Harris, O. Patterson, S. Igo and J. Hurdle, Document sublanguage clustering to detect medical specialty in cross-institutional clinical texts, in: DTMBIO’13: Proceedings of the 7th International Workshop on Data and Text Mining in Biomedical Informatics, pp. 9 – 12 (2013), DOI: 10.1145/2512089.2512101.

S. El-Sappagh, F. Franda, F. Ali and K.-S. Kwak, SNOMED CT standard ontology based on the ontology for general medical science, BMC Medical Informatics and Decision Making 18 (2018), Article number: 76, DOI: 10.1186/s12911-018-0651-5.

K. R. Gøeg, R. Cornet and S. K. Andersen, Clustering clinical models from local electronic health records based on semantic similarity, Journal of Biomedical Informatics 54 (2015), 294 – 304, DOI: 10.1016/j.jbi.2014.12.015.

X. Huang, X. Zheng, W. Yuan, F. Wang and S. Zhu, Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization, Information Sciences 181(11) (2011), 2293 – 2302, DOI: 10.1016/j.ins.2011.01.029.

M. Ivanovi´c and Z. Budimac, An overview of ontologies and data resources in medical domains, Expert Systems with Applications 41(11) (2014), 5158 – 5166, DOI: 10.1016/j.eswa.2014.02.045.

E. K. Jasila, N. Saleena and K. A. A. Nazeer, An efficient document clustering approach for devising semantic clusters, Cybernetics and Systems (2023), 1 – 18, DOI: 10.1080/01969722.2023.2175135.

E. K. Jasila, N. Saleena and K. A. A. Nazeer, Ontology based document clustering - An efficient hybrid approach, in: IEEE 9th International Conference on Advanced Computing (IACC) (Tiruchirappalli, India, 2019), pp. 153 – 157 (2019), DOI: 10.1109/IACC48062.2019.8971594.

A. E. W. Johnson, T. J. Pollard, L. Shen, L.-W. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi and R. G. Mark, MIMIC-III, a freely accessible critical care database, Scientific Data 3 (2016), Article number: 160035, DOI: 10.1038/sdata.2016.35.

F. K. Khattak, S. Jeblee, C. Pou-Prom, M. Abdalla, C. Meaney and F. Rudzicz, A survey of word embeddings for clinical text, Journal of Biomedical Informatics 100 (2019), 100057, DOI: 10.1016/j.yjbinx.2019.100057.

Y. Li, J. Cai and J. Wang, A text document clustering method based on weighted Bert model, in: 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) (Chongqing, China, 2020), Vol. 1, pp. 1426 – 1430, (2020), DOI: 10.1109/ITNEC48623.2020.9085059.

Y. Ling, X. Pan, G. Li and X. Hu, Clinical documents clustering based on medication/symptom names using multi-view nonnegative matrix factorization, IEEE Transactions on NanoBioscience 14(5) (2015), 500 – 504, DOI: 10.1109/TNB.2015.2422612.

M. Marcinczuk, M. Gniewkowski, T. Walkowiak and M. B˛edkowski, Text document clustering: Wordnet vs. TF-IDF vs. Word embeddings, in: Proceedings of the 11th Global Wordnet Conference, University of South Africa, 2021, P. Vossen and C. Fellbaum (editors), Global Wordnet Association pp. 207 – 214 (2021), URL: https://aclanthology.org/2021.gwc-1.24.

S. M. Meystre, G. K. Savova, K. C. Kipper-Schuler and J. F. Hurdle, Extracting information from textual documents in the electronic health record: A review of recent research, Yearbook of Medical Informatics 17(1) (2008), 128 – 144, DOI: 10.1055/s-0038-1638592.

K. A. A. Nazeer, S. D. M. Kumar and M. P. Sebastian, Enhancing the k-means clustering algorithm by using a O(nlogn) heuristic method for finding better initial centroids, in: 2011 Second International Conference on Emerging Applications of Information Technology (Kolkata, India, 2011), pp. 261 – 264 (2011), DOI: 10.1109/EAIT.2011.57.

O. Patterson and J. F. Hurdle, Document clustering of clinical narratives: a systematic study of clinical sublanguages, AMIA Annual Symposium Proceedings 2011 (2011), 1099 – 1107.

V. Renganathan, Text mining in biomedical domain with emphasis on document clustering, Healthcare Informatics Research 23(3) (2017), 141 – 146, DOI: 10.4258/hir.2017.23.3.141.

K. Roberts and S. M. Harabagiu, A flexible framework for deriving assertions from electronic medical records, Journal of the American Medical Informatics Association 18(5) (2011), 568 – 573, DOI: 10.1136/amiajnl-2011-000152.

A. S. Schwartz and M. A. Hearst, A simple algorithm for identifying abbreviation definitions in biomedical text, in: Biocomputing, pp. 451 – 462 (2003), DOI: 10.1142/9789812776303_0042.

S. Shah and X. Luo, Exploring diseases based biomedical document clustering and visualization using self-organizing maps, in: 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom) (Dalian, China, 2017), pp. 1 – 6, IEEE (2017), DOI: 10.1109/HealthCom.2017.8210791.

H. Shelar, G. Kaur, N. Heda and P. Agrawal, Named entity recognition approaches and their comparison for custom NER model, Science & Technology Libraries 39(2) (2020), 324 – 337, DOI: 10.1080/0194262X.2020.1759479.

B. Shickel, P. J. Tighe, A. Bihorac and P. Rashidi, Deep EHR: A survey of recent advances in deep learning techniques for Electronic Health Record (EHR) analysis, IEEE Journal of Biomedical and Health Informatics 22(5) (2018), 1589 – 1604, DOI: 10.1109/JBHI.2017.2767063.

P.-N. Tan, M. Steinbach, A. Karpatne and V. Kumar, Introduction to Data Mining, 2nd edition, Pearson, London (2019).

C. Tang, J. M. Plasek, Y. Xiong, Z. Zhang, D. W. Bates and L. Zhou, A clustering algorithm based on document embedding to identify clinical note templates, Annals of Data Science 8 (2021), 497 – 515, DOI: 10.1007/s40745-020-00296-8.

P. Yadav, M. Steinbach, V. Kumar and G. Simon, Mining Electronic Health Records (EHRs): A survey, ACM Computing Surveys 50(6) (2018), Article number: 85, 1 – 40, DOI: 10.1145/3127881.

R. Zhang, S. Pakhomov and G. B. Melton, Longitudinal analysis of new information types in clinical notes, AMIA Summits on Translational Science – Proceedings 2014 (2014), 232 – 237.

R. Zhang, S. Pakhomov, B. T. McInnes and G. B. Melton, Evaluating measures of redundancy in clinical texts, in: AMIA Annual Symposium Proceedings, Vol. 2011, p. 1612, American Medical Informatics Association (2011).

The ScispaCy Clinical Term Extraction and Snomed CT Synonymy Elimination From Clinical Data For Clustering: A Novel Study

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Indexed in

Keywords