| Title: | Weakly-supervised multilingual medical NER for symptom extraction for low-resource languages |
|---|
| Authors: | ID Sallauka, Rigon (Author) ID Arioz, Umut (Author) ID Rojc, Matej (Author) ID Mlakar, Izidor (Author) |
| Files: | applsci-15-05585-v2.pdf (338,94 KB) MD5: 9E3606C205F09FCCA4B26DDF5C379DCF
|
|---|
| Language: | English |
|---|
| Work type: | Article |
|---|
| Typology: | 1.01 - Original Scientific Article |
|---|
| Organization: | FERI - Faculty of Electrical Engineering and Computer Science
|
|---|
| Abstract: | Patient-reported health data, especially patient-reported outcomes measures, are vital for improving clinical care but are often limited by memory bias, cognitive load, and inflexible questionnaires. Patients prefer conversational symptom reporting, highlighting the need for robust methods in symptom extraction and conversational intelligence. This study presents a weakly-supervised pipeline for training and evaluating medical Named Entity Recognition (NER) models across eight languages, with a focus on low-resource settings. A merged English medical corpus, annotated using the Stanza i2b2 model, was translated into German, Greek, Spanish, Italian, Portuguese, Polish, and Slovenian, preserving the entity annotations medical problems, diagnostic tests, and treatments. Data augmentation addressed the class imbalance, and the fine-tuned BERT-based models outperformed baselines consistently. The English model achieved the highest F1 score (80.07%), followed by German (78.70%), Spanish (77.61%), Portuguese (77.21%), Slovenian (75.72%), Italian (75.60%), Polish (75.56%), and Greek (69.10%). Compared to the existing baselines, our models demonstrated notable performance gains, particularly in English, Spanish, and Italian. This research underscores the feasibility and effectiveness of weakly-supervised multilingual approaches for medical entity extraction, contributing to improved information access in clinical narratives—especially in under-resourced languages. |
|---|
| Keywords: | low-resource languages, machine translation, medical entity extraction, NER, NLP, patient-reported outcomes, weakly-supervised learning |
|---|
| Publication status: | Published |
|---|
| Publication version: | Version of Record |
|---|
| Submitted for review: | 01.05.2025 |
|---|
| Article acceptance date: | 13.05.2025 |
|---|
| Publication date: | 16.05.2025 |
|---|
| Publisher: | MDPI |
|---|
| Year of publishing: | 2025 |
|---|
| Number of pages: | 18 str. |
|---|
| Numbering: | Vol. 15, iss. 10, [article no.] 5585 |
|---|
| PID: | 20.500.12556/DKUM-92857  |
|---|
| UDC: | 004.8:61 |
|---|
| ISSN on article: | 2076-3417 |
|---|
| COBISS.SI-ID: | 236281347  |
|---|
| DOI: | 10.3390/app15105585  |
|---|
| Copyright: | © 2025 by the authors
|
|---|
| Publication date in DKUM: | 19.05.2025 |
|---|
| Views: | 0 |
|---|
| Downloads: | 4 |
|---|
| Metadata: |  |
|---|
| Categories: | Misc.
|
|---|
|
:
|
Copy citation |
|---|
| | | | Average score: | (0 votes) |
|---|
| Your score: | Voting is allowed only for logged in users. |
|---|
| Share: |  |
|---|
Hover the mouse pointer over a document title to show the abstract or click
on the title to get all document metadata. |