Natural language processing and information retrieval

Scientific area

Every day, vast amounts of text are generated through online applications such as messaging apps, social media, blogs, and online publishing platforms. Additionally, substantial volumes of text are available through traditional channels, including public policies (laws and regulations), academic publications, technical documentation (manuals), and clinical records.

Area leaders:
Fabio Crestani (USI)
Fabio Rinaldi (SUPSI)

Research groups

Natural Language Processing

Scopri di più

Traditionally in NLP words have been represented as discrete and static units of meaning, therefore it was technically difficult to represent the fact that some words are related to each other by similar meaning, and make use of this information in a computational system. Distributed word representations help to overcome this problem by using numerical vectors to represent words, which therefore can be conceived of as points in a multi-dimensional semantic space. It is worth noting that a similar technique had long been used in IR to represent documents, which probably contributed to the early successes of IR as opposed to the slow progress of NLP.
Deep learning uses multi-layered neural networks to process the information provided by words and sentences (represented as word vectors).
The NLP research group at IDSIA specializes in applications of these recent techniques to practical problems such as extracting medical knowledge from scientific literature and clinical records, or analysing social media streams to detect fake news.

Recently the group obtained two SNF projects in the area of “NLP for health”, see:

Information retrieval

The IR research group, on the other hand, is working on the use of advanced text analysis and term weighting techniques for the detection and tracking of mental health disorders in social media. More specifically, the group developed a test collection, an evaluation methodology and several effectiveness metrics for the temporal tracking of the onset of such disorders that are currently used by tens of research group worldwide in the context of CLEF (Cross Language Evaluation Forum). The modelling of the language used by users affected by these mental health disorders is also studied, for instance by means of the the automatic generation of text showing symptoms of a mental health disorder.
Another parallel line of research that the IR group is currently pursuing is related to the general area of Mobile IR in which the group has been active for many years in the context of several past projects (Crestani, 2017). Currently the group is working of Conversational IR, as a way to enhance Mobile IR. In this context, the group is exploring new deep learning models for the generation of clarifying questions that will make it possible for the conversational search system to interact in a multi-turn way with a mobile user (Aliannejadi, 2019, Sekulic, 2021}.

Leading Projects

Medical NLP

Curation of the biomedical literature

Digital Humanities

Mini-Muse is a preliminary study that aims to combine Natural Language Process algorithms and data visualization to enhance access and engagement with scientific publications in the domain of historical research.
https://mini-muse.github.io/project/

Leading Publications

Joseph Cornelius, Oscar Lithgow-Serrano, Sandra Mitrovic, Ljiljana Dolamic, and Fabio Rinaldi. 2024. BUST: Benchmark for the evaluation of detectors of LLM-Generated Text. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8029–8057, Mexico City, Mexico. Association for Computational Linguistics.

doi: 10.18653/v1/2024.naacl-long.444

Anastassia Shaitarova, Jamil Zaghir, Alberto Lavelli, Michael Krauthammer, Fabio Rinaldi. Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey. IMIA Yearbook of Medical Informatics, 2023 December 2023 Yearbook of Medical Informatics 32(01):230-243 doi: 10.1055/s-0043-1768726

doi: 10.1055/s-0043-1768726

Sedlakova J, Daniore P, Horn Wintsch A, Wolf M, Stanikic M, Haag C, Sieber C, Schneider G, Staub K, Alois Ettlin D, Grübner O, Rinaldi F, von Wyl V; University of Zurich Digital Society Initiative (UZH-DSI) Health Community. Challenges and best practices for digital unstructured data enrichment in health research: A systematic narrative review. PLOS Digit Health. 2023 Oct 11;2(10):e0000347.

doi: 10.1371/journal.pdig.0000347

Vani Kanjirangat, Tanja Samardžić, Ljiljana Dolamic, Fabio Rinaldi (2023). Optimizing the Size of Subword Vocabularies in Dialect Classification. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023) (pp. 14-30). doi: 10.18653/v1/2023.vardial-1.2

doi: 10.18653/v1/2023.vardial-1.2

Kanjirangat,V., Samardzic,T., Rinaldi,Fabio., Dolamic,Ljiljana. (2022). Early Guessing for Dialect Identification. In Findings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022), pp. 6417-6426. https://aclanthology.org/2022.findings-emnlp.479/

https://aclanthology.org/2022.findings-emnlp.479/

Lenz Furrer, Joseph Cornelius, Fabio Rinaldi. Parallel sequence tagging for concept recognition. BMC Bioinformatics volume 22, Article number: 623 (2021). doi: 10.1186/s12859-021-04511-y

doi: 10.1186/s12859-021-04511-y

Roberto Zanoli, Alberto Lavelli, Theresa Löffler, Nicolas Andres Perez Gonzalez, Fabio Rinaldi. An annotated dataset for extracting gene-melanoma relations from scientific literature. Journal of Biomedical Semantics, volume 13, Article number: 2 (2022). doi: 10.1186/s13326-021-00251-3

doi: 10.1186/s13326-021-00251-3

Gaspar F, Lutters M, Beeler PE, Lang PO, Burnand B, Rinaldi F, Lovis C, Csajka C, Le Pogam M. SwissMADE study Automatic Detection of Adverse Drug Events in Geriatric Care: Study Proposal. JMIR Res Protoc 2022;11(11):e40456 doi: 10.2196/40456

doi: 10.2196/40456

Sedlakova, Jana & Daniore, Paola & Horn, Andrea & Wolf, Markus & Stanikić, Mina & Haag, Christina & Sieber, Chloé & Schneider, Gerold & Staub, Kaspar & Ettlin, Dominik & Gruebner, Oliver & Rinaldi, Fabio & von Wyl, Viktor. (2022). Challenges and best practices for digital unstructured data enrichment in health research: a systematic narrative review. PLOS Digit Health 2(10): e0000347. https://doi.org/10.1371/journal.pdig.0000347

doi: 10.1371/journal.pdig.0000347