Interactive malayalam question answering system in health domain using neural word embedding and hybrid large language models

Liji S.K

View/Open

2129_Liji.pdf (12.03Mb)

Date

2025

Author

Liji S.K

Metadata

Show full item record

Abstract

Human-computer interaction has been a mission for mankind since the emergence of Ar- tificial Intelligence (AI) and Machine Learning (ML) techniques. Currently, Information Retrieval (IR) and Question Answering (QA) stand out as the most blooming research areas. A vast volume of data is scattered across the web, with a large number of users simultaneously seeking information. So computers must comprehend and analyze natural languages for accurate and effective communication. Natural Language Processing (NLP) and Natural Language Understanding (NLU) techniques address the challenges in man- machine interaction. Both Information Retrieval (IR) and Question Answering (QA) systems act as interfaces for communication with computers. In Information Retrieval, a set of documents is fetched from an extensive document collection. At the same time, Question Answering aims to provide the most relevant and appropriate answers to users’ queries. This study aims to create an interactive Malayalam Question Answering System (QAS) in the health domain that can effectively answer users’ queries related to their health problems and wellness. To do this, here employ advanced NLP techniques such as Word Embedding, hybrid transformer-based models, and Large Language Models (LLM). In the first phase, Word Embedding was used to capture the semantic relationships between words, with Continuous Bag Of Word (CBOW) and Skip-gram models. Then, fine-tuned different Bidirectional Encoder Representations from Transformers (BERT) models with health data sets for the Question Answering process. Secondly, a new hybrid BERT model was introduced by combining a Reader-Retriever pipeline architecture with different BERT models. Finally, proposed a new model Bio Question Answering BERT (BioQA-BERT) that uses the Low Rank Adaptation (LoRA) fine-tuning algo- rithm and the latest Large Language Model (LLM)- Large Language Meta AI (LLaMA) with the optimized BERT model RoBERTa. In the initial phase, a Word Embedding-based Question Answering model was in- troduced, with an annotated SQuAD format Malayalam Question Answering Data set (MQuAD), SQuAD means Stand-ford Question Answering Data set. In this study em- ployed two-Word Embedding algorithms - CBOW and Skip-Gram. Notably, the Skip- Gram algorithm exhibited the highest accuracy when applied to the MQuAD corpus. The primary objective of this system is to bridge the gap between human language understand- ing and machine reasoning. By exploiting Word Embedding, the system provides precise and relevant answers to user’s queries. One pivotal advancement that has significantly contributed to the effectiveness of the Question Answering system is the incorporation of Word Embedding techniques, specifically demonstrated through the superior performance of the Skip-Gram algorithm on the MQuAD corpus. A detailed analysis of various BERT models for Question Answering was conducted. Different pre-trained BERT models, namely A Lite Bidirectional Encoder Representa- tions from Transformers (ALBERT), Distil Bidirectional Encoder Representations from Transformers (DistilBERT), Struct Bidirectional Encoder Representations from Trans- formers (StructBERT), and Robustly Optimized Bidirectional Encoder Representations xixfrom Transformers (RoBERTa), were utilized and fine-tuned using the MQuAD corpus. A hybrid Reader-Retriever pipeline architecture was incorporated with different BERT models. The Question Answering model, BioQA-BERT was implemented by integrating the Low Rank Adaptation LoRA algorithm and LLaMA with the optimized BERT model RoBERTa. Experimental results indicated a notable increase in the F1 score after intro- ducing LoRA and LLaMA with BERT models. In conclusion, while this study provides significant insights into the state and challenges of Malayalam Question Answering, espe- cially in the health domain, it is constrained by the inherent complexities of the language and the limited precedent in the field. In conclusion, this research constitutes a substantial contribution to understanding the current state and challenges within the domain of Malayalam Question Answering, par- ticularly in the context of healthcare. The comprehensive exploration of these challenges illuminates the intricacies intrinsic to the Malayalam language, offering a subtle perspec- tive on the difficulties encountered in the development of robust Question Answering systems for this linguistic environment. The intricate nature of the Malayalam language poses obstacles to achieving optimal performance, necessitating sophisticated solutions attuned to its linguistic idiosyncrasies. Additionally, the limited precedent in the field of Malayalam Question Answering introduces an extra layer of complexity, highlighting the pioneering nature of this research venture. As the landscape evolves, future research endeavors will benefit from addressing these challenges, thereby building upon the foun- dational insights derived from this study and contributing to the continual advancement of Malayalam Question Answering systems, particularly the health domain.

URI

https://hdl.handle.net/20.500.12818/2488

Collections

Doctoral Theses [689]