Interactive malayalam question answering system in health domain using neural word embedding and hybrid large language models
Abstract
Human-computer interaction has been a mission for mankind since the emergence of Ar-
tificial Intelligence (AI) and Machine Learning (ML) techniques. Currently, Information
Retrieval (IR) and Question Answering (QA) stand out as the most blooming research
areas. A vast volume of data is scattered across the web, with a large number of users
simultaneously seeking information. So computers must comprehend and analyze natural
languages for accurate and effective communication. Natural Language Processing (NLP)
and Natural Language Understanding (NLU) techniques address the challenges in man-
machine interaction. Both Information Retrieval (IR) and Question Answering (QA)
systems act as interfaces for communication with computers. In Information Retrieval,
a set of documents is fetched from an extensive document collection. At the same time,
Question Answering aims to provide the most relevant and appropriate answers to users’
queries.
This study aims to create an interactive Malayalam Question Answering System (QAS)
in the health domain that can effectively answer users’ queries related to their health
problems and wellness. To do this, here employ advanced NLP techniques such as Word
Embedding, hybrid transformer-based models, and Large Language Models (LLM). In
the first phase, Word Embedding was used to capture the semantic relationships between words, with Continuous Bag Of Word (CBOW) and Skip-gram models. Then, fine-tuned different Bidirectional Encoder Representations from Transformers (BERT)
models with health data sets for the Question Answering process. Secondly, a new hybrid BERT model was introduced by combining a Reader-Retriever pipeline architecture with different BERT models. Finally, proposed a new model Bio Question Answering BERT (BioQA-BERT) that uses the Low Rank Adaptation (LoRA) fine-tuning algo-
rithm and the latest Large Language Model (LLM)- Large Language Meta AI (LLaMA)
with the optimized BERT model RoBERTa.
In the initial phase, a Word Embedding-based Question Answering model was in-
troduced, with an annotated SQuAD format Malayalam Question Answering Data set
(MQuAD), SQuAD means Stand-ford Question Answering Data set. In this study em-
ployed two-Word Embedding algorithms - CBOW and Skip-Gram. Notably, the Skip-
Gram algorithm exhibited the highest accuracy when applied to the MQuAD corpus. The
primary objective of this system is to bridge the gap between human language understand-
ing and machine reasoning. By exploiting Word Embedding, the system provides precise
and relevant answers to user’s queries. One pivotal advancement that has significantly
contributed to the effectiveness of the Question Answering system is the incorporation of
Word Embedding techniques, specifically demonstrated through the superior performance
of the Skip-Gram algorithm on the MQuAD corpus.
A detailed analysis of various BERT models for Question Answering was conducted.
Different pre-trained BERT models, namely A Lite Bidirectional Encoder Representa-
tions from Transformers (ALBERT), Distil Bidirectional Encoder Representations from
Transformers (DistilBERT), Struct Bidirectional Encoder Representations from Trans-
formers (StructBERT), and Robustly Optimized Bidirectional Encoder Representations
xixfrom Transformers (RoBERTa), were utilized and fine-tuned using the MQuAD corpus.
A hybrid Reader-Retriever pipeline architecture was incorporated with different BERT
models. The Question Answering model, BioQA-BERT was implemented by integrating
the Low Rank Adaptation LoRA algorithm and LLaMA with the optimized BERT model
RoBERTa. Experimental results indicated a notable increase in the F1 score after intro-
ducing LoRA and LLaMA with BERT models. In conclusion, while this study provides
significant insights into the state and challenges of Malayalam Question Answering, espe-
cially in the health domain, it is constrained by the inherent complexities of the language
and the limited precedent in the field.
In conclusion, this research constitutes a substantial contribution to understanding the
current state and challenges within the domain of Malayalam Question Answering, par-
ticularly in the context of healthcare. The comprehensive exploration of these challenges
illuminates the intricacies intrinsic to the Malayalam language, offering a subtle perspec-
tive on the difficulties encountered in the development of robust Question Answering
systems for this linguistic environment. The intricate nature of the Malayalam language
poses obstacles to achieving optimal performance, necessitating sophisticated solutions
attuned to its linguistic idiosyncrasies. Additionally, the limited precedent in the field
of Malayalam Question Answering introduces an extra layer of complexity, highlighting
the pioneering nature of this research venture. As the landscape evolves, future research
endeavors will benefit from addressing these challenges, thereby building upon the foun-
dational insights derived from this study and contributing to the continual advancement
of Malayalam Question Answering systems, particularly the health domain.
Collections
- Doctoral Theses [608]