Ontology-Driven Annotation of Dialogue Corpora: A Case Study in Indian Languages
DOI:
https://doi.org/10.47392/IRJAEM.2026.0182Keywords:
ontology driven annotation, dialogue corpora, Indian languages, dialogue acts, semantic ontology, code mixing, honorifics, cross lingual analysis, low resource NLP, conversational AIAbstract
The development of high-quality annotated dialogue corpora for Indian languages is still an ongoing task due to the complexity of languages, cultural issues, and the lack of standardization among various tasks. This paper proposes an ontology-based annotation framework that attempts to address these issues and provide a consistent and machine-readable annotation of multi-turn dialogues for Hindi, Tamil, and Telugu. To address these issues, we develop a formal ontology that formalizes dialogue acts (such as Request, Confirm, and Apology), roles, domain entities (such as train, ticket, and account), and discourse relations, and extend them with Indian language-specific categories such as honorific expressions, levels of politeness, and code-mixed expressions. Our ontology is used to annotate a 15,000-turn corpus of customer service and task-oriented dialogues, and we obtain inter-annotator agreement of κ=0.76 for dialogue acts and κ=0.71 for semantic roles. Our experimental results demonstrate that ontology-based annotation outperforms surface-level annotation schemes by 12-15% for dialogue act classification (F1=0.87), intent identification, and semantic search tasks. Cross-lingual transfer experiments also demonstrate that aggregating ontology-aligned data from multiple languages results in 12-18% improvement. This paper provides a scalable solution for dialogue corpus development in low-resource settings and provides the ontology and annotation guidelines for future research on Indian language conversational AI and sociolinguistic analysis.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2026 International Research Journal on Advanced Engineering and Management (IRJAEM)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
.