Ontology-Driven Annotation of Dialogue Corpora: A Case Study in Indian Languages

Authors

  • Ilakkiya J Assistant Professor, Computer Science, Yenepoya University, Bangalore, Karnataka, India. Author
  • Akhil Khureshi UG - Computer Science, Yenepoya University, Bangalore, Karnataka, India. Author
  • Nicemon Dominic UG - Computer Science, Yenepoya University, Bangalore, Karnataka, India. Author
  • Vishal V Nair UG - Computer Science, Yenepoya University, Bangalore, Karnataka, India. Author
  • P Mohammed Anas UG - Computer Science, Yenepoya University, Bangalore, Karnataka, India. Author

DOI:

https://doi.org/10.47392/IRJAEM.2026.0182

Keywords:

ontology driven annotation, dialogue corpora, Indian languages, dialogue acts, semantic ontology, code mixing, honorifics, cross lingual analysis, low resource NLP, conversational AI

Abstract

The development of high-quality annotated dialogue corpora for Indian languages is still an ongoing task due to the complexity of languages, cultural issues, and the lack of standardization among various tasks. This paper proposes an ontology-based annotation framework that attempts to address these issues and provide a consistent and machine-readable annotation of multi-turn dialogues for Hindi, Tamil, and Telugu. To address these issues, we develop a formal ontology that formalizes dialogue acts (such as Request, Confirm, and Apology), roles, domain entities (such as train, ticket, and account), and discourse relations, and extend them with Indian language-specific categories such as honorific expressions, levels of politeness, and code-mixed expressions. Our ontology is used to annotate a 15,000-turn corpus of customer service and task-oriented dialogues, and we obtain inter-annotator agreement of κ=0.76 for dialogue acts and κ=0.71 for semantic roles. Our experimental results demonstrate that ontology-based annotation outperforms surface-level annotation schemes by 12-15% for dialogue act classification (F1=0.87), intent identification, and semantic search tasks. Cross-lingual transfer experiments also demonstrate that aggregating ontology-aligned data from multiple languages results in 12-18% improvement. This paper provides a scalable solution for dialogue corpus development in low-resource settings and provides the ontology and annotation guidelines for future research on Indian language conversational AI and sociolinguistic analysis.

Downloads

Download data is not yet available.

Downloads

Published

2026-05-05