Enhancing Traffic Scene and Understanding Through Image Captioning and Audio

Authors

  • Sejal Pawar Student, Department of Information Technology, Bharati Vidyapeeth's College of Engineering for Women, Pune, India. Author
  • Shruti Mulay Student, Department of Information Technology, Bharati Vidyapeeth's College of Engineering for Women, Pune, India. Author
  • Jivani Suryawanshi Student, Department of Information Technology, Bharati Vidyapeeth's College of Engineering for Women, Pune, India. Author
  • Vaishnavi Walgude Student, Department of Information Technology, Bharati Vidyapeeth's College of Engineering for Women, Pune, India. Author
  • Prof. K. V. Patil Professor, Department of Information Technology, Bharati Vidyapeeth's College of Engineering for Women, Pune, India. Author

DOI:

https://doi.org/10.47392/IRJAEM.2024.0349

Keywords:

TTS – Text-To-Speech, ADAS – Advance Driver Assistance System, You Only Look Once (YOLO), Convolution Neural Networks (CNN)

Abstract

Navigating significant collections of traffic images on the net provides a tremendous task, mainly for users in search of particular facts. Many images lack captions, making it difficult to locate applicable content. Our undertaking addresses this difficulty by way of developing an automated labelling service that generates object-based totally description and gives auditory cues about their distances, the usage of a combination of computer vision and audio description techniques. With this, automation has become the need of the hour. The use of automation in motor vehicles is one similar area that's getting further and further significance and recognition around the world. Our technique leverages state-of-the-art object detection strategies, specially the YOLO (You Only Look Once) version, to identify and label items within traffic photos. By expertise the content of each image, our service generates labels for items detected, and presents additional information concerning their distances from the viewer. To enhance accessibility, our service includes a Text-to-Speech (TTS) engine for sounding audio description. This function caters to users with visible impairments and people who decide on auditory facts. Our studies pursuits to automate the method of annotating site visitor’s images, lowering the reliance on human intervention, especially for massive databases. We make use of a deep neural network structure, which include the YOLO version for object detection, and extra additives for distance estimation. This functionality is designed to accommodate individuals with visual impairments as well as those who favor auditory cues. By bridging the distance between image content and textual/audio descriptions, our device offers a promising solution for correctly accessing records within traffic scenes.

Downloads

Download data is not yet available.

Downloads

Published

2024-07-27