Exploration of a ViT-based multimodal approach to Vehicle Accident Detection

Multimodal Deep Learning (MMDL) has emerged as a potent framework for synthesizing information from diverse data sources, enhancing the capability of models to understand and predict complex phenomena. Particularly, Vision Transformers (ViT) have shown promising results in processing visual data alo...

Full description

Autores:
Ríos Pérez, Jesús David
Tipo de recurso:
Fecha de publicación:
2024
Institución:
Universidad del Magdalena
Repositorio:
Repositorio Unimagdalena
Idioma:
eng
OAI Identifier:
oai:repositorio.unimagdalena.edu.co:123456789/21215
Acceso en línea:
https://repositorio.unimagdalena.edu.co/handle/123456789/21215
Palabra clave:
Multimodal, Machine Learning, Data Fusion, Deep Learning.
Multimodalidad, Aprendizaje de máquinas, Fusión de datos, Aprendizaje profundo.
Rights
openAccess
License
Acceso Abierto
Description
Summary:Multimodal Deep Learning (MMDL) has emerged as a potent framework for synthesizing information from diverse data sources, enhancing the capability of models to understand and predict complex phenomena. Particularly, Vision Transformers (ViT) have shown promising results in processing visual data alongside other modalities for comprehensive analysis. This study aims to investigate the integration of MMDL and ViT in the context of traffic accident detection, addressing the critical need for advanced predictive models in this domain. Through a literature review, we assess the current landscape of MMDL applications, and highlight the evolution and challenges of multimodal learning. Building on these insights, we propose a novel MMDL architecture designed to leverage video, audio, and metadata for accurate and timely accident detection. Our methodology combines a structured review of recent MMDL research with a theoretical approach to architecture design, emphasizing the fusion of multimodal data through ViT. The review adheres to established guidelines for systematic reviews, focusing on advancements from 2019 to 2023, while the architecture design is grounded in a thorough analysis of modalities relevant to traffic incidents. The main contributions include a taxonomy of MMDL methods and a ViT-based architecture for enhancing traffic safety systems. Integrating multimodal data through advanced deep learning models can improves the prediction accuracy of traffic accident detection. This research underscores the potential of MMDL and ViT in developing robust, real-time monitoring systems, marking a step forward in the application of artificial intelligence for public safety and smart city initiatives.