Text this: Exploration of a ViT-based multimodal approach to Vehicle Accident Detection