Text this: Contextual information for instance-level video analysis