Multi-GPU distribution of single-batch, time-dependent linear products

Modern approaches to distributed deep learning focus on using more GPU nodes to process more data in parallel, updating the model weights using a distributed gradient update rule across all nodes. The main limitation of this paradigm is that it assumes that at least one sample of data can fit in a s...

Full description

Autores:: Margffoy Tuay, Edgar Andrés

Tipo de recurso:

Fecha de publicación:: 2020

Institución:: Universidad de los Andes

Repositorio:: Séneca: repositorio Uniandes

Idioma:: eng

Description
Summary:	Modern approaches to distributed deep learning focus on using more GPU nodes to process more data in parallel, updating the model weights using a distributed gradient update rule across all nodes. The main limitation of this paradigm is that it assumes that at least one sample of data can fit in a single node. However, that does not hold when dealing with large inputs or, when GPU infrastructure does not have enough memory. In this paper, we propose a new operator-level distribution approach, tailored to the aforementioned cases in which, we distribute a single input of data across multiple GPU nodes, taking into account the operators involved in a given model. By distributing the original input, we are able to reduce the space complexity of each node, thus enabling multiple GPUs to process inputs that could not fit in a single node. We validate our approach by distributing the dot product attention, a fundamental operation in modern sequence-to-sequence architectures

Multi-GPU distribution of single-batch, time-dependent linear products

Publicaciones similares