Multi-GPU distribution of single-batch, time-dependent linear products

Modern approaches to distributed deep learning focus on using more GPU nodes to process more data in parallel, updating the model weights using a distributed gradient update rule across all nodes. The main limitation of this paradigm is that it assumes that at least one sample of data can fit in a s...