Deep Learning with Multiple GPUs on Rescale: TensorFlow Tutorial
Blog post from Rescale
This article explores multi-GPU training with TensorFlow, focusing on data-parallel GPU training in both single and multi-node configurations using Rescale's infrastructure. It starts with the preparation of datasets, converting images into TFRecords format using smaller datasets like Flowers before scaling to larger datasets such as ImageNet. The training uses the Inception v3 deep neural network architecture, leveraging Rescale's MPI-configured clusters to manage distributed training processes, which include GPU-based model training and CPU-based model evaluation, alongside visualization using TensorBoard. The article details the creation of Rescale jobs for single-node and multi-node configurations, utilizing MPI scripts and TensorFlow's distributed training capabilities, to efficiently manage GPU resources across multiple nodes. It concludes with a brief mention of upcoming discussions on the performance implications of distributed training across various server configurations.