Deep Learning with Multiple GPUs on Rescale: TensorFlow Tutorial

Post Details

Company

Rescale

Date Published

March 13, 2017

Author

Mark Whitney

Word Count

2,104

Language

English

Hacker News Points

-

Source URL

rescale.com/blog/deep-learning-with-multiple-gpus-on-rescale-tensorflow

Summary

This article explores multi-GPU training with TensorFlow, focusing on data-parallel GPU training in both single and multi-node configurations using Rescale's infrastructure. It starts with the preparation of datasets, converting images into TFRecords format using smaller datasets like Flowers before scaling to larger datasets such as ImageNet. The training uses the Inception v3 deep neural network architecture, leveraging Rescale's MPI-configured clusters to manage distributed training processes, which include GPU-based model training and CPU-based model evaluation, alongside visualization using TensorBoard. The article details the creation of Rescale jobs for single-node and multi-node configurations, utilizing MPI scripts and TensorFlow's distributed training capabilities, to efficiently manage GPU resources across multiple nodes. It concludes with a brief mention of upcoming discussions on the performance implications of distributed training across various server configurations.