Machine Learning Papers

Jump to bottom

Dustin Anderson edited this page Oct 6, 2016 · 1 revision

Papers on distributed machine learning

Large Scale Distributed Deep Networks (Dean et al, 2012):

Downpour SGD (implemented in our code) and Sandblaster L-BFGS algorithms for distributed training
Discusses Google's DistBelief framework for model parallelization

Deep Learning with Elastic Averaging SGD (Zhang et al, 2015):

Elastic Averaging SGD algorithm for distributed neural network training (implemented in our code)

Faster Asynchronous SGD (Odena, 2016):

Modifies asynchronous SGD in order to mitigate the problem of stale gradient updates (could be implemented in our code)

Revisiting Distributed Synchronous SGD (Chen et al, 2016):

Synchronous SGD can run much faster if you only wait for 95% of worker updates at each time step (done in our code)

Theano-MPI: A Theano-Based Distributed Training Framework (Ma et al, 2016)

Framework using MPI with Theano
Performs fast synchronous training via direct GPU-to-GPU data transfers

Hogwild! A Lock-Free Approach to Parallelizing Stochastic Gradient Descent (Niu et al, 2011):

For ML problems that are sparse (as defined in the paper), running asynchronous SGD without locks can provide significant speedups
Our code does not operate in lock-free mode but this could be interesting to investigate in the future

Dogwild! — Distributed Hogwild for CPU and GPU (Noel and Osindero, 2014):

Describes a scheme for running asynchronous distributed SGD in "Hogwild" mode, and its implementation in the Caffe framework

Other ML papers

Distilling the Knowledge in a Neural Network (Hinton et al, 2015):

After training a large network, can 'distil' its knowledge into a smaller network by training it on the distribution of logits output by the larger network
Can train 'specialist' models to distinguish easily-confused categories when training a classifier on a large number of categories
These ideas could be relevant if we train a large model and want to use it to make predictions fast (e.g. at the HLT where the time per event is limited)

ADADELTA: An Adaptive Learning Rate Method (Zeiler, 2012):

This documents the Adadelta algorithm for parameter learning rate adaptation

ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky, Sutskever, and Hinton, 2012):

This paper covers several useful concepts in convnet construction, including the use of dropout and ReLU activation units, lateral inhibition of nodes, and training dataset augmentation

Training Deep Neural Networks with Low Precision Multiplications (Courbariaux et al, 2015):

Demonstrates that neural network training can be sped up by decreasing the precision of each parameter prior to each multiplication during back- and forward-propagation
Could be another interesting optimization to investigate