-
Notifications
You must be signed in to change notification settings - Fork 20
Machine Learning Papers
Large Scale Distributed Deep Networks (Dean et al, 2012):
- Downpour SGD (implemented in our code) and Sandblaster L-BFGS algorithms for distributed training
- Discusses Google's DistBelief framework for model parallelization
Deep Learning with Elastic Averaging SGD (Zhang et al, 2015):
- Elastic Averaging SGD algorithm for distributed neural network training (implemented in our code)
Faster Asynchronous SGD (Odena, 2016):
- Modifies asynchronous SGD in order to mitigate the problem of stale gradient updates (could be implemented in our code)
Revisiting Distributed Synchronous SGD (Chen et al, 2016):
- Synchronous SGD can run much faster if you only wait for 95% of worker updates at each time step (done in our code)
Theano-MPI: A Theano-Based Distributed Training Framework (Ma et al, 2016)
- Framework using MPI with Theano
- Performs fast synchronous training via direct GPU-to-GPU data transfers
Hogwild! A Lock-Free Approach to Parallelizing Stochastic Gradient Descent (Niu et al, 2011):
- For ML problems that are sparse (as defined in the paper), running asynchronous SGD without locks can provide significant speedups
- Our code does not operate in lock-free mode but this could be interesting to investigate in the future
Dogwild! — Distributed Hogwild for CPU and GPU (Noel and Osindero, 2014):
- Describes a scheme for running asynchronous distributed SGD in "Hogwild" mode, and its implementation in the Caffe framework
Distilling the Knowledge in a Neural Network (Hinton et al, 2015):
- After training a large network, can 'distil' its knowledge into a smaller network by training it on the distribution of logits output by the larger network
- Can train 'specialist' models to distinguish easily-confused categories when training a classifier on a large number of categories
- These ideas could be relevant if we train a large model and want to use it to make predictions fast (e.g. at the HLT where the time per event is limited)
ADADELTA: An Adaptive Learning Rate Method (Zeiler, 2012):
- This documents the Adadelta algorithm for parameter learning rate adaptation
ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky, Sutskever, and Hinton, 2012):
- This paper covers several useful concepts in convnet construction, including the use of dropout and ReLU activation units, lateral inhibition of nodes, and training dataset augmentation
Training Deep Neural Networks with Low Precision Multiplications (Courbariaux et al, 2015):
- Demonstrates that neural network training can be sped up by decreasing the precision of each parameter prior to each multiplication during back- and forward-propagation
- Could be another interesting optimization to investigate