Born Again Neural Networks

Abstract

revisit knowledge distillation with a different objective
train students that are parameterized identically to their parents
SToA performance on the CIFAR-100: validation error of 15.5% (single model), 14.9% (ensemble)
investigate KD to architectures that are different, but with comparable capacity to their teachers (between dense networks and residual networks)

Introduction
- initialize a new student and train it with the dual goals of predicting the correct labels and matching the output distribution of the teacher, which leads the students toward better local minima
- call students Born Again Networks (BAN) and show that applied to DenseNet
Method
- based on the empirical finding that the solution \theta_1^*
- self as a teacher to next step
- Sequence of Teaching Selves Born Again Networks Ensemble
  - sequence born
  - ensemble
Experiments
- DenseNets Born Again as ResNets
- Baseline: wide-ResNet and bottleneck-ResNet match the output shape of DenseNet-90-60 at each block
- BAN-Resnet: the student shares the first and last layer with its teacher (DenseNet 90-60)
- Result