Speaker: Ziv Goldfeld
Abstract: This talk will discuss the flow of information and the evolution of internal representations during deep neural network (DNN) training, aiming to demystify and mathematically explain the `compression' aspect of the Information Bottleneck theory. The theory suggests that DNN training comprises a rapid fitting phase followed by a slower compression phase, in which the mutual information I(X;T) between the input X and internal representations T decreases. Several papers observe compression of estimated mutual information on different DNN models, but the true I(X;T) over these networks is provably either constant (discrete X) or infinite (continuous X). We will explain this discrepancy between theory and experiments, and clarify what the estimated mutual information curves from past works were actually tracking.
To this end, an auxiliary (noisy) DNN framework will be introduced, in which I(X;T) is a meaningful quantity that depends on the network's parameters. This noisy framework is a good proxy for the original (deterministic) system both in terms of performance and the learned representations. To accurately track I(X;T) over noisy DNNs, a differential entropy estimator tailored to exploit the DNN's layered structure will be proposed and theoretical guarantees on its performance will be provided. Specifically, the estimator will be shown to be minimax rate optimal for the considered high-dimensional functional estimation problem, with sharp dependence of the convergence rate both on the number of samples and the problem's dimension. Using this estimator along with a certain analogy to an information-theoretic communication problem, we will unveil the geometric mechanism that drives compression of I(X;T) in noisy DNNs. Based on these findings, we will circle back to deterministic networks and demonstrate that the past observations of compression were in fact tracking the same geometric phenomenon. This finding will be leveraged to provide evidence for refuting the Information Bottleneck theory claim that more compression leads to better generalization. Future research directions inspired by this study aiming to design improved deep learning mechanism and facilitate a comprehensive information-theoretic understanding of the topic will also be discussed.