1 Introduction
Following an influential paper by Zhang et al. (2017), the basic question of why neural networks generalise
is generally regarded to be open. The authors demonstrate a surprising fact: deep learning generalises even when the neural network is expressive enough to represent functions that do not generalise. In turn, this implies that the theory of
Vapnik & Chervonenkis (1971)—based on uniform convergence of train error to population error—does not explain generalisation in neural networks.Several hypotheses have since been proposed to fill this theoretical vacuum. One prominent hypothesis states that, while most neural network solutions do not generalise well, there is an implicit bias in the kinds of functions returned by gradient descent (Soudry et al., 2018). In sharp contrast, a second hypothesis states that the solution space of a neural network is dominated by simple functions, while the complex kinds of functions that overfit are relatively rare (VallePérez et al., 2019).
This latter hypothesis dovetails with a particular “PACBayesian” theorem of McAllester (1998)
, which bounds the average population error of all classifiers consistent with a training sample. If this bound is small, then the complex functions that overfit must indeed be rare. In more technical terms, the PACBayesian theorem can provide a meaningful certificate of generalisation even for machine learning models with an infinite VapnikChervonenkis dimension, or an arbitrarily large number of parameters, by properly accounting for the measure of those very complex functions.
This paper takes a more nuanced position between these two hypotheses. While the average population error of all neural networks that fit a training sample is found to be good (and certifiably good by the PACBayesian theorem), it is still possible for certain networks with special properties to perform substantially better than average (and likewise substantially better than the PACBayes bound). Moreover, gradient descent may be used to specifically target these special networks. This subtly counters a position put forward by Mingard et al. (2021), which suggests that gradient descent may be wellmodelled as sampling randomly from a particular Bayesian posterior distribution.
To support these claims, a careful study of the behaviour of both infinite width and finite width neural networks is conducted. In particular, this paper makes the following technical contributions:

[leftmargin=4.25em]

3 Implicit Bias of Architecture. A purely analytical PACBayes bound (Theorem 2) on the population error of the neural network–Gaussian process (NNGP) posterior for binary classification is derived. The bound furnishes an interpretable measure of model complexity that depends on both the architecture and the training data. This exact analytical bound improves upon an approximate computational approach due to VallePérez et al. (2019).

. The new bound is found to be both nonvacuous and correlated with the test error of finite width multilayer perceptrons trained by gradient descent. This provides supporting evidence for the important role of architecture in generalisation, as put forward by
VallePérez et al. (2019). Still, a gap exists with the performance of gradient descent. 
5 Implicit Bias of Gradient Descent. Going further beyond the work of VallePérez et al. (2019), a new theoretical tool (Theorem 3
) is developed to enable consistent estimation of the average error of the NNGP posterior on a given holdout set. The average is found to be significantly worse than the holdout performance of Gaussian process draws with large margin. The experiment is repeated for finite width neural networks trained by gradient descent, and the same qualitative phenomenon persists. This finding demonstrates the ability of gradient descent to select large margin functions with vanishing posterior probability that nonetheless generalise significantly better than the posterior average.
2 Related Work
Marginbased generalisation theory
A rich body of work explores the connection between margin and generalisation. For instance, Bartlett et al. (2017) propose a marginbased complexity measure for neural networks derived via Rademacher complexity analysis, and Neyshabur et al. (2018) derive a similar result via PACBayes analysis. Neyshabur et al. (2019) test these bounds experimentally, finding them to be vacuous and to scale poorly with network width. Biggs & Guedj (2021) further develop this style of bound. Marginbased PACBayes bounds go back at least to the work of Herbrich (2001), who derived such a bound for linear classification. The standard idea is that a solution with large margin implies the existence of nearby solutions with the same training error (but perhaps smaller margin), facilitating a more spread out PACBayes posterior. This style of marginbased PACBayes bound does not provide guidance on whether the original large margin classifier should generalise better or worse than the additional nearby classifiers included in the posterior.
Nonvacuous bounds for neural networks
Aside from PACBayes, many styles of generalisation bound for neural networks are vacuous (Dziugaite & Roy, 2017). Even many PACBayes bounds are vacuous (Achille & Soatto, 2018; Foret et al., 2021) due to their choice of PACBayes prior. Dziugaite & Roy (2017) construct a nonvacuous weight space PACBayes bound by iteratively optimising both the PACBayes prior and posterior by gradient descent. Wu et al. (2020) extend this approach to involve Hessians. Meanwhile, VallePérez et al. (2019) instantiate a nonvacuous PACBayes bound in the function space of neural networks via the NNGP correspondence (Neal, 1994; Lee et al., 2018). The evaluation of this bound involves an iterative and statistically inconsistent expectationpropagation approximation. Unlike this paper’s purely analytical bound in Theorem 2, these nonvacuous bounds are all evaluated by iterative computational procedures.
Implicit bias of gradient descent
Much research has gone into the role that gradient descent plays in neural network generalisation. For instance, Zhang et al. (2017) suggest that gradient descent may converge to solutions with special generalisation properties, while Wilson et al. (2017) suggest that gradient descent may have a more favourable implicit bias than the Adam optimiser. Azizan et al. (2021) make a related study in the context of mirror descent. Keskar et al. (2017)
point to an implicit bias present in minibatch stochastic gradient descent, but
Geiping et al. (2021) argue against the importance of stochasticity. Meanwhile, Soudry et al. (2018)observe how gradient descent combined with certain loss functions converges to maxmargin solutions;
Wei et al. (2018) make a similar observation. On a different tack, Mingard et al. (2021), VallePérez & Louis (2020) and VallePérez et al. (2019) argue that the implicit bias of neural architecture is more important than that of gradient descent for understanding generalisation, and so long as gradient descent does not select solutions pathologically, generalisation should occur.3 Implicit Bias of Architecture
This section derives an analytical bound on the population error of an infinitely wide neural network averaged over all weight configurations that attain 0% train error—in other words, the average population error of the NNGP posterior. Since this quantity assesses the performance of all solutions rather than just those returned by gradient descent, it is intended to measure the implicit bias of architecture. In Section 4
, the bound is found to be nonvacuous for multilayer perceptrons. The source of a substantial gap with the results of gradient descent training is investigated in Section
5, and ultimately attributed to an additional implicit bias of gradient descent: namely margin.3.1 Preliminary Notation
Consider a training dataset of input–label pairs: The inputs are embedded in Euclidean space, while to simplify the analysis the paper shall restrict to binary labels . A classifier
shall depend on a weight vector
, where denotes the weight space. In particular, for an input the prediction is given by .One is interested in the relationship between the train error of a classifier on a dataset to the population error on a data distribution . For a weight vector , these are defined as:
(1) 
The solution space denotes those classifiers that attain zero train error.
3.2 Geometry of Solutions in Function Space
The function space is defined to be the set of outputs that the classifier can realise on dataset :
(2) 
In weight space the geometry of solutions can be arbitrarily complicated, but in function space the solutions occupy the single orthant picked out by the binary training labels.
To measure the volume of the solution space, one can define a measure on weight space. It is convenient to enforce so that
is a probability measure. In the absence of a better alternative,
is usually set to a multivariate Gaussian or uniform distribution. The measure
on weight space induces a measure on function space via the relation:(3) 
The volume of solutions is denoted , and can be computed either in weight or function space. In function space, is an orthant probability. The situation is visualised in Figure 1.
3.3 PACBayes Theory
PACBayes relates the volume of solutions to their average population error. The following result was derived by VallePérez et al. (2019) as a corollary of a theorem due to Langford & Seeger (2001). The result is similar in form to a theorem of McAllester (1998).
Theorem 1 (A PACBayesian theorem).
First, fix a probability measure over the weight space of a classifier. Let denote a training set of datapoints sampled iid from the data distribution and let denote the corresponding solution space. Consider the population error of weight setting , and its average over the solution space . Then, for a proportion of draws of the training set ,
(4) 
For large , the term is negligible and the result says that the population error averaged over solutions is less than the ratio of the negative log volume of solutions to the number of training points.
3.4 Volume of Solutions via Gaussian Orthant Probabilities
Since infinitely wide neural networks induce a Gaussian measure on function space, computing the volume of the solution space amounts to computing a Gaussian orthant probability.
In more detail, let denote a measure on the weight space of an infinitely wide neural network satisfying the conditions of the NNGP correspondence—see Theorem 4 in Appendix A.3 for an example. Then for a weight vector , the network outputs on a training set are distributed:
(5) 
with covariance .
Therefore, under the NNGP correspondence, the volume of solutions computed in function space is just the Gaussian probability of the orthant picked out by the binary training labels:
(6) 
To facilitate estimating and bounding this probability, this paper has derived the following lemma.
Lemma 1 (Gaussian orthant probability).
For a covariance matrix , and a binary vector , let denote the corresponding Gaussian orthant probability:
(7) 
Letting denote the identity matrix, the elementwise product and the elementwise absolute value, then may be equivalently expressed as:
(8) 
and may be bounded as follows:
(9) 
The proof is given in Appendix A.2. Equation 8 yields an unbiased Monte Carlo estimator of Gaussian orthant probabilities, and Inequality 9 yields a lower bound. The complexity measures and are defined for later use.
To gain intuition about the lemma, observe that is the orthant probability for an isotropic Gaussian. Depending on the degree of anisotropy inherent in the covariance matrix , Equation 8 captures how the orthant probability may either be exponentially amplified or suppressed compared to .
As an aside, using Inequality 9 to lower bound Equation 6 has a Bayesian interpretation. Since the volume of solutions may be written , it may be interpreted as the Bayesian evidence for the network architecture under the likelihood function . A Bayesian would then refer to Inequality 9 as an evidence lower bound.
Theorem 2 (Upper bound on the average population error of an infinitely wide neural network).
First, fix a probability measure over the weight space of an infinitely wide neural network. Let denote a training set of datapoints sampled iid from the data distribution , let denote the binary vector of training labels, and let denote the corresponding solution space. Consider the population error of weight setting , and its average over the solution space . Let denote the NNGP covariance matrix (Equation 5). Then, for a proportion of draws of the training set ,
(10) 
where the complexity measures and are defined in Lemma 1.
Since the complexity measure is an analytical function of the NNGP covariance matrix and the binary vector of training labels , Theorem 2 is an analytical generalisation bound for the NNGP posterior. For large , the result simplifies to: . So the bound depends on the ratio of complexity measures and to the number of datapoints .
To gain further intuition about Theorem 2, consider two special cases. First, suppose that the neural architecture induces no correlation between any pair of distinct data points, such that . Then and the bound is worse than chance. This corresponds to pure memorisation of the training labels. Next, suppose that the neural architecture induces strong intraclass correlations and strong interclass anticorrelations, such that . Although this is singular, it may be seen directly that . Then by Theorem 1, which is much better than chance for large . This corresponds to pure generalisation
from the training labels. Interpolating between these two extremes would suggest that a good neural architecture would impose a prior on functions with strong intraclass and weak interclass correlations.
4 Testing the Bound
This section compares the generalisation bound for infinite width networks (Theorem 2) to the performance of finite width multilayer perceptrons (MLPs) trained by gradient descent. The bound is found to be nonvacuous and correlated with the effects of varying depth and dataset complexity. Still, there is a substantial gap between the bound and gradient descent, which is investigated in Section 5.
Three modified versions of the MNIST handwritten digit dataset
(LeCun et al., 1998) of varying “hardness” were used in the experiments, as detailed in Figure 2 (top left). MLPs were trained with layers and hidden units per layer. Specifically, each input image was flattened to lie in , and normalised to satisfy . The networks consisted of an input layer in , layers in , and an output layer in The nonlinearity was set to . For this architecture, the width kernel is the compositional arccosine kernel described in Theorem 4 in Appendix A.3. For the finite width networks, the training loss was set to square loss using the binary training labels as regression targets, and the networks were trained for 100 epochs using the Nero optimiser
(Liu et al., 2021) with an initial learning rate of 0.01 decayed by a factor of 0.9 every epoch, and a minibatch size of ) data points. The final train error was 0% in all reported experiments. All bounds were computed with a failure probability of , and was estimated using MonteCarlo samples. All experiments were run on one NVIDIA Titan RTX GPU.The generalisation bound of Theorem 2 was first compared across three datasets of varying complexity. The network architecture was set to a depth MLP, and the bound was computed via MonteCarlo estimation of . The results are shown in Figure 2 (top right). The bound was found to reflect the relative hardness of the datasets. For random labels, the bound was vacuous as desired. Next, the generalisation bound was compared against the empirical performance of a depth MLP trained on the decimal digits dataset. The results are shown in Figure 2 (bottom left). While loose compared to the holdout error of the finite width network, the bound is still nonvacuous. Finally, the effect of varying network depth was investigated on the decimal digits dataset. Two depths were compared: and . The results are shown in Figure 2 (bottom right). The bounds (computed via MonteCarlo estimation of ) appear to predict the relative holdout performance of the two architectures at finite width. These results corroborate those of VallePérez & Louis (2020) but without the use of the statistically inconsistent expectationpropagation approximation.
5 Implicit Bias of Gradient Descent
In Section 3, a bound was derived on the average population error of all infinitely wide neural networks that fit a certain training set. Since the bounded quantity measures the performance of all solutions rather than just those returned by gradient descent, it is intended to assess the implicit bias of architecture. In Section 4, this bound was tested and found to be nonvacuous. Still a substantial gap was found between the bound and the performance of finite width networks trained by gradient descent. This gap could arise for several potential reasons:

[label=)]

slackness in the bounding technique;

a difference between infinite width and finite width neural networks;

an additional implicit bias of gradient descent.
This section tests these various possibilities, ultimately concluding that gradient descent does have an important additional implicit bias: the ability to control the margin of the returned network.
5.1 An Exact Formula for the Average Holdout Error of Solutions
To investigate whether slackness in the bounding technique is responsible for the gap, an additional theoretical tool was developed: a formula for the holdout error of a binary classifier averaged over the solution space. In contrast to Theorem 2 which gives an upper bound on population error, Theorem 3 is an equality and therefore does not suffer from slackness. The proof is given in Appendix A.1.
Theorem 3 (Average holdout error of a binary classifier).
First, fix a probability measure over the weight space of a binary classifier. Let denote a training set and let denote the solution space. For a holdout set of datapoints, consider the holdout error of weight setting , and its average over the solution space . Then:
(11) 
In words, the average holdout error over solutions equals the reduction in the volume of solutions when the training set is augmented with a negated holdout point, averaged over the holdout set. For an infinitely wide neural network, Equation 11 involves computing a sum of ratios of Gaussian orthant probabilities. These probabilities can be consistently estimated by Equation 8 in Lemma 1.
5.2 ThreeWay Comparison of Holdout Error
Armed with Theorem 3, this subsection makes a threeway comparison between the holdout error of infinite width networks averaged over solutions, the holdout error of finite width networks averaged over solutions, and the holdout error of finite width networks trained by gradient descent.
To compute the holdout error of finite width networks averaged over solutions, weight vectors were randomly sampled and all nonsolutions were discarded. To make this process computationally tractable, a very small training set was used consisting of only 5 samples from binary digits. A holdout set of 50 datapoints was used, and the network was set to a 7layer MLP. The results were:
average holdout error at infinite width:  
average holdout error at width 10,000:  
gradient descent holdout error at width 10,000: 
Based on these results, three comments are in order. First, the close agreement between the average holdout error at finite and infinite width suggests that the infinite width limit may not be responsible for the significant gap observed in Section 4. Second, the significant gap between the gradient descent holdout error and the holdout error averaged over solutions suggests slackness in the bounding technique in Theorem 2 may also not be the main culprit. Third, the significant gap between the gradient descent holdout error and the holdout error averaged over solutions suggests that gradient descent does have an extra implicit bias. The next subsection attempts to diagnose this implicit bias.
Top left: Schematic diagram illustrating the label ray in function space. While all functions in the grey shaded orthant attain zero train error, the darker shaded solutions have small margin and thus—under a Gaussian process model—holdout predictions are expected to be driven by random fluctuations.
Top right: Holdout error along the label ray of the small learning task described in Section 5.2. Results are shown for both networks trained by gradient descent and NNGP posterior samples. The label scale is defined in Section 5.3
and measures the distance of the function along the label ray. NNGP posterior samples were generated by sampling from a Gaussian distribution with mean and covariance given by Equations
12 and 13. Networks trained by gradient descent were found by training with the loss function given in Equation 14. Shading shows the standard deviation over 100 random intialisations or posterior samples. Deep into function space (large
) the holdout error is significantly better than for functions close to the origin (small ). The average holdout error over the orthant, as computed in Section 5.2, is also plotted. Large margin networks and NNGP posterior samples both significantly outperform the orthant average. Finally, a gap is visible between large margin NNGP posterior samples and large margin networks trained by gradient descent—this may be due to an additional and undiagnosed implicit bias of gradient descent.Bottom: Repeating experiments from Section 4 with a smaller label scale to sanity check the findings. For a depth 2 and a depth 7 MLP on the decimal digits dataset, the experiment from Section 4 was repeated with the loss function set to as defined in Equation 14. The infinite width curve shows the Theorem 2 bound estimated via . The other two curves show the results of networks trained by gradient descent with and . Despite all networks attaining 0% train error, holdout error was significantly worse for networks trained using . Also, the networks trained using exhibit a substantially smaller gap with the Theorem 2 upper bound.
5.3 Diagnosis: Margin
This subsection finds that gradient descent has an important implicit bias in determining the margin of the returned network. This conclusion is made by studying the generalisation error of functions along the label ray in function space—depicted in Figure 3 (top left). For a training set with a vector of binary labels , a function far along the label ray refers to the point .
For the case of a zero mean NNGP, the predictive distribution on a holdout set conditioned on training labels far along the label ray is given by Bishop (2006, Chapter 2.3.1):
posterior mean  (12)  
posterior covariance  (13) 
where is the covariance between holdout and train inputs, and , and are defined analogously. For large , predictions are driven by the posterior mean, whereas for small , they are driven by random fluctuations. So one expects that letting should improve holdout error.
For the case of gradient descent training, functions far along the label ray can be returned by minimising the following loss function:
(14) 
A subtle but important point is that for gradient descent training with Nero (Liu et al., 2021) the norms of the weight matrices are constrained, thus controls a properly normalised notion of margin.
An experiment was performed to measure the holdout error of networks far along the label ray—for both NNGP draws and neural networks selected by Nero—for ranging from to . As can be seen in Figure 3 (top right), varying appears to directly control the holdout error, despite all solutions attaining train error. Moreover, large solutions significantly outperform the average holdout error over the solution space. This suggests that gradient descent possesses a significant implicit bias in its ability to control margin—going beyond the implicit bias of architecture.
6 Discussion and Conclusion
This paper has explored the separate implicit biases of architecture and gradient descent. Section 3 derived an analytical generalisation bound on the NNGP posterior, Section 4 found this bound to be nonvacuous, while Section 5 showed that large margin functions substantially outperform the bound.
The findings in this paper support the importance of the implicit bias of architecture: reproducing the findings in VallePérez & Louis (2020) but with improved technical tools, the average generalisation performance of architecture was already found to be good. But the implicit bias of gradient descent was also found to be important. In particular, in contrast to an assumption made by VallePérez & Louis (2020) that gradient descent “samples the zeroerror region close to uniformly”, it was found that gradient descent can be used to target zeroerror functions with large margin. This also subtly counters a proposal by Mingard et al. (2021) that gradient descent acts like a “Bayesian sampler”. In particular, gradient descent can target functions with margin for which the Bayesian posterior probability . And indeed, these minimum a posteriori functions seem to generalise best.
One direction of future work suggested by these results is an improvement to the function space PACBayes theory to account for margin. While marginbased PACBayes bounds do already exist (Herbrich, 2001; Langford & ShaweTaylor, 2003; Neyshabur et al., 2018), these bounds operate in weight space and further do not seem to imply an advantage to selecting the maxmargin classifier over other classifiers included in the PACBayes posterior.
Ultimately, a generalisation theory that properly accounts for the various implicit biases of deep learning could provide a more principled basis for both neural architecture design as well as the design of new regularisation schemes. It is hoped that the new analytical results as well as experimental insights included in this paper contribute a step towards reaching that goal.
References
 Achille & Soatto (2018) Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 2018.
 Azizan et al. (2021) Navid Azizan, Sahin Lale, and Babak Hassibi. Stochastic mirror descent on overparameterized nonlinear models. IEEE Transactions on Neural Networks and Learning Systems, 2021.
 Bartlett et al. (2017) Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrallynormalized margin bounds for neural networks. In Neural Information Processing Systems, 2017.
 Biggs & Guedj (2021) Felix Biggs and Benjamin Guedj. On margins and derandomisation in PACBayes. arXiv:2107.03955, 2021.
 Bishop (2006) Christopher M. Bishop. Pattern Recognition and Machine Learning. SpringerVerlag, 2006.
 Cho & Saul (2009) Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. In Neural Information Processing Systems, 2009.

Dziugaite & Roy (2017)
Gintare Karolina Dziugaite and Daniel M. Roy.
Computing nonvacuous generalization bounds for deep (stochastic)
neural networks with many more parameters than training data.
In
Uncertainty in Artificial Intelligence
, 2017.  Foret et al. (2021) Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpnessaware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021.
 Geiping et al. (2021) Jonas Geiping, Micah Goldblum, Phillip E. Pope, Michael Moeller, and Tom Goldstein. Stochastic training is not necessary for generalization. arXiv:2109.14119, 2021.
 Herbrich (2001) Ralf Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, 2001.
 Keskar et al. (2017) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On largebatch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017.
 Langford & Seeger (2001) John Langford and Matthias Seeger. Bounds for averaging classifiers. Technical report, Carnegie Mellon University, 2001.
 Langford & ShaweTaylor (2003) John Langford and John ShaweTaylor. PACBayes & margins. In Neural Information Processing Systems, 2003.
 LeCun et al. (1998) Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. MNIST handwritten digit database, 1998.
 Lee et al. (2018) Jaehoon Lee, Jascha SohlDickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018.
 Liu et al. (2021) Yang Liu, Jeremy Bernstein, Markus Meister, and Yisong Yue. Learning by turning: Neural architecture aware optimisation. In International Conference on Machine Learning, 2021.

McAllester (1998)
David A. McAllester.
Some PACBayesian theorems.
In
Conference on Computational Learning Theory
, 1998.  Mingard et al. (2021) Chris Mingard, Guillermo VallePérez, Joar Skalse, and Ard A. Louis. Is SGD a Bayesian sampler? Well, almost. Journal of Machine Learning Research, 2021.
 Neal (1994) Radford M. Neal. Bayesian Learning for Neural Networks. Ph.D. thesis, Department of Computer Science, University of Toronto, 1994.
 Neyshabur et al. (2018) Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PACBayesian approach to spectrallynormalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.
 Neyshabur et al. (2019) Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. The role of overparametrization in generalization of neural networks. In International Conference on Learning Representations, 2019.
 Soudry et al. (2018) Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 2018.
 VallePérez & Louis (2020) Guillermo VallePérez and Ard A. Louis. Generalization bounds for deep learning. arXiv:2012.04115, 2020.
 VallePérez et al. (2019) Guillermo VallePérez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameter–function map is biased towards simple functions. In International Conference on Learning Representations, 2019.
 van der Vaart (1998) Aad W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.
 Vapnik & Chervonenkis (1971) Vladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 1971.
 Wei et al. (2018) Colin Wei, J. Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural networks. arXiv:1810.05369, 2018.
 Wilson et al. (2017) Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Neural Information Processing Systems, 2017.
 Wu et al. (2020) Yikai Wu, Xingyu Zhu, Chenwei Wu, Annie Wang, and Rong Ge. Dissecting Hessian: Understanding common structure of Hessian in neural networks. arXiv:2010.04261, 2020.
 Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
Appendix A Proofs
a.1 Function Space Orthant Arithmetic
Proof.
The first inequality is a basic property of logarithms. The second inequality follows from Theorem 3 of Langford & Seeger (2001), by setting the prior measure to and the posterior measure to the conditional . Under these settings, the average training error rate over the posterior is zero and . ∎
See 3
Proof.
The result follows by interchanging the order of expectations, rewriting the expectation of an indicator variable as a probability and finally applying the definition of conditional probability:
∎
a.2 Gaussian Orthant Probabilities
See 1
Proof.
The orthant probability may first be expressed using the probability density function of the multivariate Normal distribution as follows:
By the change of variables or equivalently , the orthant probability may be expressed as:
where the second equality follows by symmetry, and the third equality follows by inserting a factor of into the integrand.
Next, by Jensen’s inequality,
The third equality follows by noting , while for , . ∎
a.3 Neural Networks as Gaussian Processes
The essence of the following lemma is due to Neal (1994). The lemma will be used in the proof of Theorem 4.
Lemma 2 (NNGP correspondence).
For the neural network layer given by Equation 18, consider randomly sampling the weight matrix . If the following hold:

[label=()]

for every , the activations
are iid with finite first and second moment;

the weights
are drawn iid with zero mean and finite variance;

for any random variable
with finite first and second moment, also has finite first and second moment;
then, in the limit that , the following also hold:

[label=(0)]

for every , the activations are iid with finite first and second moment;

for any collection of inputs , the distribution of the th preactivations is jointly Normal.
While condition (i) may seem nontrivial, notice that the lemma propagates this condition to the next layer via entailment (1). This means that provided condition (i) holds for at the first layer, then recursive application of the lemma implies that the network’s preactivations are jointly Normal at all layers via entailment (2).
Proof of Lemma 2.
To establish entailment (1), consider the dimensional vector . Observe that satisfies:
(15) 
By conditions (i) and (ii), the summands in Equation 15
are iid random vectors with zero mean, and any two distinct components of the same vector summand have the same variance and zero covariance. Then by the multivariate central limit theorem
(van der Vaart, 1998, p. 16), in the limit that , the components of are Gaussian with a covariance equal to a scaled identity matrix. In particular, the components of are iid with finite first and second moment. Applying condition (iii) then implies that the same holds for . This establishes entailment (1).To establish entailment (2), consider instead the dimensional vector . Observe that satisfies:
(16) 
Again by combining conditions (i) and (ii), the summands in Equation 16 are iid random vectors with finite mean and finite covariance. Then as , the distribution of is jointly Normal—again by the multivariate central limit theorem. This establishes entailment (2). ∎
The essence of the following theorem appears in a paper by Lee et al. (2018), building on the work of Cho & Saul (2009). The theorem and its proof are included for completeness.
Theorem 4 (NNGP for relu networks).
Consider an layer MLP defined recursively via:
(17)  
(18) 
where denotes an input, denotes the preactivations at the th layer, denotes the weight matrix at the th layer, and denotes the nonlinearity.
Set the output dimension
and set the nonlinearity to a scaled relu
Suppose that the weight matrices have entries drawn iid , and consider any collection of inputs each with Euclidean norm .If , the distribution of outputs induced by random sampling of the weights is jointly Normal with mean zero and covariance:
(19) 
where .
Proof.
Condition (ii) of Lemma 2 holds at all layers for iid standard Normal weights, and condition (iii) holds trivially for the scaled relu nonlinearity. Provided one can establish condition (i) for the first layer activations , then condition (i) will hold at all layers by recursive application of Lemma 2, thus establishing joint Normality of the preactivations at all layers (including the network outputs). But condition (i) holds at the first layer, since it is quick to check by Equation 17 that for any satisfying , the preactivations are iid , and preserves both iidness and finiteness of the first and second moment.
Since the preactivations at any layer are jointly Normal, all that remains is to compute their first and second moments. For the th hidden unit in the th layer, the first moment . This can be seen by taking the expectation of Equation 18 and using the fact that the are independent of the previous layer’s activations and have mean zero.
Since the preactivations and are jointly Normal with mean zero, their distribution is completely described by their covariance matrix , defined by:
where the hidden unit index is unimportant since hidden units in the same layer are identically distributed.
The theorem statement will follow from an effort to express in terms of , and then recursing back through the network. By Equation 18 and independence of the , the covariance may be expressed as:
(20) 
where indexes an arbitrary hidden unit in the th layer. To make progress, it helps to first evaluate:
which follows by the definition of and symmetry of the Gaussian expectation around zero. Then, by recursion:
where the final equality holds because the first layer preactivations are iid by Equation 17. Therefore, the covariance at layer is just:
Equation 20 may now be used to express in terms of . Dropping the indexing for brevity:
By making the substitution , this integral becomes equivalent to as expressed in Equation 15 of Cho & Saul (2009). Substituting in the evaluation of this integral (Cho & Saul, 2009, Equation 6), one obtains:
(21) 
where .
All that remains is to evaluate . Since , this is given by:
The proof is completed by combining this expression with the recurrence relation in Equation 21. ∎
Comments
There are no comments yet.