ReLU deep networks can approximate arbitrary functions

Read through Google Brain Engineer a long time agoEric Jang An answer to the question， Wanted to share this knowledge with you！ It was also recently discovered that， There are a lot of cowboys who like to share in their blogsDL and related knowledge， So personally, I feel free to read something relevant in the blog degree， Great for your own foundation and in-depth knowledge， I would also like to thank those who have contributed toDL&ML The silent sharing of the bull， Let's work together to learn！！！ Well, that's enough talk.， Starting to make sense of the topic。 hehe！

There are many people who ask.** Why can ReLU deep networks approximate arbitrary functions?**

Its insightful to do so, but here he is simple and explains the problem with minimal mathematical form. ReLU is actually segmentally linear, so one might question that for a fixed size neural network, the ReLU network might not have a smoother + bounded representation of the activation function (e.g. tanh).

Because they learn non-smoothing functions, the ReLU network should be interpreted as separating the data in a segmented linear fashion, rather than actually being a "true" function approximation. In machine learning, one often tries to learn from datasets with a finite number of discrete data points (i.e., 100K images), and in these cases it is sufficient to learn only the separation of these data points. Consider the two-dimensional modulus operator, i.e.

vec2 p = vec2(x,y) // x,y are floats vec2 mod(p,1) { return vec2(p.x % 1, p.y % 1) }

The output of the mod function is the result of folding/scattering all 2D spaces onto the unit square. This is segmentally linear, but highly nonlinear (because there are an infinite number of linear parts).

Deep neural networks activated with ReLU work similarly - they split/fold the activation space into a cluster of different linear regions, like a really complex origami.

This is clearly shown in the third figure of the article "On the number of linear regions of Deep Neural Networks".

In Figure 2 of the article, they show how the number of linear regions increases exponentially in the depth/number of layers in the network.

It turns out that with enough layers, you can approximate "smoothing" any function to any degree. Also, if you add a smooth activation function to the last layer, you get a smooth approximation of the function.

In general, we do not want a very smooth approximation of a function that matches each data point exactly and over-fits the dataset, rather than learning a generalizable representation that works properly on the test set. By learning the separator, we get better generalizability and therefore the ReLU network is better self regularizing in this sense.

particulars：A Comparison of the Computational Power of Sigmoid and Boolean Threshold Circuits