Neural networks as non-leaky mathematical abstraction

https://www.lesswrong.com/posts/qRtbjHsJiwYggYhP4/neural-networks-as-non-leaky-mathematical-abstraction

Link post Contents

Mathematical abstractions are very leaky

In case the terms is unfamiliar to you, a leaky abstraction is an abstraction that leaks the details that it’s supposed to abstract away. Take something like sigma summation (∑), at first glance it look like an abstraction, but it’s too leaky to be a good one. Why is it leaky ? Because all the concept it "abstracts" over need to be understood in order to use the sigma summation. You need to "understand" operations, operators, numbers, ranges, limits and series. For another example take integrals (∫), where the argument for leakiness can be succinctly state. To master the abstraction one needs to understand all of calculus and all that is foundational to calculus (also known as "all of math" until 60 years ago or so). A leaky abstraction is the rule rather than the norm in mathematics. The only popular counterexample that comes to mind are integral transform (think Laplace and Fourier Transforms). Indeed, any mathematician in the audience might scoff at me for using the word "abstraction" for what they think of as shorthand notations. Why ? Arguably because mathematics has no such thing as an "observation" to abstract over. Science has the advantage of making observations, it sees that X correlates with Y in some way, then it tries to find a causal relationship between the two via theory (which is essentially an abstraction). Darwin’s observations about finches were not by any means "wrong". Heck, most of what Jabir ibn Hayyan found out about the world is probably still correct. What is incorrect or insufficient are the theoretical frameworks they used to explain them. Alchemy might have described the chemical properties and interactions of certain type of matter quite well. We’ve replaced it with the Bhoring model of chemistry simply because it abstracts away more observations than alchemy. Thinking with competition, sexual reproduction, DNA, RNA and mutations explains Darwin’s observations. This doesn’t make his original "survival of the fittest" evolutionary framework was "wrong". It just makes it a tool that outlived it’s value. So mathematics ends up being "leaky" because it has not such thing as an "observation". The fact that 2 * 3 = 6 is not observed, it’s simply "known". Or… is it ? The statement 95233345745213 * 4353614555235239 = 414609280180109235394973160907 is just as true as 2 * 3 = 6. Tell a well educated ancient Greek: " 95233345745213 times 4353614555235239 is always equal to 414609280180109235394973160907, this must be fundamentally true within our system of mathematics"… and he will look at you like you’re a complete nut. Tell that same ancient Greek:" 2 times 3 is always equal to 6, this must be fundamentally true within our system of mathematics"…. and he might think you are a bit pompous, but overall he will nod and agree with your rather banal statement. To a modern person there is little difference in how "obviously true" the two statement seem, because a modern person has a calculator. But before the advent of more modern techniques for working with numbers, such calculation were beyond the reach of most if not all. There would be not "obvious" way of checking the truth of that first statement… people would be short 414609280180109235394973160897 fingers. So really, there is something that serves a similar role to observations in mathematics. That which most can agree on as being "obviously true". Obviously there’s cases where this concept break down a bit (e.g. Ramanujan summation), but these are the exception rather than the rule. Computers basically allow us to raise the bar for "obviously true" really high. So high that "this is obviously false" brute force approaches can be used to disprove sophisticated conjectures. As long as we are willing to trust the implementation of the software and hardware, we can quickly validate any mathematical tool over a very large finite domain. Computers also allow us to build functional mathematical abstraction. Because suddenly we can "use" those abstractions without understanding them. Being able to use a function and understanding what a function does are different things in a modern age, but this is a very recent development. For the most part computers have been used to run leaky mathematical abstractions. Leaky by design, made for a world where one had to "build up" their knowledge to use them from the most obvious of truths. However, I think that non leaky abstractions are slowly showing up to the party, and I they have a lot of potential. In my view, the best specimen of such an abstraction is the neural network.

Neural Networks as mathematical abstractions

As far as dimensionality reduction (DR) methods go, an autoencoder (AE) is basically one of the easiest ones me to explain, implement and understand. Despite being quite sophisticated compared to most other nonlinear DR methods. The way I’d explain it (maybe a bit more rigorous than necessary) is:

So why is this important ?

Well, it’s important because non-leaky mathematical abstractions are rather rare. Especially ones that have such a low point of entry to and are used so widely. I wouldn’t compare neural networks to integral transforms just yet, but I think we are getting there. I’d also argue it’s important because it explains why neural networks have not only taken over the field of "hard" ML problems, but are now making their way into all facets of ML where a SVM or DT or GB classifier might have worked just fine. It’s not necessarily because they are "better", but it’s because people have more confidence in using them as an abstraction. Lastly, it’s important because it’s a way to conceptualize why neural networks are in a way better than classical ML algorithms. Because this lack of leaks means that anyone can play around with them without breaking the whole thing and being thrown one level down. Want to change the shape ? Sure. Want to change the activation function ? Sure. Want to add a state function to certain elements ? Go ahead. Want to add random connections between various elements ? Don’t see why not… etc. They have a lot of tweakable hyperparameters and they are not modifiable just in principle. Of course, every ML algorithm has tweakable parameters, but as soon as you start changing the kernel function of your SVM you realize that for the tweaks to be useful, the abstraction must break down and you need to learn the concepts underneath (and so on, and so on). It’s rare for me to argue that a relatively popular and hyped up thing is "even better" than people think. But in the case of neural networks, I truly think that they are among the first of a "new" type of mathematical abstractions. They allow people that don’t have the dozen+ years background of learning applied mathematics, to do applied mathematics.

Comment

https://www.lesswrong.com/posts/qRtbjHsJiwYggYhP4/neural-networks-as-non-leaky-mathematical-abstraction?commentId=QTfKKQnvsB5psnbxQ

For another example take integrals (∫), where the argument for leakiness can be succinctly state. To master the abstraction one needs to understand all of calculus and all that is foundational to calculus (also known as "all of math" until 60 years ago or so).

I’m not sure I follow. I certainly did a lot of integration before I knew how to formalize the concept, and I think the formal details only rarely leak. Certainly, I got through an entire four-year math degree without learning most of the formalisms listed there.

Perhaps this is not "mastering" integrals, but… if integrals are above the bar for leakiness, I’d be surprised if neural nets are below it (though I’m less comfortable with those than I am with integrals).

Comment

https://www.lesswrong.com/posts/qRtbjHsJiwYggYhP4/neural-networks-as-non-leaky-mathematical-abstraction?commentId=73dMWihz8LDz36Bn7

I mean, the kind of integral one solves in school are rather trivial, essentially edge cases that never come up IRL. (e.g. ∫x^2 or ∫e*x kinda thing) But even in that case, you still have to correctly define what you want to integrate as a function, you can’t just draw a random geometric shape and integrate it and you have to correctly "use" the integral operator. Given an arbitrary function it’s not at all obvious what the integral of that function will look like and it sometimes requires a lot of skill to deduce. Given an arbitrary problem where integrals are needed I find that it’s often non-obvious how to pick the bound, especially when we get into 2d and 3d integrals (or however you call them, I’m referring to ∫∫ and ∫∫∫).

https://www.lesswrong.com/posts/qRtbjHsJiwYggYhP4/neural-networks-as-non-leaky-mathematical-abstraction?commentId=xCbjWtT2vdR2P9Zk5

Interesting perspective, thanks for crossposting!

https://www.lesswrong.com/posts/qRtbjHsJiwYggYhP4/neural-networks-as-non-leaky-mathematical-abstraction?commentId=r7zZtZcW4JxnmejxF

So is this an argument for the end-to-end principle?

Comment

https://www.lesswrong.com/posts/qRtbjHsJiwYggYhP4/neural-networks-as-non-leaky-mathematical-abstraction?commentId=fMEJ554pnhmjEQXor

My read was that it’s less an argument for the end-to-end principle and more an argument for modular, composable building blocks of which understanding of internals is not required (not the author though). (Note that my experience of trying new combinations of deep learning components hasn’t really matched this. E.g., I’ve spent a lot of time and effort trying to get new loss functions to work with various deep learning architectures, often with very limited success and often could not get away with not understanding what was going on "under the hood".)

Comment

My read was that it’s less an argument for the end-to-end principle and more an argument for modular, composable building blocks of which understanding of internals is not required (not the author though).If it could be construed as me arguing ‘for’ something than yes, this is what I was arguing for. I’m not seeing how the end-to-end principle applies here (as in, the one used in networking), but maybe it’s a different usage of the term I’m unfamiliar with.

https://www.lesswrong.com/posts/qRtbjHsJiwYggYhP4/neural-networks-as-non-leaky-mathematical-abstraction?commentId=o9T6hb7BH3EZ4DQ2g

Take something like sigma summation (∑), at first glance it look like an abstraction, but it’s too leaky to be a good one. Why is it leaky ? Because all the concept it "abstracts" over need to be understood in order to use the sigma summation. You need to "understand" operations, operators, numbers, ranges, limits and series.It’s just a for loop. As long as we are willing to trust the implementation of the software and hardware, we can quickly validate any mathematical tool over a very large finite domain.There’s a smaller domain where we can validate the proposed counterexamples. To borrow from your example: "this is obviously false" 27^5+84^5+110^5+133^5=144^5. Obviously there’s cases where this concept break down a bit (e.g. > Ramanujan summation), but these are the exception rather than the rule.How is this a break down? (I don’t know what you’re building.) They have a lot of tweakable hyperparameters and they are no modifiable just in principle.That wording at the end suggests a typo.

Comment

https://www.lesswrong.com/posts/qRtbjHsJiwYggYhP4/neural-networks-as-non-leaky-mathematical-abstraction?commentId=FsaCZN9m48aK7phDb

It’s just a for loop.It’s not a for loop, for loops don’t deal with infinity as far as I know. How is this a break down? (I don’t know what you’re building.) As in, the results 1 + 2 + 3 + 4 … = −1/​12 is "obviously false" yet mathematically true. So the pattern of "if something is true in a very intuitive way than it must be mathematically true", doesn’t hold in those kind of cases (as opposed to the 2 * 3 = 6 case, where mathematics correctly describe what we intuit to be true by saying the statement is correct, at least if you think of mathematics as a "language" in the "programming but running on wetware" sense.

Comment

yet mathematically true. I’d wouldn’t say it’s "true". Unless you think 1=0. Proof: [1] x = 1 + 1 + 1 + …. Subtract 1 from both sides. [2] x-1 = 1 + 1 + 1 + … Substitute using [1]. [3] x-1 = x Subtract x from both sides. [4] −1 = 0 Multiply both sides by negative 1. [5] 1 = 0

Comment

I’m pretty sure appending a single number to an infinite series is not the same as appending a number to each of the terms (e.g. combining two infinite series as per my example). But even if what you wrote were "correct" by the same token that the sum of the divergent series I mentioned is, it doesn’t have much to do my point in that paragraph, which was to say that these kind of statements make no intuitive sense but yet have some correctness to them.

Comment

They are correct if you accept a strange premise like "infinity = 0" or ignore mistakes, like the one I made in the proof above.