Tuesday, April 7, 2020

The Emperor is naked -- deep learning was a good idea but gradient descent is not

Jan 20, 2020

Listening to Eric Weinstein talk about the DISC with his brother Bret made me think about my story.  It is a good conspiracy theory in the making.

Back in the 1990's AI took a turn towards physics, I credit the Hebrew University program started by Daniel Amit and Haim Sompolinsky (both physicists) together with Hana Parnas & A? Abelas (biology) and Tali Tishby (Computer Science -- but really math/physics).  They created the Center for Neural Computation an interdisciplinary center for study of the brain, of which I was a student from its inauguration.

Clearly I am biased, I am more an engineer than a theoretical thinker, and partial differential equations were never my thing.  And I am sure it started innocently enough, a bunch of great minds get together with a solid new idea and apply the tools they have.  Perhaps one of the flags was when Charlie Rosenberg was left out of the circle and forced to find a career elsewhere.  Charlie was more a software engineer than a theoretical mathematician.  Yet, beyond NetTalk which propelled Neural Nets into the public (and created funding opportunities), I credit Charlie for introducing me to the idea that the interdisciplinary study of Neural Networks provides a common language for a diverse set of people to cooperate in ways that would not be possible otherwise.  A very powerful idea that was lost when the language spoken was partial to differential equations.

So here we are many years later and Neural Networks are now taught in university as a gradient descent solution provided by partial differential equations.   Deep learning has taken off and is driven by an industry and university complex that feeds itself.  All this has made NVIDA very happy, they are leading the field in parallel processing neural units (funny thing when you think that my first project with Haim Sompolinsky was programming the ETAN, Intel's neural net chip).  In addition, the professors retain their position of research (and power) as the oracles of partial differential equations, the heart of the deep learning revolution.

However, learning as a field has been narrowed down to minimizing a loss function.  Why does that matter, when everything works so well.

Well, in truth there are some glitches in the armour, Melanie Mitchell just wrote a book on the topic.   Gary Marcus just just held the #AIDebate with Josha Bengio.   But even before that G Hinton had put out there that perhaps the field is gone astray following the Deep learning path.  Frustratingly he then proposed a model that incorporates 'capsules' that are of course trained with gradient descent...

My position has always been that learning is unsupervised, classification is supervised.  This occurred to me when I first met back-propagation in Judith Dayhoff's book and felt drawn to Kohenen's feature maps.  My masters was focused on how to dynamically learn, which led directly to my doctorate thesis.  Not only is learning unsupervised it happens continually, not in batches.

So what has become apparent to me recently is that I can now better describe why Deep Learning with Gradient Descent is broken.

1.  All supervised learning is a method for overfitting and biasing the learning
2.  Even just choosing the training data is a method of weak supervision and hence bias
3.  Embeddings are numeric representations that inshrine these biases into the system

The current approach to deep learning cannot envision a world without a loss function associated with a labeled training set.

Some of this is now changing, 'few shot' and 'one/zero shot' learning is forcing the Deep Learning community to think a little more about what is learning.  Yet they are still falling into the trap of either 'weak' or 'strong' supervision and gradient descent loss functions (everything must be numeric).

The solution requires us to:
1. Correctly define learning (creation of an efficient model -- not memorization, not classification)
2. Correctly measure success of learning (it cannot be a loss function associated with a target!)
3. Leverage symbolic/categorical learning methods to communicate information between levels
4. Develop a multi step learning approach,
4a.  learn independent observations (fully streamable) at multiple scales
4b.  learn dependent observations (weak supervision)
4c.  classify based on  a & b (strong supervision)

Here is my attempt at solving this for ImageNet - https://www.youtube.com/watch?v=HwmqbUVF26g

---
I now understand something else better as well.  I previous wrote:
https://ashlag-cause-and-kook-affect.blogspot.com/2018_07_09_archive.html

which ends with:
Your choice drives the identity of the elements that you will meet. Hence, when I meet a book and you meet the same book, they are not the same, even though they may contain the same content, but because each book is infused with a second identity, through the second-order relationships, they are different books.  Simply said, the words in the book have different meaning to me than to you due to the different social constructs that we live in, each providing a different context to the words and hence a different meaning.

Now I understand that there are two stages, in the first stage I am completely free, and independent, I construct my own personal identity and world view.  In the second stage I limit my field of view to the distribution of friends I have chosen, the reviewer list.  This is the 'weak' supervision coming into play.  I am biasing myself, but within the limited view of the world that has been imposed upon me.

Now I can at anytime (theoretically) move, and find new friends, then my new community will be the element of weak supervision in my life.  The beauty is that I do not need to reshape my personal identity, that can stay fixed all along preserving my core free will (unsupervised development)

-- here is an interesting article pointed out to me by Tuvia Kutscher
Sparse Algorithms are not Stable: A No-free-lunch Theorem, Huan Xu, Constantine Caramanis, Member, IEEE and Shie Mannor, Senior Member, IEEE

I am thinking that intuitively to me their statement is only true for learning algorithms that minimize a global loss function (and are not sensitive to scale).  However, it seems to me that a local (hierarchical) learning methodology can be both sparse and stable.

Thus given a specific scale a local algorithm is stable. (The system may *appear* to be less sparse at a different scale)



















few shot learning...

A metric is the key to learning.  But how to define a metric in a non-numeric space?

The challenge in symbolic learning is measuring the relationship between categories.  The current style in the deep learning community is to convert the categories to numbers and then train models on the numbers.

So the latest and greatest approach in the deep learning community is to create an embedding.  The embedding maps the categories to a numeric space.  Ideally the numeric space is created in a manner that maximizes the space.  Imagine a square where all the data falls into the top right corner, that is not an efficient use of the space.  Similarly the embedding tries to create a space that utilized efficiently, spreading the data in the space.

Well this mean that the embedding creates a space relative to the data set it is trained on.  Hence, the distribution of the training data is critical in determining the space.  When data from a new distribution appears it will not map well into the embedding.

Hence in the few-shot learning world, were very little is available for the training phase and it is assumed new distribution will arrive, it is not enough to just train a classifier, or search in the embedded space, it is necessary to recreate the embedding with the new data, otherwise the classifier will be trapped looking only in the top left corner of the space.

So now I understand what Bengio is saying, you learn the embedding in a broad space, then you use attention to focus only on part of that space for the problem specific part of that space.

Now if the embedding is broad enough, it didn't learn anything.  If the embedding is narrow it learnt the training distribution.  I get that.


Do they do that?

http://papers.nips.cc/paper/6996-prototypical-networks-for-few-shot-learning.pdf


So, it looks like they create the embedding with the entire train dataset?  I could not figure out if the embedding included data that was later defined as validation or test data or only built upon the training data...

—————
I was trying to understand if when you create the embeddings you train the networks on the entire train set? or the entire dataset?

And later when you define a train/test set for the prototypical networks, the train/test is similarly separated.

So for example, if say the embedding is created for categories A,B & C and then you build the prototypical network on A, B & C, then when categories X & Y come along, you utilized the existing embedding network, map X & Y to the embedded space and then import them into the prototypical network?  Or perhaps A,B,C & X,Y are all utilized to create the embedding, but only A,B,C are employed to create the prototypical network?
—————-

what is the analogy in the Ashlag/Kook world?