Listening to Eric Weinstein talk about the DISC with his brother Bret made me think about my story. It is a good conspiracy theory in the making.
Back in the 1990's AI took a turn towards physics, I credit the Hebrew University program started by Daniel Amit and Haim Sompolinsky (both physicists) together with Hana Parnas & A? Abelas (biology) and Tali Tishby (Computer Science -- but really math/physics). They created the Center for Neural Computation an interdisciplinary center for study of the brain, of which I was a student from its inauguration.
Clearly I am biased, I am more an engineer than a theoretical thinker, and partial differential equations were never my thing. And I am sure it started innocently enough, a bunch of great minds get together with a solid new idea and apply the tools they have. Perhaps one of the flags was when Charlie Rosenberg was left out of the circle and forced to find a career elsewhere. Charlie was more a software engineer than a theoretical mathematician. Yet, beyond NetTalk which propelled Neural Nets into the public (and created funding opportunities), I credit Charlie for introducing me to the idea that the interdisciplinary study of Neural Networks provides a common language for a diverse set of people to cooperate in ways that would not be possible otherwise. A very powerful idea that was lost when the language spoken was partial to differential equations.
So here we are many years later and Neural Networks are now taught in university as a gradient descent solution provided by partial differential equations. Deep learning has taken off and is driven by an industry and university complex that feeds itself. All this has made NVIDA very happy, they are leading the field in parallel processing neural units (funny thing when you think that my first project with Haim Sompolinsky was programming the ETAN, Intel's neural net chip). In addition, the professors retain their position of research (and power) as the oracles of partial differential equations, the heart of the deep learning revolution.
However, learning as a field has been narrowed down to minimizing a loss function. Why does that matter, when everything works so well.
Well, in truth there are some glitches in the armour, Melanie Mitchell just wrote a book on the topic. Gary Marcus just just held the #AIDebate with Josha Bengio. But even before that G Hinton had put out there that perhaps the field is gone astray following the Deep learning path. Frustratingly he then proposed a model that incorporates 'capsules' that are of course trained with gradient descent...
My position has always been that learning is unsupervised, classification is supervised. This occurred to me when I first met back-propagation in Judith Dayhoff's book and felt drawn to Kohenen's feature maps. My masters was focused on how to dynamically learn, which led directly to my doctorate thesis. Not only is learning unsupervised it happens continually, not in batches.
So what has become apparent to me recently is that I can now better describe why Deep Learning with Gradient Descent is broken.
1. All supervised learning is a method for overfitting and biasing the learning
2. Even just choosing the training data is a method of weak supervision and hence bias
3. Embeddings are numeric representations that inshrine these biases into the system
The current approach to deep learning cannot envision a world without a loss function associated with a labeled training set.
Some of this is now changing, 'few shot' and 'one/zero shot' learning is forcing the Deep Learning community to think a little more about what is learning. Yet they are still falling into the trap of either 'weak' or 'strong' supervision and gradient descent loss functions (everything must be numeric).
The solution requires us to:
1. Correctly define learning (creation of an efficient model -- not memorization, not classification)
2. Correctly measure success of learning (it cannot be a loss function associated with a target!)
3. Leverage symbolic/categorical learning methods to communicate information between levels
4. Develop a multi step learning approach,
4a. learn independent observations (fully streamable) at multiple scales
4b. learn dependent observations (weak supervision)
4c. classify based on a & b (strong supervision)
Here is my attempt at solving this for ImageNet - https://www.youtube.com/watch?v=HwmqbUVF26g
---
I now understand something else better as well. I previous wrote:
https://ashlag-cause-and-kook-affect.blogspot.com/2018_07_09_archive.html
which ends with:
Your choice drives the identity of the elements that you will meet. Hence, when I meet a book and you meet the same book, they are not the same, even though they may contain the same content, but because each book is infused with a second identity, through the second-order relationships, they are different books. Simply said, the words in the book have different meaning to me than to you due to the different social constructs that we live in, each providing a different context to the words and hence a different meaning.
Now I understand that there are two stages, in the first stage I am completely free, and independent, I construct my own personal identity and world view. In the second stage I limit my field of view to the distribution of friends I have chosen, the reviewer list. This is the 'weak' supervision coming into play. I am biasing myself, but within the limited view of the world that has been imposed upon me.
Now I can at anytime (theoretically) move, and find new friends, then my new community will be the element of weak supervision in my life. The beauty is that I do not need to reshape my personal identity, that can stay fixed all along preserving my core free will (unsupervised development)
-- here is an interesting article pointed out to me by Tuvia Kutscher
Sparse Algorithms are not Stable: A No-free-lunch Theorem, Huan Xu, Constantine Caramanis, Member, IEEE and Shie Mannor, Senior Member, IEEE
Thus given a specific scale a local algorithm is stable. (The system may *appear* to be less sparse at a different scale)