[Bmi] Deep Learning, Convolution, and Error Back-Propagation

Sun Mar 22 12:41:00 EDT 2015

Dear colleagues,

This is a discussion about well known techniques, not specifically about 
whose work.   We have had many papers about neural networks.  But we did 
not have sufficiently honest discussion on well-known techniques.  At 
least I hesitated very much to discuss such a subject, because Profs. X, 
Y, Z used such techniques.  This lack of honesty has caused a lot of 
waste in resources, including time (of our professors, researchers, 
postdocs, and graduate students) and money (governments, private 
foundations, and companies).   Still, I am afraid that the following 
paragraphs will make some well known researchers angry.  For that 
reason, the following discussion has identified myself (J. Weng) who 
should be blamed for using some of the well-known techniques.  I also 
made mistakes.  Please accept my apology.

Please reply with your comments.

---- some new paragraphs in the Brain Principles Manifesto ----

Industrial and academic interests have been keen on a combination of two 
things — easily understandable tests (e.g., G. Hinton et al. NIPS 2012, 
congratulations!) and major companies are involved (e.g., Google, 
thanks!).  We have read statements like “our results can be improved 
simply by waiting for faster GPUs and bigger datasets to become 
available” (G. Hinton et al. NIPS 2012).  However, the newly known brain 
principles have told us that the ways to conduct such tests (e.g., 
ImageNet) will give only vanishing gains that do not lead to a 
human-like zero error rate, regardless how long the Moore’s Law can 
continue and how many more static images are added to the training set.  
Why?  All such tests used static images in which objects mix with the 
background. Such tests therefore prevent participating groups from 
seriously considering autonomous object segmentation (free of 
handcrafted object model).  Through synapse maintenance (Y. Wang et al. 
ICBM 2012), neurons in a human brain automatically cut off inputs from 
background pixels if background pixels matched badly compared with 
attended object pixels.  Our babies spend much more time in dynamic 
physical world than seeing static photos.

Our industry should learn more powerful brain mechanisms that went 
beyond conventional well-known, well-tested techniques.  The following 
gives some examples:

(1) Deep Learning Networks (e.g., J. Weng et al. IJCNN 1992, Y. LeCun et 
al. Proceedings of IEEE 1998, G. Hinton et al. NIPS 2012) are not only 
biologically implausible but also functionally weak. The brain uses a 
rich network of processing areas (e.g., Felleman & Van Essen, Cerebral 
Cortex 1991) where connections are almost always two-way (J. Weng, 
Natural and Artificial Intelligence, 2012), not a cascade of modules as 
in the Deep Learning Networks.  Such a Deep Learning Network is not able 
to conduct top-down attention in a cluttered scene (e.g., attention to 
location or type in J. Weng, Natural and Artificial Intelligence, 2012 
or attention to more complex object shape as reported in L. B. Smith et 
al. Developmental Science 2005).

(2) Convolution (e.g., J. Weng et al. IJCNN 1992, Y. LeCun et al. 
Proceedings of IEEE 1998, G. Hinton et al. NIPS 2012) is not only 
biologically implausible, but also computationally weak. Why? All 
feature neurons in the brain carry not only sensory information but also 
motor information (e.g., Felleman & Van Essen, Cerebral Cortex 1991) so 
that later-processing neurons become less concrete and more abstract --- 
which is impossible to accomplish using the shift-invariant 
convolution.   Namely, convolution is always location-concrete (even 
using max-pulling) and never location-abstract.

(3) Error back-propagation in neural networks (e.g., G. Hinton et al. 
NIPS 2012) is not only biologically implausible (e.g., a baby does not 
have error in his motors) but also damaging to long-term memory because 
of its lack of match-based competition for error-causality (such as 
those in SOM, LISSOM, and LCA as optimal SOM).   Even though the 
gradient vector identifies a neuron that can reduce the current error, 
the current error is not the business of that neuron at all and it must 
keep its own long-term memory unchanged.  That is why error 
back-propagation is well known to be bad for incremental learning and 
requires research assistants to try many guesses of initial weights 
(i.e., using the test set as the training set!).  Let us not be blinded 
by artificial low error rates.

Do our industry and public need another 20 years?

---- end of the new paragraphs -----
Full text:

The Brain Principles Manifesto
(Draft Version 4.5)

March 21, 2015

Historically, public acceptance of science was slow.  For example, 
Charles Darwin waited about 20 years (from the 1830s to 1858) to publish 
his theory of evolution for fear of public reaction. About 20 years 
later (by the 1870s) the scientific community and much of the general 
public had accepted evolution as a fact.   Of course, the debate on 
evolution still goes on today.

Is the public acceptance of science faster in modern days?  Not 
necessarily so, even though we have now better and faster means to 
communicate.   The primary reason is still the same but much more 
severe—the remaining open scientific problems are more complex and the 
required knowledge goes beyond a typical single person.

For instance, network-like brain computation — connectionist computation 
(e.g., J. McClelland and D. Rumelhart, Parallel Distributed Processing, 
1986) — has been long doubted and ignored by industry.   Deep 
convolutional networks appeared by at least 1980 (K. Fukushima).  
Max-pooling technique for deep convolutional networks was published by 
1992 (J. Weng et al.).  However, Apple, Baidu, Google, Microsoft, 
Samsung, and other major related companies did not show considerable 
interest till after 2012. That is a delay of about 20 years.  The two 
techniques above are not very difficult to understand.  However, these 
two suddenly hot techniques have already been proved obsolete by the 
discoveries of more fundamental and effective principles of the brain, 
six of which are intuitively explained below.

Industrial and academic interests have been keen on a combination of two 
things — easily understandable tests (e.g., G. Hinton et al. NIPS 2012, 
congratulations!) and major companies are involved (e.g., Google, 
thanks!).  We have read statements like “our results can be improved 
simply by waiting for faster GPUs and bigger datasets to become 
available” (G. Hinton et al. NIPS 2012).  However, the newly known brain 
principles have told us that the ways to conduct such tests (e.g., 
ImageNet) will give only vanishing gains that do not lead to a 
human-like zero error rate, regardless how long the Moore’s Law can 
continue and how many more static images are added to the training set.  
Why?  All such tests used static images in which objects mix with the 
background. Such tests therefore prevent participating groups from 
seriously considering autonomous object segmentation (free of 
handcrafted object model).  Through synapse maintenance (Y. Wang et al. 
ICBM 2012), neurons in a human brain automatically cut off inputs from 
background pixels if background pixels matched badly compared with 
attended object pixels.  Our babies spend much more time in dynamic 
physical world than seeing static photos.

Our industry should learn more powerful brain mechanisms that went 
beyond conventional well-known, well-tested techniques.  The following 
gives some examples:

(1) Deep Learning Networks (e.g., J. Weng et al. IJCNN 1992, Y. LeCun et 
al. Proceedings of IEEE 1998, G. Hinton et al. NIPS 2012) are not only 
biologically implausible but also functionally weak. The brain uses a 
rich network of processing areas (e.g., Felleman & Van Essen, Cerebral 
Cortex 1991) where connections are almost always two-way (J. Weng, 
Natural and Artificial Intelligence, 2012), not a cascade of modules as 
in the Deep Learning Networks.  Such a Deep Learning Network is not able 
to conduct top-down attention in a cluttered scene (e.g., attention to 
location or type in J. Weng, Natural and Artificial Intelligence, 2012 
or attention to more complex object shape as reported in L. B. Smith et 
al. Developmental Science 2005).

(2) Convolution (e.g., J. Weng et al. IJCNN 1992, Y. LeCun et al. 
Proceedings of IEEE 1998, G. Hinton et al. NIPS 2012) is not only 
biologically implausible, but also computationally weak. Why? All 
feature neurons in the brain carry not only sensory information but also 
motor information (e.g., Felleman & Van Essen, Cerebral Cortex 1991) so 
that later-processing neurons become less concrete and more abstract --- 
which is impossible to accomplish using the shift-invariant 
convolution.   Namely, convolution is always location-concrete (even 
using max-pulling) and never location-abstract.

(3) Error back-propagation in neural networks (e.g., G. Hinton et al. 
NIPS 2012) is not only biologically implausible (e.g., a baby does not 
have error in his motors) but also damaging to long-term memory because 
of its lack of match-based competition for error-causality (such as 
those in SOM, LISSOM, and LCA as optimal SOM).   Even though the 
gradient vector identifies a neuron that can reduce the current error, 
the current error is not the business of that neuron at all and it must 
keep its own long-term memory unchanged.  That is why error 
back-propagation is well known to be bad for incremental learning and 
requires research assistants to try many guesses of initial weights 
(i.e., using the test set as the training set!).  Let us not be blinded 
by artificial low error rates.

Do our industry and public need another 20 years?

On the other hand, neuroscience and neuropsychology have made many 
advances by providing experimental data (e.g., Felleman & Van Essen, 
Cerebral Cortex 1991).   However, it has been well recognized that these 
disciplines are data-rich and theory-poor. The phenomena of brain 
circuits and brain behavior are extremely rich. Many researchers in 
these areas use only local tools (e.g., attracters that can only be 
attracted into local extrema) and consequently have been overwhelmed by 
the richness of brain phenomena.   A fundamental reason is that they 
miss the guidance of the global automata theory of computer science, 
although previous automata do not emerge.  For example, X. -J. Wang et 
al. Nature 2013 stated correctly that neurons of mixed selectivity were 
rarely analyzed but have widely observed.  However, the mixed 
selectivity has already been well explained, as a special case, by the 
new Emergent Turing Machine in Developmental Networks in a theoretically 
complete way.  The traditional Universal Turing Machine is a theoretical 
model for modern-day computers --- how computers work --- but they do 
not emerge.   The mixed selectivity of neurons in such a new kind of 
Turing Machine are caused by emergent and beautiful brain circuits, but 
each neuron still uses a simple similarity of inner product in its high 
dimensional and dynamic input space.

October 2011, a highly respected multi-disciplinary professor kindly 
wrote: “I tell these students that they can work on brains and do good 
science, or work on robots and do good engineering. But if they try to 
do both at once, the result will be neither good science nor good 
engineering.”  How long does it take for the industry and public to 
accept that the pessimistic view of the brain was no longer true even then?

The brain principles that have already been discovered could bring 
fundamental changes in the way humans live, the way countries and 
societies are organized, our industry, our economy, and the way humans 
treat one another.

The known brain principles have told us that the brain of anybody, 
regardless of his education and experience, is fundamentally 
shortsighted, in both space and time.  Prof. Jonathan Haidt documented 
well such shortsightedness in his book “The Righteous Mind: Why Good 
People Are Divided by Politics and Religion”, although not in terms of 
brain computation.

In terms of brain computation, the circuits in your brain self-wire 
beautifully and precisely according to your real-time experience (the 
genome only regulates) and their various invariance properties required 
for abstraction also largely depend on experience.  Serotonin (caused 
by, e.g., threats), dopamine (caused by e.g., praise), and other neural 
transmitters quickly bias these circuits so that neurons for more 
long-term thoughts lost in competition to fire.  Furthermore, such bias 
has a long-term effect.  Therefore, you make long-term mistakes but you 
still feel you are right.   Everybody is like that.  Depending on 
experience, shortsightedness varies in terms of subject matter.

Traditionally, many domain experts think that computers and brain appear 
to use very different principles.  Naturally emerging Turing Machine in 
Developmental Networks that has been mathematically proved (see J. Weng, 
Brain as an Emergent Finite Automaton: A Theory and Three Theorems, 
IJIS, 2015) should change our intuition.
The new result proposed the following six brain principles:
1.    The developmental program (genome-like, task-nonspecific) 
regulates the development (i.e., lifetime learning) of a 
task-nonspecific “brain-like” network —— Developmental Network. The 
Developmental Network is of general-purpose—can learn any body-capable 
tasks, in principle.   Not only pattern recognition.
2.    The brain’s images are naturally sensed images of cluttered scenes 
where many objects mix.  In typical machine training (e.g., Krizhevsky 
et al. NIPS 2012),  each training image has a bounding box drawn around 
each object to learn, which is not the case  for a human baby.  Neurons 
in the Developmental Network automatically learn object segmentation 
through synapse maintenance.
3.    The brain’s muscles have multiple subareas where each subarea 
represents either declarative knowledge (e.g., abstract concepts such as 
location, type, scale, etc.) or non-declarative knowledge (e.g., driving 
a car or riding a bicycle).   Not just discrete class labels in global 
classification.
4.    Each brain in the physical world is at least is a Super Turing 
Machine in a Developmental Network.  Every area in the network emerges 
(does not statically exist, see M. Sur et al. Nature 2000 and P. Voss, 
Frontiers in Psychology 2013) using a unified area function whose 
feature development is nonlinear but free of local minima, contrary to 
engineering intuition --- not convolution; not error back-propagation.
5.    The brain’s Developmental Network learns incrementally—taking 
one-pair of sensory pattern and motor pattern at a time to update the 
“brain” and discarding the pair immediately after.  Namely, a real brain 
has only one pair of stereoscopic retinas which cannot store more than 
one pair of image.  Batch learning (i.e., learn before test) is not 
scalable: Without a mistake in an early test, a student cannot learn how 
to correct the mistake later.
6.    The brain’s Developmental Network is always optimal—Each network 
update in real time computes the maximum likelihood estimate of the 
“brain”, conditioned on the limited computational resources and the 
limited learning experience in its “life” so far.   One should not use 
the test set as a training set: report only the best network after 
trying many networks on the test set.

The logic completeness of a brain is (partially, not all) understood by 
a Universal Turing Machine in a Developmental Network.  This emergent 
automaton brain model proposes that each brain is an automaton, but also 
very different from all traditional symbolic automata because it 
programs itself—emergent.  No traditional Turing Machine can program 
itself  but a brain Turing Machine does.

The automaton brain model has predicted that brain circuits dynamically 
and precisely record the statistics of experience, roughly consistent 
with neural anatomy (e.g., Felleman & Van Essen, Cerebral Cortex, 
1991).  In particular, the model predicted that “shifting attention 
between `humans’ and `vehicles’ dramatically changes brain 
representation of all categories” (J. Gallant et al. Nature 
Neuroscience, 2013) and that human attention “can regulate the activity 
of their neurons in the medial temporal lobe” (C. Koch et al. Nature, 
2010).  The “place” cells work of the 2014 Nobel Prize in Physiology or 
Medicine implies that neurons encode exclusively bottom-up information 
(place). The automaton brain model challenges such a view: Neurons 
represent a combination of both bottom-up (e.g., place) and top-down 
context (e.g., goal) as reported by Koch et al. and Gallant et al.

Unfortunately, the automaton brain model implies that all 
neuroscientists and neural network researchers are unable to understand 
the brain of their studies without a rigorous training in automata 
theory.   For example, traditional models for nervous systems and neural 
networks focus on pattern recognition and do not have the capabilities 
of a grounded symbol system (e.g., “rulefully combining and 
recombining,” Stevan Harnad, Physica D, 1990).  The automata theory 
deals with such capabilities.  Does this new knowledge stun our students 
and researchers or guide them so their time is better spent?

Brain automata would enable us to see answers to a wide variety of 
important questions, some of which are raised below. The automaton brain 
model predicts that there is no absolute right or wrong in any brain but 
its environmental experiences wire and rewire the brain.   We do not 
provide yes/no answers here, only raise questions.

How can our industry and public understand that the door for 
understanding brains has opened for them?  How can they see the 
economical outlooks that this opportunity leads them to?

How should our educational system reform to prepare our many bright 
minds for the new brain age?   Has our government been prompt to 
properly respond to this modern call from the nature?

How should our young generation act for the new opportunity that is 
unfolding before their eyes?  Is a currently narrowly defined academic 
degree sufficient for their career?

How can everybody take advantage of the new knowledge about his own 
brain so that he is more successful in his career, including statesmen, 
officials, educators, attorneys, entrepreneurs, doctors, technicians, 
artists, workers, drivers, and other mental and manual workers?

Regardless where we are and what we do, we are all governed by the same 
set of brain principles.  Everybody’s brain automatically programs itself.

---- end of the manifesto ----

-John

-- 
--
Juyang (John) Weng, Professor
Department of Computer Science and Engineering
MSU Cognitive Science Program and MSU Neuroscience Program
428 S Shaw Ln Rm 3115
Michigan State University
East Lansing, MI 48824 USA
Tel: 517-353-4388
Fax: 517-432-1061
Email:weng at cse.msu.edu
URL:http://www.cse.msu.edu/~weng/
----------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cse.msu.edu/pipermail/bmi/attachments/20150322/385f0fc7/attachment-0001.html>