[Bmi] Deep Learning, Convolution, and Error Back-Propagation
Juyang Weng
weng at cse.msu.edu
Sun Mar 22 12:41:00 EDT 2015
Dear colleagues,
This is a discussion about well known techniques, not specifically about
whose work. We have had many papers about neural networks. But we did
not have sufficiently honest discussion on well-known techniques. At
least I hesitated very much to discuss such a subject, because Profs. X,
Y, Z used such techniques. This lack of honesty has caused a lot of
waste in resources, including time (of our professors, researchers,
postdocs, and graduate students) and money (governments, private
foundations, and companies). Still, I am afraid that the following
paragraphs will make some well known researchers angry. For that
reason, the following discussion has identified myself (J. Weng) who
should be blamed for using some of the well-known techniques. I also
made mistakes. Please accept my apology.
Please reply with your comments.
---- some new paragraphs in the Brain Principles Manifesto ----
Industrial and academic interests have been keen on a combination of two
things — easily understandable tests (e.g., G. Hinton et al. NIPS 2012,
congratulations!) and major companies are involved (e.g., Google,
thanks!). We have read statements like “our results can be improved
simply by waiting for faster GPUs and bigger datasets to become
available” (G. Hinton et al. NIPS 2012). However, the newly known brain
principles have told us that the ways to conduct such tests (e.g.,
ImageNet) will give only vanishing gains that do not lead to a
human-like zero error rate, regardless how long the Moore’s Law can
continue and how many more static images are added to the training set.
Why? All such tests used static images in which objects mix with the
background. Such tests therefore prevent participating groups from
seriously considering autonomous object segmentation (free of
handcrafted object model). Through synapse maintenance (Y. Wang et al.
ICBM 2012), neurons in a human brain automatically cut off inputs from
background pixels if background pixels matched badly compared with
attended object pixels. Our babies spend much more time in dynamic
physical world than seeing static photos.
Our industry should learn more powerful brain mechanisms that went
beyond conventional well-known, well-tested techniques. The following
gives some examples:
(1) Deep Learning Networks (e.g., J. Weng et al. IJCNN 1992, Y. LeCun et
al. Proceedings of IEEE 1998, G. Hinton et al. NIPS 2012) are not only
biologically implausible but also functionally weak. The brain uses a
rich network of processing areas (e.g., Felleman & Van Essen, Cerebral
Cortex 1991) where connections are almost always two-way (J. Weng,
Natural and Artificial Intelligence, 2012), not a cascade of modules as
in the Deep Learning Networks. Such a Deep Learning Network is not able
to conduct top-down attention in a cluttered scene (e.g., attention to
location or type in J. Weng, Natural and Artificial Intelligence, 2012
or attention to more complex object shape as reported in L. B. Smith et
al. Developmental Science 2005).
(2) Convolution (e.g., J. Weng et al. IJCNN 1992, Y. LeCun et al.
Proceedings of IEEE 1998, G. Hinton et al. NIPS 2012) is not only
biologically implausible, but also computationally weak. Why? All
feature neurons in the brain carry not only sensory information but also
motor information (e.g., Felleman & Van Essen, Cerebral Cortex 1991) so
that later-processing neurons become less concrete and more abstract ---
which is impossible to accomplish using the shift-invariant
convolution. Namely, convolution is always location-concrete (even
using max-pulling) and never location-abstract.
(3) Error back-propagation in neural networks (e.g., G. Hinton et al.
NIPS 2012) is not only biologically implausible (e.g., a baby does not
have error in his motors) but also damaging to long-term memory because
of its lack of match-based competition for error-causality (such as
those in SOM, LISSOM, and LCA as optimal SOM). Even though the
gradient vector identifies a neuron that can reduce the current error,
the current error is not the business of that neuron at all and it must
keep its own long-term memory unchanged. That is why error
back-propagation is well known to be bad for incremental learning and
requires research assistants to try many guesses of initial weights
(i.e., using the test set as the training set!). Let us not be blinded
by artificial low error rates.
Do our industry and public need another 20 years?
---- end of the new paragraphs -----
Full text:
The Brain Principles Manifesto
(Draft Version 4.5)
March 21, 2015
Historically, public acceptance of science was slow. For example,
Charles Darwin waited about 20 years (from the 1830s to 1858) to publish
his theory of evolution for fear of public reaction. About 20 years
later (by the 1870s) the scientific community and much of the general
public had accepted evolution as a fact. Of course, the debate on
evolution still goes on today.
Is the public acceptance of science faster in modern days? Not
necessarily so, even though we have now better and faster means to
communicate. The primary reason is still the same but much more
severe—the remaining open scientific problems are more complex and the
required knowledge goes beyond a typical single person.
For instance, network-like brain computation — connectionist computation
(e.g., J. McClelland and D. Rumelhart, Parallel Distributed Processing,
1986) — has been long doubted and ignored by industry. Deep
convolutional networks appeared by at least 1980 (K. Fukushima).
Max-pooling technique for deep convolutional networks was published by
1992 (J. Weng et al.). However, Apple, Baidu, Google, Microsoft,
Samsung, and other major related companies did not show considerable
interest till after 2012. That is a delay of about 20 years. The two
techniques above are not very difficult to understand. However, these
two suddenly hot techniques have already been proved obsolete by the
discoveries of more fundamental and effective principles of the brain,
six of which are intuitively explained below.
Industrial and academic interests have been keen on a combination of two
things — easily understandable tests (e.g., G. Hinton et al. NIPS 2012,
congratulations!) and major companies are involved (e.g., Google,
thanks!). We have read statements like “our results can be improved
simply by waiting for faster GPUs and bigger datasets to become
available” (G. Hinton et al. NIPS 2012). However, the newly known brain
principles have told us that the ways to conduct such tests (e.g.,
ImageNet) will give only vanishing gains that do not lead to a
human-like zero error rate, regardless how long the Moore’s Law can
continue and how many more static images are added to the training set.
Why? All such tests used static images in which objects mix with the
background. Such tests therefore prevent participating groups from
seriously considering autonomous object segmentation (free of
handcrafted object model). Through synapse maintenance (Y. Wang et al.
ICBM 2012), neurons in a human brain automatically cut off inputs from
background pixels if background pixels matched badly compared with
attended object pixels. Our babies spend much more time in dynamic
physical world than seeing static photos.
Our industry should learn more powerful brain mechanisms that went
beyond conventional well-known, well-tested techniques. The following
gives some examples:
(1) Deep Learning Networks (e.g., J. Weng et al. IJCNN 1992, Y. LeCun et
al. Proceedings of IEEE 1998, G. Hinton et al. NIPS 2012) are not only
biologically implausible but also functionally weak. The brain uses a
rich network of processing areas (e.g., Felleman & Van Essen, Cerebral
Cortex 1991) where connections are almost always two-way (J. Weng,
Natural and Artificial Intelligence, 2012), not a cascade of modules as
in the Deep Learning Networks. Such a Deep Learning Network is not able
to conduct top-down attention in a cluttered scene (e.g., attention to
location or type in J. Weng, Natural and Artificial Intelligence, 2012
or attention to more complex object shape as reported in L. B. Smith et
al. Developmental Science 2005).
(2) Convolution (e.g., J. Weng et al. IJCNN 1992, Y. LeCun et al.
Proceedings of IEEE 1998, G. Hinton et al. NIPS 2012) is not only
biologically implausible, but also computationally weak. Why? All
feature neurons in the brain carry not only sensory information but also
motor information (e.g., Felleman & Van Essen, Cerebral Cortex 1991) so
that later-processing neurons become less concrete and more abstract ---
which is impossible to accomplish using the shift-invariant
convolution. Namely, convolution is always location-concrete (even
using max-pulling) and never location-abstract.
(3) Error back-propagation in neural networks (e.g., G. Hinton et al.
NIPS 2012) is not only biologically implausible (e.g., a baby does not
have error in his motors) but also damaging to long-term memory because
of its lack of match-based competition for error-causality (such as
those in SOM, LISSOM, and LCA as optimal SOM). Even though the
gradient vector identifies a neuron that can reduce the current error,
the current error is not the business of that neuron at all and it must
keep its own long-term memory unchanged. That is why error
back-propagation is well known to be bad for incremental learning and
requires research assistants to try many guesses of initial weights
(i.e., using the test set as the training set!). Let us not be blinded
by artificial low error rates.
Do our industry and public need another 20 years?
On the other hand, neuroscience and neuropsychology have made many
advances by providing experimental data (e.g., Felleman & Van Essen,
Cerebral Cortex 1991). However, it has been well recognized that these
disciplines are data-rich and theory-poor. The phenomena of brain
circuits and brain behavior are extremely rich. Many researchers in
these areas use only local tools (e.g., attracters that can only be
attracted into local extrema) and consequently have been overwhelmed by
the richness of brain phenomena. A fundamental reason is that they
miss the guidance of the global automata theory of computer science,
although previous automata do not emerge. For example, X. -J. Wang et
al. Nature 2013 stated correctly that neurons of mixed selectivity were
rarely analyzed but have widely observed. However, the mixed
selectivity has already been well explained, as a special case, by the
new Emergent Turing Machine in Developmental Networks in a theoretically
complete way. The traditional Universal Turing Machine is a theoretical
model for modern-day computers --- how computers work --- but they do
not emerge. The mixed selectivity of neurons in such a new kind of
Turing Machine are caused by emergent and beautiful brain circuits, but
each neuron still uses a simple similarity of inner product in its high
dimensional and dynamic input space.
October 2011, a highly respected multi-disciplinary professor kindly
wrote: “I tell these students that they can work on brains and do good
science, or work on robots and do good engineering. But if they try to
do both at once, the result will be neither good science nor good
engineering.” How long does it take for the industry and public to
accept that the pessimistic view of the brain was no longer true even then?
The brain principles that have already been discovered could bring
fundamental changes in the way humans live, the way countries and
societies are organized, our industry, our economy, and the way humans
treat one another.
The known brain principles have told us that the brain of anybody,
regardless of his education and experience, is fundamentally
shortsighted, in both space and time. Prof. Jonathan Haidt documented
well such shortsightedness in his book “The Righteous Mind: Why Good
People Are Divided by Politics and Religion”, although not in terms of
brain computation.
In terms of brain computation, the circuits in your brain self-wire
beautifully and precisely according to your real-time experience (the
genome only regulates) and their various invariance properties required
for abstraction also largely depend on experience. Serotonin (caused
by, e.g., threats), dopamine (caused by e.g., praise), and other neural
transmitters quickly bias these circuits so that neurons for more
long-term thoughts lost in competition to fire. Furthermore, such bias
has a long-term effect. Therefore, you make long-term mistakes but you
still feel you are right. Everybody is like that. Depending on
experience, shortsightedness varies in terms of subject matter.
Traditionally, many domain experts think that computers and brain appear
to use very different principles. Naturally emerging Turing Machine in
Developmental Networks that has been mathematically proved (see J. Weng,
Brain as an Emergent Finite Automaton: A Theory and Three Theorems,
IJIS, 2015) should change our intuition.
The new result proposed the following six brain principles:
1. The developmental program (genome-like, task-nonspecific)
regulates the development (i.e., lifetime learning) of a
task-nonspecific “brain-like” network —— Developmental Network. The
Developmental Network is of general-purpose—can learn any body-capable
tasks, in principle. Not only pattern recognition.
2. The brain’s images are naturally sensed images of cluttered scenes
where many objects mix. In typical machine training (e.g., Krizhevsky
et al. NIPS 2012), each training image has a bounding box drawn around
each object to learn, which is not the case for a human baby. Neurons
in the Developmental Network automatically learn object segmentation
through synapse maintenance.
3. The brain’s muscles have multiple subareas where each subarea
represents either declarative knowledge (e.g., abstract concepts such as
location, type, scale, etc.) or non-declarative knowledge (e.g., driving
a car or riding a bicycle). Not just discrete class labels in global
classification.
4. Each brain in the physical world is at least is a Super Turing
Machine in a Developmental Network. Every area in the network emerges
(does not statically exist, see M. Sur et al. Nature 2000 and P. Voss,
Frontiers in Psychology 2013) using a unified area function whose
feature development is nonlinear but free of local minima, contrary to
engineering intuition --- not convolution; not error back-propagation.
5. The brain’s Developmental Network learns incrementally—taking
one-pair of sensory pattern and motor pattern at a time to update the
“brain” and discarding the pair immediately after. Namely, a real brain
has only one pair of stereoscopic retinas which cannot store more than
one pair of image. Batch learning (i.e., learn before test) is not
scalable: Without a mistake in an early test, a student cannot learn how
to correct the mistake later.
6. The brain’s Developmental Network is always optimal—Each network
update in real time computes the maximum likelihood estimate of the
“brain”, conditioned on the limited computational resources and the
limited learning experience in its “life” so far. One should not use
the test set as a training set: report only the best network after
trying many networks on the test set.
The logic completeness of a brain is (partially, not all) understood by
a Universal Turing Machine in a Developmental Network. This emergent
automaton brain model proposes that each brain is an automaton, but also
very different from all traditional symbolic automata because it
programs itself—emergent. No traditional Turing Machine can program
itself but a brain Turing Machine does.
The automaton brain model has predicted that brain circuits dynamically
and precisely record the statistics of experience, roughly consistent
with neural anatomy (e.g., Felleman & Van Essen, Cerebral Cortex,
1991). In particular, the model predicted that “shifting attention
between `humans’ and `vehicles’ dramatically changes brain
representation of all categories” (J. Gallant et al. Nature
Neuroscience, 2013) and that human attention “can regulate the activity
of their neurons in the medial temporal lobe” (C. Koch et al. Nature,
2010). The “place” cells work of the 2014 Nobel Prize in Physiology or
Medicine implies that neurons encode exclusively bottom-up information
(place). The automaton brain model challenges such a view: Neurons
represent a combination of both bottom-up (e.g., place) and top-down
context (e.g., goal) as reported by Koch et al. and Gallant et al.
Unfortunately, the automaton brain model implies that all
neuroscientists and neural network researchers are unable to understand
the brain of their studies without a rigorous training in automata
theory. For example, traditional models for nervous systems and neural
networks focus on pattern recognition and do not have the capabilities
of a grounded symbol system (e.g., “rulefully combining and
recombining,” Stevan Harnad, Physica D, 1990). The automata theory
deals with such capabilities. Does this new knowledge stun our students
and researchers or guide them so their time is better spent?
Brain automata would enable us to see answers to a wide variety of
important questions, some of which are raised below. The automaton brain
model predicts that there is no absolute right or wrong in any brain but
its environmental experiences wire and rewire the brain. We do not
provide yes/no answers here, only raise questions.
How can our industry and public understand that the door for
understanding brains has opened for them? How can they see the
economical outlooks that this opportunity leads them to?
How should our educational system reform to prepare our many bright
minds for the new brain age? Has our government been prompt to
properly respond to this modern call from the nature?
How should our young generation act for the new opportunity that is
unfolding before their eyes? Is a currently narrowly defined academic
degree sufficient for their career?
How can everybody take advantage of the new knowledge about his own
brain so that he is more successful in his career, including statesmen,
officials, educators, attorneys, entrepreneurs, doctors, technicians,
artists, workers, drivers, and other mental and manual workers?
Regardless where we are and what we do, we are all governed by the same
set of brain principles. Everybody’s brain automatically programs itself.
---- end of the manifesto ----
-John
--
--
Juyang (John) Weng, Professor
Department of Computer Science and Engineering
MSU Cognitive Science Program and MSU Neuroscience Program
428 S Shaw Ln Rm 3115
Michigan State University
East Lansing, MI 48824 USA
Tel: 517-353-4388
Fax: 517-432-1061
Email:weng at cse.msu.edu
URL:http://www.cse.msu.edu/~weng/
----------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cse.msu.edu/pipermail/bmi/attachments/20150322/385f0fc7/attachment-0001.html>
More information about the BMI
mailing list