Intro Mind Notes, Week 8: Vision
(HMW, Ch. 4, especially pp. 211-214, 242-261, and 268-284)
A. Why the Problem of Vision is Hard
- People think vision is easy. All that has to be done, they think, is
for the nervous system to get a picture to the brain. Then the brain just
sees what is going on. But that idea buys into the fallacy of the homunculus.
Images from the external world are projected (upside down) onto to our
retina at the back of the eyeball. This retinal image can be considered
to be a vector or array of values, one value for each rod and cone on the
retina. This representation is a far cry from what cognition needs to navigate
the world. To be of use, the brain needs a representation of a three-dimensional
world filled with objects.
- The representation cognition needs would allow us to distinguish one
object from another, to appreciate their positions, motions, sizes, shapes,
and textures, despite the fact that lighting in the environment is variable,
that we ourselves may be moving, and that we must recognize objects from
many different points of view.
- The problem of vision is to explain the mechanism that transforms the
retinal array into an object-level representation that can be stored in
memory and processed by other cognitive systems. Pinker assumes this representation
is symbolic, that is written in mentalese.
- The problem of retrieving the three dimensional image from retinal
arrays is not solvable. This is why the eye can be fooled by illusions.
But the vision system manages to do a reasonably good job nonetheless by
depending on basic assumptions about the world of 3D objects. For example,
very few objects can increase or decrease their sizes like balloons. Since
so few objects that humans encounter do this, the visual system has come
to depend on the assumption that objects stay pretty much the same size.
This allows it to predict that if an object's image becomes larger in the
visual field, the reason is that it is moving closer, and not actually
increasing in size.
B. The Nature of Low-Level Visual Processing
- A fundamental question about vision is the extent to which higher cognitive
processes such as goals, expectations, attention, reasoning, and conceptual
structures, influence the transformation from retinal to cognitive level
representations. Although these so called top down factors are clearly
important, a lot of visual processing can be safely studied from bottom
up, leaving aside these top down considerations.
- Anatomical study of the visual system gives important clues as to how
it works. Visual processing is massively parallel and local. Local means
that processing at one point of the image depends for the most part only
on activity of nearby points.
- We also know that visual processing is modular. There are different
maps of the visual system specially designed to resolve different parts
of the problem. Examples: edge detection, motion, depth. Each of these
may be further subdivided. For example, for depth information, we have
several different systems based on different sources of depth information:
stereopsis , the differences in the images in each eye; motion,
as object recedes or advances the "size" of object changes; and
overlap, where depth is indicated by whether one object blocks the view
of another.
C. David Marr's Theory of Vision
- David Marr, in his book Vision presented one of the most influential
theories of vision in the early 80s. His theory has been an inspiration
for the computational theory of mind. His main idea is that the function
of the visual system is to convert images projected onto our retina into
representations of the world written in mentalese. The process starts with
the retinal images from two eyes and proceeds through a number of different
levels of representation.
- Grey-Level Representation . This is just the raw output from
the rods and cones on the retina. It is a vector of values indicating the
activity of each retinal neuron.
- Zero Crossing Map . On Marr's theory, the first job the vision
system has to do is detect edges or object boundaries in the grey level
representation. Usually an object boundary corresponds to a quick change
in intensity of light. Intensity changes can be computed by taking differences
between adjoining pixels in the image. (For math mavens, this is the first
derivative.) If we look at changes in those changes (the second derivative)
areas where there was a change in the original will correspond to zero
activity (zero crossings ) in the new image. The new image will
look something like an outline version of the old one. Since neurons calculate
weighted sums, and weights can be negative, arrays of neurons can easily
compute differences in activity between neighboring neurons. Marr worked
out the neural nets that can compute the zero crossing image, and so find
some of the information it needs to locate object boundaries.
- Primal Sketch. Knowing where edges are is not enough. The orientation
of these edges must be computed, and junction points between edges such
a Ts and Ls must be found, for these provide important clues about overlap.
The primal sketch represents these important features along with the edges.
- 2.5 D Sketch. But the primal sketch is not enough, because edges
can arise from changes in lighting due to the angle at which we see a surface.
The problem is compounded by the fact that lighting need not be uniform
across the surface, and by the fact that the surface of an object can be
"painted" in different colors. (See the diagram on p. 242.) In
pages 242-255 Pinker does a beautiful job of explaining how separate systems
(demons) computing how light coming to the eye is effected by these factors
can cooperate to compute the nature of the object in the real world. The
2.5 D sketch keeps track of all information about edges in the visual field,
distinguishing between changes in color on the surface of the object, changes
due to changes in angle on the surface, and changes due to lighting. (See
the diagram on p. 260).
- Frame Neutral Sketch. But the 2.5 D Sketch is not enough. Our
bodies and eyes are constantly on the move, so the image on our eyeballs
is constantly changing. Eye motions (saccades ) constantly flit
from one spot to another in the scene and are essential to effective vision.
The detection of the motion of the visual scene across the retinal array
must be suppressed during a saccade. Think how much jitter there would
be if you were making a video and you were to move your camera the way
your eyes move. The brain needs to distinguish what is moving in the real
world from the changes in the images it receives that are due to body and
eye motion. The frame neutral sketch represents only the changes in the
world, separating those from changes due to eye and body motion.
- 3D Sketch. The 3D sketch is the final representation computed
by the visual system. It provides a three dimensional representation of
the objects in the world, allowing us to recognize what those objects are
even though they look very different from different points of view. On
Marr's theory this is done by representing the object as a nested set of
more and more complex cylinders: Human: Head Body; Body: Trunk, Arms, Leg;
Arm: Upper-Arm, Forearm, Hand; Hand: Palm, Fingers.
D. Extensions of Marr's Theory
- On Beiderman's theory, a more complex set of basic objects (called
geons ) is used for object recognition: cones, cylinders, cubical
shapes, and distortions of these that result from changing the length to
width ratio and the shape of the center line. (See the picture on p. 270.)
Biederman believes that representations of objects we can identify are
stored in the form of something like sentences, listing the components
that form the object along with their attachment points. In short, the
recognition of objects is like the recognition of a sentence, for the object
is composed of geons the way a sentence is composed of words. But how are
the components recognized? By the boundaries between them, which are typically
concave and rather sharply sloped inwards. (Consider the "joints"
of the Michelin man, for an exaggerated version of the idea.)
- But geons can't entirely explain our ability to recognize objects from
many different viewpoints. To do that , it would seem that the brain would
have to stored a representation for an object to be identified from all
possible viewpoints. True, the up-down axis is used as a major default
assumption about how objects are aligned, and this may simplify the process
of object identification. Violations of this alignment cause errors in
identification (as NASA designers know well.) However, we can still identify
objects when they are upside down or sideways.
- There is good evidence that the human visual system also has a ability
to mentally rotate 3-D representations to help in object identification.
This would vastly reduce the number of representations needed for an object
to be recognized. Pinker discusses some of his own work on this topic on
pp. 279-284.
E. Face Recognition
- Geon theory cannot be the whole story for object recognition. It is
likely that we have other visual systems for recognizing natural objects
like trees and mountains which cannot easily be represented as combinations
of geons.
- One case where we have clear evidence of this is in the recognition
of faces. The evidence for a special module for face recognition comes
from brain injury patients who are (pretty much) normal in all other visual
recognition tasks but who simply cannot recognize faces. Other patients
can recognize faces but lack the ability to recognize other objects.
F. Treisman's Theory of Attention (See HMW pp. 140-142)
- Treisman's thesis is that there are basic visual processes that are
computed in parallel that feed information to higher level processes responsible
for binding features together. These second processes are calculated in
series by an attention mechanism.
- We can develop evidence for this theory by presenting images with target
shapes surrounded by distractors. If we measure the reaction time for identifying
the targets and discover that it is fast and does not depend on the number
of distractors, then we assume it is a basic parallel process. If the reaction
time grows with the number of distractors then we assume the process is
serial and involves attending to one thing after another in the scene.
- For example, the letters L and T have the same elements in the same
orientation, and differ only in how the elements are conjoined to each
other. Recognition of these targets depends on attention. The differences
between them do not just obviously "pop out". However if you
examine a field of |s and /s, where the only difference is the orientation,
the difference is immediately and easily apparent.
- Basic features include orientation, brightness, curvature. A discrimination
that requires conjoining features (white triangles and black squares, vs.
black triangles and white squares) is extremely difficult to discriminate
and takes tedious one-by-one inspection.
G. Top-Down vs. Bottom-Up
- Marr and many other researchers have tried to create theories of vision
where the processing from retina to brain does not require higher-level
information to identify the object. (For example, concepts like animals
have 4 legs, or that the sky is above us and is blue or grey, etc.)
- Clearly there are instances where higher level information is required
to resolve the ambiguities in a scene. For example the same shape (N) can
be read horizontally as an en, and vertically as a zee.
- But to what extent does vision rely on top-down processing? Consider
the Kanisza Triangle . (It is in HMW, p. 259.) Here we see an image
of a triangle hovering above the scene, but there is no luminance difference
on the two sides of its edges to allow us to pick out the boundary. Why
do we perceive the edge? Perhaps conceptual information about how other
things blot out other shapes helps us. However, there is some evidence
that this phenomenon is very low level. For example we have evidence from
monkey studies that the "edges" are already processed early in
visual processing. So maybe a bottom up explanation of the Kanisza triangle
effect is more likely.
H. Imagination and Imagery
- Some cognitive scientists have championed the view that imagination
is a separate cognitive ability that provides an alternative to the symbolic
processing story. Brains might contain a special graphic processor along
with (or instead of) a symbolic processor. What advantages would this graphic
processor bring?
- One important idea is that visual imagery carries much more information
that symbolic representations are capable of. A picture is worth a thousand
words. Consulting an image makes things obvious that we would otherwise
have to think out. There are a number of skills such as finding things,
planning errands, trying out ways of building things such as bridges, explaining
continental drift etc., where an ability to imagine the various objects,
actions and likely outcomes is extremely helpful. We can literally see
in our mind's eye the things we should avoid doing when we imagine a course
of events. Imagination gives us foresight. It also allows us to adapt ahead
of time. For example, just by imagining a task, the athlete can train herself
to improve.
- There is excellent evidence that language understanding and reasoning
is based on metaphors which are in turn founded on visual imagery . For
example, top, up ,etc. mean better, stronger (top of his game) down, bottom
mean worse, weaker etc. (in the pits). If I imagine A is to the left of
B and B to the left of C, I instantly "see" that A is to the
left of C.
- There is also evidence that imagination of mental pictures rather than
symbolic representations is crucial to certain cognitive abilities. Kosslyn
had people imagine moving attention from one point to another on a map.
The time to move attention was proportional to the distance on the map
suggesting that attention actually "moves" from point to point
in an imaging space. In another experiment, subjects asked whether two
images matched apparently used a mental rotation technique to solve the
task, for the time it took for solution depended on the angle through which
the image would have to be rotated to align the two.
- Experiments with PET and rCBF scanning show that visual imagination
and other cognitive tasks differ in that the former involve activation
of visual areas of the brain. Research with brain-damaged patients has
shown that brain damage can selectively impair abilities at mental rotation.