Bayesian model of dynamic image stabilization in the visual system

PNAS: Bayesian model of dynamic image stabilization in the visual system

author

Yoram Burak Uri Rokni Markus Meister Haim Somplolinsky

purpose

To draw a model to explain why human visual system can stablize dynamic image.

pipline

stage 1 As the paper proved, when performing high acuity visual tasks, the brain must consider the drift of image.

stage 2 draw a decode strategy to interprete the spikes emitted by the retina(视网膜) difficulty: possible stimuli has an exponentially large number

stage 3 Another implementation involves two populations of cells, one tracks position, another represents a stabilized estimate of the image itself.

the spikes from the retina are dynamically routed to the two populations and are interpreted in a probablistic mannner.

The paper also considers the architecture of neural circuitry, measures the performance of human fixational eye motion (under measured statistics)

prediction In high auity tasks, fixed features within the visual scene are beneficial because they provide information about the drifting position of the image.

preceding work

brain infers surroundings (Hypotheses)

when noisy, the interpretation becomes ambiguous, and those hypotheses compete.

How could the brain estimate the image from 2D scenes

sources of ambiguity: 1. noise in neural circuitry 2. random movements of the eye that leads to image jitter on the retina

An ideal Bayesian decoder in the brain take those into consider. But the number of variables we must consider is too large.

Prior work on Bayesian inference focused on simplified conditions in which the subject estimates only a single typicallly static sensory variable.

model

simulated cells(retina) The researchers model the fovea as a homogeneous array of the retinal ganglion cells of a single type, arranged on a rectangular grid.

input image The image consist of black-and-white pixels on the same grid, whose itensities are drawn independently from a binary distribution.

fire process(output) The firing of each cell is an inhomogeneous Poisson process(非齐次泊松过程), whose rate depends on the image pixel in the receptive field

a simple version pixel on → cell fire rate = \(\lambda 1\) pixel off → cell fire rate = \(\lambda 0\)

a realistic version later the fire rate dependas on the past light intensity within the retina's integration time.

about fixational movements the fixational movements of the image over the retina are modeled as a discrete random walk

task visualization

Use 'E' as an input， apply fixational movements on the image. The fig C refers to the spikes generated by their model retina. In which \(\lambda_0 =10Hz\),\(\lambda_1 = 100Hz\) [figure]

Indeed, images of a Snellen letter derived from simple spike accumulation in each pixel seem almost random. And so without some knowledge of the image trajectory, such a reconstruction is impossible.

math theory -- factorized bayesian decoder

definition

\(s\): a probabilistic estimate of the image \(x\): retinal position \(p(x,t)\): probability distribution of positions \(p_i (s_i, t)\): probability distributions for individual pixels in the stabilized coordinates of the image \(s_i\): a pixel is on or off (1/0) \(m_i (t) = p_i (1,t) = 1- p_i (0,t)\) \(D\) : D ≃ 100 arcmin2/s, diffusion coefficient(a human feature, )

The full Bayesian estimate is like:

\[p(s,x,t) = p(x,t) \prod \limits_i p_i (s_i,t)\]

This form ignores any correlations between the values of different pixels or between the image and its position.

Meaning that each pixel is considered desparately. And the relation between the position and the content of the picture.

gradient

分段函数 between spike

\[\frac{\partial p(x,t)}{\partial t} = D \nabla^2 p(x,t) \]

\[\nabla^2 p(x,t) = p(\overline{x}, t) - p(x,t) = \sum_{x' \in NN(x)}{p(x',t)} - 4p(x,t)\]

\(D\)是系数

又

\[\frac{\partial m_i (t)}{\partial t} = -\Delta \lambda[1-m_i (t)]m_i (t)\]

\[\Delta \lambda = \lambda_1 - \lambda_0\]

\(m_i (t)\) decays toward zero in the absence of spikes, with a rate proportional to \(\Delta \lambda\)
if \(m_i\) is either 0 or 1, such that the decoder is certain about the value of pixel \(i, m_i\) remains unchanged

due to spike At time t, the ganglion cell k fires a spike.

\[p(x,t_+) \propto [\lambda_0 + \Delta \lambda m_{k-x} (t_-)] \cdot p(x,t_-)\]

\[m_i (t_+) = m_i (t_-) + \Phi [m_i (t_-)]\cdot p (k-i,t_+)\]

\[\Phi(m) = \frac{\Delta \lambda m(1-m)}{\lambda_0 + \Delta \lambda m }\]

→ The change in \(m_i\) is proportional to the estimated probability that the image is at position \(k-i\).

network implementation

RGC: retina ganglion cell This result suggests a network architecture with two divergent projections from retinal ganglion cells to the what cells and the where cells, along with reciprocal recurrent connections between both of these populations. [figure 1-D]

performance

Comparison between the model and static decoder.

The response of factorized decoder to a sample stimulus is illustrated in [Figure 2-A]

The estimate of the image itself, represented by activity in the what population, gradually improves with time. In this example almost all of the pixels are estimated correctly at 300 ms

[Figure 2-B]

When tested with many random images, the factorized decoder routinely reconstructed 90% of the pixels correctly in just 100 ms (Fig. 2B). By comparison, a static decoder that ignores eye movements and simply accumulates spikes performed very poorly: Shortly after stimulus onset it reached a maximum of nearly 60% correctly estimated pixels, but then the blurring from retinal motion took its toll.

trend

Performance improves with slower eye movements, higher firing rates, and larger image size

When D is small, the decoder easily tracks the position of the image, and performance is limited only by the stochasticity of the ganglion cell response. As D increases, the performance degrades due to uncertainty about the position (Fig. 3A). The convergence time increases sharply above a critical value of D. This value is proportional to the RGC ﬁring rates, as can be deduced from dimensional analysis. With a larger image, more information is available about the trajectory, and the decoder’s performance improves markedly (Fig. 3B). Further analysis shows that increasing the number of pixels by a factor f acts roughly like a reduction of D by a factor \(\sqrt{f}\). This sensitivity to image size should be observable in psychophysical experiments.

[figure 3-A]

[figure 3-B]

simulate human

With D set to 100 \(arcmin^2 /s\), corresponding to the measured statistics of human ﬁxational drift (11–13), the factorized decoder performs well on images that cover at least 40 × 40 pixels (20 × 20 arcmin) (Fig. 3B). Reconstruction improves dramatically if one is satisﬁed with a lower resolution. For example, if the pixel size is increased from 0.5 to 1 arcmin, then the eye drift changes the pixel contents less rapidly, and four ganglion cells are available to report each pixel. Under these conditions, small 5 × 5 arcmin images can be decoded rapidly to high accuracy (Fig. 3B).

in more realistic scene

smaller picture could be better reconstructed

but the advantage of the model will be more salient when it comes to more pixels

discrimination task

The possible images represents the letters A-Z. Spikes are generated by a model retina with a biphasic temporal filter and diffusion coeffient D = 100 \(arcmin^2 /s\)and fed into decoder.

biphasic temporal filter: to minus the affect of noise

The decoder achieves a 90% success rate after ~300ms(about the length of a human fixation), compatible with human vision on this task.

static: ~50%

Discussion

alternative approaches

By stabilizing the retinal image, as proposed here, fixational image motion is dealt with once and for all by dedicated neural circuitry that performs the same computation regardless of the image content.

This division of labor is functionally attractive, but one can imagine an alternative scenario in which the visual system deals with fixational motion separately whenever it analyzes the foveal image for a specific visual task.

The strategy, piecewise static decoder: 1. in each short time window, generate a positioninvariant likelihood that each of the possible letters is in the image, using the static decoder 2. summate these log-likelihoods across windows to accumulate evidence over time, while ignoring the continuity of the trajectory across adjoining windows.

can work well in the letter discrimination task

drawbacks: 1. seems complicated, because intricate neural circuitry must be set up for each possible pattern and every kind of visual task. 2. two eyes will produce relative jitter 3. not consistent with RGC: - When the temporal response properties of RGCs are taken into account, eye motion has two competing effects within our model. On one hand, it introduces ambiguity in the interpretation of retinal spikes. On the other hand, it helps drive the RGCs, whose response to completely static stimuli is weak. - Previous analysis of ideal discrimination between two small stimuli at the limit of visual acuity suggested that a small drift would be beneficial, but the actual eye movements of human subjects are much larger and on balance deleterious

Indeed, it seems that certain types of retinal ganglion cells appear designed to ignore global image motion entirely and respond only when an object moves relative to the background scene.

A broader question is hwo the brain forms a stable scene representation across saccades(big movements), in which the computational principles presented here may not be useful in brain. In fact, the brain might use a different neural circuity to process those conditions.

implement in brain

The factorized decoder mentioned above is based on a a hypothesis that the image pixels are the fundamental units.

But if the computation is performed in the visual cortex(视神经皮层), the decoder may represent propabilities for presence of more complex features.

The implementation of the factorized decoding strategy has several salient features. 1. divergent afferents, where << what 2. signal from retina to where&what requires a multiplicative gating controlled in a reciprocal(相互的，交互的) fashion by the signals in those populations. 3. In the where popilstion, local excitatory connections are required to implement the diffusive update between spikes, and a global divisiv mechanism. 4. involve local nonlinerities（局部非线性性）.

Location

So where should these circuit be in human visual system?

fixational drift(movement) is indepent for two eyes → the circuit should be within the monocular part of the visual pathway(视觉通路的单眼部分)

If they are fixed on retina → not accurate enough to perform such task

Maybe at foveal region of V1(V1区中央凹区域).

How to combine the images from two eyes hypothesis: two monocular populations of where neurons that control the inputs to a single population of what neurons

Such a binocular representation of the stabilized image may appear in disparity-selective neurons in V1 or downstream of V1, for example in a binocular population in V2 that receives monocular inputs

further experiment: record from cortical neurons that represent the primate fovea,whose receptivefield structure is fine enough to resolve patterns close to the animal’s acuity.