Towards Laws of Visual Attention


Author: Dario Zanca
Date: March, 2019
Topics: Computational modeling of visual attention; Computer vision; Machine Learning.


Visual attention is a crucial process for humans and foveated animals in general. The ability to select relevant locations in the visual field greatly simplifies the problem of vision. It allows a parsimonious management of the computational resources while catching and tracking coherences within the observed temporal phenomenon. Understanding the mechanisms of attention can reveal a lot about human intelligence. At the same time, it seems increasingly important for building intelligent artificial agents that aim at approaching human performance in real-world visual tasks. For this reasons, in the past three decades, many studies have been conducted to create computational models of human attention. However, these have been often carried over as the mere prediction of the saliency, i.e. topographic map that represents conspicuousness of scene locations. Although of great importance and usefulness in many applications, this type of study does not provide an exhaustive description of the attention mechanism, since it misses to describe its temporal component.

In this thesis, we propose three models of scanpaths, i.e. trajectories of free visual exploration. These models share a fundamental idea: the evolution of the mechanisms of visual attention has been guided by fundamental functional principles. Scanpath models emerge as laws of nature, in the framework of mechanics. The approaches are mainly data-driven (bottom-up), defined on video streams and visual properties completely determine the forces that guide movements.

The first proposal (EYMOL) is a theory of free visual exploration based on the general Principle of Least Action. In the framework of analytic mechanics, a scanpath emerges in accordance with three basic functional principles: boundedness of the retina, curiosity for visual details and invariance of the brightness along the trajectories. This principles are given a mathematical formulation to define a potential energy. The resulting (differential) laws of motion are very effective in predicting saliency. Due to the very local nature of this laws (computation at each time step involve only a single pixel and its close surround), this approach is suitable for real-time application.



The second proposal (CF-EYMOL) expands the first model with the information coming from the internal state of a pre-trained deep fully convolutional neural network. A visualization technique is presented to effectively extract convolutional features (CF) activations. This information is then used to modify the potential field in order to favour exploration of those pixels that are more likely to belong to an object. This produces incremental results on saliency prediction. At the same time, it suggests how to introduce preferences in the visual exploration process through an external (top-down) signal.



The third proposal (G-EYMOL) can be seen as a generalisation of the previous works. It is completely developed in the framework of gravitational (G) physics. No special rule is described to define the direction of exploration, except that the features themselves act as masses attracting the focus of attention. Features are given we assume they come from external calculation. In principle, they can also derive from a convolutional neural network, as in the previous proposal, or they can simply be raw brightness values. In our experiments, we use only two basic features: the spatial gradient of brightness and the optical flow. The choice, slightly inspired by the basic raw information in the earliest stage V1 of the human vision, is particularly effective in the experiments of scanpath prediction. The model also includes a dynamic process of inhibition of return defined within the same framework and which is crucial to provide the plus of energy for the exploration process. The laws of motion that are derived are integral-differential, as they also include sums over the entire retina. Despite this, the system is still widely suitable for real-time applications since only one step computation is needed to calculate the next gaze position.


Related projects





 |  Category: PhD Theses