Learning in Visual Environments

This project aim at developing intelligent agents with visual skills that operate in a given environment. A continuous stream of data (video signal) is presented to the agent and the agent is expected to learn from the processed information, progressively developing his skills in making predictions over the “pixels” of the observed data stream. While nowadays Computer Vision systems are usually built “offline”, using large-scale and fully-supervised datasets, in this project we consider the most natural setting in which the agent, while “living” in the observed environment, receives a few information (supervisions) from the user, and it also has the capability of asking for supervisions, when needed. The principle of Least Cognitive Action, that parallels the laws of mechanics, is exploited to devise the life-long online learning laws that drive the behaviour of the agent. Motion invariance allows the agent to develop robust features, that is further extended with the idea of invariance to some categories of eye movements, where the notion of focus of attention is introduced to reduce the information overflow that is typical of commonly observed scenes. Human language is processed to create a dialogue-based interaction with the agent.

From Alessandro Betti, Stefano Melacci, Marco Gori. “Motion Invariance in Visual Environments.” 28th International Joint Conference on Artificial Intelligence (IJCAI), 2019 (to appear):

“The puzzle of computer vision might find new challenging solutions when we realize that most successful methods are working at image level, which is remarkably more difficult than processing directly visual streams, just as it happens in nature. The processing of a stream of frames naturally leads to formulate the motion invariance principle, which enables the construction of a new theory of visual learning based on convolutional features. The theory addresses a number of intriguing questions that arise in natural vision, and offers a well-posed computational scheme for the discovery of convolutional filters over the retina. They are driven by the Euler-Lagrange differential equations derived from the principle of least cognitive action, that parallels the laws of mechanics. Unlike traditional convolutional networks, which need massive supervision, the proposed theory offers a truly new scenario in which feature learning takes place by unsupervised processing of video signals.”


Main Publications


Related Work



The code (in a very WORK-IN-PROGRESS state, we are currently working for making it more accessible and well documented) that implements a feature extractor based on the principle of Least Cognitive Action can be downloaded by cloning the following GitLab repository (default/main branch is 2019):

git clone -b 2019 https://gitlab.com/mela64/motioninvariance


  • The instructions on how to setup the code and run a sample experiment can be found here.
  • A visual description of the organization of the software is also included.



Video showing the feature extraction process, exploiting our visualization tools:


Videos showing our approach to the problem of estimating the focus of attention. We model of scanpath as a dynamic process which can be interpreted as a variational law somehow related to mechanics, where the focus of attention is subject to a gravitational field. See also this link.

Focus of attention on static scenes
Focus of attention on dynamic scenes