From birth, babies begin to receive visual and auditory stimuli, essential for learning something essential in their lives: language. Between six and nine months, they begin to speak, associating sounds with real-world objects and concepts. By the time they reach the age of two, they typically have a vocabulary of around 300 words. But how does this learning process take place? A team of researchers from New York University studied recordings of a child’s daily life during his first year of life to find the answer. The experiment not only confirmed the link between visual and linguistic representation – that is, what is seen and the word that corresponds to it – but also contributed to the development of an artificial intelligence (AI) model ), who managed to recognize different objects in the same way as children.
“Large AI systems are trained and fed by an astronomical amount of data. We are talking about billions of words to be able to develop a linguistic system,” explains Wai Keen Vong, doctor of psychology and computer science, who coordinated the study published this Thursday in the journal Science. “However, humans only need a few thousand words to achieve an effective communication system,” he adds. From this contrast was born the interest in investigating whether an AI would be capable of learning to talk in the same way as children: observe their environment, listen to the people around them and make the connection between what they see and what they hear.
Early language acquisition is a widely debated topic for which several hypotheses have been proposed. Traditionally, these types of studies have been conducted in controlled laboratories, resulting in findings that often cannot be effectively extrapolated to more dynamic and varied real-world settings. “The novelty of this analysis lies in the fact that we were able to work with first-hand data, from a real learning situation,” emphasizes Vong.
To this end, Vong’s team analyzed 61 hours of the life of Sam, an Australian boy who for a year and a half – from six to 25 months – wore a helmet with a camera that recorded the interactions he he had with their parents and grandparents on a daily basis. In reality, he only recorded 1% of his waking time during the entire experiment. Despite this, hundreds of images were created that reproduced exactly what the child saw, accompanied by the linguistic expressions of those close to him, which explained the nature of the objects around him. “For example, during the meal, the camera in his head recorded the image of a spoon, at the same time that his mother asked him something related to this utensil. And so on, with dozens of everyday objects,” explains Vong.
The link between these two mediums is almost never obvious. In fact, the researcher acknowledges that part of the challenge for babies is understanding exactly what word is associated with the object they are interacting with. “Most of the time, parents don’t label every item. For every ball that Sam looked at, his parents didn’t tell him “that’s a ball”, “look at the ball”. He listened to the words in a natural context, and the difficulty is to determine precisely, within a more or less long sentence, which word corresponds to the round object with which he was playing,” underlines Vong.
Train an AI like a baby
After observing the child’s behavior, the researchers were able to confirm that he had learned the meaning of the words by connecting the visual stimulus, that is to say the image presented to him, with the response of the limbs. of his family, who repeated the corresponding word. With these results, they moved on to the second phase of the experiment: testing whether an AI would be able to learn to recognize objects in the same way as Sam.
The artificial intelligence model, called CVCL (The child’s point of view for contrasting learning, contrastive learning from the child’s point of view), was trained with 64 visual categories – utensils, toys, animals, among others – and the transcription of what Sam heard when looking at these objects. Once this database was created, the researchers began testing whether the AI was capable of identifying the images. According to Vong, the model, with limited sensory information and relatively generic learning mechanisms, provides a computational basis for studying how children acquire their first words and how those words can connect to the visual world.
“We found that CVCL can learn to make connections between images and text from limited fragments of a single child’s experience,” the authors point out in the study. In some cases, the objects appeared against a white background, while in others they were in an environment with more stimuli. In fact, the model’s classification accuracy was 61.6% and remained high even when images other than Sam’s recordings, on which the AI had not been trained, were fed into the system. “The results confirm our hypothesis that with just two impulses, what the child sees and what he hears, it is possible to achieve and accelerate this type of learning,” emphasizes Vong.
Study how speech is born
Antonio Rodríguez Fornells, researcher at the Institute of Neurosciences of the University of Barcelona, highlights the innovative aspect of the study, which opens the way to understanding, through computer simulations, what are the minimal learning mechanisms that children use to face the challenge of learning a language: “Previous studies on babies in developmental psychology provide key information with very novel experiments, but the lack of neuroscience or neuroimaging studies on them (due to the difficulty of applying these techniques in babies) does not allow the same to be done. progress.” in neuroscience to clarify the brain mechanisms that support these language acquisition processes,” explains this neuroscientist.
Additionally, it acknowledges that the simulations proposed in the article support some previously proposed theories of language. “Among them, simple associative learning mechanisms (which allow images and words to be linked) are sufficient in a natural learning environment (like that which children experience at birth and in the first months of their life) to be able to learn these relationships and generalize the content of meaning,” adds Rodríguez Fornells.
However, the study has certain limitations. The CVCL model was trained with recordings from a single child’s head-mounted camera and learned through voice transcriptions rather than direct speech, which omits important nuances such as intonation and emphasis. “It should also be remembered that the learning of the model was passive, based on recordings, without active interaction with the environment, which is different from the way children learn in real environments,” acknowledge the authors of the research .