Microsoft Technology Generates Impressive Descriptions of Photos

By now, you probably know that companies like Google and Facebook are investing heavily in image recognition. Current technologies can already identify faces and describe objects found in the pictures, for example. Microsoft, another giant who is taking the matter very seriously, already at the stage of worrying about the context: thecompany has created a system of neural networks trying to describe what is happening in a picture of how we would do.
The recognition of images is more important than it seems. Services such as Google Photos and Flickr have technologies that try to get rid of the task of organizing photos manually or sort them with tags for easy searching. Facebook, as you may know, the face recognition helps us to mark the contacts that came out in our photos.
But the group of Mark Zuckerberg did not stop there. Earlier this month, the company announced a technology that is able to analyze photos and create captions for them automatically. The aim is to describe the images of the news feed for visually impaired users. It is no wonder: the company estimates that 39 million members of social network are blind or have severe difficulty seeing.
A noble idea, no? But there is a hitch here: at least for now, this system is able to describe objects in the image, but is not very accurate in context. The system can, for example, describe a picture as “the image appears to show three people smiling outdoors”, but does not indicate if individuals are on the street, in the park, anyway.
Members of Microsoft Research have teamed up with researchers from several US universities – and even an expert in artificial intelligence Facebook – to overcome this kind of limitation: the idea is to make the algorithm describing the “story” that the image account. In this sense, the researchers prefer to use that term, history, instead of “description” or “legend” to explain what the technology does.
Frank Ferraro, a researcher at Johns Hopkins University and an author of the project, gives an example: a photo album showing people drinking at a bar; in one of the later images, someone just lying on the couch. A system of automatic captions probably describe the scene with something like “there is a person lying on the couch.” The Microsoft Research system, however, can analyze the previous pictures (the drinking class) and label the image as “this person is probably drunk.”
It’s hard to do that. We have an incredible ability to identify or imagine the photo context because we relied upon a number of parameters: facial expressions, experience in similar environments, memories of places and so on. Still, there are situations where we can not understand exactly what is going on because the information that would help us with the context are insufficient.
Here is the reason for the system of Microsoft Research work best with groups of images. Instead of generating a specific outcome for each photo, the algorithm considers what has been identified in other images to reinforce the basic parameters.Thus, an object that appears only in some photos may help explain what happens in others.
As an example, the picture at the beginning of the post was described as follows: “was the baby’s first birthday.” See how were the automatic captions generated for other images from the same album (in translation):
As you may have guessed, this is only part of the job. The researchers had to train the system, which consists of deep neural networks (type that has been used for voice recognition and translation of texts) with extracted image sequences Flickr. Participants Mechanical Turk crowd sourcing site Amazon, were hired to create individual and sequential subtitles in order to mount the base to the algorithm training.
In the next phase, the system was fed with new images to describe them based on knowledge gained in training. To validate the descriptions, the researchers compared them with captions of the same pictures made by people. The end result was quite convincing.
This project is at an early stage, however. more training and improvements for it to become really useful is required. Even if it becomes more developed, the system should not be completely accurate in the context of images (again, neither are we). But if combined with algorithms that recognize places, for example, imagine how far the self-description tools images may go.