Google's ¡ÆShow and Tell' AI can tell you what's in a photo with nearly 94% Àµ³Î

Google's ¡ÆShow and Tell' AI can tell you Àµ³Î¤Ë¡¿¤Þ¤µ¤Ë what's in a photo (almost): System À¸À®¤¹¤ës captions with nearly 94% Àµ³Î

ºÇ¿·¤Î ¸«²ò¡¿ËÝÌõ¡¿ÈÇ of the system is faster to train and far more Àµ³Î¤Ê

Picture captioning AI can now À¸À®¤¹¤ë descriptions with 94% Àµ³Î?

²ñ¼Ò¡¿·ø¤¤ has now ²òÊü¡Ê¤¹¤ë¡Ëd the open-source code to let?developers?take part

Published: 19:54 BST, 23 September 2016 | Updated: 20:30 BST, 23 September 2016

¿Í¹©Åª¤Ê ÃÎÇ½ systems have recently begun to try their ¼êÅÏ¤¹ at Îá¾õing picture captions, often producing hilarious, and even ÉÔ²÷¤Ê¡¿¹¶·â, ¼ºÇÔs.

But, Google¡Çs Show and Tell algorithm has almost perfected the ¡Ê¼êÀè¤Î¡Ëµ»½Ñ.

¤Ë¤è¤ì¤Ð the ²ñ¼Ò¡¿·ø¤¤, the AI can now ½Ò¤Ù¤ë images with nearly 94 ¥Ñ¡¼¥»¥ó¥È Àµ³Î and may even ¡Æunderstand¡Ç the ¾õ¶· and deeper meaning of a scene.

¤Ë¤è¤ì¤Ð the ²ñ¼Ò¡¿·ø¤¤, Google's AI can now ½Ò¤Ù¤ë images with nearly 94 ¥Ñ¡¼¥»¥ó¥È Àµ³Î and may even ¡Æunderstand¡Ç the ¾õ¶· and deeper meaning of a scene. The AI was first trained in 2014, and has ¹ï¡¹¤È ²þÁ±¤¹¤ëd in the time since

Google has ²òÊü¡Ê¤¹¤ë¡Ëd the open-source code for its image captioning system, µö¤¹ing developers to take part, the ²ñ¼Ò¡¿·ø¤¤ ÌÀ¤é¤«¤Ë¤¹¤ë¡¿Ï³¤é¤¹d on its ¸¦µæ blog.

The AI was first trained in 2014, and has ¹ï¡¹¤È ²þÁ±¤¹¤ëd in the time since.

Now, the ¸¦µæ°÷s say it is faster to train, and produces more ¾ÜºÙ¡Ê¤Ë½Ò¤Ù¤ë¡Ëd, Àµ³Î¤Ê descriptions.

The most ºÇ¶á¤Î ¸«²ò¡¿ËÝÌõ¡¿ÈÇ of the system uses the Inception V3 image Ê¬Îà model, and ¤ò¼õ¤±¤ës a È³¶â-tuning ÃÊ³¬ in which its ¸«ÄÌ¤· and language ¹½À®Í×ÁÇs are trained on human À¸À®¤¹¤ëd captions.

HOW IT WORKS ?

The AI can ½Ò¤Ù¤ë Àµ³Î¤Ë¡¿¤Þ¤µ¤Ë what's in a scene

The system uses the Incepti on V3 image Ê¬Îà model as the basis for the image encoder, µö¤¹ing for 93.9 ¥Ñ¡¼¥»¥ó¥È Ê¬Îà Àµ³Î.

These encodings help the system to Ç§¤á¤ë ¤µ¤Þ¤¶¤Þ¤Ê È¿ÂÐ¤¹¤ës in an image.

Then the image model is È³¶â-tuned, µö¤¹ing the system to ½Ò¤Ù¤ë the È¿ÂÐ¤¹¤ës rather than ´ÊÃ±¤Ë Ê¬Îà¤¹¤ëing them.

So, it can identify the colours in an image, and ·èÄê¤¹¤ë how È¿ÂÐ¤¹¤ës in the image relate to each other.

In this ÃÊ³¬, the system¡Çs ¸«ÄÌ¤· and language ¹½À®Í×ÁÇs are jontly trained on human À¸À®¤¹¤ëd captions.

ÀëÅÁ

Examples of its Ç½ÎÏs show the AI can ½Ò¤Ù¤ë Àµ³Î¤Ë¡¿¤Þ¤µ¤Ë what is in a scene, ´Þ¤àing ¡ÆA person on a beach Èô¹Ôµ¡¤Ç¹Ô¤¯ing a Æ»¶ñ,¡Ç and ¡Æa blue and yellow train traveling Éé¤«¤¹¡¿·âÄÆ¤¹¤ë train À×¤ò¤Ä¤±¤ës.¡Ç

As the system learns on a training »Ï¤á¤ë¡¤·è¤á¤ë of human captions, it ¤¤¤Ä¤«s will ºÆ»ÈÍÑ¤¹¤ë these captions for a Îà»÷¤Î scene.

This, the ¸¦µæ°÷s say, may »É·ã¡Ê¤¹¤ë¡Ë some questions on its true Ç½ÎÏs ? but while it does ¡Æregurgitate¡Ç captions when applicable, this is not always the »öÎã¡¿´µ¼Ô.

¡ÆSo does it really understand the È¿ÂÐ¤¹¤ës and their interactions in each image? Or does it always regurgitate descriptions from the training data?,' the ¸¦µæ°÷s wrote.?

As the system learns on a training »Ï¤á¤ë¡¤·è¤á¤ë of human captions, it ¤¤¤Ä¤«s will ºÆ»ÈÍÑ¤¹¤ë these captions for a Îà»÷¤Î scene. This can be seen in the exam ples above?

'Excitingly, our model does indeed develop the ability to À¸À®¤¹¤ë Àµ³Î¤Ê new captions when ¸½ºß¤Îd with ´°Á´¤Ë new scenes, ¼¨¤¹ing a deeper understanding of the È¿ÂÐ¤¹¤ës and ¾õ¶· in the images.'

An example ³ôd in the blog ÃÏ°Ì¡¤Ç¤Ì¿¤¹¤ë shows how the ¹½À®Í×ÁÇs of separate images come together to À¸À®¤¹¤ë new captions.

Three separate images of dogs in ¤µ¤Þ¤¶¤Þ¤Ê ¾õ¶·¡¿¾ðÀªs can thus lead to the Àµ³Î¤Ê description of a photo later on: ¡ÆA dog is sitting on the beach next to a dog.¡Ç?

¡ÆMoreover,¡Ç the ¸¦µæ°÷s explain, ¡Æit learns how to É½ÌÀ¤¹¤ë that knowledge in natural-sounding English phrases ¤Ë¤â¤«¤«¤ï¤é¤º receiving no ÉÕ²Ã language training other than reading the human captions.¡Ç

MICROSOFT'S CAPTION BOT GETS IT HILARIOUSLY WRONG?

Microsoft's CaptionBot, which Ê¬ÀÏ¤¹¤ës pictures ¡¼¤¹¤ë¤¿¤á¤Ë ÌÀ³Î¤ËÉ½¤¹ captions, has been °ÌÃÖ¡¿±øÅÀ¡¿¸«¤Ä¤±½Ð¤¹ on with some results, but horridly wrong for others ? it thought the First Lady Michelle Obama was a ÆÈË¼ phone.

When it was ²òÊü¡Ê¤¹¤ë¡Ëd to the public earlier this year, the program seemed to be Àµ³Î¤Ê with almost all of the images it received.

Microsoft's CaptionBot, which Ê¬ÀÏ¤¹¤ës pictures ¡¼¤¹¤ë¤¿¤á¤Ë ÌÀ³Î¤ËÉ½¤¹ captions, has been °ÌÃÖ¡¿±øÅÀ¡¿¸«¤Ä¤±½Ð¤¹ on with some results, but horridly wrong for others

But recently, it mistook an Éª as a woman ¾®¾×ÆÍing her teeth and a ¤Î¶á¤¯¤Ë up of a human ÃíÌÜ¤¹¤ë¡¤¤â¤¯¤í¤à as a ¤Î¶á¤¯¤Ë up of a doughnut ¶á¤Å¤¯ a cup.

'It's Áá´ü¤Ë days for image captioning,' a Microsoft spokesperson told Dailymail in April.

'Like any ¿Í¹©Åª¤Ê ÃÎÇ½ system, we use feedback fr om »ÈÍÑ¼Ôs of CaptionBot to ²þÁ±¤¹¤ë our results and make it more Àµ³Î¤Ê.'?

ÀëÅÁ

Comments 42

³ô what you think

¸«²ò¡Ê¤ò¤È¤ë¡Ë all

The comments below have not been ²º·ò¤Êd.

¸«²ò¡Ê¤ò¤È¤ë¡Ë all

The ¸«²ò¡Ê¤ò¤È¤ë¡Ës É½ÌÀ¤¹¤ëd in the contents above are those of our »ÈÍÑ¼Ôs and do not ¤ä¤à¤òÆÀ¤º È¿±Ç¤¹¤ë the ¸«²ò¡Ê¤ò¤È¤ë¡Ës of MailOnline.

We are no longer ¼õÂ÷¤¹¤ëing comments on this article.

Google's ¡ÆShow and Tell' AI can tell you Àµ³Î¤Ë¡¿¤Þ¤µ¤Ë what's in a photo (almost): System À¸À®¤¹¤ës captions with nearly 94% Àµ³Î

HOW IT WORKS ?

MICROSOFT'S CAPTION BOT GETS IT HILARIOUSLY WRONG?

Most watched News ¥Ó¥Ç¥ªs

Comments 42

DON'T MISS

TECH NEWS & REVIEWS

Next story

Google's ¡ÆShow and Tell' AI can tell you Àµ³Î¤Ë¡¿¤Þ¤µ¤Ë what's in a photo (almost): System À¸À®¤¹¤ës captions with nearly 94% Àµ³Î

RELATED ARTICLES

³ô this article

HOW IT WORKS ?

MICROSOFT'S CAPTION BOT GETS IT HILARIOUSLY WRONG?

³ô or comment on this article: Google's ¡ÆShow and Tell' AI can tell you what's in a photo with nearly 94% Àµ³Î

Most watched News ¥Ó¥Ç¥ªs

Comments 42

DON'T MISS

TECH NEWS & REVIEWS

Next story