Friday, October 25, 2013

Flickr Tags are Useless

Recently I began to code up an image classifier, and my hope was to use Flickr images as a source of real data.  The Flickr API allows you to easily obtain the most recent 100 images that have a given tag.  However, I found that the correspondence between tags and content to be marginal.  About 75% of the images I downloaded tagged with "chicken" were pictures of a ferret (presumably named Chicken?) and more than half of the images tagged with "kitten" were of puppies.  Images tagged with "sushi" included dresses, kids playing the violin, ducks, and an assortment of other objects with absolutely no relation to sushi.

Tagged with "sushi"

The problem is both with the API method and human behavior.  Many people batch tag photos, so that everything in the collection has the same label.  If I go on a road trip in a Corvette and take a number of pictures, many of them are bound to not include a Corvette, yet they will be tagged with "corvette".  The API retrieves the most recent images, so if someone happened to upload a significant number of mis-tagged images, the API will retrieve them.  A much better approach is to only obtain a small sample from each account, if possible.

An application of a real-time image classification is to have a system that automatically prunes superfluous tags, or at least suggests to the user that a photo is mis-tagged.