Project goal
Our client wanted to find images using natural text, for example "images of me and my wife drinking wine in Italy". This is a great challenge because you need to identify not only "wine" in the image, but have some understanding about the action "drinking". Solving the problem of finding an image, where it normally feels like searching a needle in a haystack.
Solution
Object Detection
We studied and tested all the bleeding edge Object Detection models, which included Swin Transformes, EfficientNet with FPN and many others. But given the high complexity of this problem we went deeper and found VinVL a model specially created for providing tags to a captioning system. This solution provided a lot more tags, which is what was needed for this project!
Image Captioning
Now that we know what's on an image (i.e. wine, people, ...) we need to know what are they doing, this is case it might seem obvious, but there are lot of cases when it is NOT! For example in a mountain we can either climb or hike. This is where image captioning does its magic, by providing an understanding about the action and identifying more image elements.
Our Solution
Provided all the necessary information to the NLP team, where we didn't only use a state-of-the-art object detection system, but we went the extra mile with the captioning system to ensure that the actions were accurately captured, since this was one of the main focus of our client.