Automated Artistic Choice with Cross-Modal Matching


Software builders promote their Apps by creating product pages with App photographs, and bidding on search phrases. It’s then essential for App photographs to be extremely related with the search phrases. Options to this downside require an image-text matching mannequin to foretell the standard of the match between the chosen picture and the search phrases. On this work, we current a novel strategy to matching an App picture to look phrases primarily based on fine-tuning a pre-trained LXMERT mannequin. We present that in comparison with the CLIP mannequin and a baseline utilizing a Transformer mannequin for search phrases, and a ResNet mannequin for photographs, we considerably enhance the matching accuracy. We consider our strategy utilizing two units of labels: advertiser related (picture, search time period) pairs for a given utility, and human rankings for the relevance between (picture, search time period) pairs. Our strategy achieves 0.96 AUC rating for advertiser related floor fact, outperforming the transformer+ResNet baseline and the fine-tuned CLIP mannequin by 8% and 14%. For human labeled floor fact, our strategy achieves 0.95 AUC rating, outperforming the transformer+ResNet baseline and the fine-tuned CLIP mannequin by 16% and 17%.

Leave a Reply

Your email address will not be published. Required fields are marked *