[ad_1]
Paper summary: Massive-scale web-crawled datasets are basic for the success of pre-training vision-language fashions, akin to CLIP. Nonetheless, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in attaining exact image-text alignment. Current strategies using massive language fashions (LLMs) for caption rewriting have proven promise on small, curated datasets like CC3M and CC12M. This research introduces a scalable pipeline for noisy caption rewriting. In contrast to current LLM rewriting strategies, we emphasize the incorporation of visible ideas into captions, termed as Visible-enriched Captions (VeCap). To make sure information range, we suggest a novel combined coaching scheme that optimizes the utilization of AltTexts alongside newly generated VeCap. We showcase the variation of this methodology for coaching CLIP on large-scale web-crawled datasets, termed VeCLIP. Using this cost-effective pipeline, we effortlessly scale our dataset as much as 300 million samples named VeCap dataset. Our outcomes present vital benefits in image-text alignment and general mannequin efficiency. For instance, VeCLIP achieves as much as +25.2% acquire in COCO and Flickr30k retrieval duties underneath the 12M setting. For information effectivity, VeCLIP achieves +3% acquire whereas solely utilizing 14% of the information employed within the vanilla CLIP and 11% in ALIGN. We additionally observe the VeCap information is complementary with different properly curated datasets good for zero-shot classification duties. When combining VeCap and DFN, our mannequin can obtain sturdy efficiency on each of image-text retrieval and zero-shot classification duties, e.g. 83.1% accuracy@1 on ImageNet zero-shot for a H/14 mannequin.
[ad_2]
Source link