[ad_1]
This paper has been accepted to the UniReps Workshop in NeurIPS 2023.
Contrastive language picture pretraining has grow to be the usual strategy for coaching imaginative and prescient language fashions. Regardless of the utility of CLIP visible options as world representations for photos, they’ve limitations with regards to duties involving object localization, pixel-level understanding of the picture, or 3D notion. Multi-task coaching is a well-liked resolution to handle this downside, however gathering a large-scale annotated multi-task dataset incurs important prices. Moreover, coaching on separate activity particular datasets can also be difficult from optimization and coaching perspective as a result of aligning gradients and information coming from totally different enter distributions and duties. To beat these shortcomings, we research pseudo-labeling with task-specific specialists to enhance CLIP options for more difficult down-stream duties. In our strategy, we leverage a number of present open-source pretrained fashions and pseudo-label an uncurated web-scale image-caption dataset with the specialists. We then practice CLIP with contrastive loss and activity particular losses with pseudo labels by the lightweight heads that we connect to the imaginative and prescient spine.
[ad_2]
Source link