Abstract: The pretrained vision-language models (VLMs) have achieved outstanding performance in various visual tasks, primarily due to the knowledge they have acquired from massive image-text pairs.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results