Abstract: The pretrained vision-language models (VLMs) have achieved outstanding performance in various visual tasks, primarily due to the knowledge they have acquired from massive image-text pairs.