[ad_1]
This paper was accepted on the workshop I Can’t Consider It’s Not Higher! (ICBINB) at NeurIPS 2023.
Current advances in picture tokenizers, comparable to VQ-VAE, have enabled text-to-image technology utilizing auto-regressive strategies, just like language modeling. Nonetheless, these strategies have but to leverage pre-trained language fashions, regardless of their adaptability to varied downstream duties. On this work, we discover this hole, and discover that pre-trained language fashions provide restricted assist in auto-regressive text-to-image technology. We offer a two-fold rationalization by analyzing tokens from every modality. First, we show that picture tokens possess considerably completely different semantics in comparison with textual content tokens, rendering pre-trained language fashions no more practical in modeling them than randomly initialized ones. Second, the textual content tokens within the image-text datasets are too easy in comparison with regular language mannequin pre-training knowledge, making any small randomly initialized language fashions obtain the identical perplexity with bigger pre-trained ones, and causes the catastrophic degradation of language fashions’ functionality.
[ad_2]
Source link