Although Convolutional Neural Networks (CNNs) have dominated the field of computer vision for years, newer models of Vision Transformers (ViTs) have also shown remarkable capabilities, achieving comparable and even better performance than CNNs. on many computer vision tasks. The success of the ViTs has raised a number of questions: How do the ViTs solve these image-based tasks? Do they act like convolutions, learning from scratch the same inductive biases? Or do they develop new representations of tasks? And what role does the scale play in learning these representations?
To find the answers, a new Google Brain article explores the internal representation structures of ViTs and CNNs on image classification tasks, providing surprising insight into the differences between ViTs and CNNs.
The team summarizes its main contributions as follows:
- We study the internal representational structure of ViTs and CNNs, finding striking differences between the two models, such as ViTs having more uniform representations, with greater similarity between the lower and upper layers.
- By analyzing how local / global spatial information is used, we find that ViTs embed more global information than ResNet in the lower layers, which leads to quantitatively different characteristics.
- Nonetheless, we find that incorporating local information into the lower layers remains vital, with large-scale pre-training data helping the early layers of attention learn how to do it.
- We study the uniform internal structure of ViTs, finding that jump connections in ViTs are even more influential than in ResNets and have important effects on performance and representational similarity.
- Motivated by potential future uses in object detection, we examine the extent to which input spatial information is preserved, finding links between spatial localization and classification methods.
- We investigate the effects of dataset scale on transfer learning, with a study of linear probes revealing its importance for high-quality intermediate representations.
Analyzing the layer representations of a neural network is a challenge because the features are distributed over a large number of neurons. This distribution makes it even more difficult to compare representations between different neural networks. To address these problems, previous studies have proposed Kernel-Centered Alignment (KKA) to allow quantitative comparisons of representations within and between networks.
In this work, the Google Brain team uses the CKA approach to study the internal representation structure of each model, taking each pair of layers in a model and calculating their CKA similarity. The results show that there is a clear difference between ViT and CNN in this regard. Researchers are also making comparisons between models of all ViT and ResNet layers.
The study reveals that lower layers ViT compute representations differently from lower layers ResNet; ViTs propagate representations more strongly between the lower and upper layers; and, compared to ResNet, the higher ViT layers produce very different visual representations.
The team then explores local and global information in layer representations, finding that early use of local information for image jobs (such as hard-coded in CNN architectures) is important for good performance. They also find that access to more global information leads to characteristics that are quantitatively different from those calculated by local receptive fields in the lower layers of ResNet. Additional findings from the study are that the effective lower layer receptor fields for ViTs are greater than those for ResNets; and that the ViT receptive fields become much more global midway through the network, while the effective ResNet receptive fields gradually develop.
The study also shows that ViT jump connections in representation propagation are more influential than those in ResNets and can have a large effect on performance and representation similarity. Regarding the spatial information and location properties of the two approaches, the researchers find that ViTs with CLS tokens allow strong preservation of spatial information, suggesting their promising potential for future uses in object detection.
Finally, tests of scale effects in transfer learning reveal that larger ViT models develop significantly stronger intermediate representations thanks to larger pre-training data sets.
Overall, the article provides a lot of valuable information on the differences between ViTs and CNNs in computer vision, as well as detailed descriptions of how ViTs solve image classification tasks.
The paper Do vision transformers see as convolutional neural networks? is on arXiv.
Author: Hecate Il | Editor: Michael Sarazen, Zhang Channel
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Weekly Synchronized Global AI to get weekly AI updates.