SWIN Transformers: The Best of Two Worlds

Fig 4. Window partitioning in successive layers of SWIN

In first layer, the image is divided into windows denoted in the figure by red boxes. Each window is further divided into patches denoted by gray boxes.

In second layer, this window is shifted as depicted and these windows are overlapping with the windows divided in previous layer. As it can be noted, the number of patches remain fixed, which ensures that the computational complexity remains linear with image size.

Moreover the overlapping nature of the windows ensures that the patches which were not paid attention to in previous layer due to being in different window, can be emphasized on in current layer. This ensures that global context is captured through these connections bridged from the previous layer.

First the input RGB image is split into non-overlapping windows, each patch is treated as a token and converted to raw pixel values as its feature set. The patch size is 4×4, hence the feature set dimensions become 4x4x3=48. A linear embedding layer converts the patch into embedding token of dimension C. These tokens are fed through several transformer blocks, these blocks maintain the number of patches to be H/4 x W/4. This set of transformer blocks is called stage 1.

Next up is the patch merging layer; the first patch merging layer concatenates the features of each group of 2×2 neighboring non-overlapping windows and reduces the number of tokens with new output dimension as 2C. SWIN transformer blocks processes these tokens to extract the features, however the resolution of patches this time is H/8 x W/8.

This entire block of patch merging and feature extraction is referred to as Stage 2. There are two more such blocks which are referred to as Stage 3 and Stage 4, and each of them reduces the token numbers to half of input resolution which is H/16 x W/16 and H/32 x W/32 respectively.

These stages allow the tokens size to be variable across the layers, which helps with reliable feature extraction of visual entities on different scales and as explained before, the shifting nature of these windows helps with establishing self-attention connectors with all the patches across different windows.

Shifted Window Multi-Head Self Attention

SWIN transformers replaces the multi-head self attention usually used in the transformers by the shifted window multi-head self attention as seen above. The rest of the layers like MLP and Layer Normalization are kept the same as in traditional transformers.

This implies the self attention is only performed in the local windows which are further divided into patches of size MxM. It ensures that the computational complexity of the model remains linear, as long as M is a constant. In SWIN transformer’s case the M is kept to be constant at 7. Following equation can depict the difference between computational complexity of the traditional multi-head self attention and the shifted window self attention against the image size hxw

It can be clearly seen that the multi-head self attention MSA varies quadratically with the image dimension hxw. Whereas, shifted window multi-head self attention W-MSA varies linearly with the image dimensions hxw if the variable M is constant. This makes SWIN transformer more suitable for training on images with higher resolutions.

The downside to window-based self attention is the lack of self-attention connections outside the local window, which limits the model’s capability to capture the global contexts. The SWIN tackles this problem by shifting the windows with respect to each other in successive SWIN blocks. There are two window partitioning configuration which are alternated between two blocks of a particular stage.

The first block partitions the image regularly starting from the leftmost pixel, and dividing the 8×8 feature map into a 2×2 window grid where each window is 4×4 dimension (M=4). For the next block this window is shifted by M/2, M/2 pixels with respect to windows in previous block. This shifting now allows cross window connections which helps in gaining a global context for the relationships between the patches across the windows. To illustrate the processing of the image input sequence by the two blocks following equations come in handy.

The zl’ and zl are feature map outputs by the MSA block and the MLP block respectively, zl and zl+1 are the outputs from two consecutive layers of SWIN transformer blocks. One issue with shifted window partitioning is visible in the figure 4, where the second configuration has more windows as compared to the first configuration, it can also be seen that the some windows are smaller in second configuration.

SWIN transformer architecture solves this problem using cyclic shifting windows, where the windows on the fringes are padded with each other. This is called the cyclic shift padding depicted in figure ___

Results and Experiments

The performance of the SWIN transformer model has been analyzed and compared against state of the art CNNs and Vision Transformers for all mainstream computer vision tasks like image classification, object detection and image segmentation.

There are 4 variants of the SWIN transformer architecture, which vary in number of layers and the input dimension C of the input token sequence after linear projection. The different variants can be categorized as below

For image classification, the SWIN transformer has been evaluated against the state of the art CNNs like RegNet and EfficientNet and also against transformers like ViT and DeIT. The benchmark dataset utilized for this comparison is the widely accepted ImageNet 1k, which has 1.28million training images along with 50k evaluation images for about 1000 classes. SWIN-T outperforms DeIT-S architecture by 1.53% for 224×224 input for the top-1 accuracy metric

For object detection, COCO 2017 dataset benchmark is used to judge the performance of SWIN and compared against other backbones. For object detection, only the state of the art model backbones like ConvNeXt and DeIT are compared with SWIN.

The framework used to predict bounding boxes with these backbones is the cascade mask-RCNN. SWIN-B achieves a high box detection average precision of 51.9AP, which is a 3.6 point gain over what ResNeXt101-64x4d exhibits. On the other hand SWIN-T outperforms the DeIT-S model of similar size by 2.5 box AP points.

For Image segmentation, the benchmark dataset ADE 20k is used which consists of about 150 classes distributed across 20k training images, 2k validation images and 3k testing images. SWIN is compared to DeIT and ResNet 101 models. SWIN-S exhibits 5.3mIoU points higher the DeIT-S model of same computation cost. It also outperforms the ResNet-101 by 4.4 mIoU points.

Conclusion: Vision Transformers for Cyber Defense

Vision Transformers are an emerging method in solving computer vision tasks due to the immense success of transformers in the NLP domain. However, vision transformers are impeded by some fundamental issues arising out of differences between text data and image data.

The variation of scale between various visual entities is higher in the image data. Vision Transformers also use the same fixed scale token strategy used by transformers in NLP domain, unsuitable for extracting feature patterns for the small scale visual entities. Vision transformer have quadratic computational complexity against image size, making it unsuitable for processing high resolution image data.

SWIN transformers solve this problem by proposing images to be broken in windows, these windows are further broken down into smaller patches, these patches are fed as an input sequence and attention is calculated only by creating self attention connection locally. The self-attention connection with patches in other windows is achieved by shifting the windows in next block and allow the flow of connection from previous block.

Between the stages, the window size is decreased to allow creation of hierarchical maps. Since the patches are fixed in number, the computational complexity against the image dimensions is restricted to linear, making SWIN more suitable than traditional transformer.

To learn more about Bolster’s Research Team, and to see how generative AI can help protect your business from cyber attacks, speak with the Bolster team today!

References