Vision Transformer from Scratch

01. The Concept

The Vision Transformer (ViT) revolutionized computer vision by applying the standard Transformer architecture directly to images, with the fewest possible modifications.

In this project, I built a ViT from the ground up in PyTorch without relying on pre-built high-level libraries. Key components implemented include Patch Embeddings, the CLS Token, and Multi-Head Self-Attention.

02. Technical Implementation

Patch Embedding

Instead of processing individual pixels, we split the image into fixed-size patches. This is efficiently implemented using a Conv2d layer with kernel size and stride equal to the patch size.

model.py

class PatchEmbedding(nn.Module):
    def __init__(self, in_channels=3, patch_size=4, embed_dim=128):
        super().__init__()
        self.patch_size = patch_size
        self.conv = nn.Conv2d(in_channels, embed_dim, 
                              kernel_size=patch_size, 
                              stride=patch_size)

    def forward(self, x):
        # x: (B, C, H, W) -> (B, Embed_Dim, Grid_H, Grid_W)
        x = self.conv(x) 
        # Flatten -> (B, Embed_Dim, N_Patches)
        x = x.flatten(2)
        # Transpose -> (B, N_Patches, Embed_Dim)
        return x.transpose(1, 2)

Multi-Head Self-Attention

The core of the Transformer is the self-attention mechanism. It allows the model to weigh the importance of different patches relative to each other. We split the embedding dimension into multiple heads to capture different types of relationships.

attention.py

class MultiheadSelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.head_dim = embed_dim // num_heads
        self.scale = self.head_dim ** -0.5
        self.qkv = nn.Linear(embed_dim, embed_dim * 3, bias=False)
        self.proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        B, N, C = x.shape
        # qkv: (B, N, 3*C) -> (B, N, 3, num_heads, head_dim)
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim)
        # Permute to (3, B, num_heads, N, head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]

        # Attention Score: (B, Heads, N, N)
        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)

        # Context: (B, Heads, N, Head_Dim) -> (B, N, C)
        x = (attn @ v).transpose(1, 2).flatten(2)
        return self.proj(x)

Experimental Hybrid

I experimented with a CNN-ViT Hybrid stem to introduce inductive bias for local texture recognition before the global transformer layers.

Positional Embedding

Since self-attention is permutation-invariant, learnable positional embeddings are added to the patch embeddings to retain spatial information.

Vision Transformer
from Scratch

01. The Concept

02. Technical Implementation

Patch Embedding

Multi-Head Self-Attention

Experimental Hybrid

Positional Embedding

Results (CIFAR-10)

Source Code

Vision Transformerfrom Scratch

01. The Concept

02. Technical Implementation

Patch Embedding

Multi-Head Self-Attention

Experimental Hybrid

Positional Embedding

Results (CIFAR-10)

Source Code

Vision Transformer
from Scratch