Technical Deep Dive

VectorVibe

When ML Meets Color Science: Building an AI-Powered Photo Curator

🎯 The Problem Space

Imagine having 1000+ photos on your phone. Every day, you want to display a small collection (3-9 photos) that share a cohesive “vibe” — similar colors, themes, and visual harmony.

Challenge: How do you define “vibe” mathematically? How do you search through thousands of images efficiently?

Answer: Multi-modal similarity search combining perceptual color science with deep learning embeddings.

🏗️ System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      INPUT: Photo Collection                     │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                   FEATURE EXTRACTION LAYER                       │
│  ┌──────────────────┐              ┌──────────────────┐         │
│  │  Color Analysis  │              │  CLIP Embeddings │         │
│  │   (CIELAB k-means)│             │   (ViT-B/32)     │         │
│  │   5 colors/image │              │   512-dim vector │         │
│  └──────────────────┘              └──────────────────┘         │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    SIMILARITY COMPUTATION                        │
│         ┌───────────────────────────────────────┐               │
│         │  Fused Distance = α·Dcolor + (1-α)·Dclip │            │
│         │  Where α = 0.6 (60% color, 40% semantic)│            │
│         └───────────────────────────────────────┘               │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                    VECTOR SEARCH (FAISS)                         │
│              k-NN search in ~O(log n) time                       │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                  2D LAYOUT (UMAP)                                │
│        Dimensionality reduction for spatial positioning          │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                      OUTPUT: Curated Gallery                     │
│              3-9 images with unified aesthetic                   │
└─────────────────────────────────────────────────────────────────┘

🔬 Technical Deep Dive

1. Content-Addressable Storage

image_id = SHA1(file_contents)

Why? Same image = same hash. Cryptographic collision resistance ensures different images never collide. Used by Git, Docker, IPFS.

Time Complexity: O(n) where n = file size
Space: 160-bit (20-byte) identifier

2. Color Feature Extraction (CIELAB + K-Means)

RGB is NOT perceptually uniform! A distance of 10 units might be noticeable in one part of RGB space but invisible in another.

Enter CIELAB: A perceptually uniform color space where Euclidean distance ≈ human-perceived color difference.

RGB Color Space          CIELAB Color Space
(Device-dependent)       (Perceptually uniform)

    G                          a* (green ← → red)
    │                               ↑
    │                               │
    │                               │
    └──── R                         └────→ b* (blue ← → yellow)
   /                               /
  B                               L* (lightness)

❌ Distance ≠ Perception      ✅ Distance = Perception

K-Means Algorithm:

Initialize k=5 random centers in LAB space
Assign each pixel to nearest center (Euclidean)
Recompute centers as mean of assigned pixels
Repeat until convergence (~10-20 iterations)

Complexity: O(n · k · i · d)
n = 262,144 pixels (512×512)
k = 5 clusters
i = ~15 iterations
d = 3 dimensions (L, a, b)

3. Semantic Embeddings (CLIP ViT-B/32)

CLIP (Contrastive Language-Image Pre-training) understands both images AND text. Trained on 400M image-text pairs.

Architecture: Vision Transformer (ViT) with 32×32 patches

Input Image (512×512)
         │
         ├─→ Split into 32×32 patches
         │
         ├─→ Linear projection to 768-dim
         │
         ├─→ 12 Transformer layers
         │
         └─→ Output: 512-dim embedding vector

This vector captures SEMANTIC meaning:
• "dog" and "puppy" are close
• "beach" and "ocean" are close
• "sunset" and "sunrise" are close

Model Size: ~150M parameters
Inference: ~50ms on CPU per image
Output: 512-dimensional unit vector (L2-normalized)

4. Fused Similarity Metric

The magic happens here! We combine TWO distance metrics:

D_color = CIEDE2000(palette₁, palette₂)

D_semantic = 1 - cos_similarity(embed₁, embed₂)

D_fused = α · D_color + (1-α) · D_semantic

Where α = 0.6 (tunable: 60% color, 40% meaning)

CIEDE2000: Industry-standard color difference formula. Not just Euclidean distance — accounts for human perception quirks!

5. Vector Search with FAISS

FAISS (Facebook AI Similarity Search) — the secret sauce for fast nearest-neighbor search.

Naive Search:     FAISS (IVF + PQ):
   O(n)               O(log n)
   
For n=1000 images:
• Naive: 1000 comparisons
• FAISS: ~10 comparisons (100x speedup!)

For n=1M images:
• Naive: 1M comparisons
• FAISS: ~20 comparisons (50,000x speedup!)

Index Type: Flat (exact search) for <10k images
Fallback: IVF (Inverted File Index) for >10k
Query Time: O(log n) with IVF, O(n) with Flat

6. 2D Layout with UMAP

UMAP (Uniform Manifold Approximation and Projection) — think t-SNE but faster and better at preserving global structure.

Input: 512-dim embedding + 15-dim color histogram = 527-dim

Output: 2D coordinates (x, y) ∈ [0, 1]²

Goal: Preserve local neighborhoods in high-dim space

Why? Similar images cluster together spatially. Creates visually pleasing grid layouts!

💻 Tech Stack Breakdown

🐍

FastAPI

Modern async Python API

🗄️

SQLModel

Type-safe ORM + SQLite

🧠

PyTorch

Deep learning framework

🔍

FAISS

Vector similarity search

🎨

colormath

CIEDE2000 calculations

📊

scikit-learn

K-means clustering

⚛️

Next.js

React framework (App Router)

🎭

Framer Motion

Smooth animations

🔄

TanStack Query

Server state management

⚡ The Pipeline Flow

📁

Ingest

Hash + Extract

→

🎯

Select

FAISS Search

→

📐

Layout

UMAP 2D

→

📤

Publish

JSON Output

CLI Commands

# Run full pipeline
python -m app.cli run-daily

# Ingest new images
python -m app.cli ingest

# Compare two images (debug)
python -m app.cli explain <id1> <id2>

# Database statistics
python -m app.cli stats

📊 By The Numbers

Lines of Code:    ~3,500
Backend Files:    18 Python files
Frontend Files:   14 TypeScript files
Dependencies:     13 (backend) + 12 (frontend)

Model Stats:
├─ CLIP ViT-B/32:       ~150M parameters
├─ Embedding dim:       512
├─ Color palette:       5 colors × 3 channels (LAB)
├─ Feature vector:      527 dimensions total
└─ Search complexity:   O(log n) with FAISS

Inference Time (CPU):
├─ CLIP embedding:      ~50ms per image
├─ Color extraction:    ~30ms per image
├─ FAISS k-NN query:    ~2ms for k=50
└─ Total pipeline:      ~100-200ms per image

🎓 CS Concepts In Action

✅ Cryptographic Hashing — SHA-1 for content addressing
✅ Unsupervised Learning — K-means clustering
✅ Transfer Learning — Pre-trained CLIP model
✅ Attention Mechanism — Vision Transformer architecture
✅ Vector Similarity — Cosine distance in embedding space
✅ Approximate Search — FAISS indexing strategies
✅ Dimensionality Reduction — UMAP manifold learning
✅ Perceptual Metrics — CIELAB + CIEDE2000
✅ RESTful APIs — FastAPI async endpoints
✅ Type Safety — Pydantic schemas + TypeScript
✅ Reactive UI — React Query + Framer Motion

🤔 Why This Matters

This project demonstrates:

Multi-modal AI: Combining color science (classical CV) with deep learning (CLIP)
Production ML: Efficient inference, caching, and vector search at scale
Human-Centered Design: Perceptual color spaces, not raw RGB
Privacy First: 100% local, no cloud APIs, EXIF stripping
Full Stack: Python backend + TypeScript frontend + real-time updates

It's a real-world application of algorithms you'd study in ML, Computer Vision, and Systems Design courses!

Want to see it in action?

Experience the vibe curator live on the portfolio!

Rishabh Goenka