Technical Deep Dive

VectorVibe

When ML Meets Color Science: Building an AI-Powered Photo Curator

🎯 The Problem Space

Imagine having 1000+ photos on your phone. Every day, you want to display a small collection (3-9 photos) that share a cohesive β€œvibe” β€” similar colors, themes, and visual harmony.

Challenge: How do you define β€œvibe” mathematically? How do you search through thousands of images efficiently?

Answer: Multi-modal similarity search combining perceptual color science with deep learning embeddings.

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      INPUT: Photo Collection                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   FEATURE EXTRACTION LAYER                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚  Color Analysis  β”‚              β”‚  CLIP Embeddings β”‚         β”‚
β”‚  β”‚   (CIELAB k-means)β”‚             β”‚   (ViT-B/32)     β”‚         β”‚
β”‚  β”‚   5 colors/image β”‚              β”‚   512-dim vector β”‚         β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    SIMILARITY COMPUTATION                        β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚         β”‚  Fused Distance = Ξ±Β·Dcolor + (1-Ξ±)Β·Dclip β”‚            β”‚
β”‚         β”‚  Where Ξ± = 0.6 (60% color, 40% semantic)β”‚            β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    VECTOR SEARCH (FAISS)                         β”‚
β”‚              k-NN search in ~O(log n) time                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  2D LAYOUT (UMAP)                                β”‚
β”‚        Dimensionality reduction for spatial positioning          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      OUTPUT: Curated Gallery                     β”‚
β”‚              3-9 images with unified aesthetic                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”¬ Technical Deep Dive

1. Content-Addressable Storage

image_id = SHA1(file_contents)

Why? Same image = same hash. Cryptographic collision resistance ensures different images never collide. Used by Git, Docker, IPFS.

Time Complexity: O(n) where n = file size
Space: 160-bit (20-byte) identifier

2. Color Feature Extraction (CIELAB + K-Means)

RGB is NOT perceptually uniform! A distance of 10 units might be noticeable in one part of RGB space but invisible in another.

Enter CIELAB: A perceptually uniform color space where Euclidean distance β‰ˆ human-perceived color difference.

RGB Color Space          CIELAB Color Space
(Device-dependent)       (Perceptually uniform)

    G                          a* (green ← β†’ red)
    β”‚                               ↑
    β”‚                               β”‚
    β”‚                               β”‚
    └──── R                         └────→ b* (blue ← β†’ yellow)
   /                               /
  B                               L* (lightness)

❌ Distance β‰  Perception      βœ… Distance = Perception

K-Means Algorithm:

  1. Initialize k=5 random centers in LAB space
  2. Assign each pixel to nearest center (Euclidean)
  3. Recompute centers as mean of assigned pixels
  4. Repeat until convergence (~10-20 iterations)
Complexity: O(n Β· k Β· i Β· d)
n = 262,144 pixels (512Γ—512)
k = 5 clusters
i = ~15 iterations
d = 3 dimensions (L, a, b)

3. Semantic Embeddings (CLIP ViT-B/32)

CLIP (Contrastive Language-Image Pre-training) understands both images AND text. Trained on 400M image-text pairs.

Architecture: Vision Transformer (ViT) with 32Γ—32 patches

Input Image (512Γ—512)
         β”‚
         β”œβ”€β†’ Split into 32Γ—32 patches
         β”‚
         β”œβ”€β†’ Linear projection to 768-dim
         β”‚
         β”œβ”€β†’ 12 Transformer layers
         β”‚
         └─→ Output: 512-dim embedding vector

This vector captures SEMANTIC meaning:
β€’ "dog" and "puppy" are close
β€’ "beach" and "ocean" are close
β€’ "sunset" and "sunrise" are close
Model Size: ~150M parameters
Inference: ~50ms on CPU per image
Output: 512-dimensional unit vector (L2-normalized)

4. Fused Similarity Metric

The magic happens here! We combine TWO distance metrics:

Dcolor = CIEDE2000(palette₁, paletteβ‚‚)

Dsemantic = 1 - cos_similarity(embed₁, embedβ‚‚)

Dfused = Ξ± Β· Dcolor + (1-Ξ±) Β· Dsemantic

Where Ξ± = 0.6 (tunable: 60% color, 40% meaning)

CIEDE2000: Industry-standard color difference formula. Not just Euclidean distance β€” accounts for human perception quirks!

5. Vector Search with FAISS

FAISS (Facebook AI Similarity Search) β€” the secret sauce for fast nearest-neighbor search.

Naive Search:     FAISS (IVF + PQ):
   O(n)               O(log n)
   
For n=1000 images:
β€’ Naive: 1000 comparisons
β€’ FAISS: ~10 comparisons (100x speedup!)

For n=1M images:
β€’ Naive: 1M comparisons
β€’ FAISS: ~20 comparisons (50,000x speedup!)
Index Type: Flat (exact search) for <10k images
Fallback: IVF (Inverted File Index) for >10k
Query Time: O(log n) with IVF, O(n) with Flat

6. 2D Layout with UMAP

UMAP (Uniform Manifold Approximation and Projection) β€” think t-SNE but faster and better at preserving global structure.

Input: 512-dim embedding + 15-dim color histogram = 527-dim

Output: 2D coordinates (x, y) ∈ [0, 1]²

Goal: Preserve local neighborhoods in high-dim space

Why? Similar images cluster together spatially. Creates visually pleasing grid layouts!

πŸ’» Tech Stack Breakdown

🐍
FastAPI
Modern async Python API
πŸ—„οΈ
SQLModel
Type-safe ORM + SQLite
🧠
PyTorch
Deep learning framework
πŸ”
FAISS
Vector similarity search
🎨
colormath
CIEDE2000 calculations
πŸ“Š
scikit-learn
K-means clustering
βš›οΈ
Next.js
React framework (App Router)
🎭
Framer Motion
Smooth animations
πŸ”„
TanStack Query
Server state management

⚑ The Pipeline Flow

πŸ“
Ingest
Hash + Extract
β†’
🎯
Select
FAISS Search
β†’
πŸ“
Layout
UMAP 2D
β†’
πŸ“€
Publish
JSON Output

CLI Commands

# Run full pipeline
python -m app.cli run-daily

# Ingest new images
python -m app.cli ingest

# Compare two images (debug)
python -m app.cli explain <id1> <id2>

# Database statistics
python -m app.cli stats

πŸ“Š By The Numbers

Lines of Code:    ~3,500
Backend Files:    18 Python files
Frontend Files:   14 TypeScript files
Dependencies:     13 (backend) + 12 (frontend)

Model Stats:
β”œβ”€ CLIP ViT-B/32:       ~150M parameters
β”œβ”€ Embedding dim:       512
β”œβ”€ Color palette:       5 colors Γ— 3 channels (LAB)
β”œβ”€ Feature vector:      527 dimensions total
└─ Search complexity:   O(log n) with FAISS

Inference Time (CPU):
β”œβ”€ CLIP embedding:      ~50ms per image
β”œβ”€ Color extraction:    ~30ms per image
β”œβ”€ FAISS k-NN query:    ~2ms for k=50
└─ Total pipeline:      ~100-200ms per image

πŸŽ“ CS Concepts In Action

  • βœ… Cryptographic Hashing β€” SHA-1 for content addressing
  • βœ… Unsupervised Learning β€” K-means clustering
  • βœ… Transfer Learning β€” Pre-trained CLIP model
  • βœ… Attention Mechanism β€” Vision Transformer architecture
  • βœ… Vector Similarity β€” Cosine distance in embedding space
  • βœ… Approximate Search β€” FAISS indexing strategies
  • βœ… Dimensionality Reduction β€” UMAP manifold learning
  • βœ… Perceptual Metrics β€” CIELAB + CIEDE2000
  • βœ… RESTful APIs β€” FastAPI async endpoints
  • βœ… Type Safety β€” Pydantic schemas + TypeScript
  • βœ… Reactive UI β€” React Query + Framer Motion

πŸ€” Why This Matters

This project demonstrates:

  1. Multi-modal AI: Combining color science (classical CV) with deep learning (CLIP)
  2. Production ML: Efficient inference, caching, and vector search at scale
  3. Human-Centered Design: Perceptual color spaces, not raw RGB
  4. Privacy First: 100% local, no cloud APIs, EXIF stripping
  5. Full Stack: Python backend + TypeScript frontend + real-time updates

It's a real-world application of algorithms you'd study in ML, Computer Vision, and Systems Design courses!

Want to see it in action?

Experience the vibe curator live on the portfolio!

Rishabh Goenka