When ML Meets Color Science: Building an AI-Powered Photo Curator
Imagine having 1000+ photos on your phone. Every day, you want to display a small collection (3-9 photos) that share a cohesive βvibeβ β similar colors, themes, and visual harmony.
Challenge: How do you define βvibeβ mathematically? How do you search through thousands of images efficiently?
Answer: Multi-modal similarity search combining perceptual color science with deep learning embeddings.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INPUT: Photo Collection β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FEATURE EXTRACTION LAYER β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β Color Analysis β β CLIP Embeddings β β
β β (CIELAB k-means)β β (ViT-B/32) β β
β β 5 colors/image β β 512-dim vector β β
β ββββββββββββββββββββ ββββββββββββββββββββ β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SIMILARITY COMPUTATION β
β βββββββββββββββββββββββββββββββββββββββββ β
β β Fused Distance = Ξ±Β·Dcolor + (1-Ξ±)Β·Dclip β β
β β Where Ξ± = 0.6 (60% color, 40% semantic)β β
β βββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VECTOR SEARCH (FAISS) β
β k-NN search in ~O(log n) time β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2D LAYOUT (UMAP) β
β Dimensionality reduction for spatial positioning β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OUTPUT: Curated Gallery β
β 3-9 images with unified aesthetic β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββimage_id = SHA1(file_contents)Why? Same image = same hash. Cryptographic collision resistance ensures different images never collide. Used by Git, Docker, IPFS.
RGB is NOT perceptually uniform! A distance of 10 units might be noticeable in one part of RGB space but invisible in another.
Enter CIELAB: A perceptually uniform color space where Euclidean distance β human-perceived color difference.
RGB Color Space CIELAB Color Space
(Device-dependent) (Perceptually uniform)
G a* (green β β red)
β β
β β
β β
βββββ R ββββββ b* (blue β β yellow)
/ /
B L* (lightness)
β Distance β Perception β
Distance = PerceptionK-Means Algorithm:
CLIP (Contrastive Language-Image Pre-training) understands both images AND text. Trained on 400M image-text pairs.
Architecture: Vision Transformer (ViT) with 32Γ32 patches
Input Image (512Γ512)
β
βββ Split into 32Γ32 patches
β
βββ Linear projection to 768-dim
β
βββ 12 Transformer layers
β
βββ Output: 512-dim embedding vector
This vector captures SEMANTIC meaning:
β’ "dog" and "puppy" are close
β’ "beach" and "ocean" are close
β’ "sunset" and "sunrise" are closeThe magic happens here! We combine TWO distance metrics:
Dcolor = CIEDE2000(paletteβ, paletteβ)
Dsemantic = 1 - cos_similarity(embedβ, embedβ)
Dfused = Ξ± Β· Dcolor + (1-Ξ±) Β· Dsemantic
Where Ξ± = 0.6 (tunable: 60% color, 40% meaning)
CIEDE2000: Industry-standard color difference formula. Not just Euclidean distance β accounts for human perception quirks!
FAISS (Facebook AI Similarity Search) β the secret sauce for fast nearest-neighbor search.
Naive Search: FAISS (IVF + PQ): O(n) O(log n) For n=1000 images: β’ Naive: 1000 comparisons β’ FAISS: ~10 comparisons (100x speedup!) For n=1M images: β’ Naive: 1M comparisons β’ FAISS: ~20 comparisons (50,000x speedup!)
UMAP (Uniform Manifold Approximation and Projection) β think t-SNE but faster and better at preserving global structure.
Input: 512-dim embedding + 15-dim color histogram = 527-dim
Output: 2D coordinates (x, y) β [0, 1]Β²
Goal: Preserve local neighborhoods in high-dim space
Why? Similar images cluster together spatially. Creates visually pleasing grid layouts!
# Run full pipeline python -m app.cli run-daily # Ingest new images python -m app.cli ingest # Compare two images (debug) python -m app.cli explain <id1> <id2> # Database statistics python -m app.cli stats
Lines of Code: ~3,500 Backend Files: 18 Python files Frontend Files: 14 TypeScript files Dependencies: 13 (backend) + 12 (frontend) Model Stats: ββ CLIP ViT-B/32: ~150M parameters ββ Embedding dim: 512 ββ Color palette: 5 colors Γ 3 channels (LAB) ββ Feature vector: 527 dimensions total ββ Search complexity: O(log n) with FAISS Inference Time (CPU): ββ CLIP embedding: ~50ms per image ββ Color extraction: ~30ms per image ββ FAISS k-NN query: ~2ms for k=50 ββ Total pipeline: ~100-200ms per image
This project demonstrates:
It's a real-world application of algorithms you'd study in ML, Computer Vision, and Systems Design courses!
Experience the vibe curator live on the portfolio!
Rishabh Goenka