SkillsAggSubmit Skill

ask-gallery

Clean

Design, build, and deploy Ask Gallery — a semantic photo search system for mobile devices. Use when asked to create image search features using natural language queries, build photo analysis pipelines with VLM/CLIP/OCR/face recognition, design vector search architectures with TiDB/Meilisearch/Kafka/EKS, or implement swappable ML model registries. Triggers on keywords like semantic search, photo search, image captioning, visual question answering, gallery AI, image embedding, multimodal search.

0 stars🍴 0 forks0 installs

Install Command

npx skills add ecmoce/ask-gallery
Author
ecmoce
Repository
ecmoce/ask-gallery
Discovered via
github topic
Weekly installs
0
Quality score
20/100
Last commit
2/24/2026

SKILL.md

---
name: ask-gallery
description: Design, build, and deploy Ask Gallery — a semantic photo search system for mobile devices. Use when asked to create image search features using natural language queries, build photo analysis pipelines with VLM/CLIP/OCR/face recognition, design vector search architectures with TiDB/Meilisearch/Kafka/EKS, or implement swappable ML model registries. Triggers on keywords like semantic search, photo search, image captioning, visual question answering, gallery AI, image embedding, multimodal search.
---

# Ask Gallery — Semantic Photo Search System

Build a production-grade system that lets users search phone photos using natural language (e.g., "find receipt photos and calculate expenses", "show beach photos where I appear").

## Architecture Overview

Two pipelines: **Ingestion** (process images) and **Search** (answer queries).

```
User Photo → S3 → Kafka → [VLM | CLIP | OCR | Face | GPS→Place] → TiDB + Meilisearch
User Query → Query Analyzer (LLM) → TiDB Vector + Meilisearch Place Search → RRF Fusion → Re-Ranker → Answer Generator → Response
```

## Ingestion Pipeline

Process each uploaded image through parallel Kafka consumers:

1. **VLM Captioning** — Generate detailed captions + semantic tags
2. **Image Embedding** — Create CLIP vectors for image-to-text matching
3. **Text Embedding** — Embed captions/tags for text-to-text search
4. **OCR** — Extract text from receipts, signs, documents
5. **Face Detection** — Detect and encode faces for person search
6. **GPS + Reverse Geocoding** — GPS → Place names → Meilisearch indexing

Store everything in TiDB (SQL + vectors with ADCPE quantization), place index in Meilisearch. Images in S3.

### Ingestion Implementation

```
# Kafka consumer fan-out pattern
For each image-uploaded event:
  1. Store raw image in S3
  2. Fan-out to parallel processors via Kafka topics
  3. Each processor writes results to DB independently
  4. Mark image-processed when all complete
```

Key config: see `scripts/model_config.py` for model selection.

## Search Pipeline

1. **Query Analysis** — LLM parses intent, filters (person/location/date), and required post-processing
2. **Hybrid Search** — Parallel: TiDB vector search (ADCPE-compressed) + Meilisearch place search
3. **Reciprocal Rank Fusion (RRF)** — Combine TiDB + Meilisearch scores using rank-based fusion (k=60)
4. **Re-Ranking** — Cross-encoder scores top-K candidates for final precision boost
5. **Answer Generation** — LLM summarizes results based on original query intent
6. **Session Cache** — Redis stores context for follow-up questions (30min TTL)

### Query Analysis Prompt Pattern

```
Given user query, extract:
- intent: find | compare | calculate | summarize | rank
- entities: person names, locations, objects, dates
- filters: date_range, face_id, location_bbox, object_type
- post_processing: ocr_read | count | rank_aesthetic | calculate_sum
```

## Model Registry

All models are swappable via `scripts/model_config.py`. Each entry has: model name, parameters, VRAM requirement, speed benchmark, accuracy score, recommended GPU.

| Category | Default Model | Top Upgrade Option | Key Alternatives |
|---|---|---|---|
| VLM | Qwen2.5-VL-7B (6GB) | **Qwen3-VL-8B** (8GB, DocVQA 96.5%) | CogVLM2-19B, InternVL3-78B, GPT-5.2-Vision |
| Image Embedding | SigLIP-SO400M (2GB) | **SigLIP2-SO400M** (2GB, +2.9% ImageNet) | EVA-CLIP-8B, OpenCLIP-ViT-bigG-14 |
| Text Embedding | BGE-M3 (2GB) | **Qwen3-Embedding-8B** (8GB, MTEB #1 70.58) | NV-Embed-v2, voyage-3-large |
| Re-Ranker | BGE Reranker v2 M3 (2GB) | **Qwen3-Reranker-8B** (8GB, 100+ languages) | Cohere-Rerank-v3.5, bge-reranker-v2.5 |
| OCR | PaddleOCR v4 (1GB) | **PP-OCRv5** (1GB, +13%p accuracy) | Google Cloud Vision, EasyOCR |
| Face | InsightFace buffalo_l (1GB) | **AdaFace** (1GB, +0.05% LFW) | ArcFace-R100, FaceNet-v2 |
| LLM (Query) | Claude Sonnet 4 (API) | **GPT-5.2** (API, Arena ELO 1420) | Claude Opus 4.6, Gemini 3 Pro |

### Swapping Models

```python
from model_config import set_current_model, get_current_model
# Switch VLM at runtime
set_current_model("vlm", "cogvlm2-llama3-chat-19B")
# Check current
info = get_current_model("vlm")
```

## AWS Infrastructure

- **EKS**: GPU pool (g5.xlarge, 2-10 nodes) + CPU pool (m6i.xlarge, 2-8 nodes)
- **Kafka**: Strimzi KRaft, 3 brokers, 9 topics with retention policies
- **TiDB**: TiDB Operator on Kubernetes (3 TiDB + 3 TiKV + 3 PD nodes)
- **Meilisearch**: 2-3 replicas with persistent volumes
- **S3**: 3 buckets (images, models, backups)
- **ElastiCache**: Redis r6g for session cache
- **ALB**: TLS termination, HTTP→HTTPS redirect

### Scaling Rules

- Ingestion: KEDA autoscaler on Kafka consumer lag (2-8 GPU pods)
- Search: HPA on CPU 70% / Memory 80%, PDB min 2 available
- TiDB: Auto-scale TiDB nodes based on query QPS, TiKV nodes based on storage
- Meilisearch: Scale replicas based on search throughput

## Local Development

```bash
cd prototype/
docker-compose up -d  # Starts: FastAPI, TiDB, Meilisearch, Kafka, Redis, MinIO
# Ingestion: http://localhost:8001
# Search API: http://localhost:8000
# TiDB: mysql://root@localhost:4000/ask_gallery
# Meilisearch: http://localhost:7700
```

## Key Design Decisions

1. **Fan-out via Kafka** — Each processor runs independently; failure in one doesn't block others
2. **TiDB unified storage** — SQL + vectors in one database, ADCPE quantization for 32x compression
3. **Meilisearch for places** — Bucket sort ranking (words→typo→proximity→attribute→sort→exactness) + geo-filtering for location-based queries with typo tolerance
4. **Reciprocal Rank Fusion** — Combine TiDB vector + Meilisearch ranking scores mathematically
5. **Cross-encoder re-ranking** — Final precision boost on top candidates
6. **Session-based follow-ups** — Redis caches search context for conversational refinement
7. **Model Registry pattern** — All ML models swappable without code changes

## File Reference

Read reference docs for detailed specifications:
- `references/architecture.md` — Full system architecture with diagrams
- `references/api-spec.md` — REST API endpoint specifications
- `references/models.md` — Detailed model comparison and benchmarks
- `references/prompt-templates.md` — VLM/LLM prompt templates

Run `grep -n "keyword" references/*.md` for specific topics in large files.

Similar Skills

image-genClean

Generate compelling cover images and in-article illustrations for technical articles using the imagen CLI tool. Use when asked to "generate images", "create cover image", "make article illustrations", "create visual assets", or "add images to article". Handles both high-impact conceptual cover images and technical diagrams/illustrations for specific concepts. Includes prompt engineering best practices and SEO-friendly image integration.

npx skills add SpillwaveSolutions/image_gen

Use this skill when generating images from markdown prompts, creating blog illustrations from text descriptions, or running an end-to-end image generation and optimization pipeline. Takes a markdown file with image prompts, generates via Gemini, uploads to Cloudinary, returns optimized URLs.

apidevopsdocumentationimage-generationgemini
npx skills add smith-horn/skill-image-pipeline

Async deep research via Gemini Interactions API (no Gemini CLI dependency). RAG-ground queries on local files (--context), preview costs (--dry-run), structured JSON output, adaptive polling. Universal skill for 30+ AI agents including Claude Code, Amp, Codex, and Gemini CLI.

npx skills add 24601/agent-deep-research
vap-mediaClean

AI image, video, and music generation. Flux, Veo 3.1, Suno V5.

npx skills add RenSeiji27/vap-media-skill
ask-gallery | SkillsAgg