All News
qwenmultimodallocal-llmvideo-searchopen-source

Find Any Moment in Hours of Video — No Transcription Required

SentrySearch uses Qwen3-VL embeddings to search video footage by natural language queries, running entirely offline on consumer hardware.

Vlad MakarovVlad Makarovreviewed and published
3 min read

"Red truck running a stop sign." Type that into SentrySearch, and it will scrub through hours of dashcam footage and hand you the exact clip — trimmed, saved, ready. No transcription. No frame captioning. No cloud API. Just raw video pixels embedded into the same vector space as your words.

How It Works

Developer Soham Rajadhyaksha built SentrySearch originally for Tesla Sentry Mode footage, but it works with any MP4 files. The tool splits video into 30-second overlapping chunks, downscales them to 480p at 5fps, skips still frames automatically, and then embeds each chunk using Qwen3-VL — Alibaba's multimodal vision-language model.

The key insight is that Qwen3-VL can natively embed video, not just individual frames. Unlike CLIP-based approaches that analyze one frame at a time and miss motion entirely, Qwen3-VL processes actual video sequences. When you search "person walking across the parking lot," it understands walking as temporal movement, not just a static pose.

Embeddings go into a local ChromaDB vector database. Search queries get embedded into the same space and matched via cosine similarity. The whole pipeline runs on a single CLI command.

Hardware requirements:

  • NVIDIA GPU with 18GB+ VRAM: full 8B model
  • Apple Silicon with 24GB+ RAM: full 8B model via MPS
  • 8-16GB VRAM: quantized 8B or smaller 2B model
  • No GPU: not recommended (CPU float32 too slow)

The demo video on YouTube shows the full workflow.

Why This Matters

Video search has been a frustratingly unsolved problem for most people. The standard approach — transcribe audio, caption frames, search the text — loses everything that makes video video: motion, spatial relationships, visual context. A red truck running a stop sign has no audio cue and might not trigger a frame caption. Multimodal embeddings skip the text middleman entirely.

SentrySearch hit 1,200 GitHub stars within days of adding local model support. The original Gemini-only version scored 433 points on Hacker News, but the community's #1 request was local support — no API keys, no data leaving the machine. This update delivered exactly that, and the response from r/LocalLLaMA reflected it.

The practical applications go well beyond dashcams. Security footage, personal video archives, body cam review, wildlife cameras — anywhere you have hours of video and need to find specific moments. The fact that it runs entirely offline on consumer hardware, thanks to advances in local inference, makes it genuinely useful rather than just technically impressive.

Related Articles

Scroll down

to load the next article