by groundlight
Expose HuggingFace zero‑shot object detection models as MCP tools, allowing language and vision‑language models to locate objects and zoom into image regions directly.
Mcp Vision provides a Model Context Protocol (MCP) server that wraps HuggingFace computer‑vision models—primarily zero‑shot object detection pipelines—and makes them available as callable tools. It lets downstream agents such as Claude or other LLMs request object detection results or cropped images without re‑implementing vision logic.
git clone git@github.com:groundlight/mcp-vision.git
cd mcp-vision
make build-docker
mcp-vision entry to claude_desktop_config.json. Example for GPU:
"mcpServers": {
"mcp-vision": {
"command": "docker",
"args": ["run", "-i", "--rm", "--runtime=nvidia", "--gpus", "all", "mcp-vision"],
"env": {}
}
}
Replace with the CPU variant if no GPU is available.locate_objects – supply image_path, a list of candidate_labels, and optionally a specific HuggingFace model.zoom_to_object – supply image_path, a target label, and optionally a model; receives a cropped image of the highest‑scoring detection.google/owlvit-large-patch14.locate_objects (returns detection list) and zoom_to_object (returns cropped image).hf_model argument.uv Python package manager.Q: Do I need a GPU?
A: A GPU accelerates model loading and inference, especially for the default large model. CPU works but may be slow; you can switch to a smaller model via DEFAULT_OBJDET_MODEL.
Q: How do I choose a different HuggingFace model?
A: Pass the model name to the hf_model parameter of either tool. Any model listed under the zero‑shot object detection pipeline tag on HuggingFace is compatible.
Q: Can I run the server without Docker?
A: Yes. Install dependencies with uv install and start the server via uv run python mcp_vision.
Q: What formats are accepted for image_path?
A: Both local file paths and publicly accessible URLs are supported.
Q: Why does Claude sometimes fall back to web search instead of calling the tools? A: When web search is enabled, Claude prioritises it. Disable web search in Claude Desktop for deterministic tool usage.
A Model Context Protocol (MCP) server exposing HuggingFace computer vision models such as zero-shot object detection as tools, enhancing the vision capabilities of large language or vision-language models.
This repo is in active development. See below for details of currently available tools.
Clone the repo:
git clone git@github.com:groundlight/mcp-vision.git
Build a local docker image:
cd mcp-vision
make build-docker
Add this to your claude_desktop_config.json:
If your local environment has access to a NVIDIA GPU:
"mcpServers": {
"mcp-vision": {
"command": "docker",
"args": ["run", "-i", "--rm", "--runtime=nvidia", "--gpus", "all", "mcp-vision"],
"env": {}
}
}
Or, CPU only:
"mcpServers": {
"mcp-vision": {
"command": "docker",
"args": ["run", "-i", "--rm", "mcp-vision"],
"env": {}
}
}
When running on CPU, the default large-size object detection model make take a long time to laod and run inference. Consider using a smaller model as DEFAULT_OBJDET_MODEL (you can tell Claude directly to use a specific model too).
(Beta) It is possible to run the public docker image directly without building locally, however the download time may interfere with Claude's loading of the server.
"mcpServers": {
"mcp-vision": {
"command": "docker",
"args": ["run", "-i", "--rm", "--runtime=nvidia", "--gpus", "all", "groundlight/mcp-vision:latest"],
"env": {}
}
}
The following tools are currently available through the mcp-vision server:
image_path (string) URL or file path, candidate_labels (list of strings) list of possible objects to detect, hf_model (optional string), will use "google/owlvit-large-patch14" by default, which could be slow on a non-GPU machineimage_path (string) URL or file path, label (string) object label to find and zoom and crop to, hf_model (optional), will use "google/owlvit-large-patch14" by default, which could be slow on a non-GPU machineRun Claude Desktop with Claude Sonnet 3.7 and mcp-vision configured as an MCP server in claude_desktop_config.json.
The prompt used in the example video and blog post was:
From the information on that advertising board, what is the type of this shop?
Options:
The shop is a yoga studio.
The shop is a cafe.
The shop is a seven-eleven.
The shop is a milk tea shop.
The image is the first image in the V*Bench/GPT4V-hard dataset and can be found here: https://huggingface.co/datasets/craigwu/vstar_bench/blob/main/GPT4V-hard/0.JPG (use the download link).
Note:
Run locally using the uv package manager:
uv install
uv run python mcp_vision
Build the Docker image locally:
make build-docker
Run the Docker image locally:
make run-docker-cpu
or
make run-docker-gpu
[Groundlight Internal] Push the Docker image to Docker Hub (requires DockerHub credentials):
make push-docker
If Claude Desktop is failing to connect to mcp-vision:
On accounts that have web search enabled, Claude will prefer to use web search over local MCP tools AFAIK. Disable web search for best results.
Please log in to share your review and rating for this MCP.
Discover more MCP servers with similar functionality and use cases
by danny-avila
Provides a customizable ChatGPT‑like web UI that integrates dozens of AI models, agents, code execution, image generation, web search, speech capabilities, and secure multi‑user authentication, all open‑source and ready for self‑hosting.
by ahujasid
BlenderMCP integrates Blender with Claude AI via the Model Context Protocol (MCP), enabling AI-driven 3D scene creation, modeling, and manipulation. This project allows users to control Blender directly through natural language prompts, streamlining the 3D design workflow.
by pydantic
Enables building production‑grade generative AI applications using Pydantic validation, offering a FastAPI‑like developer experience.
by GLips
Figma-Context-MCP is a Model Context Protocol (MCP) server that provides Figma layout information to AI coding agents. It bridges design and development by enabling AI tools to directly access and interpret Figma design data for more accurate and efficient code generation.
by mcp-use
Easily create and interact with MCP servers using custom agents, supporting any LLM with tool calling and offering multi‑server, sandboxed, and streaming capabilities.
by sonnylazuardi
This project implements a Model Context Protocol (MCP) integration between Cursor AI and Figma, allowing Cursor to communicate with Figma for reading designs and modifying them programmatically.
by lharries
WhatsApp MCP Server is a Model Context Protocol (MCP) server for WhatsApp that allows users to search, read, and send WhatsApp messages (including media) through AI models like Claude. It connects directly to your personal WhatsApp account via the WhatsApp web multi-device API and stores messages locally in a SQLite database.
by idosal
GitMCP is a free, open-source remote Model Context Protocol (MCP) server that transforms any GitHub project into a documentation hub, enabling AI tools to access up-to-date documentation and code directly from the source to eliminate "code hallucinations."
by Klavis-AI
Klavis AI provides open-source Multi-platform Control Protocol (MCP) integrations and a hosted API for AI applications. It simplifies connecting AI to various third-party services by managing secure MCP servers and authentication.