Mcp Vision

What is Mcp Vision about?

Mcp Vision provides a Model Context Protocol (MCP) server that wraps HuggingFace computer‑vision models—primarily zero‑shot object detection pipelines—and makes them available as callable tools. It lets downstream agents such as Claude or other LLMs request object detection results or cropped images without re‑implementing vision logic.

How to use Mcp Vision?

Installation
- Clone the repository and build the Docker image:
```
git clone git@github.com:groundlight/mcp-vision.git
cd mcp-vision
make build-docker
```
- Alternatively, run the pre‑built image from Docker Hub.

Configure the client (e.g., Claude Desktop) by adding an mcp-vision entry to claude_desktop_config.json. Example for GPU:

"mcpServers": {
  "mcp-vision": {
    "command": "docker",
    "args": ["run", "-i", "--rm", "--runtime=nvidia", "--gpus", "all", "mcp-vision"],
    "env": {}
  }
}

Replace with the CPU variant if no GPU is available.

Invoke tools from the LLM:
- locate_objects – supply image_path, a list of candidate_labels, and optionally a specific HuggingFace model.
- zoom_to_object – supply image_path, a target label, and optionally a model; receives a cropped image of the highest‑scoring detection.

Key Features of Mcp Vision

Zero‑shot object detection using state‑of‑the‑art models like google/owlvit-large-patch14.
Two ready‑to‑use tools: locate_objects (returns detection list) and zoom_to_object (returns cropped image).
GPU and CPU support with simple Docker‑based deployment.
Extensible – any HuggingFace model compatible with the zero‑shot object‑detection pipeline can be specified via the hf_model argument.
Developer‑friendly local execution via the uv Python package manager.

Use Cases

Enriching LLM agents with visual understanding for tasks like product identification, scene description, and visual QA.
Building conversational assistants that can dynamically focus on specific objects within an image (e.g., “show me the logo on the billboard”).
Automating image analysis pipelines where a lightweight API call to an MCP server is preferred over full model hosting.
Rapid prototyping of vision‑augmented workflows in research or product demos.

FAQ

Q: Do I need a GPU? A: A GPU accelerates model loading and inference, especially for the default large model. CPU works but may be slow; you can switch to a smaller model via DEFAULT_OBJDET_MODEL.

Q: How do I choose a different HuggingFace model? A: Pass the model name to the hf_model parameter of either tool. Any model listed under the zero‑shot object detection pipeline tag on HuggingFace is compatible.

Q: Can I run the server without Docker? A: Yes. Install dependencies with uv install and start the server via uv run python mcp_vision.

Q: What formats are accepted for image_path? A: Both local file paths and publicly accessible URLs are supported.

Q: Why does Claude sometimes fall back to web search instead of calling the tools? A: When web search is enabled, Claude prioritises it. Disable web search in Claude Desktop for deterministic tool usage.

mcp-vision by

A Model Context Protocol (MCP) server exposing HuggingFace computer vision models such as zero-shot object detection as tools, enhancing the vision capabilities of large language or vision-language models.

This repo is in active development. See below for details of currently available tools.

Installation

Clone the repo:

git clone git@github.com:groundlight/mcp-vision.git

Build a local docker image:

cd mcp-vision
make build-docker

Configuring Claude Desktop

Add this to your claude_desktop_config.json:

If your local environment has access to a NVIDIA GPU:

"mcpServers": {
  "mcp-vision": {
    "command": "docker",
    "args": ["run", "-i", "--rm", "--runtime=nvidia", "--gpus", "all", "mcp-vision"],
	"env": {}
  }
}

Or, CPU only:

"mcpServers": {
  "mcp-vision": {
    "command": "docker",
    "args": ["run", "-i", "--rm", "mcp-vision"],
	"env": {}
  }
}

When running on CPU, the default large-size object detection model make take a long time to laod and run inference. Consider using a smaller model as DEFAULT_OBJDET_MODEL (you can tell Claude directly to use a specific model too).

(Beta) It is possible to run the public docker image directly without building locally, however the download time may interfere with Claude's loading of the server.

"mcpServers": {
  "mcp-vision": {
    "command": "docker",
    "args": ["run", "-i", "--rm", "--runtime=nvidia", "--gpus", "all", "groundlight/mcp-vision:latest"],
	"env": {}
  }
}

Tools

The following tools are currently available through the mcp-vision server:

locate_objects

Description: Detect and locate objects in an image using one of the zero-shot object detection pipelines available through HuggingFace (list for reference [https://huggingface.co/models?pipeline_tag=zero-shot-object-detection&sort=trending]).
Input: image_path (string) URL or file path, candidate_labels (list of strings) list of possible objects to detect, hf_model (optional string), will use "google/owlvit-large-patch14" by default, which could be slow on a non-GPU machine
Returns: List of dicts in HF object-detection format

zoom_to_object

Description: Zoom into an object in the image, allowing you to analyze it more closely. Crop image to the object bounding box and return the cropped image. If many objects are present in the image, will return the 'best' one as represented by object score.
Input: image_path (string) URL or file path, label (string) object label to find and zoom and crop to, hf_model (optional), will use "google/owlvit-large-patch14" by default, which could be slow on a non-GPU machine
Returns: MCPImage or None

Example in blog post and video

Run Claude Desktop with Claude Sonnet 3.7 and mcp-vision configured as an MCP server in claude_desktop_config.json.

The prompt used in the example video and blog post was:

From the information on that advertising board, what is the type of this shop?
Options:
The shop is a yoga studio.
The shop is a cafe.
The shop is a seven-eleven.
The shop is a milk tea shop.

The image is the first image in the V*Bench/GPT4V-hard dataset and can be found here: https://huggingface.co/datasets/craigwu/vstar_bench/blob/main/GPT4V-hard/0.JPG (use the download link).

Note:

If you upload the image directly into the conversation with Claude instead of providing a download link, it will not be able to call the tools and will attempt to answer directly.
On accounts that have web search enabled, Claude will prefer to use web search over local MCP tools AFAIK. Disable web search for best results.

Development

Run locally using the uv package manager:

uv install
uv run python mcp_vision

Build the Docker image locally:

make build-docker

Run the Docker image locally:

make run-docker-cpu

make run-docker-gpu

[Groundlight Internal] Push the Docker image to Docker Hub (requires DockerHub credentials):

make push-docker

Troubleshooting

If Claude Desktop is failing to connect to mcp-vision:

Check the configuration is correct (CPU vs GPU)
Developer options may need to be enabled in Claude Desktop
Depending on the size of the model(s) used, give it a few minutes to download them from HuggingFace on first opening Claude Desktop. Once downloaded, the server will respond and Claude will connect.

On accounts that have web search enabled, Claude will prefer to use web search over local MCP tools AFAIK. Disable web search for best results.

TODO

Host best models online instead of requiring local download
Add more tools

Mcp Vision Overview

What is Mcp Vision about?

How to use Mcp Vision?

Key Features of Mcp Vision

Use Cases

FAQ

Mcp Vision's README

mcp-vision by

Installation

Configuring Claude Desktop

Tools

Example in blog post and video

Development

Troubleshooting

TODO

Mcp Vision Reviews

Login Required

Actions

Mcp Vision's Information

Related MCP Servers

LibreChat

Blender

Pydantic AI

Figma

Mcp Use

Talk To Figma

WhatsApp MCP Server

GitMCP

Discord