Segment Anything (Meta)

Visit Site

Core Functions and Features

Segment Anything Model (SAM) is a new AI model from Meta AI that can "cut out" any object, in any image, with a single click. SAM is a promptable segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training.

Variety of Input Prompts: Prompts specifying what to segment in an image allow for a wide range of segmentation tasks without the need for additional training. Supported prompt types include: Foreground/background points, Bounding box, and Mask. Text prompts are explored in the paper but the capability is not released.

Flexible Integration: SAM's promptable design enables flexible integration with other systems. It can take input prompts from other systems, such as taking a user's gaze from an AR/VR headset to select an object, or using bounding box prompts from an object detector to enable text-to-object segmentation.

Extensible Outputs: Output masks can be used as inputs to other AI systems. For example, object masks can be tracked in videos, enable imaging editing applications, be lifted to 3D, or used for creative tasks like collaging.

Zero-shot Generalization: SAM has learned a general notion of what objects are -- this understanding enables zero-shot generalization to unfamiliar objects and images without requiring additional training.

Efficient & Flexible Model Design: SAM is designed to be efficient enough to power its data engine. The model is decoupled into 1) a one-time image encoder (ViT-H based, 632M parameters, takes ~0.15 seconds on an NVIDIA A100 GPU) and 2) a lightweight mask decoder (Transformer based, 4M parameters combined with the prompt encoder, takes ~50ms on CPU in the browser using multithreaded SIMD execution). For platforms, the image encoder is implemented in PyTorch and requires a GPU; the prompt encoder and mask decoder can run directly with PyTorch or be converted to ONNX and run efficiently on CPU or GPU across platforms supporting ONNX runtime. The model was trained for 3-5 days on 256 A100 GPUs.

Data Engine and Dataset

SAM's advanced capabilities are the result of its training on millions of images and masks collected through a model-in-the-loop "data engine." Researchers used SAM and its data to interactively annotate images and update the model, repeating this cycle to improve both. After annotating enough masks with SAM’s help, they leveraged SAM’s sophisticated ambiguity-aware design to annotate new images fully automatically by presenting SAM with a grid of points on an image and asking SAM to segment everything at each point. The final dataset, SA-1B, includes more than 1.1 billion segmentation masks collected on ~11 million licensed and privacy preserving images.

Frequently Asked Questions (FAQ)

Does the model produce mask labels? No, the model predicts object masks only and does not generate labels.
Does the model work on videos? Currently the model only supports images or individual frames from videos.
Where can I find the code? Code is available on GitHub.

Pricing Information

No pricing information is mentioned on the page; the model code and dataset are available for free.

Visits: 463.2K

Country: United States

Design assistant Open Source

Share