OWL-ViT

Open-vocabulary object detection model by Google using vision transformers.

Open SourceSelf HostedOffline CapableGPU Required (6GB+ VRAM)
0.0 (0)

About

OWL-ViT, Vision Transformer for Open-World Localization from Google, performs open-vocabulary object detection by accepting free-text queries rather than a fixed label set. It transfers image-text pretraining in the style of CLIP to detection without task-specific training data, so it can localize objects described by arbitrary text. It is distributed within Google's Scenic research codebase for attention-based vision models. Released under the Apache 2.0 license.

Reviews (0)

Leave a Review

No reviews yet. Be the first to review!

Details

Price
Free
Platform
Local/Desktop
Difficulty
Intermediate (3/5)
License
Apache-2.0
Minimum VRAM
6 GB
Added
Apr 3, 2026

Related Tools

Simple and effective multi-object tracking using every detection box.

Open SourceSelf HostedOfflineGPU 4GB+
Intermediate
0.0 (0)

Monocular depth estimation model producing detailed depth maps from single images.

Open SourceSelf HostedOfflineGPU 4GB+
Easy
0.0 (0)

End-to-end object detection with transformers by Meta, eliminating hand-designed components.

Open SourceSelf HostedOfflineGPU 8GB+
Advanced
0.0 (0)

Self-supervised vision transformer by Meta producing universal visual features.

Open SourceSelf HostedOfflineGPU 6GB+
Intermediate
0.0 (0)

Unified vision foundation model by Microsoft for captioning, detection, and segmentation.

Open SourceSelf HostedOfflineGPU 6GB+
Intermediate
0.0 (0)

Robust multi-object tracking combining motion and appearance cues.

Open SourceSelf HostedOfflineGPU 4GB+
Intermediate
0.0 (0)
Browse all Computer Vision & Object Detection tools