OWL-ViT
Open-vocabulary object detection model by Google using vision transformers.
About
OWL-ViT, Vision Transformer for Open-World Localization from Google, performs open-vocabulary object detection by accepting free-text queries rather than a fixed label set. It transfers image-text pretraining in the style of CLIP to detection without task-specific training data, so it can localize objects described by arbitrary text. It is distributed within Google's Scenic research codebase for attention-based vision models. Released under the Apache 2.0 license.
Reviews (0)
Leave a Review
No reviews yet. Be the first to review!
Details
- Price
- Free
- Platform
- Local/Desktop
- Difficulty
- Intermediate (3/5)
- License
- Apache-2.0
- Minimum VRAM
- 6 GB
- Added
- Apr 3, 2026
Related Tools
Simple and effective multi-object tracking using every detection box.
Monocular depth estimation model producing detailed depth maps from single images.
End-to-end object detection with transformers by Meta, eliminating hand-designed components.
Self-supervised vision transformer by Meta producing universal visual features.
Unified vision foundation model by Microsoft for captioning, detection, and segmentation.
Robust multi-object tracking combining motion and appearance cues.