Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild Demo

Click on any of the examples below and run the multimodal inference demo. Check out the test-time adaptation feature by switching to the other tab above.
If you encounter any errors, refresh the browser and rerun the demo, or try again the next day. We will improve this in the future.
Project Website | LISA-AVS Demo

Satellite Image

Full Taxonomy Name (optional)

Sound Input (optional)

Ground-level Image (optional)

Heat-map Results

Ground Image Query

Text Query

Sound Query

In-Domain Taxonomy

Examples

Satellite Image	Ground-level Image (optional)	Full Taxonomy Name (optional)	Sound Input (optional)

Out-Domain Taxonomy

Examples

Satellite Image	Ground-level Image (optional)	Full Taxonomy Name (optional)	Sound Input (optional)

The satellite image CLIP encoder is fine-tuned using Sentinel-2 Level 2A satellite image and taxonomy images (with GPS locations) from iNaturalist. The sound CLIP encoder is fine-tuned with a subset of the same taxonomy images and their corresponding sounds from iNaturalist. Some of these iNaturalist data are also used in Taxabind. Note that while some of the examples above result in poor probability distributions, they will be improved using our test-time adaptation framework during the search process.

Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild Demo

Heat-map Results

In-Domain Taxonomy

Out-Domain Taxonomy

Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild Demo

Model Inputs

Live Heatmap Output

Taxonomy