The satellite image CLIP encoder is fine-tuned using Sentinel-2 Level 2A satellite image and taxonomy images (with GPS locations) from iNaturalist. The sound CLIP encoder is fine-tuned with a subset of the same taxonomy images and their corresponding sounds from iNaturalist. Note that while some of the examples above result in poor probability distributions, they will be improved using our test-time adaptation framework during the search process.