Abstract: Deep learning has become a dominant force in artificial intelligence; however, its substantial energy consumption and associated carbon emissions have received relatively little attention.
Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at ...
Abstract: Text-Pedestrian Image Retrieval employs textual description of pedestrian's appearance to identify the corresponding pedestrian image. This task involves modality discrepancy and the ...