Open-Tag: A Generative Framework for Open-World Multimodal Tagging

1Chinese Academy of Sciences Institute of Automation 2University of Chinese Academy of Sciences 3Aerospace Information Research Institute, Chinese Academy of Sciences
Teaser Image

Three Paradigms for Multimodal Tagging: (a) closed-set classification with fixed labels, (b) open-vocabulary matching with candidate tag texts, (c) open-world generation of free-form tag sequences.

Abstract

Multimodal tagging is essential for content understanding by assigning concise, semantically relevant tags to visual inputs. However, real-world tagging is inherently open-ended: user-generated content is noisy, long-tailed, and continuously evolving, challenging conventional closed-set or open-vocabulary classification methods. We propose Open-Tag, a generative framework for Open-world Multimodal Tagging that produces unordered, variable-length tag sequences in natural language without relying on predefined tag sets.

Open-Tag introduces two key innovations: (1) an Order-Prompted Tag Sequence Generation that maps learnable, order-agnostic queries to latent tag semantics, enabling permutation-invariant tag generation, and (2) a Multi-Source Retrieval-Augmented Generation that fuses tag candidates from heterogeneous retrieval systems across visual, textual, and metadata modalities. A score normalization and aggregation strategy ensures robust fusion, enhancing the diversity and grounding of generated tags.

To evaluate Open-Tag, we construct two large-scale datasets: CREATE-Tag (Chinese video) and PEXEL-Tag (English image), with over 3M videos and 160K images with tens of thousands of real-user tags. We propose a novel open-set evaluation metric, Tag Gain, to quantify the generation of relevant but previously unseen tags. Experiments show that Open-Tag outperforms state-of-the-art baselines on closed-set F1 and open-set Tag Gain, highlighting its generalization and novel tag discovery capabilities.

#1 Method

Model Image

Overview of the Open-Tag Framework: (a) Multimodal Hybrid Encoder fuses visual (image frames or video clips) and textual (user titles) features via cross-modal attention to produce a unified representation. (b) Order Prompt Encoder generates sample-specific learnable queries, aligned with ground-truth tags via bipartite matching and contrastive learning. (c) Multi-source Tag Recommendation module retrieves tag candidates from similar multimodal content via multiple retrieval systems, forming an external prompt for generation. (d) Prompt-guided Tag Decoder auto-regressively generates tag sequences using implicit semantics from order prompts and explicit knowledge from retrieved tags. All components are jointly trained end-to-end.

Promp-Tag Alignment

We perform one-to-one alignment with ground-truth tags via bipartite matching, to semantically match each order prompt toward a distinct tag.

Model Image

Multi-source Video Retrieval

Given a query video, multiple retrieval systems–video-to-video, title-to-title, and cross-modal retrieval—-return relevant videos with associated tags and scores. Retrieved results are aggregated, and a consensus scoring mechanism selects top-ranked tags as explicit generation guidance.

Model Image

#2 Evaluation

Datasets

We focus on CREATE-Tag, which provides 210K training [Data Download] and 5K test [Data Download] samples. The training and test sets share 2,795 tags. For evaluation, we split these into 705 common and 2,090 rare tags.

We also build PEXEL-Tag, a newly collected dataset from the Pexels platform , comprising 162K training [Data Download] and 5K test [Data Download] samples. The dataset contains 28,094 unique tags, with 5,669 shared between training and testing. We further categorize these into 1,627 common and 4,042 rare tags for performance analysis

Results Image

Open-set Metric (Tag Gain)

To evaluate the model’s ability to generate novel and informative tags, we introduce an open-set evaluation protocol based on Tag Gain, which jointly considers tag novelty, visual relevance, and semantic diversity.

Given a generated tag set \(T_{gen} = \{t_1, t_2, \dots, t_k\}\) and the corresponding visual input \(v\), we construct a filtered subset \(\hat{T}_{gen}\) by enforcing two criteria:

  • Visual relevance: Each candidate tag \(t_j \in T_{gen}\) must be sufficiently aligned with the image content. We measure this using the CLIP-based similarity between the tag and the visual input.
    \[ \mathrm{sim}_{\text{CLIP}}(t_j, v) > \tau_{\text{rel}}. \]
  • Semantic diversity: To avoid redundancy, each selected tag should be semantically dissimilar to the tags already included in \(\hat{T}_{gen}\). This is enforced by requiring:
    \[ \forall t' \in \hat{T}_{gen},\quad \mathrm{sim}_{\text{BGE}}(t_j, t') < \tau_{\text{div}}. \]

We apply a greedy filtering process: starting from an empty set \(\hat{T}_{gen} = \emptyset\), we iterate through the candidate tags (optionally sorted by visual relevance), and sequentially add a tag \(t_j\) to \(\hat{T}_{gen}\) only if it satisfies both conditions above. Typically, we set \(\tau_{\text{rel}}=0.3\) and \(\tau_{\text{div}}=0.8\). Formally:

\[ \hat{T}_{gen} = \left\{ t_j \in T_{gen} \;\middle|\; \mathrm{sim}_{\text{CLIP}}(t_j, v) > \tau_{\text{rel}} \land \forall t' \in \hat{T}_{gen},\ \mathrm{sim}_{\text{BGE}}(t_j, t') < \tau_{\text{div}} \right\}. \]

Based on the filtered tag set \(\hat{T}_{gen}\), we define two variants of Tag Gain to assess open-world tagging:

  • Known Tag Gain (\(\Delta_{\text{known}}\)): The proportion of relevant but unannotated tags that were seen during training:
    \[ T_{\text{known}} = \left\{ t \in \hat{T}_{gen} \;\middle|\; t \notin T_{\text{gt}},\ t \in T_{\text{train}} \right\}, \] \[ \Delta_{\text{known}} = \frac{ \left| T_{\text{known}} \right| }{ \left| T_{\text{gt}} \right| }. \]
  • Novel Tag Gain (\(\Delta_{\text{novel}}\)): The proportion of generated tags that are unseen in training:
    \[ T_{\text{novel}} = \left\{ t \in \hat{T}_{gen} \;\middle|\; \ t \notin T_{\text{train}} \right\}, \] \[ \Delta_{\text{novel}} = \frac{ \left| T_{\text{novel}} \right| }{ \left| T_{\text{gt}} \right| }. \]

#3 Results

Open-Set Evaluation

Method CREATE-Tag PEXEL-Tag
Tknown Δknown Tnovel Δnovel Tknown Δknown Tnovel Δnovel
Bin1.228.1%0.000.00%0.821.4%0.000.00%
ASL1.432.5%0.000.00%0.924.6%0.000.00%
Order-Free1.431.6%0.000.00%1.027.4%0.000.00%
Orderless1.841.3%0.000.00%1.436.4%0.000.00%
Tag2Text1.738.5%0.000.00%1.130.2%0.000.00%
ML-Decoder1.842.2%0.000.00%1.233.8%0.000.00%
RAM1.842.5%0.000.00%1.334.2%0.000.00%
RAM++2.045.6%0.000.00%1.436.7%0.000.00%
Open-Book1.637.1%0.153.43%1.231.9%0.133.11%
Baseline1.637.3%0.102.24%1.334.7%0.112.63%
  + OPG2.252.3%0.307.04%2.049.1%0.245.66%
  + RAG 4.880.3%3.9572.8% 4.594.3%3.6171.2%

Qualitative Examples on CREATE-Tag

Results Image

Qualitative Examples on PEXEL-Tag

Results Image