Open-Tag: A Generative Framework for Open-World Multimodal Tagging

¹Chinese Academy of Sciences Institute of Automation ²University of Chinese Academy of Sciences ³Aerospace Information Research Institute, Chinese Academy of Sciences

Abstract

Multimodal tagging is essential for content understanding by assigning concise, semantically relevant tags to visual inputs. However, real-world tagging is inherently open-ended: user-generated content is noisy, long-tailed, and continuously evolving, challenging conventional closed-set or open-vocabulary classification methods. We propose Open-Tag, a generative framework for Open-world Multimodal Tagging that produces unordered, variable-length tag sequences in natural language without relying on predefined tag sets.

Open-Tag introduces two key innovations: (1) an Order-Prompted Tag Sequence Generation that maps learnable, order-agnostic queries to latent tag semantics, enabling permutation-invariant tag generation, and (2) a Multi-Source Retrieval-Augmented Generation that fuses tag candidates from heterogeneous retrieval systems across visual, textual, and metadata modalities. A score normalization and aggregation strategy ensures robust fusion, enhancing the diversity and grounding of generated tags.

To evaluate Open-Tag, we construct two large-scale datasets: CREATE-Tag (Chinese video) and PEXEL-Tag (English image), with over 3M videos and 160K images with tens of thousands of real-user tags. We propose a novel open-set evaluation metric, Tag Gain, to quantify the generation of relevant but previously unseen tags. Experiments show that Open-Tag outperforms state-of-the-art baselines on closed-set F1 and open-set Tag Gain, highlighting its generalization and novel tag discovery capabilities.

#1 Method

Overview of the Open-Tag Framework: (a) Multimodal Hybrid Encoder fuses visual (image frames or video clips) and textual (user titles) features via cross-modal attention to produce a unified representation. (b) Order Prompt Encoder generates sample-specific learnable queries, aligned with ground-truth tags via bipartite matching and contrastive learning. (c) Multi-source Tag Recommendation module retrieves tag candidates from similar multimodal content via multiple retrieval systems, forming an external prompt for generation. (d) Prompt-guided Tag Decoder auto-regressively generates tag sequences using implicit semantics from order prompts and explicit knowledge from retrieved tags. All components are jointly trained end-to-end.

Multi-source Video Retrieval

Given a query video, multiple retrieval systems–video-to-video, title-to-title, and cross-modal retrieval—-return relevant videos with associated tags and scores. Retrieved results are aggregated, and a consensus scoring mechanism selects top-ranked tags as explicit generation guidance.

Datasets

We focus on CREATE-Tag, which provides 210K training [Data Download] and 5K test [Data Download] samples. The training and test sets share 2,795 tags. For evaluation, we split these into 705 common and 2,090 rare tags.

We also build PEXEL-Tag, a newly collected dataset from the Pexels platform , comprising 162K training [Data Download] and 5K test [Data Download] samples. The dataset contains 28,094 unique tags, with 5,669 shared between training and testing. We further categorize these into 1,627 common and 4,042 rare tags for performance analysis

Open-set Metric (Tag Gain)

To evaluate the model’s ability to generate novel and informative tags, we introduce an open-set evaluation protocol based on Tag Gain, which jointly considers tag novelty, visual relevance, and semantic diversity.

Given a generated tag set \(T_{gen} = \{t_1, t_2, \dots, t_k\}\) and the corresponding visual input \(v\), we construct a filtered subset \(\hat{T}_{gen}\) by enforcing two criteria:

Visual relevance: Each candidate tag \(t_j \in T_{gen}\) must be sufficiently aligned with the image content. We measure this using the CLIP-based similarity between the tag and the visual input.
\[ \mathrm{sim}_{\text{CLIP}}(t_j, v) > \tau_{\text{rel}}. \]
Semantic diversity: To avoid redundancy, each selected tag should be semantically dissimilar to the tags already included in \(\hat{T}_{gen}\). This is enforced by requiring:
\[ \forall t' \in \hat{T}_{gen},\quad \mathrm{sim}_{\text{BGE}}(t_j, t') < \tau_{\text{div}}. \]

We apply a greedy filtering process: starting from an empty set \(\hat{T}_{gen} = \emptyset\), we iterate through the candidate tags (optionally sorted by visual relevance), and sequentially add a tag \(t_j\) to \(\hat{T}_{gen}\) only if it satisfies both conditions above. Typically, we set \(\tau_{\text{rel}}=0.3\) and \(\tau_{\text{div}}=0.8\). Formally:

\[ \hat{T}_{gen} = \left\{ t_j \in T_{gen} \;\middle|\; \mathrm{sim}_{\text{CLIP}}(t_j, v) > \tau_{\text{rel}} \land \forall t' \in \hat{T}_{gen},\ \mathrm{sim}_{\text{BGE}}(t_j, t') < \tau_{\text{div}} \right\}. \]

Based on the filtered tag set \(\hat{T}_{gen}\), we define two variants of Tag Gain to assess open-world tagging:

Known Tag Gain (\(\Delta_{\text{known}}\)): The proportion of relevant but unannotated tags that were seen during training:
\[ T_{\text{known}} = \left\{ t \in \hat{T}_{gen} \;\middle|\; t \notin T_{\text{gt}},\ t \in T_{\text{train}} \right\}, \] \[ \Delta_{\text{known}} = \frac{ \left| T_{\text{known}} \right| }{ \left| T_{\text{gt}} \right| }. \]
Novel Tag Gain (\(\Delta_{\text{novel}}\)): The proportion of generated tags that are unseen in training:
\[ T_{\text{novel}} = \left\{ t \in \hat{T}_{gen} \;\middle|\; \ t \notin T_{\text{train}} \right\}, \] \[ \Delta_{\text{novel}} = \frac{ \left| T_{\text{novel}} \right| }{ \left| T_{\text{gt}} \right| }. \]

Open-Set Evaluation

Method	CREATE-Tag	PEXEL-Tag
Bin	1.2	28.1%	0.00	0.00%	0.8	21.4%	0.00	0.00%
ASL	1.4	32.5%	0.00	0.00%	0.9	24.6%	0.00	0.00%
Order-Free	1.4	31.6%	0.00	0.00%	1.0	27.4%	0.00	0.00%
Orderless	1.8	41.3%	0.00	0.00%	1.4	36.4%	0.00	0.00%
Tag2Text	1.7	38.5%	0.00	0.00%	1.1	30.2%	0.00	0.00%
ML-Decoder	1.8	42.2%	0.00	0.00%	1.2	33.8%	0.00	0.00%
RAM	1.8	42.5%	0.00	0.00%	1.3	34.2%	0.00	0.00%
RAM++	2.0	45.6%	0.00	0.00%	1.4	36.7%	0.00	0.00%
Open-Book	1.6	37.1%	0.15	3.43%	1.2	31.9%	0.13	3.11%
Baseline	1.6	37.3%	0.10	2.24%	1.3	34.7%	0.11	2.63%
+ OPG	2.2	52.3%	0.30	7.04%	2.0	49.1%	0.24	5.66%
+ RAG	4.8	80.3%	3.95	72.8%	4.5	94.3%	3.61	71.2%

Method

CREATE-Tag

PEXEL-Tag

T_known

Δ_known

T_novel

Δ_novel

T_known

Δ_known

T_novel

Δ_novel

Bin

1.2

28.1%

0.00

0.00%

0.8

21.4%

0.00

0.00%

ASL

1.4

32.5%

0.00

0.00%

0.9

24.6%

0.00

0.00%

Order-Free

1.4

31.6%

0.00

0.00%

1.0

27.4%

0.00

0.00%

Orderless

1.8

41.3%

0.00

0.00%

1.4

36.4%

0.00

0.00%

Tag2Text

1.7

38.5%

0.00

0.00%

1.1

30.2%

0.00

0.00%

ML-Decoder

1.8

42.2%

0.00

0.00%

1.2

33.8%

0.00

0.00%

RAM

1.8

42.5%

0.00

0.00%

1.3

34.2%

0.00

0.00%

RAM++

2.0

45.6%

0.00

0.00%

1.4

36.7%

0.00

0.00%

Open-Book

1.6

37.1%

0.15

3.43%

1.2

31.9%

0.13

3.11%

Baseline

1.6

37.3%

0.10

2.24%

1.3

34.7%

0.11

2.63%

+ OPG

2.2

52.3%

0.30

7.04%

2.0

49.1%

0.24

5.66%

+ RAG

4.8

80.3%

3.95

72.8%

4.5

94.3%

3.61

71.2%

Open-Tag: A Generative Framework for Open-World Multimodal Tagging

Three Paradigms for Multimodal Tagging: (a) closed-set classification with fixed labels, (b) open-vocabulary matching with candidate tag texts, (c) open-world generation of free-form tag sequences.

Abstract

#1 Method

Promp-Tag Alignment

Multi-source Video Retrieval

#2 Evaluation

Datasets

Open-set Metric (Tag Gain)

#3 Results

Open-Set Evaluation

Qualitative Examples on CREATE-Tag

Qualitative Examples on PEXEL-Tag