Source Identification for Generative Models

Can we provide the source of generation?

This is a problem formulation for finding the source of generation.

1. Introduction

In the training of a deep neural network, we collect a large amount of data and process the data only for training purposes. The pre-process mostly decreases the amount of information to abandon irrelevant information in data and provides compact representations for the training data by discarding information which is not helpful for the training. The process of filtering information is essential to properly match the training objective efficiently and robustly For example, in the classification model, we use correlative features. In the field of NLP, we use natural sentences without meta data. . Although such filtering can benefit the training, we discard even necessary meta data which helps to identify the information such as the source of data. One example would be the copyright of lyrics or contents in a book. For example, can you identify the source of the sentence below? That is, who said the sentence.

Creating safe AGI that benefits all of humanity
- From : ???

The source of the sentence is website of OpenAI ยฉ. As the source is entangled with the sentence, we get additional information such as

Even though there are benefits of knowing the source location of the data, the source link of data may not required to train neural networks For example, the causal language modeling of GPT and the masked language modeling of BERT do not require the source link to maximize the training objective. . However, the source information is highly important when our objective is not performance but tracing the source of the information that the generative model provides. In short,

Only for the training purpose, discarding irrelevant information is beneficial. But, we don't know made the information.

Within the progress of generative models, another training mechanism is required to ensure the safety issues for both models and data.


2. Problem Definition

The fundamental problem is the identification of the source of a corpus called as source identification problem (SIP). Given a corpus, scrapped from a website or generated by chatGPT, we identify the original source of the corpus. The casual link of source $S$, text $x$, generator $G$, and the generated text $\hat{x}$ is as follows:

\[\begin{equation} s \rightarrow x \rightarrow G \rightarrow \hat{x} \end{equation}\]

The SIP is a identification problem of the source of the text $x$ or the generated text.

\[\begin{gather} x \rightarrow s \\ \hat{x} \leftarrow s \end{gather}\]

However, it is a ill-posed problem as multiple sources are possible. As finding the source from corpus is neither injective nor surjective. For example, the corpus โ€œapple is deliciousโ€ is a quite common sentence and lots of websites can include the corpus (non-injective) and the corpus โ€œfine apple is pricked by pineapple.โ€ may not have the source, but generated by a generative model (non-surjective). In addition if the text is slightly changed by adding โ€œTheโ€, the corpora have almost same meaning and the sources must be equal.

\[\begin{equation} \hat{x} \rightarrow \{s_1, s_2, \cdots\} \end{equation}\]

As such, the source verification problem is a problem of generating possible multiple sources of a given corpus. Although, functionally learning the Equation (4) can solve the source identification problem, we donโ€™t know the actual meaning of mapping corpora to sources. One reason is that a corpus has semantic and lexical information and two different corpora can have exactly same information.


3. Building Blocks of SIP

Before proceeding on the detailed discussion on SIP, we discuss on the proper definitions for sources and corpora. C-corpora (copyrighted-corpora) are semantic or lexical contents in the source. Source is the genuine location of c-corpora.

3.1 Copyrighted-Corpora

Which information

The copyrighted-corpora is copyrighted a corpora in two perspectives, semantic anc lexical.

Ex) Creating safe AGI that benefits all of humanity

  • Semantic : AI system benefits all people
  • Lexical : Creating, safe, AGI, benefit, humanity

As such, we have the following properties for copyrighted-corpora

Which Sources

One additional property is that a corpus can be formed by multiple sources of corpora. For example, we can combine two corpus from OpenAI and Meta each to make the following sentence. [ยฉ from Metaโ€™s Action]

Ex) Creating safe AGI that benefits all of humanity by keeping people safe and making a positive impact.

Definition of Copyrighted-Corpora

Definition [Copyrighted-Corpora]
Copyrighted-corpora is a corpora whose lexical or semantic meaning is similar with corpus from a source or combined by multiple corpus originated from possibly multiple corpus.

This definition is similar to how we think about copyright. Even though a sentence is not totally matched with any sentence in training dataset, Multiple sentences from other sources can be used to form the sentence by extracting (1) semantic information and (2) lexical information.

3.2 Source

Source is the origin of sentences. The origin could be companies, persons, websites which have the copyright of the sentences. We list some possible sources.

There are several ways to provide the source link of generated sentences.

  1. Generation : directly generate the sources with labels.
  2. Watermark: inject identifiable keys to sentences which is visible in the inference steps
  3. Web search engine : search websites with a GPT and provide links
  4. Prompt tuning : ask GPT to provide the source
Methods Procedure Pros Cons
Generation $G \rightarrow \hat{x} \rightarrow \hat{G} \rightarrow s$ Easily learnable in-scalable, hard to believe
Watermark $G \rightarrow (s), h \rightarrow \hat{x}$ Only requires inference steps Unclear how to apply it to NLP.(Fourier Frequency)
Web Search $G \rightarrow [s_1,s_2,\cdots ] \rightarrow s_k \rightarrow \hat{x}$ Direct identification does not provide modelโ€™s knowledge in inference
Prompt Tuning $G \rightarrow \hat{x}, S$ The most easiest way hard to believe

Identification via Generation

The most simple way is to train another GPT model to recover the source link of the sentences. When a GPT model $G$ generates sentences $\hat{x}$, another GPT-like model generates the description of sources $S$.

\[G \rightarrow \hat{x} \rightarrow \hat{G} \rightarrow s\]

Watermark

Previous work is done in vision domain which has continuous representations. Watermark is a simple way to encode information in the representation and previously explored in diffusion models ,. Although it is compelling way of encoding a secret key in the generated stuffs. It is only available for syntactic representation and unclear for the semantic meaning itself. When a GPT model $G$ generates sentences $\hat{x}$, the inner representation $h$ provokes the source of sentences.

\[G \rightarrow (s), h \rightarrow \hat{x}\]

WebGPT and Bing

One way to prevent the copyright issues is to make a model directly uses the corpus in a source in the inference procedure. This method can provide the source link to the end users. Recently WebGPT uses the Bing search engine to improve the factuality and to provide the external links , [OpenAIโ€™s blog].

\[G \rightarrow [s_1,s_2,s_3,\cdots ] \rightarrow s_k \rightarrow \hat{x}\]

GPT Prompt tuning

One direct way is to ask the model to provide the source link [related post].

Please provide sources for the previous answer
Please provide URL sources
Please provide 10 URL sources

\[G \rightarrow \hat{x}, S\]

Methods

We propose Source generation fine-tuning framework to learn the auxiliary task.

Figure 1. An illustration of token communication in fine-tuned GPT for source generative task. In this example, the generated texts are composed by two texts of sources Wikipedia and Oxford dictionary.
Figure 2. Framework of Source Generative Task. (1) pre-training data is prepared without source tags. (2) A GPT model is trained with pre-training data by language modeling. (3) The pre-training data is tagged with source information. (4) The GPT model is fine-tuned to generate source of the generated texts.

Watermark

Defender: Watermark ๐ŸŒŠ

For content $x$, watermark $b$ is given to protect the content. The defender encode an invisible watermark in the content with encoder $E$ takes the content and watermark to generate a watermarked information $x_{wm}$:

\[x_{wm} = E(x,b)\]

The decoder $D$ takes $x_{wm}$ and recovers the watermark $b$.

Attacker : Distort Watermark ๐Ÿ‘พ

The attacker tries to modify the watermarked $w_{wm}$ image by attacked content $\tilde{x}_{wm}$ with modifier $g$ to remove the encoded information $b$ in it.

\[\tilde{x}_{wm} = g(w_{wm})\]

The goal of the attacker is to minimize the recovery of watermark in the content.

\[\min L(b, D(\hat{x}_{wm}))\]

where $L$ is a matching loss such as bitwise accuracy.

Provenance Detection


Watermark videos

pseudorandom function $f(w_{t-c+1}, \cdots, w_{t-1}, i)$ maps the latest token $c$ to $r_{t,i} \in [0,1]$. This is pseudo probability distribution.

Parameter Efficient Fine-tuning


2023.09.23 : Given A Language Model and Any Text, Verify the Source Prediction Performance

Let $f$ be a generative language model and $\theta$ be the trained parameters with large corpus. Our goal is to verify the source $y$ of text tokens $x=(x_1, x_2, \cdots, x_n)$. To decode the source of text, we use the logits of decoder outputs which we call traced logits of text $x$.

\[T(x) = (\hat{p}_2, \hat{p}_3, \cdots, \hat{p}_{n+1})\]

where $\hat{p}_i = P( \cdot \vert x_1, x_2, \cdots, x_i ; \theta) \in \mathbb{R}^{V}$ is the conditional probability distribution over vocabulary $V$ given previous tokens.

This logits are different from previous watermark based data provenance. As the watermark assumes that the generated text is decoded output of the generative model, while our method does not have such assumption. Therefore, we consider how a generative model encodes any text.


2023.10.09

๋ชจ๋ธ์ด ๋ฌธ์žฅ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์•”๊ธฐํ•˜๊ณ  ์žˆ์œผ๋ฉด์„œ, ํŠน์ • ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•œ ํ™•๋ฅ ๊ฐ’์ด ๋†’์€ ๊ฒฝ์šฐ, ํ•ด๋‹น ์›์ฒœ ์†Œ์Šค์— ๋Œ€ํ•œ ์ถœ์ฒ˜์˜ ํ™•์‹ ์ด ๋†’์•„์ง„๋‹ค.

๋„ค๊ฑฐํ‹ฐ๋ธŒ ์ƒ˜ํ”Œ์˜ ์ข…๋ฅ˜์— ๋Œ€ํ•ด์„œ ์ •๋ณด๋ฅผ ํŒ๋‹จํ•ด์•ผ ํ•œ๋‹ค. ์˜ˆ๋กœ, ์™„๋ฒฝํ•˜๊ฒŒ ์ผ์น˜ํ•˜๋Š” ๋ฌธ์žฅ๋งŒ ํฌํ•จํ•˜๋Š” ๊ฒฝ์šฐ๋ถ€ํ„ฐ, ๋ฌธ์žฅ์˜ ์˜๋ฏธ๊ฐ€ ๋น„์Šทํ•œ ๊ฒฝ์šฐ, ๊ทธ๋ฆฌ๊ณ  ์‚ฌ์šฉํ•˜๋Š” ์–ดํœ˜๊ฐ€ ๋น„์Šทํ•œ ๊ฒฝ์šฐ๋ฅผ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ, ํ•ด๋‹น ์›์ฒœ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ์›๋ณธ๊ณผ GPT ์ด ์ƒ˜ํ”Œ๋ง์„ ํ•œ ๊ฒƒ์„ ๋น„๊ตํ•˜๋Š” ๊ฒฝ์šฐ๋„ ๊ฐ€๋Šฅํ•˜๋‹ค. ์ด๋Š” ์›์ฒœ์†Œ์Šค ํƒ์ง€์— ๋Œ€ํ•ด์„œ ์ผ๋ฐ˜ํ™”์˜ ์ˆ˜์ค€์ด ์š”๊ตฌํ•˜๋Š” ์ •๋„์— ๋”ฐ๋ผ์„œ ๋‹ค๋ฅด๋ฉฐ, ์ด๋กœ ์ธํ•ด์„œ ์˜ˆ์ธก ๋ชจ๋ธ์˜ ์ข…๋ฅ˜์™€ ์„ฑ๋Šฅ๋„ ๋‹ฌ๋ผ์ ธ์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. Softํ•œ Negative Sample ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋‹จ์ˆœํ•œ ๋ชจ๋ธ๋กœ๋„ ์›์ฒœ์†Œ์Šค๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ์›์ฒœ์†Œ์Šค๊ฐ€ ๋น„์Šทํ•œ ๋ฐ์ดํ„ฐ๋ผ๋ฉด, ๋‘ ํด๋ž˜์Šค๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฒƒ์€ ์‰ฌ์šด ์ผ์ด ์•„๋‹ˆ๋ฉฐ, ์–ด๋– ํ•œ ํŠน์ง•์œผ๋กœ๋ถ€ํ„ฐ ๊ตฌ๋ถ„ํ• ์ง€ ๊ณ ๋ฏผํ•ด์•ผ ํ•œ๋‹ค. ํด๋ž˜์‹œํŒŒ์ด์–ด๋Š” ๊ฒฐ๊ตญ,


2023.10.10 : ํ•™์Šต๋ฐ์ดํ„ฐ์˜ ํ™•์žฅ : XML ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ฐ ์ถ”๊ฐ€

====================
** INCLUDE **
LF-Amazon-131K
trn
From label 294805 131073
294805it [00:00, 1010858.06it/s]
294805it [00:00, 2783772.85it/s]
num inputs: 294805
num outputs: 294805
['Methodical Bible study: A new ', 'GeoPuzzle U.S.A. and Canada - ']
['4315:1.0\n', '112532:1.0 113827:1.0\n']
====================
Wikipedia-500K
trn
From label 1813391 501070
1813391it [00:11, 163675.08it/s]
1813391it [00:00, 2942573.18it/s]
num inputs: 1813391
num outputs: 1813391
['Anarchism redirect2|anarchis', 'Albedo other uses use dm']
['81199:1.0 83757:1.0 83805:1.0 ', '144177:1.0 144212:1.0 182348:1']


2023.10.11 : Cross Entropy with Increasing Number of Labels ์‹คํ—˜

๋‹จ์ˆœํ•˜๊ฒŒ, Classifier ๋ฅผ ํ•™์Šตํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ผ๋‹จ ๋ชจ์œผ๊ณ , ํ•™์Šต ๋ฐฉ๋ฒ•์„ ์ ์ง„์ ์œผ๋กœ ๊ฐœ์„ ํ•ด์•ผ ํ•œ๋‹ค. ๊ฒฐ๊ตญ SIP ๋ฌธ์ œ๋ฅผ XML ๋ฌธ์ œ์— ๊ฐ€๊น๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ ์ „์—, ๋‹จ์ˆœ Classifier ์— ๋Œ€ํ•ด์„œ ํ•™์Šตํ•œ ๊ฒฐ๊ณผ๋ฅผ Reportํ•˜๋Š” ๊ฒƒ์€ ํ•„์š”ํ•˜๋‹ค. ์ด ๋•Œ, CrossEntropy Loss ๋ฅผ ํ•œ ๊ฒƒ๊ณผ Binary Cross Entropy Loss ๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒƒ์„ ๋น„๊ตํ•ด์•ผ ํ•œ๋‹ค.

Let $S_i$ be the set of positive labels and negative labels in a shortlist $N_i$. The standard training loss for classification problem is cross entropy loss (CE). Cross entropy loss is defined as follows:

\[\mathcal{L}_{CE} = \sum_i^N \sum_{\ell \in S_i} y_{i\ell} \log \frac{\exp{(W_{\ell} z_i)}}{\sum_{\ell' } \exp{ (W_{\ell'} z_i) }}\]

With the PG19 datasets with 100 samples per label each, we evaluate the basic ReLU based encoder for the classification problem.

# Model Spec linear_hidden_size=1024 linear_activation=relu linear_n_layers=2

We observe that the training is not scalable with the increasing number of labels. We conclude two things in this experiment

Training

Evaluation

๋ณ€๊ฒฝ๋œ Loss : CE โ€“> BCE

Binary cross entropy loss is defined as follows:

\[\mathcal{L}_{BCE} = \sum_i^N \sum_{\ell \in S_i} y_{i\ell} \log (\sigma (W_{\ell} z_i )) + (1-y_{i\ell}) \log (1- \sigma (W_{\ell} z_i ))\]

where $\sigma$ is sigmoid function $\sigma(x) = \frac{1}{1+\exp{(-x)}}$.

์ž…๋ ฅ๊ฐ’ (GPT Hiddens)์— ๋Œ€ํ•œ ๊ณ ๋ฏผ

์ค‘์š”์งˆ๋ฌธ : ํ•™์Šต์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ์€ ๋‹จ์–ด์— ๋Œ€ํ•œ ์œ ์˜๋ฏธํ•œ Feature ์ด๊ธฐ ๋•Œ๋ฌธ์ธ๊ฐ€? ์•„๋‹ˆ๋ฉด, Class ์— ๋Œ€ํ•ด์„œ ๊ตฌ๋ถ„๋œ ํ‘œํ˜„ ๊ณต๊ฐ„์„ ์ง€๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์ธ๊ฐ€? ์ฆ‰, ๋ฌธ์žฅ์„ ๊ธฐ์–ตํ•˜์ง€ ์•Š๋”๋ผ๋„, ๋‹จ์ˆœํžˆ ํ‘œํ˜„์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์›์ฒœ ์†Œ์Šค๋ฅผ ์ฐพ๋Š” ๊ฒŒ ๊ฐ€๋Šฅํ•œ๊ฐ€?

GPT ๋ชจ๋ธ์ด ์›์ฒœ ์†Œ์Šค์— ๋Œ€ํ•ด์„œ ํ•™์Šต์„ ํ•ด์„œ, ์ดํ›„์— ๋‚˜์˜ค๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ ํŠน์ง•์„ ๊ฐ€์ง€๋ฉด์„œ, ํด๋ž˜์Šค๋ณ„๋กœ ๊ตฌ๋ถ„๋  ์ˆ˜ ์žˆ๋‹ค๋ฉด, ์›์ฒœ์†Œ์Šค๋ฅผ ์ฐพ๋Š”๋ฐ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋งŒ์ผ, GPT ๋ชจ๋ธ์ด ํ•ด๋‹น ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ ์›์ฒœ์†Œ์Šค๋ฅผ ๊ธฐ์–ตํ• ๋งŒํผ ์ •๋ณด๋Ÿ‰์„ ์ง€๋‹ˆ๊ณ  ์žˆ๋‹ค๋ฉด GPT ๋ชจ๋ธ์ด ์›์ฒœ์†Œ์Šค์— ๋Œ€ํ•ด์„œ ์›์ฒœ์„ ๋งตํ•‘ํ• ๋งŒํผ ์œ ์˜๋ฏธํ•œ ํ‘œํ˜„ ๊ณต๊ฐ„์„ ์ง€๋‹ˆ๋Š”๊ฐ€?

  1. ์›์ฒœ์„ ๋‚˜๋ˆ ์„œ ๊ธฐ์–ตํ• ๋งŒํผ ์„œ๋กœ ๋‹ค๋ฅธ ํ‘œํ˜„ ๊ณต๊ฐ„์ด๋‹ค.
  2. ์›์ฒœ์„ ๊ธฐ์–ตํ• ๋งŒํผ ์œ ์˜๋ฏธํ•œ ํ‘œํ˜„์ด๋‹ค.

2023.10.13 ๋…ผ๋ฌธ ํ•™์Šต ๊ตฌ์กฐ ํ™•์ •

1. ํ•™์Šต ์„ธํŒ…์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋น„๊ต 

1.1 ๋‹จ์ˆœ ํ•™์Šต

BCE / CE์— ๋Œ€ํ•ด์„œ ๋…ผ์˜ํ•˜๋ฉด์„œ ๋‘ ํ•™์Šต ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•œ๋‹ค. 

Increasing Number of Labels 
Increasing Number of Words 
Increasing Model Size 

1.2 DeepXML ์„ธํŒ… 

BCE์— ๋Œ€ํ•ด์„œ ์ถ”๊ฐ€์ ์ธ ์ถ”๊ฐ€์ ์ธ ํ•™์Šต์„ ์ง„ํ–‰ํ•  ๊ฒฝ์šฐ ์„ฑ๋Šฅ์„ ์ฒดํฌํ•œ๋‹ค. 

์ถ”๊ฐ€์ ์ธ ํ•™์Šต์€ 
1. ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ํ†ตํ•œ ๊ฐ€์งœ ๋ ˆ์ด๋ธ” ํ•™์Šต 
2. Hard Negative Sampling 
3. Training (DeepXML Framework)

Increasing Number of Labels 
Increasing Number of Words 
Increasing Model Size 

* Note 1 : ๋ชจ๋“  ์›์ฒœ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ 1์ฐจ์ ์ธ ํ•™์Šต์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. 
* Note 2 : ์ดํ›„, ์ถ”๊ฐ€์ ์ธ ์›์ฒœ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ Finetuning ํ•  ์ˆ˜ ์žˆ๋‹ค. 

2. Encoder Ablation

* Simplest 
* Deep MLP 
* RNN
* Transformer Encoder 

2023.10.13 : XML Literature Review

์ตœ๊ทผ ์—ฐ๊ตฌ๋œ XML์— ๋Œ€ํ•ด์„œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•˜๊ณ , SIP ๋…ผ๋ฌธ์— ์ ์„ ๋‚ด์šฉ์„ ์†Œ๊ฐœํ•œ๋‹ค.

As the training of source identification is not scalable with CE loss, the alternative loss is binary cross entropy loss (BCE) which is widely used in one-versus-all training framework [1,2,3] and in the field of extreme classification (EC) or extreme multi-label classification (XML). Deep learning-based XML methods have three properties in common : feature learning, negative label shortlisting, and classifier training [3]. We follow the XML training framework to train the source identification problem with integer labels. We do not utilize the label features, for example, the book name is not used.
Although recent works utilizes the label features to handle the lack of training samples [5,6,7], source identification could not directly use label features. As the input embedding GPT hidden representations are not sentence embeddings and the similarity between the labels and inputs are not guaranteed. Therefore, SIP problem is EC problem without label features.

AttentionXML forms a tree for label features [6]. SiameseXML enhanced zero-shot Siamese architecture to few-shot classification problem [7]. Renee propose optimization loss to stablize the training and increase the size of training to the fart more extreme number of labels (1B) [2]. DEXA [4] use additional parameters to complement the gap of pure label feature embeddings.

[1] Rifkin, Ryan, and Aldebaro Klautau. โ€œIn defense of one-vs-all classification.โ€ The Journal of Machine Learning Research 5 (2004): 101-141.

[2] Jain, Vidit, et al. โ€œRenee: END-TO-END TRAINING OF EXTREME CLASSIFICATION MODELS.โ€ Proceedings of Machine Learning and Systems 5 (2023). (Renee)

[3] Dahiya, Kunal, et al. โ€œDeepxml: A deep extreme multi-label learning framework applied to short text documents.โ€ Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 2021. (DeepXML)

[4] Dahiya, Kunal, et al. โ€œDeep Encoders with Auxiliary Parameters for Extreme Classification.โ€ Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023. DEXA

[5] Jiang, Ting, et al. โ€œLightxml: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification.โ€ Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 9. 2021. (LightXML)

[6] You, Ronghui, et al. โ€œAttentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification.โ€ Advances in Neural Information Processing Systems 32 (2019). (AttentionXML)

[7] Dahiya, Kunal, et al. โ€œSiamesexml: Siamese networks meet extreme classifiers with 100m labels.โ€ International Conference on Machine Learning. PMLR, 2021. (SiameseXML)

[8] @Misc{Bhatia16, author = {Bhatia, K. and Dahiya, K. and Jain, H. and Kar, P. and Mittal, A. and Prabhu, Y. and Varma, M.}, title = {The extreme classification repository: Multi-label datasets and code}, url = {http://manikvarma.org/downloads/XC/XMLRepository.html}, year = {2016} }


2023.10.16 : Gathering Hiddens

GPT Hidden์˜ ๋ถ„ํฌ์— ๋Œ€ํ•œ ๋ถ„์„/์›์ฒœ์†Œ์Šค ๋งตํ•‘์„ ํ•™์Šตํ•˜๋ฏ€๋กœ, ์ผ๋‹จ GPT Hiddens (Vocab ์ „ ๋งˆ์ง€๋ง‰ ํ‘œํ˜„)๋ฅผ ๋ชจ์•„๋‘๊ณ , prediction task ๋งŒ ๋”ฐ๋กœ ํ•™์Šตํ•œ๋‹ค. XMl๊ณผ GPT Hiddens ๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒฝ์šฐ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์ด ์‚ฌ์šฉ๋œ๋‹ค. ์˜ˆ๋กœ PG19 ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ, 20000๊ฐœ๊ฐ€ ๋„˜๋Š” ๋ ˆ์ด๋ธ”์ด ์žˆ๋Š”๋ฐ, ๋ ˆ์ด๋ธ” ๋‹น 100๊ฐœ์”ฉ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป๋Š” ๊ฒฝ์šฐ, (2,000,000, Tokens, Dims) ์‚ฌ์ด์ฆˆ์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ €์žฅํ•ด์•ผ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ ˆ์ด๋ธ” 1000๊ฐœ์— ๋Œ€ํ•ด์„œ 100๊ฐœ์˜ ์ƒ˜ํ”Œ์„ ๋ชจ์œผ๊ณ , CE Loss๋ฅผ Binary CE ๋กœ ๋ณ€๊ฒฝํ•œ ๊ฒƒ๊ณผ, Renee ์—์„œ ์ œ์•ˆํ•œ Loss๊นŒ์ง€ ์จ๋ณธ๋‹ค.

Models Amazon131K PG19 Wikipedia500
Pythia-70m โœ… โœ… ย 
Pythia-160m โœ… - ย 
Pythia-410m โœ… - ย 
Pythia-1b โœ… ย  ย 
Pythia-1.4b โœ… ย  ย 
Pythia-2.8b โœ… ย  ย 
Pythia-6.9b โœ… ย  ย 
Pythia-12b โœ… ย  ย 
LLama-chat-7b - ย  ย 
LLama-7b - ย  ย 
LLama-chat-13b - ย  ย 
LLama-13b - ย  ย 
GPT2-XL - ย  ย 
GPT2-XL - ย  ย 
OPT-350m - ย  ย 

Data stats