Can we provide the source of generation?
This is a problem formulation for finding the source of generation.
In the training of a deep neural network, we collect a large amount of data and process the data only for training purposes.
The pre-process mostly decreases the amount of information to abandon irrelevant information in data and provides compact representations for the training data by discarding information which is not helpful for the training. The process of filtering information is essential to properly match the training objective efficiently and robustly
Creating safe AGI that benefits all of humanity
- From : ???
The source of the sentence is website of OpenAI ยฉ. As the source is entangled with the sentence, we get additional information such as
Even though there are benefits of knowing the source location of the data, the source link of data may not required to train neural networks
Only for the training purpose, discarding irrelevant information is beneficial. But, we don't know made the information.
Within the progress of generative models, another training mechanism is required to ensure the safety issues for both models and data.
The fundamental problem is the identification of the source of a corpus called as source identification problem (SIP). Given a corpus, scrapped from a website or generated by chatGPT, we identify the original source of the corpus. The casual link of source $S$, text $x$, generator $G$, and the generated text $\hat{x}$ is as follows:
\[\begin{equation} s \rightarrow x \rightarrow G \rightarrow \hat{x} \end{equation}\]The SIP is a identification problem of the source of the text $x$ or the generated text.
\[\begin{gather} x \rightarrow s \\ \hat{x} \leftarrow s \end{gather}\]However, it is a ill-posed problem as multiple sources are possible. As finding the source from corpus is neither injective nor surjective. For example, the corpus โapple is deliciousโ is a quite common sentence and lots of websites can include the corpus (non-injective) and the corpus โfine apple is pricked by pineapple.โ may not have the source, but generated by a generative model (non-surjective). In addition if the text is slightly changed by adding โTheโ, the corpora have almost same meaning and the sources must be equal.
\[\begin{equation} \hat{x} \rightarrow \{s_1, s_2, \cdots\} \end{equation}\]As such, the source verification problem is a problem of generating possible multiple sources of a given corpus. Although, functionally learning the Equation (4) can solve the source identification problem, we donโt know the actual meaning of mapping corpora to sources. One reason is that a corpus has semantic and lexical information and two different corpora can have exactly same information.
Before proceeding on the detailed discussion on SIP, we discuss on the proper definitions for sources and corpora. C-corpora (copyrighted-corpora) are semantic or lexical contents in the source. Source is the genuine location of c-corpora.
The copyrighted-corpora is copyrighted a corpora in two perspectives, semantic anc lexical.
Ex) Creating safe AGI that benefits all of humanity
- Semantic : AI system benefits all people
- Lexical : Creating, safe, AGI, benefit, humanity
As such, we have the following properties for copyrighted-corpora
Semantic Source
: the semantic meaning of a copyrighted-corpus comes from the sourceLexical Source
: the lexical information of a copyrighted-corpus comes from the sentenceOne additional property is that a corpus can be formed by multiple sources of corpora. For example, we can combine two corpus from OpenAI and Meta each to make the following sentence. [ยฉ from Metaโs Action]
Ex) Creating safe AGI that benefits all of humanity by keeping people safe and making a positive impact.
Definition [Copyrighted-Corpora]
Copyrighted-corpora is a corpora whose lexical or semantic meaning is similar with corpus from a source or combined by multiple corpus originated from possibly multiple corpus.
This definition is similar to how we think about copyright. Even though a sentence is not totally matched with any sentence in training dataset, Multiple sentences from other sources can be used to form the sentence by extracting (1) semantic information and (2) lexical information.
Source is the origin of sentences. The origin could be companies, persons, websites which have the copyright of the sentences. We list some possible sources.
Company
: The company who has the copyright of overall contents (OpenAI, Meta, โฆ)Person
: The labeled person who talked or wrote sentences (Martin Ruther King, โฆ)Website
: The well known website (Wiki, Reddit, โฆ )Hyperlinks
: Specific link of website (https://distill.pub/2020/circuits/)There are several ways to provide the source link of generated sentences.
Methods | Procedure | Pros | Cons |
---|---|---|---|
Generation | $G \rightarrow \hat{x} \rightarrow \hat{G} \rightarrow s$ | Easily learnable | in-scalable, hard to believe |
Watermark | $G \rightarrow (s), h \rightarrow \hat{x}$ | Only requires inference steps | Unclear how to apply it to NLP.(Fourier Frequency) |
Web Search | $G \rightarrow [s_1,s_2,\cdots ] \rightarrow s_k \rightarrow \hat{x}$ | Direct identification | does not provide modelโs knowledge in inference |
Prompt Tuning | $G \rightarrow \hat{x}, S$ | The most easiest way | hard to believe |
The most simple way is to train another GPT model to recover the source link of the sentences. When a GPT model $G$ generates sentences $\hat{x}$, another GPT-like model generates the description of sources $S$.
\[G \rightarrow \hat{x} \rightarrow \hat{G} \rightarrow s\]Previous work is done in vision domain which has continuous representations.
Watermark is a simple way to encode information in the representation and previously explored in diffusion models
One way to prevent the copyright issues is to make a model directly uses the corpus in a source in the inference procedure. This method can provide the source link to the end users. Recently WebGPT uses the Bing search engine to improve the factuality and to provide the external links
One direct way is to ask the model to provide the source link [related post].
\[G \rightarrow \hat{x}, S\]Please provide sources for the previous answer
Please provide URL sources
Please provide 10 URL sources
We propose Source generation fine-tuning framework to learn the auxiliary task.
For content $x$, watermark $b$ is given to protect the content. The defender encode an invisible watermark in the content with encoder $E$ takes the content and watermark to generate a watermarked information $x_{wm}$:
\[x_{wm} = E(x,b)\]The decoder $D$ takes $x_{wm}$ and recovers the watermark $b$.
The attacker tries to modify the watermarked $w_{wm}$ image by attacked content $\tilde{x}_{wm}$ with modifier $g$ to remove the encoded information $b$ in it.
\[\tilde{x}_{wm} = g(w_{wm})\]The goal of the attacker is to minimize the recovery of watermark in the content.
\[\min L(b, D(\hat{x}_{wm}))\]where $L$ is a matching loss such as bitwise accuracy.
pseudorandom function $f(w_{t-c+1}, \cdots, w_{t-1}, i)$ maps the latest token $c$ to $r_{t,i} \in [0,1]$. This is pseudo probability distribution.
Let $f$ be a generative language model and $\theta$ be the trained parameters with large corpus. Our goal is to verify the source $y$ of text tokens $x=(x_1, x_2, \cdots, x_n)$. To decode the source of text, we use the logits of decoder outputs which we call traced logits of text $x$.
\[T(x) = (\hat{p}_2, \hat{p}_3, \cdots, \hat{p}_{n+1})\]where $\hat{p}_i = P( \cdot \vert x_1, x_2, \cdots, x_i ; \theta) \in \mathbb{R}^{V}$ is the conditional probability distribution over vocabulary $V$ given previous tokens.
This logits are different from previous watermark based data provenance. As the watermark assumes that the generated text is decoded output of the generative model, while our method does not have such assumption. Therefore, we consider how a generative model encodes any text.
๋ชจ๋ธ์ด ๋ฌธ์ฅ์ ๋ํ ์ ๋ณด๋ฅผ ์๊ธฐํ๊ณ ์์ผ๋ฉด์, ํน์ ๋ ์ด๋ธ์ ๋ํ ํ๋ฅ ๊ฐ์ด ๋์ ๊ฒฝ์ฐ, ํด๋น ์์ฒ ์์ค์ ๋ํ ์ถ์ฒ์ ํ์ ์ด ๋์์ง๋ค.
๋ค๊ฑฐํฐ๋ธ ์ํ์ ์ข ๋ฅ์ ๋ํด์ ์ ๋ณด๋ฅผ ํ๋จํด์ผ ํ๋ค. ์๋ก, ์๋ฒฝํ๊ฒ ์ผ์นํ๋ ๋ฌธ์ฅ๋ง ํฌํจํ๋ ๊ฒฝ์ฐ๋ถํฐ, ๋ฌธ์ฅ์ ์๋ฏธ๊ฐ ๋น์ทํ ๊ฒฝ์ฐ, ๊ทธ๋ฆฌ๊ณ ์ฌ์ฉํ๋ ์ดํ๊ฐ ๋น์ทํ ๊ฒฝ์ฐ๋ฅผ ์๊ฐํ ์ ์๋ค. ๋ํ, ํด๋น ์์ฒ ๋ฐ์ดํฐ์ ๋ํด์ ์๋ณธ๊ณผ GPT ์ด ์ํ๋ง์ ํ ๊ฒ์ ๋น๊ตํ๋ ๊ฒฝ์ฐ๋ ๊ฐ๋ฅํ๋ค. ์ด๋ ์์ฒ์์ค ํ์ง์ ๋ํด์ ์ผ๋ฐํ์ ์์ค์ด ์๊ตฌํ๋ ์ ๋์ ๋ฐ๋ผ์ ๋ค๋ฅด๋ฉฐ, ์ด๋ก ์ธํด์ ์์ธก ๋ชจ๋ธ์ ์ข ๋ฅ์ ์ฑ๋ฅ๋ ๋ฌ๋ผ์ ธ์ผ ํ๋ค๋ ๊ฒ์ ์๋ฏธํ๋ค. Softํ Negative Sample ์ ์ฌ์ฉํ๋ ๊ฒฝ์ฐ ๋จ์ํ ๋ชจ๋ธ๋ก๋ ์์ฒ์์ค๋ฅผ ํ์งํ ์ ์๋ค. ๊ทธ๋ฌ๋, ์์ฒ์์ค๊ฐ ๋น์ทํ ๋ฐ์ดํฐ๋ผ๋ฉด, ๋ ํด๋์ค๋ฅผ ๊ตฌ๋ถํ๋ ๊ฒ์ ์ฌ์ด ์ผ์ด ์๋๋ฉฐ, ์ด๋ ํ ํน์ง์ผ๋ก๋ถํฐ ๊ตฌ๋ถํ ์ง ๊ณ ๋ฏผํด์ผ ํ๋ค. ํด๋์ํ์ด์ด๋ ๊ฒฐ๊ตญ,
====================
** INCLUDE **
LF-Amazon-131K
trn
From label 294805 131073
294805it [00:00, 1010858.06it/s]
294805it [00:00, 2783772.85it/s]
num inputs: 294805
num outputs: 294805
['Methodical Bible study: A new ', 'GeoPuzzle U.S.A. and Canada - ']
['4315:1.0\n', '112532:1.0 113827:1.0\n']
====================
Wikipedia-500K
trn
From label 1813391 501070
1813391it [00:11, 163675.08it/s]
1813391it [00:00, 2942573.18it/s]
num inputs: 1813391
num outputs: 1813391
['Anarchism redirect2|anarchis', 'Albedo other uses use dm']
['81199:1.0 83757:1.0 83805:1.0 ', '144177:1.0 144212:1.0 182348:1']
๋จ์ํ๊ฒ, Classifier ๋ฅผ ํ์ตํ ๊ฒฐ๊ณผ๋ฅผ ์ผ๋จ ๋ชจ์ผ๊ณ , ํ์ต ๋ฐฉ๋ฒ์ ์ ์ง์ ์ผ๋ก ๊ฐ์ ํด์ผ ํ๋ค. ๊ฒฐ๊ตญ SIP ๋ฌธ์ ๋ฅผ XML ๋ฌธ์ ์ ๊ฐ๊น๊ธฐ ๋๋ฌธ์ด๋ค. ๊ทธ ์ ์, ๋จ์ Classifier ์ ๋ํด์ ํ์ตํ ๊ฒฐ๊ณผ๋ฅผ Reportํ๋ ๊ฒ์ ํ์ํ๋ค. ์ด ๋, CrossEntropy Loss ๋ฅผ ํ ๊ฒ๊ณผ Binary Cross Entropy Loss ๋ฅผ ์ฌ์ฉํ ๊ฒ์ ๋น๊ตํด์ผ ํ๋ค.
Let $S_i$ be the set of positive labels and negative labels in a shortlist $N_i$. The standard training loss for classification problem is cross entropy loss (CE). Cross entropy loss is defined as follows:
\[\mathcal{L}_{CE} = \sum_i^N \sum_{\ell \in S_i} y_{i\ell} \log \frac{\exp{(W_{\ell} z_i)}}{\sum_{\ell' } \exp{ (W_{\ell'} z_i) }}\]With the PG19 datasets with 100 samples per label each, we evaluate the basic ReLU based encoder for the classification problem.
We observe that the training is not scalable with the increasing number of labels. We conclude two things in this experiment
Binary cross entropy loss is defined as follows:
\[\mathcal{L}_{BCE} = \sum_i^N \sum_{\ell \in S_i} y_{i\ell} \log (\sigma (W_{\ell} z_i )) + (1-y_{i\ell}) \log (1- \sigma (W_{\ell} z_i ))\]where $\sigma$ is sigmoid function $\sigma(x) = \frac{1}{1+\exp{(-x)}}$.
์ค์์ง๋ฌธ : ํ์ต์ด ๊ฐ๋ฅํ ๊ฒ์ ๋จ์ด์ ๋ํ ์ ์๋ฏธํ Feature ์ด๊ธฐ ๋๋ฌธ์ธ๊ฐ? ์๋๋ฉด, Class ์ ๋ํด์ ๊ตฌ๋ถ๋ ํํ ๊ณต๊ฐ์ ์ง๋๊ธฐ ๋๋ฌธ์ธ๊ฐ? ์ฆ, ๋ฌธ์ฅ์ ๊ธฐ์ตํ์ง ์๋๋ผ๋, ๋จ์ํ ํํ์ด ๋ค๋ฅด๊ธฐ ๋๋ฌธ์ ์์ฒ ์์ค๋ฅผ ์ฐพ๋ ๊ฒ ๊ฐ๋ฅํ๊ฐ?
GPT ๋ชจ๋ธ์ด ์์ฒ ์์ค์ ๋ํด์ ํ์ต์ ํด์, ์ดํ์ ๋์ค๋ ๋จ์ด์ ๋ํด์ ํน์ง์ ๊ฐ์ง๋ฉด์, ํด๋์ค๋ณ๋ก ๊ตฌ๋ถ๋ ์ ์๋ค๋ฉด, ์์ฒ์์ค๋ฅผ ์ฐพ๋๋ฐ ์ฌ์ฉ๋ ์ ์๋ค. ๋ง์ผ, GPT ๋ชจ๋ธ์ด ํด๋น ๋ฌธ์ฅ์ ๋ํด์ ์์ฒ์์ค๋ฅผ ๊ธฐ์ตํ ๋งํผ ์ ๋ณด๋์ ์ง๋๊ณ ์๋ค๋ฉด GPT ๋ชจ๋ธ์ด ์์ฒ์์ค์ ๋ํด์ ์์ฒ์ ๋งตํํ ๋งํผ ์ ์๋ฏธํ ํํ ๊ณต๊ฐ์ ์ง๋๋๊ฐ?
1. ํ์ต ์ธํ
์ ๋ฐ๋ฅธ ์ฑ๋ฅ ๋น๊ต
1.1 ๋จ์ ํ์ต
BCE / CE์ ๋ํด์ ๋
ผ์ํ๋ฉด์ ๋ ํ์ต ๊ฒฐ๊ณผ๋ฅผ ๋น๊ตํ๋ค.
Increasing Number of Labels
Increasing Number of Words
Increasing Model Size
1.2 DeepXML ์ธํ
BCE์ ๋ํด์ ์ถ๊ฐ์ ์ธ ์ถ๊ฐ์ ์ธ ํ์ต์ ์งํํ ๊ฒฝ์ฐ ์ฑ๋ฅ์ ์ฒดํฌํ๋ค.
์ถ๊ฐ์ ์ธ ํ์ต์
1. ํด๋ฌ์คํฐ๋ง์ ํตํ ๊ฐ์ง ๋ ์ด๋ธ ํ์ต
2. Hard Negative Sampling
3. Training (DeepXML Framework)
Increasing Number of Labels
Increasing Number of Words
Increasing Model Size
* Note 1 : ๋ชจ๋ ์์ฒ ๋ฐ์ดํฐ์ ๋ํด์ 1์ฐจ์ ์ธ ํ์ต์ ์งํํ ์ ์๋ค.
* Note 2 : ์ดํ, ์ถ๊ฐ์ ์ธ ์์ฒ ๋ฐ์ดํฐ์ ๋ํด์ Finetuning ํ ์ ์๋ค.
2. Encoder Ablation
* Simplest
* Deep MLP
* RNN
* Transformer Encoder
์ต๊ทผ ์ฐ๊ตฌ๋ XML์ ๋ํด์ ๋ด์ฉ์ ์ ๋ฆฌํ๊ณ , SIP ๋ ผ๋ฌธ์ ์ ์ ๋ด์ฉ์ ์๊ฐํ๋ค.
As the training of source identification is not scalable with CE loss, the alternative loss is binary cross entropy loss (BCE) which is widely used in one-versus-all training framework [1,2,3] and in the field of extreme classification (EC) or extreme multi-label classification (XML). Deep learning-based XML methods have three properties in common : feature learning, negative label shortlisting, and classifier training [3]. We follow the XML training framework to train the source identification problem with integer labels. We do not utilize the label features, for example, the book name is not used.
Although recent works utilizes the label features to handle the lack of training samples [5,6,7], source identification could not directly use label features. As the input embedding GPT hidden representations are not sentence embeddings and the similarity between the labels and inputs are not guaranteed. Therefore, SIP problem is EC problem without label features.
AttentionXML forms a tree for label features [6]. SiameseXML enhanced zero-shot Siamese architecture to few-shot classification problem [7]. Renee propose optimization loss to stablize the training and increase the size of training to the fart more extreme number of labels (1B) [2]. DEXA [4] use additional parameters to complement the gap of pure label feature embeddings.
[1] Rifkin, Ryan, and Aldebaro Klautau. โIn defense of one-vs-all classification.โ The Journal of Machine Learning Research 5 (2004): 101-141.
[2] Jain, Vidit, et al. โRenee: END-TO-END TRAINING OF EXTREME CLASSIFICATION MODELS.โ Proceedings of Machine Learning and Systems 5 (2023). (Renee)
[3] Dahiya, Kunal, et al. โDeepxml: A deep extreme multi-label learning framework applied to short text documents.โ Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 2021. (DeepXML)
[4] Dahiya, Kunal, et al. โDeep Encoders with Auxiliary Parameters for Extreme Classification.โ Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023. DEXA
[5] Jiang, Ting, et al. โLightxml: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification.โ Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 9. 2021. (LightXML)
[6] You, Ronghui, et al. โAttentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification.โ Advances in Neural Information Processing Systems 32 (2019). (AttentionXML)
[7] Dahiya, Kunal, et al. โSiamesexml: Siamese networks meet extreme classifiers with 100m labels.โ International Conference on Machine Learning. PMLR, 2021. (SiameseXML)
[8] @Misc{Bhatia16, author = {Bhatia, K. and Dahiya, K. and Jain, H. and Kar, P. and Mittal, A. and Prabhu, Y. and Varma, M.}, title = {The extreme classification repository: Multi-label datasets and code}, url = {http://manikvarma.org/downloads/XC/XMLRepository.html}, year = {2016} }
GPT Hidden์ ๋ถํฌ์ ๋ํ ๋ถ์/์์ฒ์์ค ๋งตํ์ ํ์ตํ๋ฏ๋ก, ์ผ๋จ GPT Hiddens (Vocab ์ ๋ง์ง๋ง ํํ)๋ฅผ ๋ชจ์๋๊ณ , prediction task ๋ง ๋ฐ๋ก ํ์ตํ๋ค.
XMl๊ณผ GPT Hiddens ๋ฅผ ๊ฒฐํฉํ๋ ๊ฒฝ์ฐ ๋ฉ๋ชจ๋ฆฌ๊ฐ ๋๋ฌด ๋ง์ด ์ฌ์ฉ๋๋ค. ์๋ก PG19 ๋ฐ์ดํฐ์ ๊ฒฝ์ฐ, 20000๊ฐ๊ฐ ๋๋ ๋ ์ด๋ธ์ด ์๋๋ฐ, ๋ ์ด๋ธ ๋น 100๊ฐ์ฉ ๋ฐ์ดํฐ๋ฅผ ์ป๋ ๊ฒฝ์ฐ, (2,000,000, Tokens, Dims)
์ฌ์ด์ฆ์ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ์ ์ฅํด์ผ ํ๋ค. ๋ฐ๋ผ์ ๋ ์ด๋ธ 1000๊ฐ์ ๋ํด์ 100๊ฐ์ ์ํ์ ๋ชจ์ผ๊ณ , CE Loss๋ฅผ Binary CE ๋ก ๋ณ๊ฒฝํ ๊ฒ๊ณผ, Renee ์์ ์ ์ํ Loss๊น์ง ์จ๋ณธ๋ค.
Models | Amazon131K | PG19 | Wikipedia500 |
---|---|---|---|
Pythia-70m |
โ | โ | ย |
Pythia-160m |
โ | - | ย |
Pythia-410m |
โ | - | ย |
Pythia-1b |
โ | ย | ย |
Pythia-1.4b |
โ | ย | ย |
Pythia-2.8b |
โ | ย | ย |
Pythia-6.9b |
โ | ย | ย |
Pythia-12b |
โ | ย | ย |
LLama-chat-7b |
- | ย | ย |
LLama-7b |
- | ย | ย |
LLama-chat-13b |
- | ย | ย |
LLama-13b |
- | ย | ย |
GPT2-XL |
- | ย | ย |
GPT2-XL |
- | ย | ย |
OPT-350m |
- | ย | ย |