Mathematical Synthetic Token Dataset
How I make synthetic dataset for GPT training.
Introduction
Recently, the mechanism of transformer architecture has been studied and several properties are discovered.
To investigate the mechanism, all the processes of training should be revealed. However, as most foundation models have trained on complex NLP domains, the exact underlying states are unclear. To make the procedures clear, the dataset must be under control. In this project we develop a synthetic dataset for training transformer models. In detail, we build mathematical token sequences with customized rules and easy labels.
Design
We first build the easiest token dataset and inherit the class to make a mathematical token dataset. The mathematical token dataset is further used to make a compositional dataset which combines several mathematical operational datasets. That is, we make an individual dataset such as addition and subtraction with a user-defined number of digit symbols. Then, a compositional dataset further combines several operational datasets.
Note that these mathematical synthetic datasets have no label, but a sequence of tokens to represent any operations. To generate labels, we define โdomainโ. A domain is a source of sequences assumed to have labels. As we constructed datasets with operations (+, -, *, //), we assign synthetic labels to each sequence. For example, if the label is โwhether a sequence has zeroโ or โwhether a sequence does not have zeroโ would be assigned if the condition is met.
Tokens
We have symbolic tokens which indicate
- digit : numbers
0 ~ N
- special tokens :
+ - * //
def _make_special_tokens(self):
# ๐ implemented
# ๐ vocab_len is number of tokens + number of special_tokens
self.equal_token = self.num_symbols + 0
self.mod_equal_token = self.num_symbols + 1
self.add_token = self.num_symbols + 2
self.multiply_token = self.num_symbols + 3
self.subtract_token = self.num_symbols + 4
self.divide_token = self.num_symbols + 5
self.negative_token = self.num_symbols + 6
self.eos_token = self.num_symbols + 7
self.num_special_tokens = 8
self.vocab_len = self.num_symbols + self.num_special_tokens
self.token_to_symbol = {
i:str(i) for i in range(self.num_symbols)
}
self.token_to_symbol[self.equal_token] = "="
self.token_to_symbol[self.mod_equal_token] = "=(m)"
self.token_to_symbol[self.add_token] = "+"
self.token_to_symbol[self.multiply_token] = "*"
self.token_to_symbol[self.subtract_token] = "-"
self.token_to_symbol[self.divide_token] = "/"
self.token_to_symbol[self.negative_token] = "-"
self.token_to_symbol[self.eos_token] = "[eos]>"
self.symbol_to_token = {v: k for k, v in self.token_to_symbol.items()}
Examples
-
Addition [go_to_the_link] : a single addition operation is used.
-
Composition [go_to_the_link] : four types of operations are combined to construct single dataset.
-
Modulo [go_to_the_link] : equation is replaced by modulo equation.
-
Domain [go_to_the_link] : assign labels to each sequence
๐ Example 1
Addition
from green.tokens import print_dataset
from green.tokens import MathematicalTokenDatasetV1
train_dataset = MathematicalTokenDatasetV1(True, num_symbols=5, seed=0, modulo=False, modulo_int=None)
valid_dataset = MathematicalTokenDatasetV1(False, num_symbols=5, seed=0, modulo=False, modulo_int=None)
print_dataset(train_dataset, valid_dataset)
# ๐จ๏ธ these lines are the printed results
------- Train: True
------- Length: 10
------- num_symbols: 5
------- vocab_len 13
sequence indices : [array([ 1, 6, 8, 9, 14, 4, 2, 13, 10, 7])]
tokens / decoded
(0, 7, 1, 5, 1) 0 + 1 = 1
(1, 7, 1, 5, 2) 1 + 1 = 2
(1, 7, 3, 5, 4) 1 + 3 = 4
(2, 7, 0, 5, 2) 2 + 0 = 2
(4, 7, 0, 5, 4) 4 + 0 = 4
(0, 7, 4, 5, 4) 0 + 4 = 4
(0, 7, 2, 5, 2) 0 + 2 = 2
(3, 7, 1, 5, 4) 3 + 1 = 4
(2, 7, 1, 5, 3) 2 + 1 = 3
(1, 7, 2, 5, 3) 1 + 2 = 3
------- Train: False
------- Length: 5
------- num_symbols: 5
------- vocab_len 13
sequence indices : [[0, 3, 5, 11, 12]]
tokens / decoded
(0, 7, 0, 5, 0) 0 + 0 = 0
(0, 7, 3, 5, 3) 0 + 3 = 3
(1, 7, 0, 5, 1) 1 + 0 = 1
(2, 7, 2, 5, 4) 2 + 2 = 4
(3, 7, 0, 5, 3) 3 + 0 = 3
๐ Example 2
Composition (Add+Subtract+Multiply+Divide)
from green.tokens import MathematicalCompositionalDatasetV4
train_dataset = MathematicalCompositionalDatasetV4(True, num_symbols=3, seed=0, modulo=False, modulo_int=None)
valid_dataset = MathematicalCompositionalDatasetV4(False, num_symbols=3, seed=0, modulo=False, modulo_int=None)
print_dataset(train_dataset, valid_dataset)
# ๐จ๏ธ these lines are the printed results
------- Train: True
------- Length: 19
------- num_symbols: 3
------- vocab_len 11
sequence indices : [array([5, 2, 1, 3]), array([7, 2, 1, 4, 8, 6]), array([6, 2, 1, 7, 3]), array([5, 2, 1, 3])]
tokens / decoded
(2, 5, 0, 3, 2) 2 + 0 = 2
(0, 5, 2, 3, 2) 0 + 2 = 2
(0, 5, 1, 3, 1) 0 + 1 = 1
(1, 5, 0, 3, 1) 1 + 0 = 1
(2, 7, 1, 3, 1) 2 - 1 = 1
(0, 7, 2, 3, 9, 2) 0 - 2 = - 2
(0, 7, 1, 3, 9, 1) 0 - 1 = - 1
(1, 7, 1, 3, 0) 1 - 1 = 0
(2, 7, 2, 3, 0) 2 - 2 = 0
(2, 7, 0, 3, 2) 2 - 0 = 2
(2, 6, 0, 3, 0) 2 * 0 = 0
(0, 6, 2, 3, 0) 0 * 2 = 0
(0, 6, 1, 3, 0) 0 * 1 = 0
(2, 6, 1, 3, 2) 2 * 1 = 2
(1, 6, 0, 3, 0) 1 * 0 = 0
(2, 8, 2, 3, 1) 2 / 2 = 1
(1, 8, 1, 3, 1) 1 / 1 = 1
(0, 8, 2, 3, 0) 0 / 2 = 0
(1, 8, 2, 3, 0) 1 / 2 = 0
------- Train: False
------- Length: 10
------- num_symbols: 3
------- vocab_len 11
sequence indices : [[0, 4], [0, 3, 5], [0, 4, 5], [0, 4]]
tokens / decoded
(0, 5, 0, 3, 0) 0 + 0 = 0
(1, 5, 1, 3, 2) 1 + 1 = 2
(0, 7, 0, 3, 0) 0 - 0 = 0
(1, 7, 0, 3, 1) 1 - 0 = 1
(1, 7, 2, 3, 9, 1) 1 - 2 = - 1
(0, 6, 0, 3, 0) 0 * 0 = 0
(1, 6, 1, 3, 1) 1 * 1 = 1
(1, 6, 2, 3, 2) 1 * 2 = 2
(0, 8, 1, 3, 0) 0 / 1 = 0
(2, 8, 1, 3, 2) 2 / 1 = 2
๐ Example 3
Modulo + Composition
from green.tokens import MathematicalCompositionalDatasetV2
train_dataset = MathematicalCompositionalDatasetV2(True, num_symbols=3, seed=0, modulo=True, modulo_int=2)
valid_dataset = MathematicalCompositionalDatasetV2(False, num_symbols=3, seed=0, modulo=True, modulo_int=2)
print_dataset(train_dataset, valid_dataset)
# ๐จ๏ธ these lines are the printed results
------- Train: True
------- Length: 12
------- num_symbols: 3
------- vocab_len 11
sequence indices : [array([7, 2, 1, 4, 8, 6]), array([7, 2, 1, 4, 8, 6])]
(2, 5, 1, 4, 2, 1) 2 + 1 =(m) 2 1
(0, 5, 2, 4, 2, 0) 0 + 2 =(m) 2 0
(0, 5, 1, 4, 2, 1) 0 + 1 =(m) 2 1
(1, 5, 1, 4, 2, 0) 1 + 1 =(m) 2 0
(2, 5, 2, 4, 2, 0) 2 + 2 =(m) 2 0
(2, 5, 0, 4, 2, 0) 2 + 0 =(m) 2 0
(2, 6, 1, 4, 2, 0) 2 * 1 =(m) 2 0
(0, 6, 2, 4, 2, 0) 0 * 2 =(m) 2 0
(0, 6, 1, 4, 2, 0) 0 * 1 =(m) 2 0
(1, 6, 1, 4, 2, 1) 1 * 1 =(m) 2 1
(2, 6, 2, 4, 2, 0) 2 * 2 =(m) 2 0
(2, 6, 0, 4, 2, 0) 2 * 0 =(m) 2 0
------- Train: False
------- Length: 6
------- num_symbols: 3
------- vocab_len 11
sequence indices : [[0, 3, 5], [0, 3, 5]]
(0, 5, 0, 4, 2, 0) 0 + 0 =(m) 2 0
(1, 5, 0, 4, 2, 1) 1 + 0 =(m) 2 1
(1, 5, 2, 4, 2, 1) 1 + 2 =(m) 2 1
(0, 6, 0, 4, 2, 0) 0 * 0 =(m) 2 0
(1, 6, 0, 4, 2, 0) 1 * 0 =(m) 2 0
(1, 6, 2, 4, 2, 0) 1 * 2 =(m) 2 0
๐ Example 4
Domain Labels
Consider the following domains
# ๐ domain labels
# 0: when the first symbol is odd
# 1: when the first symbol is even
# 2: when the first symbol and the second symbol are both odd
# 3: when the first symbol and the second symbol are both even
# 4: when zero is included in the sequence
# 5: when zero is not included in the sequence
from green.tokens import MathematicalDomainDatasetV1
from green.tokens import print_domain_dataset
train_domain = MathematicalDomainDatasetV1(train_dataset)
valid_domain = MathematicalDomainDatasetV1(valid_dataset)
print_domain_dataset(train_domain, valid_domain)
# ๐จ๏ธ these lines are the printed results
------- Train: True
------- Length: 10
------- num_symbols: 3
------- vocab_len 11
sequence indices : [array([5, 2, 1, 3]), array([7, 2, 1, 4, 8, 6])]
(2, 5, 0, 3, 2) 2 + 0 = 2 domain: [1, 3, 4]
(0, 5, 2, 3, 2) 0 + 2 = 2 domain: [1, 3, 4]
(0, 5, 1, 3, 1) 0 + 1 = 1 domain: [1, 4]
(1, 5, 0, 3, 1) 1 + 0 = 1 domain: [0, 4]
(2, 7, 1, 3, 1) 2 - 1 = 1 domain: [1, 5]
(0, 7, 2, 3, 9, 2) 0 - 2 = - 2 domain: [1, 3, 4]
(0, 7, 1, 3, 9, 1) 0 - 1 = - 1 domain: [1, 4]
(1, 7, 1, 3, 0) 1 - 1 = 0 domain: [0, 2, 4]
(2, 7, 2, 3, 0) 2 - 2 = 0 domain: [1, 3, 4]
(2, 7, 0, 3, 2) 2 - 0 = 2 domain: [1, 3, 4]
------- Train: False
------- Length: 5
------- num_symbols: 3
------- vocab_len 11
sequence indices : [[0, 4], [0, 3, 5]]
(0, 5, 0, 3, 0) 0 + 0 = 0 domain: [1, 3, 4]
(1, 5, 1, 3, 2) 1 + 1 = 2 domain: [0, 2, 5]
(0, 7, 0, 3, 0) 0 - 0 = 0 domain: [1, 3, 4]
(1, 7, 0, 3, 1) 1 - 0 = 1 domain: [0, 4]
(1, 7, 2, 3, 9, 1) 1 - 2 = - 1 domain: [0, 5]