A general dataloader build on top of Pytorch Dataloader.
Install pip install hakkero-dataloader
and run the following command to build index.
hakkero -h
usage: hakkero [-h] [--version] [--filename FILENAME] [--output OUTPUT] --dtype {legacy,message,preference} [--num_workers NUM_WORKERS] [--not_shuf]
build index for dataset
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--filename FILENAME full filename of jsonl file
--output OUTPUT output path for saving data.jsonl and index.h5
--dtype {legacy,message,preference}
data type
--num_workers NUM_WORKERS
number of workers
--not_shuf not shuf data
from hakkero.dataset import get_data
dp_world_size, dp_rank = 1, 0
tokenizer = ...
batch_size = 4
max_length = 4096
n_workers = 2
dataset, dataloader, forward_keys = get_data(
config="/path/to/dataset config",
dp_rank=dp_rank,
dp_world_size=dp_world_size,
tokenizer=tokenizer,
batch_size=batch_size,
max_length=max_length,
# segment and tokenize strategy or set them in `config` and let strategy_segment=None and strategy_tokenize=None:
st_segment="naive",
st_tokenize="legacy",
# add bos/eos token for legacy tokenize strategy
add_bos_token=True,
add_eos_token=True,
# norm dataset weight with tokens of target
norm_weight_with_n_targets=False,
# keep <think>xxx</think> in message or not
# no - not keep <think>xx</think> (default)
# last - keep <think>xx</think> in last turn
# all - keep <think>xx</think> in all turns
keep_think="no",
homogeneous=True,
seed=9527,
n_workers=n_workers,
is_preference=False,
use_unpad_data=False,
use_unpad_in_pad=False,
)
prefetcher = dataloader.prefetch(n_workers, drop_last=False)
for step, batch in enumerate(prefetcher, start=0):
print(batch)
example of config
:
{
"hermes25_1":
{
"group": "en",
"name": "hermes25_1",
"epoch": 1,
"path": "hermes25",
"strategy":
{
"st_segment": "integrous",
"st_tokenize": "hg"
},
"weight": 0.5
},
"hermes25_2":
{
"group": "en",
"name": "hermes25_1",
"epoch": 1,
"path": "hermes25",
"strategy":
{
"st_segment": "integrous",
"st_tokenize": "hg"
},
"weight": 0.5
}
}
See segmentation.py and tokenization.py for more details.
integrous
: discard sample that is too long, exceedmax_length
concat
: split long sample, concat it with previous segment, shuffle all segments- not support preference data.
naive
: split long sample with random length, shuffle all segments- not support preference data.
unbiased
: split long sample exceedmax_length
with random length, shuffle all segments.- not support preference data.
-
legacy
:\n\n
as delimiter to join text and usetokenizer.encode
to encode the input.-
format of input data
{ "uid": "xxx", "data": { "title": "xxx", "summary": "xxx", "abstract": "xxx", "text": "xxx", "question": "xxx", "answer": "xxx", "code": "xxx", "label": "xxx" } }
- All fields except
label
are stripped and joined with "\n\n" as the context. label
is the target to learn for finetuning (pretrain data should not have thelabel
field).- See func
legacy
in tokenization.py for more details.
- All fields except
-
extra parameters:
add_bos_token
,add_eos_token
-
-
hg
: huggingface message data, usetokenizer.apply_chat_template
to encode the input.-
format of input data
{ "uid": "xx", "data": [ {"role": "user", "content": "xxx"}, {"role": "assistant", "content": "xxx"}, ... ] }
See func
huggingface_message
in tokenization.py for more details. -
extra parameters:
keep_think
: support keep<think>xx</think>
or notno
- not keep<think>xx</think>
(default)last
- keep<think>xx</think>
in last turnall
- keep<think>xx</think>
in all turns
-
-
chatml
: chat message data, use chatml to encode the input.-
format of input data
{ "uid": "xx", "data": [ {"role": "user", "content": "xxx"}, {"role": "assistant", "content": "xxx"}, ... ] }
See func
chatml_message
in tokenization.py for more details.
-
-
chatml_qwen2_vl_message
: chat message vl data, use chatml to encode the input.-
format of input data
{ "uid": "xx", "data": [ { "role": "user", "content": [ { "type": "image", "image": "images/2.jpg" }, { "type": "text", "text": 6CFC "他是谁?" } ] }, { "role": "assistant", "content": [ { "type": "text", "text": "他是来自拜仁慕尼黑的托马斯·穆勒。" } ] }, ... ] }
See func
chatml_qwen2_vl_message
in tokenization.py for more details. Only support "integrous" segmentation strategies
-
-
hg_preference
: preference data, usetokenizer.apply_chat_template
to encode the input.-
format of input data
{ "uid": "xx", "data": { "context": [ {"role": "user", "content": "xxx"}, {"role": "assistant", "content": "xxx"}, ... {"role": "user", "content": "xxx"} ], "chosen": "chosen response", "rejected": "rejected response" } }
See func
huggingface_preference
in tokenization.py for more details. -
extra parameters:
keep_think
: support keep<think>xx</think>
or notno
- not keep<think>xx</think>
(default)last
- keep<think>xx</think>
in last turnall
- keep<think>xx</think>
in all turns
-
-
chatml_preference
: preference data, use chatml to encode the input.-
format of input data
{ "uid": "xx", "data": { "context": [ {"role": "user", "content": "xxx"}, {"role": "assistant", "content": "xxx"}, ... {"role": "user", "content": "xxx"} ], "chosen": "chosen response", "rejected": "rejected response" } }
See func
chatml_preference
in tokenization.py for more details.
-