GitHub - airlsyn/hakkero-dataloader

hakkero-dataloader

A general dataloader build on top of Pytorch Dataloader.

1. How to use

1.1 Build Index

Install pip install hakkero-dataloader and run the following command to build index.

hakkero -h

usage: hakkero [-h] [--version] [--filename FILENAME] [--output OUTPUT] --dtype {legacy,message,preference} [--num_workers NUM_WORKERS] [--not_shuf]

build index for dataset

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --filename FILENAME   full filename of jsonl file
  --output OUTPUT       output path for saving data.jsonl and index.h5
  --dtype {legacy,message,preference}
                        data type
  --num_workers NUM_WORKERS
                        number of workers
  --not_shuf            not shuf data

1.2 Use In Training

from hakkero.dataset import get_data

dp_world_size, dp_rank = 1, 0
tokenizer = ...
batch_size = 4
max_length = 4096
n_workers = 2

dataset, dataloader, forward_keys = get_data(
    config="/path/to/dataset config",
    dp_rank=dp_rank,
    dp_world_size=dp_world_size,
    tokenizer=tokenizer,
    batch_size=batch_size,
    max_length=max_length,
    # segment and tokenize strategy or set them in `config` and let strategy_segment=None and strategy_tokenize=None: 
    st_segment="naive",
    st_tokenize="legacy",
    # add bos/eos token for legacy tokenize strategy
    add_bos_token=True,
    add_eos_token=True,
    # norm dataset weight with tokens of target
    norm_weight_with_n_targets=False,
    # keep <think>xxx</think> in message or not
    # no - not keep <think>xx</think> (default)
    # last - keep <think>xx</think> in last turn
    # all - keep <think>xx</think> in all turns
    keep_think="no",
    homogeneous=True,
    seed=9527,
    n_workers=n_workers,
    is_preference=False,
    use_unpad_data=False,
    use_unpad_in_pad=False,
)

prefetcher = dataloader.prefetch(n_workers, drop_last=False)
for step, batch in enumerate(prefetcher, start=0):
  print(batch)

example of config:

{
    "hermes25_1":
    {
        "group": "en",
        "name": "hermes25_1",
        "epoch": 1,
        "path": "hermes25",
        "strategy":
        {
            "st_segment": "integrous",
            "st_tokenize": "hg"
        },
        "weight": 0.5
    },
    "hermes25_2":
    {
        "group": "en",
        "name": "hermes25_1",
        "epoch": 1,
        "path": "hermes25",
        "strategy":
        {
            "st_segment": "integrous",
            "st_tokenize": "hg"
        },
        "weight": 0.5
    }
}

2. Supported Strategies

See segmentation.py and tokenization.py for more details.

2.1 Segmentation Strategies

integrous: discard sample that is too long, exceed max_length
concat: split long sample, concat it with previous segment, shuffle all segments
- not support preference data.
naive: split long sample with random length, shuffle all segments
- not support preference data.
unbiased: split long sample exceed max_length with random length, shuffle all segments.
- not support preference data.

2.2 Tokenization Strategies

legacy: \n\n as delimiter to join text and use tokenizer.encode to encode the input.
- format of input data
```
{
  "uid": "xxx",
  "data":
  {
      "title": "xxx",
      "summary": "xxx",
      "abstract": "xxx",
      "text": "xxx",
      "question": "xxx",
      "answer": "xxx",
      "code": "xxx",
      "label": "xxx"
  }
}
```
  - All fields except label are stripped and joined with "\n\n" as the context.
  - label is the target to learn for finetuning (pretrain data should not have the label field).
  - See func legacy in tokenization.py for more details.
- extra parameters:
  - add_bos_token, add_eos_token
hg: huggingface message data, use tokenizer.apply_chat_template to encode the input.
- format of input data
```
{
  "uid": "xx",
  "data": [
    {"role": "user", "content": "xxx"},
    {"role": "assistant", "content": "xxx"},
     ...
  ]
}
```
  See func huggingface_message in tokenization.py for more details.
- extra parameters:
  - keep_think: support keep <think>xx</think> or not
    - no - not keep <think>xx</think> (default)
    - last - keep <think>xx</think> in last turn
    - all - keep <think>xx</think> in all turns

chatml: chat message data, use chatml to encode the input.

format of input data

{
  "uid": "xx",
  "data": [
    {"role": "user", "content": "xxx"},
    {"role": "assistant", "content": "xxx"},
     ...
  ]
}

See func chatml_message in tokenization.py for more details.

chatml_qwen2_vl_message: chat message vl data, use chatml to encode the input.

format of input data

{
  "uid": "xx",
  "data": [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "images/2.jpg"
            },
            {
                "type": "text",
                "text": 
6CFC
"他是谁？"
            }
        ]
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "他是来自拜仁慕尼黑的托马斯·穆勒。"
            }
        ]
    },
     ...
  ]
}

See func chatml_qwen2_vl_message in tokenization.py for more details. Only support "integrous" segmentation strategies

hg_preference: preference data, use tokenizer.apply_chat_template to encode the input.
- format of input data
```
{
  "uid": "xx",
  "data": {
    "context": [
      {"role": "user", "content": "xxx"},
      {"role": "assistant", "content": "xxx"},
      ...
      {"role": "user", "content": "xxx"}
    ],
    "chosen": "chosen response",
    "rejected": "rejected response"
  }
}
```
  See func huggingface_preference in tokenization.py for more details.
- extra parameters:
  - keep_think: support keep <think>xx</think> or not
    - no - not keep <think>xx</think> (default)
    - last - keep <think>xx</think> in last turn
    - all - keep <think>xx</think> in all turns

chatml_preference: preference data, use chatml to encode the input.

format of input data

{
  "uid": "xx",
  "data": {
    "context": [
      {"role": "user", "content": "xxx"},
      {"role": "assistant", "content": "xxx"},
      ...
      {"role": "user", "content": "xxx"}
    ],
    "chosen": "chosen response",
    "rejected": "rejected response"
  }
}

See func chatml_preference in tokenization.py for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
hakkero		hakkero
sbin		sbin
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hakkero-dataloader

1. How to use

1.1 Build Index

1.2 Use In Training

2. Supported Strategies

2.1 Segmentation Strategies

2.2 Tokenization Strategies

About

Releases 13

Packages

Languages

airlsyn/hakkero-dataloader

Folders and files

Latest commit

History

Repository files navigation

hakkero-dataloader

1. How to use

1.1 Build Index

1.2 Use In Training

2. Supported Strategies

2.1 Segmentation Strategies

2.2 Tokenization Strategies

About

Resources

Stars

Watchers

Forks

Releases 13

Packages 0

Languages

Packages