8000 GitHub - airlsyn/hakkero-dataloader
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

airlsyn/hakkero-dataloader

Repository files navigation

hakkero-dataloader

A general dataloader build on top of Pytorch Dataloader.

1. How to use

1.1 Build Index

Install pip install hakkero-dataloader and run the following command to build index.

hakkero -h

usage: hakkero [-h] [--version] [--filename FILENAME] [--output OUTPUT] --dtype {legacy,message,preference} [--num_workers NUM_WORKERS] [--not_shuf]

build index for dataset

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --filename FILENAME   full filename of jsonl file
  --output OUTPUT       output path for saving data.jsonl and index.h5
  --dtype {legacy,message,preference}
                        data type
  --num_workers NUM_WORKERS
                        number of workers
  --not_shuf            not shuf data

1.2 Use In Training

from hakkero.dataset import get_data

dp_world_size, dp_rank = 1, 0
tokenizer = ...
batch_size = 4
max_length = 4096
n_workers = 2

dataset, dataloader, forward_keys = get_data(
    config="/path/to/dataset config",
    dp_rank=dp_rank,
    dp_world_size=dp_world_size,
    tokenizer=tokenizer,
    batch_size=batch_size,
    max_length=max_length,
    # segment and tokenize strategy or set them in `config` and let strategy_segment=None and strategy_tokenize=None: 
    st_segment="naive",
    st_tokenize="legacy",
    # add bos/eos token for legacy tokenize strategy
    add_bos_token=True,
    add_eos_token=True,
    # norm dataset weight with tokens of target
    norm_weight_with_n_targets=False,
    # keep <think>xxx</think> in message or not
    # no - not keep <think>xx</think> (default)
    # last - keep <think>xx</think> in last turn
    # all - keep <think>xx</think> in all turns
    keep_think="no",
    homogeneous=True,
    seed=9527,
    n_workers=n_workers,
    is_preference=False,
    use_unpad_data=False,
    use_unpad_in_pad=False,
)

prefetcher = dataloader.prefetch(n_workers, drop_last=False)
for step, batch in enumerate(prefetcher, start=0):
  print(batch)

example of config:

{
    "hermes25_1":
    {
        "group": "en",
        "name": "hermes25_1",
        "epoch": 1,
        "path": "hermes25",
        "strategy":
        {
            "st_segment": "integrous",
            "st_tokenize": "hg"
        },
        "weight": 0.5
    },
    "hermes25_2":
    {
        "group": "en",
        "name": "hermes25_1",
        "epoch": 1,
        "path": "hermes25",
        "strategy":
        {
            "st_segment": "integrous",
            "st_tokenize": "hg"
        },
        "weight": 0.5
    }
}

2. Supported Strategies

See segmentation.py and tokenization.py for more details.

2.1 Segmentation Strategies

  • integrous: discard sample that is too long, exceed max_length
  • concat: split long sample, concat it with previous segment, shuffle all segments
    • not support preference data.
  • naive: split long sample with random length, shuffle all segments
    • not support preference data.
  • unbiased: split long sample exceed max_length with random length, shuffle all segments.
    • not support preference data.

2.2 Tokenization Strategies

  • legacy: \n\n as delimiter to join text and use tokenizer.encode to encode the input.

    • format of input data

      {
        "uid": "xxx",
        "data":
        {
            "title": "xxx",
            "summary": "xxx",
            "abstract": "xxx",
            "text": "xxx",
            "question": "xxx",
            "answer": "xxx",
            "code": "xxx",
            "label": "xxx"
        }
      }
      • All fields except label are stripped and joined with "\n\n" as the context.
      • label is the target to learn for finetuning (pretrain data should not have the label field).
      • See func legacy in tokenization.py for more details.
    • extra parameters:

      • add_bos_token, add_eos_token
  • hg: huggingface message data, use tokenizer.apply_chat_template to encode the input.

    • format of input data

      {
        "uid": "xx",
        "data": [
          {"role": "user", "content": "xxx"},
          {"role": "assistant", "content": "xxx"},
           ...
        ]
      }

      See func huggingface_message in tokenization.py for more details.

    • extra parameters:

      • keep_think: support keep <think>xx</think> or not
        • no - not keep <think>xx</think> (default)
        • last - keep <think>xx</think> in last turn
        • all - keep <think>xx</think> in all turns
  • chatml: chat message data, use chatml to encode the input.

    • format of input data

      {
        "uid": "xx",
        "data": [
          {"role": "user", "content": "xxx"},
          {"role": "assistant", "content": "xxx"},
           ...
        ]
      }

      See func chatml_message in tokenization.py for more details.

  • chatml_qwen2_vl_message: chat message vl data, use chatml to encode the input.

    • format of input data

      {
        "uid": "xx",
        "data": [
          {
              "role": "user",
              "content": [
                  {
                      "type": "image",
                      "image": "images/2.jpg"
                  },
                  {
                      "type": "text",
                      "text": 
      6CFC
      "他是谁?"
                  }
              ]
          },
          {
              "role": "assistant",
              "content": [
                  {
                      "type": "text",
                      "text": "他是来自拜仁慕尼黑的托马斯·穆勒。"
                  }
              ]
          },
           ...
        ]
      }

      See func chatml_qwen2_vl_message in tokenization.py for more details. Only support "integrous" segmentation strategies

  • hg_preference: preference data, use tokenizer.apply_chat_template to encode the input.

    • format of input data

      {
        "uid": "xx",
        "data": {
          "context": [
            {"role": "user", "content": "xxx"},
            {"role": "assistant", "content": "xxx"},
            ...
            {"role": "user", "content": "xxx"}
          ],
          "chosen": "chosen response",
          "rejected": "rejected response"
        }
      }

      See func huggingface_preference in tokenization.py for more details.

    • extra parameters:

      • keep_think: support keep <think>xx</think> or not
        • no - not keep <think>xx</think> (default)
        • last - keep <think>xx</think> in last turn
        • all - keep <think>xx</think> in all turns
  • chatml_preference: preference data, use chatml to encode the input.

    • format of input data

      {
        "uid": "xx",
        "data": {
          "context": [
            {"role": "user", "content": "xxx"},
            {"role": "assistant", "content": "xxx"},
            ...
            {"role": "user", "content": "xxx"}
          ],
          "chosen": "chosen response",
          "rejected": "rejected response"
        }
      }

      See func chatml_preference in tokenization.py for more details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

0