Seeded preprocessing #295

le1nux · 2025-01-27T17:08:44Z

What does this PR do?

This PR adds

FileExistencePolicy specifying how to react when a preprocessed file already exists
Added seeding to the API endpoints create_shuffled_dataset_chunk and shuffle_tokenized_data
Moved shuffle_tokenized_data to the API layer rather than dataloader
Added hashing function calculate_hashed_seed to calculate a seed value from a list of hashed strings

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

…hunk

…_chunk

…et_chunk to be of ".pbin"

flxst

Great work! :)

Added some comments, mostly suggestions for minor improvements. However, the branch does not contain the latest version of main.

src/modalities/api.py

flxst · 2025-02-07T10:16:40Z

src/modalities/api.py

@@ -47,16 +76,8 @@ def create_raw_data_index(
    """
    index_path = LargeFileLinesReader.default_index_path(src_path, index_path)
    if index_path.exists():
-        if file_existence_policy == FileExistencePolicy.SKIP:
-            get_logger(name="main").warning(f"Index already exists at {str(index_path)}. Skipping index creation.")
+        if not enforce_file_existence_policy(index_path, file_existence_policy):
            return


I personally find the logic with the returned boolean a bit confusing. To me, it would make more sense to swap True/False for Override/Skip and then use

if enforce_file_existence_policy(index_path, file_existence_policy): return

In addition, one could integrate the previous line if index_path.exists(): into enforce_file_existence_policy.

But I guess this is a matter of personal taste in the end, nothing wrong with the current solution either.

The bool flag returned by the function indicates whether we should continue or not. we could make it explicit via

allow_continue = enforce_file_existence_policy(index_path, file_existence_policy) if not allow_continue: return

What do you think?

Yeah that's good. I would personally use the opposite boolean for readability, e.g. stop_process instead of allow_continue (and then ofc get rid of the not), but either is fine.

changed in eb818e6

src/modalities/__main__.py

src/modalities/api.py

src/modalities/__main__.py

src/modalities/dataloader/shuffle_tokenized_data.py

flxst · 2025-02-07T11:43:39Z

src/modalities/dataloader/shuffle_tokenized_data.py

-        pkl_encoded_index = f.read()
-        index_base = pickle.loads(pkl_encoded_index)
+        # Step 2: Shuffle the index
+        rng = Random(seed)


Is there a particular reason why Random is used here? Asking because this seems very similar to the shuffle_file_chunks_in_place method in create_chunks.py, where np.random is used. Maybe one could even use the same function for both shuffling processes?

Yes, we could do this and probably we should. Since I prepared the training data with this state already, I suggest we create an issue and fix this after we pushed the new modalities version. This way, we are consistent for now.

Created an issue: #304

tests/end2end_tests/test_shuffle_tokenized_data.py

fromm-m · 2025-02-19T14:56:00Z

@le1nux I think It makes sense to resolve the issues of @flxst before I have a look.

fromm-m

LGTM

le1nux added 14 commits January 27, 2025 17:59

feat: added seeding of preprocessing routines

68a49e7

feat: integrated FileExistsPolicy

8fa3f20

feat: added seeding routine

7fa2d95

feat: added seeding test

49db5ef

refactor: added input_data_root_path to CMD_create_shuffled_dataset_c…

f801f4a

…hunk

feat: added test_shuffle_tokenized_data

69418d8

fix: fixed failing unit tests

4036181

chore: removed debug print statements from seeding

adf1689

refactor: improved test test_create_shuffled_dataset_chunk

048a75e

refactor: removed vocab_size requirement from create_shuffled_dataset…

9ebaeca

…_chunk

refactor: removed vocab_size from test_create_shuffled_dataset_chunk

7c6da5d

chore: removed vocab_size option from CMD_create_shuffled_dataset_chunk

b8c75e4

chore: fixed suffix for input_file_list_path in create_shuffled_datas…

5f762b7

…et_chunk to be of ".pbin"

feat: added logging to create_shuffled_dataset_chunk

6253005

le1nux requested review from mali-git and fromm-m January 28, 2025 10:44

le1nux marked this pull request as ready for review January 28, 2025 10:44

chore: added more logging to index creation

552142a

flxst approved these changes Feb 7, 2025

View reviewed changes

fromm-m assigned le1nux Feb 19, 2025

le1nux added 2 commits February 19, 2025 17:12

chore: Merge branch 'main' into seeded_preprocessing

17eccef

refactor: included review feedback

54c62f4

flxst mentioned this pull request Feb 20, 2025

Un 6D47 ify random shuffling #304

Open

refactor: swapped the retuirn flags for enforce_file_existence_policy

eb818e6

fromm-m approved these changes Feb 20, 2025

View reviewed changes

le1nux merged commit d74edd5 into main Feb 20, 2025
3 checks passed

le1nux deleted the seeded_preprocessing branch February 20, 2025 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Seeded preprocessing #295

Seeded preprocessing #295

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Seeded preprocessing #295

Seeded preprocessing #295

Uh oh!

Conversation

Uh oh!

What does this PR do?

Checklist before submitting final PR

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!