8000 Seeded preprocessing by le1nux · Pull Request #295 · Modalities/modalities · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Seeded preprocessing #295

New issue
< 8000 div class="mt-3 mb-2 text-center">

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Feb 20, 2025
Merged

Seeded preprocessing #295

merged 18 commits into from
Feb 20, 2025

Conversation

le1nux
Copy link
Member
@le1nux le1nux commented Jan 27, 2025

What does this PR do?

This PR adds

  • FileExistencePolicy specifying how to react when a preprocessed file already exists
  • Added seeding to the API endpoints create_shuffled_dataset_chunk and shuffle_tokenized_data
  • Moved shuffle_tokenized_data to the API layer rather than dataloader
  • Added hashing function calculate_hashed_seed to calculate a seed value from a list of hashed strings

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@le1nux le1nux requested review from mali-git and fromm-m January 28, 2025 10:44
@le1nux le1nux marked this pull request as ready for review January 28, 2025 10:44
Copy link
Member
@flxst flxst 8000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! :)

Added some comments, mostly suggestions for minor improvements. However, the branch does not contain the latest version of main.

@@ -47,16 +76,8 @@ def create_raw_data_index(
"""
index_path = LargeFileLinesReader.default_index_path(src_path, index_path)
if index_path.exists():
if file_existence_policy == FileExistencePolicy.SKIP:
get_logger(name="main").warning(f"Index already exists at {str(index_path)}. Skipping index creation.")
if not enforce_file_existence_policy(index_path, file_existence_policy):
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find the logic with the returned boolean a bit confusing. To me, it would make more sense to swap True/False for Override/Skip and then use

if enforce_file_existence_policy(index_path, file_existence_policy):
     return

In addition, one could integrate the previous line if index_path.exists(): into enforce_file_existence_policy.

But I guess this is a matter of personal taste in the end, nothing wrong with the current solution either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bool flag returned by the function indicates whether we should continue or not. we could make it explicit via

allow_continue = enforce_file_existence_policy(index_path, file_existence_policy)
if not allow_continue:
    return

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's good. I would personally use the opposite boolean for readability, e.g. stop_process instead of allow_continue (and then ofc get rid of the not), but either is fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed in eb818e6

pkl_encoded_index = f.read()
index_base = pickle.loads(pkl_encoded_index)
# Step 2: Shuffle the index
rng = Random(seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason why Random is used here? Asking because this seems very similar to the shuffle_file_chunks_in_place method in create_chunks.py, where np.random is used. Maybe one could even use the same function for both shuffling processes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could do this and probably we should. Since I prepared the training data with this state already, I suggest we create an issue and fix this after we pushed the new modalities version. This way, we are consistent for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue: #304

@fromm-m
Copy link
Member
fromm-m commented Feb 19, 2025

@le1nux I think It makes sense to resolve the issues of @flxst before I have a look.

Copy link
Member
@fromm-m fromm-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@le1nux le1nux merged commit d74edd5 into main Feb 20, 2025
3 checks passed
@le1nux le1nux deleted the seeded_preprocessing branch February 20, 2025 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0