-
Notifications
You must be signed in to change notification settings - Fork 10
Seeded preprocessing #295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeded preprocessing #295
Conversation
…et_chunk to be of ".pbin"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! :)
Added some comments, mostly suggestions for minor improvements. However, the branch does not contain the latest version of main
.
@@ -47,16 +76,8 @@ def create_raw_data_index( | |||
""" | |||
index_path = LargeFileLinesReader.default_index_path(src_path, index_path) | |||
if index_path.exists(): | |||
if file_existence_policy == FileExistencePolicy.SKIP: | |||
get_logger(name="main").warning(f"Index already exists at {str(index_path)}. Skipping index creation.") | |||
if not enforce_file_existence_policy(index_path, file_existence_policy): | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find the logic with the returned boolean a bit confusing. To me, it would make more sense to swap True/False for Override/Skip and then use
if enforce_file_existence_policy(index_path, file_existence_policy):
return
In addition, one could integrate the previous line if index_path.exists():
into enforce_file_existence_policy
.
But I guess this is a matter of personal taste in the end, nothing wrong with the current solution either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bool flag returned by the function indicates whether we should continue or not. we could make it explicit via
allow_continue = enforce_file_existence_policy(index_path, file_existence_policy)
if not allow_continue:
return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's good. I would personally use the opposite boolean for readability, e.g. stop_process
instead of allow_continue
(and then ofc get rid of the not
), but either is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed in eb818e6
pkl_encoded_index = f.read() | ||
index_base = pickle.loads(pkl_encoded_index) | ||
# Step 2: Shuffle the index | ||
rng = Random(seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason why Random
is used here? Asking because this seems very similar to the shuffle_file_chunks_in_place
method in create_chunks.py
, where np.random
is used. Maybe one could even use the same function for both shuffling processes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could do this and probably we should. Since I prepared the training data with this state already, I suggest we create an issue and fix this after we pushed the new modalities version. This way, we are consistent for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue: #304
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What does this PR do?
This PR adds
FileExistencePolicy
specifying how to react when a preprocessed file already existscreate_shuffled_dataset_chunk
andshuffle_tokenized_data
shuffle_tokenized_data
to the API layer rather than dataloadercalculate_hashed_seed
to calculate a seed value from a list of hashed stringsChecklist before submitting final PR
python tests/tests.py
)CHANGELOG_DEV.md
)