-
-
Notifications
You must be signed in to change notification settings - Fork 779
Programmatically remove control characters from LLM response #1367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Reviewed everything up to 8f01649 in 2 minutes and 26 seconds
More details
- Looked at
58
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
5
drafted comments based on config settings.
1. instructor/dsl/partial.py:51
- Draft comment:
Consider adding type hints and a docstring to 'remove_control_chars' for clarity. - Reason this comment was not posted:
Confidence changes required:50%
<= threshold85%
None
2. instructor/dsl/partial.py:55
- Draft comment:
Add type annotations and a brief docstring to 'process_potential_object' to clarify its intended usage. - Reason this comment was not posted:
Confidence changes required:50%
<= threshold85%
None
3. instructor/dsl/partial.py:230
- Draft comment:
Potential bug: Using 'chunk_buffer += chunk' on a string extends the list with individual characters. Use 'chunk_buffer.append(chunk)' instead. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 80% vs. threshold = 85%
The code seems to be handling streaming JSON chunks. The current implementation splits strings into characters which may be intentional since it's later joined back together. However, if chunk is meant to be preserved as a whole string, using append() would be more appropriate. The behavior difference could impact how the JSON is processed.
I could be wrong about the intention - maybe splitting into characters is actually desired for some JSON parsing reason. The code seems to work in practice since it's joining the characters back together.
While the current code technically works by splitting and rejoining, using append() would be cleaner and more efficient if preserving whole chunks is intended. The risk of bugs from character-level manipulation makes this worth fixing.
Keep the comment as it identifies a legitimate code improvement that would make the chunk handling more robust and efficient.
4. instructor/dsl/partial.py:51
- Draft comment:
Consider adding type hints for 'remove_control_chars' and 'process_potential_object' to improve code clarity. - Reason this comment was not posted:
Confidence changes required:66%
<= threshold85%
None
5. instructor/dsl/partial.py:116
- Draft comment:
Typo: In the comment on lines 116-117, 'it's' should be 'its' for proper grammar. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
Workflow ID: wflow_VWpKaCcd3rGRO960
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
potential_object += chunk | ||
obj = from_json( | ||
(potential_object.strip() or "{}").encode(), partial_mode=partial_mode | ||
chunk_buffer += chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using +=
on a list with a string will extend the list by characters. Replace with append()
to keep chunks intact.
chunk_buffer += chunk | |
chunk_buffer.append(chunk) |
) | ||
yield obj | ||
if chunk_buffer: | ||
potential_object += remove_control_chars(chunk_buffer[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug: In the final check, only processing chunk_buffer[0]
ignores remaining buffered content. Consider joining all elements (e.g., using ''.join(chunk_buffer)
).
potential_object += remove_control_chars(chunk_buffer[0]) | |
potential_object += remove_control_chars(''.join(chunk_buffer)) |
Important
Add
remove_control_chars()
to clean LLM responses and refactormodel_from_chunks()
inpartial.py
for improved data processing.remove_control_chars()
to remove control characters from strings using regex.model_from_chunks()
inpartial.py
to useremove_control_chars()
for cleaning JSON chunks.process_potential_object()
to handle object processing and validation.model_from_chunks()
now buffers chunks and removes control characters before processing.This description was created by
for 8f01649. It will automatically update as commits are pushed.