8000 Programmatically remove control characters from LLM response by xtzie · Pull Request #1367 · 567-labs/instructor · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Programmatically remove control characters from LLM response #1367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 3, 2025

Conversation

xtzie
Copy link
Contributor
@xtzie xtzie commented Mar 3, 2025

Important

Add remove_control_chars() to clean LLM responses and refactor model_from_chunks() in partial.py for improved data processing.

  • Functionality:
    • Add remove_control_chars() to remove control characters from strings using regex.
    • Refactor model_from_chunks() in partial.py to use remove_control_chars() for cleaning JSON chunks.
    • Introduce process_potential_object() to handle object processing and validation.
  • Behavior:
    • model_from_chunks() now buffers chunks and removes control characters before processing.
    • Handles remaining buffer after loop to ensure all data is processed.

This description was created by Ellipsis for 8f01649. It will automatically update as commits are pushed.

Copy link
Contributor
@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to 8f01649 in 2 minutes and 26 seconds

More details
  • Looked at 58 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 5 drafted comments based on config settings.
1. instructor/dsl/partial.py:51
  • Draft comment:
    Consider adding type hints and a docstring to 'remove_control_chars' for clarity.
  • Reason this comment was not posted:
    Confidence changes required: 50% <= threshold 85%
    None
2. instructor/dsl/partial.py:55
  • Draft comment:
    Add type annotations and a brief docstring to 'process_potential_object' to clarify its intended usage.
  • Reason this comment was not posted:
    Confidence changes required: 50% <= threshold 85%
    None
3. instructor/dsl/partial.py:230
  • Draft comment:
    Potential bug: Using 'chunk_buffer += chunk' on a string extends the list with individual characters. Use 'chunk_buffer.append(chunk)' instead.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 80% vs. threshold = 85%
    The code seems to be handling streaming JSON chunks. The current implementation splits strings into characters which may be intentional since it's later joined back together. However, if chunk is meant to be preserved as a whole string, using append() would be more appropriate. The behavior difference could impact how the JSON is processed.
    I could be wrong about the intention - maybe splitting into characters is actually desired for some JSON parsing reason. The code seems to work in practice since it's joining the characters back together.
    While the current code technically works by splitting and rejoining, using append() would be cleaner and more efficient if preserving whole chunks is intended. The risk of bugs from character-level manipulation makes this worth fixing.
    Keep the comment as it identifies a legitimate code improvement that would make the chunk handling more robust and efficient.
4. instructor/dsl/partial.py:51
  • Draft comment:
    Consider adding type hints for 'remove_control_chars' and 'process_potential_object' to improve code clarity.
  • Reason this comment was not posted:
    Confidence changes required: 66% <= threshold 85%
    None
5. instructor/dsl/partial.py:116
  • Draft comment:
    Typo: In the comment on lines 116-117, 'it's' should be 'its' for proper grammar.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_VWpKaCcd3rGRO960


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

potential_object += chunk
obj = from_json(
(potential_object.strip() or "{}").encode(), partial_mode=partial_mode
chunk_buffer += chunk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using += on a list with a string will extend the list by characters. Replace with append() to keep chunks intact.

Suggested change
chunk_buffer += chunk
chunk_buffer.append(chunk)

)
yield obj
if chunk_buffer:
potential_object += remove_control_chars(chunk_buffer[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug: In the final check, only processing chunk_buffer[0] ignores remaining buffered content. Consider joining all elements (e.g., using ''.join(chunk_buffer)).

Suggested change
potential_object += remove_control_chars(chunk_buffer[0])
potential_object += remove_control_chars(''.join(chunk_buffer))

@jxnl jxnl merged commit ccfa60a into 567-labs:main Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0