Programmatically remove control characters from LLM response #1367

xtzie · 2025-03-03T08:31:31Z

Important

Add remove_control_chars() to clean LLM responses and refactor model_from_chunks() in partial.py for improved data processing.

Functionality:
- Add remove_control_chars() to remove control characters from strings using regex.
- Refactor model_from_chunks() in partial.py to use remove_control_chars() for cleaning JSON chunks.
- Introduce process_potential_object() to handle object processing and validation.
Behavior:
- model_from_chunks() now buffers chunks and removes control characters before processing.
- Handles remaining buffer after loop to ensure all data is processed.

^{This description was created by}^{for 8f01649. It will automatically update as commits are pushed.}

ellipsis-dev

❌ Changes requested. Reviewed everything up to 8f01649 in 2 minutes and 26 seconds

More details

Looked at 58 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 5 drafted comments based on config settings.

1. instructor/dsl/partial.py:51

Draft comment:
Consider adding type hints and a docstring to 'remove_control_chars' for clarity.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 85%
None

2. instructor/dsl/partial.py:55

Draft comment:
Add type annotations and a brief docstring to 'process_potential_object' to clarify its intended usage.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 85%
None

3. instructor/dsl/partial.py:230

Draft comment:
Potential bug: Using 'chunk_buffer += chunk' on a string extends the list with individual characters. Use 'chunk_buffer.append(chunk)' instead.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 80% vs. threshold = 85%
The code seems to be handling streaming JSON chunks. The current implementation splits strings into characters which may be intentional since it's later joined back together. However, if chunk is meant to be preserved as a whole string, using append() would be more appropriate. The behavior difference could impact how the JSON is processed.
I could be wrong about the intention - maybe splitting into characters is actually desired for some JSON parsing reason. The code seems to work in practice since it's joining the characters back together.
While the current code technically works by splitting and rejoining, using append() would be cleaner and more efficient if preserving whole chunks is intended. The risk of bugs from character-level manipulation makes this worth fixing.
Keep the comment as it identifies a legitimate code improvement that would make the chunk handling more robust and efficient.

4. instructor/dsl/partial.py:51

Draft comment:
Consider adding type hints for 'remove_control_chars' and 'process_potential_object' to improve code clarity.
Reason this comment was not posted:
Confidence changes required: 66% <= threshold 85%
None

5. instructor/dsl/partial.py:116

Draft comment:
Typo: In the comment on lines 116-117, 'it's' should be 'its' for proper grammar.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_VWpKaCcd3rGRO960

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2025-03-03T08:34:06Z

instructor/dsl/partial.py

-            potential_object += chunk
-            obj = from_json(
-                (potential_object.strip() or "{}").encode(), partial_mode=partial_mode
+            chunk_buffer += chunk


Using += on a list with a string will extend the list by characters. Replace with append() to keep chunks intact.

Suggested change

chunk_buffer += chunk

chunk_buffer.append(chunk)

ellipsis-dev · 2025-03-03T08:34:06Z

instructor/dsl/partial.py

+            )
+            yield obj
+        if chunk_buffer:
+            potential_object += remove_control_chars(chunk_buffer[0])


Potential bug: In the final check, only processing chunk_buffer[0] ignores remaining buffered content. Consider joining all elements (e.g., using ''.join(chunk_buffer)).

Suggested change

potential_object += remove_control_chars(chunk_buffer[0])

potential_object += remove_control_chars(''.join(chunk_buffer))

Programmatically remove control characters from LLM response

8f01649

ellipsis-dev bot reviewed Mar 3, 2025

View reviewed changes

jxnl merged commit ccfa60a into 567-labs:main Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Programmatically remove control characters from LLM response #1367

Programmatically remove control characters from LLM response #1367

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	potential_object += remove_control_chars(chunk_buffer[0])
	potential_object += remove_control_chars(''.join(chunk_buffer))

Uh oh!

Programmatically remove control characters from LLM response #1367

Programmatically remove control characters from LLM response #1367

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!