10000 feat: Add support for extracting citation and hallucination information from Granite 3.2 model output by hickeyma · Pull Request #51 · ibm-granite/granite-io · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: Add support for extracting citation and hallucination information from Granite 3.2 model output #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Mar 12, 2025

Conversation

hickeyma
Copy link
Collaborator
@hickeyma hickeyma commented Mar 5, 2025

Parser which supports extraction of Granite 3.2 output as follows:

  • Text (main response)
  • Citations
  • Hallucinations

Closes #36

Note: Big thank you to @yannisk2 for writing the parser engine.

@hickeyma hickeyma marked this pull request as draft March 5, 2025 15:58
@hickeyma hickeyma 8000 force-pushed the feat/handle-citations branch from 73e0063 to f306fd9 Compare March 5, 2025 16:01
@frreiss
Copy link
Collaborator
frreiss commented Mar 5, 2025

I presume the intent here is to integrate the output parser into Granite3Point2InputOutputProcessor.output_to_result() and to have that method return a new, Granite-specific subclass of ChatCompletionResult that contains additional fields for the parsed citations and hallucination output?

@hickeyma
Copy link
Collaborator Author
hickeyma commented Mar 5, 2025

I presume the intent here is to integrate the output parser into Granite3Point2InputOutputProcessor.output_to_result() and to have that method return a new, Granite-specific subclass of ChatCompletionResult that contains additional fields for the parsed citations and hallucination output?

@frreiss Yes, I am first trying to integrate the parser and align it with the project. I aim to finish the PR with output to the processor.

@hickeyma hickeyma force-pushed the feat/handle-citations branch 3 times, most recently from 6124411 to 1b721c7 Compare March 7, 2025 13:03
@hickeyma hickeyma marked this pull request as ready for review March 7, 2025 19:37
@hickeyma hickeyma changed the title feat: Add parser for Granite 3.2 model output feat: Add support for extracting citation and hallucination information from Granite 3.2 model output Mar 7, 2025
Copy link
Collaborator
@frreiss frreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good overall, some changes needed with regard to document texts and error handling.

Also it would be preferable to use Anthropic's citation format if feasible.

hickeyma added a commit that referenced this pull request Mar 10, 2025
Review comments:
- #51 (review)

Updates:
- Updated Citations, Documents and Hallucination fields so they they will be
only be serialized with AssistantMessage object if set.
- Moved testdata directory to specific granite_3_2 directory to self contain
the tests data
- Moved Granite 3.2 specific implementation (io processor and parser) to
Granite 3.2 module under IO
- Modify the parser to log errors and return gracefully to the caller rather
than exceptions

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
@hickeyma hickeyma requested a review from frreiss March 10, 2025 20:55
@hickeyma
Copy link
Collaborator Author
hickeyma 8000 commented Mar 10, 2025

Thanks @frreiss for the review and feedback. Updated and ready for review again.

hickeyma and others added 14 commits March 11, 2025 21:45
Parser for Granite's output response:
- Text
- Citation
- Hallucination

Closes #36

Co-authored-by: Yannis Katsis <yannis.katsis@ibm.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Script tests the parser capability by calling
Granite model with RAG and citations.

Closes #36

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
The parser requires documents to be a list of dictionaries with
keys "doc_id" and "text". The output from the model is in this format however:

<co>1</co> Document 1: "RAG, retrieval-augmented generation, is a technique
that grants generative artificial intelligence models information retrieval capabilities."

This commit converts from string format to disctionary.

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
- Ruff
- Pylint

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Co-authored-by: Yannis Katsis <yannis.katsis@ibm.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Updates as follows:
- Make internal functions pseudo private
- Add any missing arg and return types
- Change main entry function to `parse_model_output`
- Change  main entry function to take model output
only as its param
- Extended the main entry function to extract documents from the output
instead of them being passed inaddition to the response as previous

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Unit tests failing because nltk data not downloaded.

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
- Fix the extraction of constituents to make it more flexible and able
to handle hallucination and citations together in the output
- Refractor the extraction of citations to use underlying
`_split_model_output_into_parts` output
- Escape specific metacharacters as raised by pep8

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Parses the constituent parts of Granite 3.2 model output
and adds them as separate fields to the generated output.

Constituent parts are:
- Citations
- Documents
- Hallucinations

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
These unit tests are added to
Granite3Point2InputOutputProcessor.output_to_result()
unit tests and test output parsing of citations
and hallucinations and setting them as data to the class.

This commit also includes:
- Fix whereby the reponse (minus constituents) was not
set to the content field
- Constants for citation and hallucination start points

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Review comments:
- #51 (review)

Updates:
- Updated Citations, Documents and Hallucination fields so they they will be
only be serialized with AssistantMessage object if set.
- Moved testdata directory to specific granite_3_2 directory to self contain
the tests data
- Moved Granite 3.2 specific implementation (io processor and parser) to
Granite 3.2 module under IO
- Modify the parser to log errors and return gracefully to the caller rather
than exceptions

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
After moving Graite 3.2 specific code to its own directory, new cassettes
for pytest-recording (vcrpy) are required as the tests also moved.

Running this command generates the new cassettes:
`tox -e unit -- --record-mode=new_episodes`

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
`_convert_doc_strs_to_dicts()` was returning the citation id
instead of the document id for document

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Parser needs to use text of document source that is used
as input to the chat completion call when finding the citation
span as this is the source of truth, not the model output
as it may only output a subset of the doc.

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
@hickeyma hickeyma force-pushed the feat/handle-citations branch from 70ea66a to 48571a0 Compare March 11, 2025 21:50
Copy link
Collaborator
@frreiss frreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @hickeyma!

@frreiss frreiss merged commit d763770 into main Mar 12, 2025
11 checks passed
@hickeyma hickeyma deleted the feat/handle-citations branch March 12, 2025 09:20
mandel pushed a commit to mandel/granite-io that referenced this pull request Apr 2, 2025
Review comments:
- ibm-granite#51 (review)

Updates:
- Updated Citations, Documents and Hallucination fields so they they will be
only be serialized with AssistantMessage object if set.
- Moved testdata directory to specific granite_3_2 directory to self contain
the tests data
- Moved Granite 3.2 specific implementation (io processor and parser) to
Granite 3.2 module under IO
- Modify the parser to log errors and return gracefully to the caller rather
than exceptions

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Signed-off-by: Louis Mandel <lmandel@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add parser for Granite 3.2 citation output
2 participants
0