8000 Add EncodingFormat for FHIR files by crisely09 · Pull Request #883 · mlcommons/croissant · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add EncodingFormat for FHIR files #883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

crisely09
Copy link

We would like to use Croissant recordsets to read FHIR (nested JSON Lines), wildly used in the medical sector.
This PR is an "easy" approach to enable the support for FHIR (application/fhir+json) encoding format.

@crisely09 crisely09 requested a review from a team as a code owner May 28, 2025 09:54
Copy link
github-actions bot commented May 28, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@ccl-core ccl-core self-requested a review May 28, 2025 10:01
@ccl-core
Copy link
Contributor

Hi @crisely09 , thank you for your contribution!
To increase our test coverage and enrich the example datasets for Croissant users, would you mind adding an example dataset which uses the new FHIR format to https://github.com/mlcommons/croissant/tree/main/datasets/1.1 ?

@crisely09
Copy link
Author

Hi @crisely09 , thank you for your contribution! To increase our test coverage and enrich the example datasets for Croissant users, would you mind adding an example dataset which uses the new FHIR format to https://github.com/mlcommons/croissant/tree/main/datasets/1.1 ?

I have added the example metadata into the datasets folder. I am not sure how to generate the output folder.
Also, I don't know what is the format error I am getting in the read.py file.
Thanks a lot for your help.

@ccl-core
Copy link
Contributor

Thanks! I'll review later on today.
You can generate the output records using this script: https://github.com/mlcommons/croissant/blob/main/python/mlcroissant/mlcroissant/scripts/load.py

@crisely09
Copy link
Author

I have noticed that the way the json is loaded is suuuuper slow, I am trying something to accelerate the Reading of Json files when jsonPath is used.

@crisely09
Copy link
Author

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

"""Parsed all JSONs defined in the fields of RecordSet and outputs a pd.DF."""
series = {}
for field in fields:
json_path = field.source.extract.json_path
if json_path is None:
jp = field.source.extract.json_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favor of keeping the old variable names, for readibility (same below)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced jp with json_path.

@@ -1,20 +1,219 @@
"""Parse JSON operation."""

import json
import jmespath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know these libraries are not that big, but I was wondering whether we should rather lazily load them?

@ccl-core
Copy link
Contributor

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

Yeah, mypy is annoying :S So the logs point to:
mlcroissant/_src/operation_graph/operations/read.py:137: error: Incompatible types in assignment (expression has type "JsonlReader", variable has type "JsonReader") [assignment]
So it seems like MyPy believes the variable reader is expected to hold an object of type JsonReader -- I guess MyPy infers the type of reader from its first assignment reader = JsonReader(self.fields)? Have you tried to explicitly declare the possible types for reader, like with reader: JsonReader | JsonlReader before the conditional block? I guess another option could be to use a typing.Protocol, but I would give it a try with the first method first...

For the formatting error, have you tried runnin black with the same specifications (--check --line-length 88 --preview etc) as we do in the tests? This should hopefully fix the tests.

"""
# raw JSON fallback: one‐cell DataFrame
fh.seek(0)
content = json.load(fh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it might make sense to use orjson.loads here as well? Wouldn't it maximise performance and be more consistent?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes total sense.

@ccl-core
Copy link
Contributor

Thank you @crisely09 for the PR! I like this approach, the refactoring of the JSON parsing logic into the two classes makes the codebase cleaner and more modular. And having support for FHIR-formatted data is great!

I left a few comments, let me know if you have further problems with the tests. I'll be OOO next week, but maybe @marcenacp or @benjelloun can unblock you with the needed approvals if needed?

@crisely09
Copy link
Author

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

Yeah, mypy is annoying :S So the logs point to: mlcroissant/_src/operation_graph/operations/read.py:137: error: Incompatible types in assignment (expression has type "JsonlReader", variable has type "JsonReader") [assignment] So it seems like MyPy believes the variable reader is expected to hold an object of type JsonReader -- I guess MyPy infers the type of reader from its first assignment reader = JsonReader(self.fields)? Have you tried to explicitly declare the possible types for reader, like with reader: JsonReader | JsonlReader before the conditional block? I guess another option could be to use a typing.Protocol, but I would give it a try with the first method first...

I went back to the logs, and the errors seem to be related to files I haven't modified, base_node.py for instance.

@crisely09
Copy link
Author

Thank you @crisely09 for the PR! I like this approach, the refactoring of the JSON parsing logic into the two classes makes the codebase cleaner and more modular. And having support for FHIR-formatted data is great!

I left a few comments, let me know if you have further problems with the tests. I'll be OOO next week, but maybe @marcenacp or @benjelloun can unblock you with the needed approvals if needed?

Thanks a lot @ccl-core for the careful review ! I think I have addressed all your comments, feel free to have another look.

@crisely09
Copy link
Author

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

@ccl-core
Copy link
Contributor

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

Hi @crisely09 , sorry, I was OOO last week :) Let me try to see if I can reproduce the mypy errors in my workspace!

@ccl-core
Copy link
Contributor

Hi @crisely09 , the mypy errors were due to a new version of mypy, and were unrelated to your changes, as you already pointed out (the CI was failing since a few weeks anyways https://github.com/mlcommons/croissant/actions/workflows/ci.yml :) )
I sent #890 that should hopefully fix the issue.

import jsonpath_rw
import orjson
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?


# Load entire JSON file (could be a list or a single dict).
raw = fh.read()
data = orjson.loads(raw)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see here an example of how to lazily load a library: 4fbd358

@ccl-core
Copy link
Contributor
ccl-core commented Jun 12, 2025

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

I believe the mypy tests should be fixed now. The failures in the notebook tests probably stem from the refactored JSON parsing logic.

@crisely09
Copy link
Author

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

I believe the mypy tests should be fixed now. The failures in the notebook tests probably stem from the refactored JSON parsing logic.

Thank you for the review!!
I will have a look at the parsing logic, to keep the expected behavior for this type of files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0