Add EncodingFormat for FHIR files #883

crisely09 · 2025-05-28T09:54:25Z

We would like to use Croissant recordsets to read FHIR (nested JSON Lines), wildly used in the medical sector.
This PR is an "easy" approach to enable the support for FHIR (application/fhir+json) encoding format.

github-actions · 2025-05-28T09:54:38Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

ccl-core · 2025-05-28T10:05:48Z

Hi @crisely09 , thank you for your contribution!
To increase our test coverage and enrich the example datasets for Croissant users, would you mind adding an example dataset which uses the new FHIR format to https://github.com/mlcommons/croissant/tree/main/datasets/1.1 ?

crisely09 · 2025-05-28T10:31:39Z

Hi @crisely09 , thank you for your contribution! To increase our test coverage and enrich the example datasets for Croissant users, would you mind adding an example dataset which uses the new FHIR format to https://github.com/mlcommons/croissant/tree/main/datasets/1.1 ?

I have added the example metadata into the datasets folder. I am not sure how to generate the output folder.
Also, I don't know what is the format error I am getting in the read.py file.
Thanks a lot for your help.

datasets/1.1/pharmaccess-momcare-fhir/metadata.json

ccl-core · 2025-05-28T10:36:50Z

Thanks! I'll review later on today.
You can generate the output records using this script: https://github.com/mlcommons/croissant/blob/main/python/mlcroissant/mlcroissant/scripts/load.py

crisely09 · 2025-05-28T10:57:57Z

I have noticed that the way the json is loaded is suuuuper slow, I am trying something to accelerate the Reading of Json files when jsonPath is used.

crisely09 · 2025-05-28T13:32:02Z

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

datasets/1.1/pharmaccess-momcare-fhir/metadata.json

ccl-core · 2025-05-29T20:54:09Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

    """Parsed all JSONs defined in the fields of RecordSet and outputs a pd.DF."""
    series = {}
    for field in fields:
-        json_path = field.source.extract.json_path
-        if json_path is None:
+        jp = field.source.extract.json_path


I would be in favor of keeping the old variable names, for readibility (same below)

Replaced jp with json_path.

ccl-core · 2025-05-29T21:11:57Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

@@ -1,20 +1,219 @@
 """Parse JSON operation."""

+import json
+import jmespath


I know these libraries are not that big, but I was wondering whether we should rather lazily load them?

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

ccl-core · 2025-05-29T21:52:32Z

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

Yeah, mypy is annoying :S So the logs point to:
mlcroissant/_src/operation_graph/operations/read.py:137: error: Incompatible types in assignment (expression has type "JsonlReader", variable has type "JsonReader") [assignment]
So it seems like MyPy believes the variable reader is expected to hold an object of type JsonReader -- I guess MyPy infers the type of reader from its first assignment reader = JsonReader(self.fields)? Have you tried to explicitly declare the possible types for reader, like with reader: JsonReader | JsonlReader before the conditional block? I guess another option could be to use a typing.Protocol, but I would give it a try with the first method first...

For the formatting error, have you tried runnin black with the same specifications (--check --line-length 88 --preview etc) as we do in the tests? This should hopefully fix the tests.

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

ccl-core · 2025-05-29T21:22:14Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

+        """
+        # raw JSON fallback: one‐cell DataFrame
+        fh.seek(0)
+        content = json.load(fh)


I wonder whether it might make sense to use orjson.loads here as well? Wouldn't it maximise performance and be more consistent?

Yes, makes total sense.

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

ccl-core · 2025-05-29T21:55:45Z

Thank you @crisely09 for the PR! I like this approach, the refactoring of the JSON parsing logic into the two classes makes the codebase cleaner and more modular. And having support for FHIR-formatted data is great!

I left a few comments, let me know if you have further problems with the tests. I'll be OOO next week, but maybe @marcenacp or @benjelloun can unblock you with the needed approvals if needed?

crisely09 · 2025-05-30T15:24:53Z

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

Yeah, mypy is annoying :S So the logs point to: mlcroissant/_src/operation_graph/operations/read.py:137: error: Incompatible types in assignment (expression has type "JsonlReader", variable has type "JsonReader") [assignment] So it seems like MyPy believes the variable reader is expected to hold an object of type JsonReader -- I guess MyPy infers the type of reader from its first assignment reader = JsonReader(self.fields)? Have you tried to explicitly declare the possible types for reader, like with reader: JsonReader | JsonlReader before the conditional block? I guess another option could be to use a typing.Protocol, but I would give it a try with the first method first...

I went back to the logs, and the errors seem to be related to files I haven't modified, base_node.py for instance.

crisely09 · 2025-05-30T15:27:02Z

Thank you @crisely09 for the PR! I like this approach, the refactoring of the JSON parsing logic into the two classes makes the codebase cleaner and more modular. And having support for FHIR-formatted data is great!

I left a few comments, let me know if you have further problems with the tests. I'll be OOO next week, but maybe @marcenacp or @benjelloun can unblock you with the needed approvals if needed?

Thanks a lot @ccl-core for the careful review ! I think I have addressed all your comments, feel free to have another look.

crisely09 · 2025-06-06T09:43:48Z

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

ccl-core · 2025-06-11T08:59:38Z

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

Hi @crisely09 , sorry, I was OOO last week :) Let me try to see if I can reproduce the mypy errors in my workspace!

ccl-core · 2025-06-12T15:00:52Z

Hi @crisely09 , the mypy errors were due to a new version of mypy, and were unrelated to your changes, as you already pointed out (the CI was failing since a few weeks anyways https://github.com/mlcommons/croissant/actions/workflows/ci.yml :) )
I sent #890 that should hopefully fix the issue.

ccl-core · 2025-06-12T15:08:26Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

 import jsonpath_rw
+import orjson


ccl-core · 2025-06-12T15:11:51Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

+
+        # Load entire JSON file (could be a list or a single dict).
+        raw = fh.read()
+        data = orjson.loads(raw)


You can see here an example of how to lazily load a library: 4fbd358

ccl-core · 2025-06-12T15:17:56Z

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

I believe the mypy tests should be fixed now. The failures in the notebook tests probably stem from the refactored JSON parsing logic.

crisely09 · 2025-06-23T07:18:55Z

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

I believe the mypy tests should be fixed now. The failures in the notebook tests probably stem from the refactored JSON parsing logic.

Thank you for the review!!
I will have a look at the parsing logic, to keep the expected behavior for this type of files.

add reading option for fhir

fe626df

crisely09 requested a review from a team as a code owner May 28, 2025 09:54

ccl-core self-requested a review May 28, 2025 10:01

crisely09 added 3 commits May 28, 2025 12:05

little reformatting

4501421

add fhir dataset example

5d82a63

small addition to metadata

1dd393f

ccl-core reviewed May 28, 2025

View reviewed changes

datasets/1.1/pharmaccess-momcare-fhir/metadata.json Show resolved Hide resolved

crisely09 added 2 commits May 28, 2025 12:55

added output for serviceRequest loading record-set

92a9c75

simplify a bit the metadata file

cc18426

crisely09 added 6 commits May 28, 2025 14:00

Read JSON files faster

265e93f

bring back previous definition of the parse_json_content

1ee3986

few format fixes

30fde7c

align dataset metadata example

350e9a5

fall back to jsonpath_rw when there is recursive-descent

ed78906

fix flake8

a4dce21

ccl-core reviewed May 29, 2025

View reviewed changes

datasets/1.1/pharmaccess-momcare-fhir/metadata.json Outdated Show resolved Hide resolved

ccl-core reviewed May 29, 2025

View reviewed changes

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py Show resolved Hide resolved

ccl-core reviewed May 29, 2025

View reviewed changes

crisely09 added 2 commits May 30, 2025 16:55

Black format fixes, add tests for classes, other suggested changes

3fb1277

updated output from dataset

bf76353

crisely09 added 3 commits May 30, 2025 17:05

fix isort

3504bf6

fix test expectations

5e5b9b2

fix format

062ab96

crisely09 added 3 commits May 30, 2025 17:31

fix flakes

5c790b0

fix expectation of tests

d0f36f6

if not replaced to if is None

c331ae3

ccl-core reviewed Jun 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add EncodingFormat for FHIR files #883

Add EncodingFormat for FHIR files #883

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add EncodingFormat for FHIR files #883

Are you sure you want to change the base?

Add EncodingFormat for FHIR files #883

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!