-
Notifications
You must be signed in to change notification settings - Fork 71
Add EncodingFormat for FHIR files #883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Hi @crisely09 , thank you for your contribution! |
I have added the example metadata into the datasets folder. I am not sure how to generate the output folder. |
Thanks! I'll review later on today. |
I have noticed that the way the json is loaded is suuuuper slow, I am trying something to accelerate the Reading of Json files when jsonPath is used. |
OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows. |
"""Parsed all JSONs defined in the fields of RecordSet and outputs a pd.DF.""" | ||
series = {} | ||
for field in fields: | ||
json_path = field.source.extract.json_path | ||
if json_path is None: | ||
jp = field.source.extract.json_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favor of keeping the old variable names, for readibility (same below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced jp
with json_path
.
@@ -1,20 +1,219 @@ | |||
"""Parse JSON operation.""" | |||
|
|||
import json | |||
import jmespath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know these libraries are not that big, but I was wondering whether we should rather lazily load them?
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Show resolved
Hide resolved
Yeah, mypy is annoying :S So the logs point to: For the formatting error, have you tried runnin black with the same specifications ( |
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Show resolved
Hide resolved
""" | ||
# raw JSON fallback: one‐cell DataFrame | ||
fh.seek(0) | ||
content = json.load(fh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it might make sense to use orjson.loads here as well? Wouldn't it maximise performance and be more consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes total sense.
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Show resolved
Hide resolved
Thank you @crisely09 for the PR! I like this approach, the refactoring of the JSON parsing logic into the two classes makes the codebase cleaner and more modular. And having support for FHIR-formatted data is great! I left a few comments, let me know if you have further problems with the tests. I'll be OOO next week, but maybe @marcenacp or @benjelloun can unblock you with the needed approvals if needed? |
I went back to the logs, and the errors seem to be related to files I haven't modified, |
Thanks a lot @ccl-core for the careful review ! I think I have addressed all your comments, feel free to have another look. |
Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look? |
Hi @crisely09 , sorry, I was OOO last week :) Let me try to see if I can reproduce the mypy errors in my workspace! |
Hi @crisely09 , the mypy errors were due to a new version of mypy, and were unrelated to your changes, as you already pointed out (the CI was failing since a few weeks anyways https://github.com/mlcommons/croissant/actions/workflows/ci.yml :) ) |
import jsonpath_rw | ||
import orjson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
|
||
# Load entire JSON file (could be a list or a single dict). | ||
raw = fh.read() | ||
data = orjson.loads(raw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see here an example of how to lazily load a library: 4fbd358
I believe the mypy tests should be fixed now. The failures in the notebook tests probably stem from the refactored JSON parsing logic. |
Thank you for the review!! |
We would like to use Croissant recordsets to read FHIR (nested JSON Lines), wildly used in the medical sector.
This PR is an "easy" approach to enable the support for FHIR (application/fhir+json) encoding format.