10000 Search CQ DIrectory for orgs within a radius from patient address by RamilGaripov · Pull Request #1271 · metriport/metriport · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Search CQ DIrectory for orgs within a radius from patient address #1271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Dec 8, 2023

Conversation

RamilGaripov
Copy link
Contributor
@RamilGaripov RamilGaripov commented Dec 1, 2023

Ref: https://github.com/metriport/metriport-internal/issues/1302

Dependencies

Description

  • Added AWS Location Service to Core and API stack
  • Added functionality to find CQ directory organizations within a radius around patients' addresses
  • Added internal endpoints to create mock cq organizations for testing - this is temporary
  • Removed update functionality for CQ directory entry rows
  • Removed listAllOrganizations from the carequality-sdk

Testing

Local tests:

  • Thoroughly tested locally: 70k rows of mocked organizations get created and inserted in 16-17 seconds.

Tests on staging:

  • 100k rows of mocked organizations created and inserted in 2 min 20 seconds. The API was very slow after that for a couple of minutes. Note that the prod db instance is 4x bigger than that on staging, so it should handle it more easily.
  • Added some sleep cycles in between batches to ease the load on the db: 100k rows took 3 min 20 seconds, and the API seemed to work completely fine right after.

Follow up / Related issues:

  • Updating the Address model to include geog coordinates - 1326

Release Plan

  • Run some SQL queries in staging to clean up the previous migrations - context
  • Run some SQL queries in prod to clean up the previous migrations
  • Release NPM (alpha)
  • Release NPM (prod)
  • Merge upstream
  • Merge this

Comment on lines 36 to 38
): Promise<CQOrganizationCreateResponse[]> => {
const oids = orgDataArray.map(data => data.oid);
const existingEntries = await getCQDirectoryEntriesByOids(oids);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is critical AFAIK, so we probably want to do some tweaks we don't usually do for regular DB/queries/lookups:

  1. oids can probably have 10s of 1,000s of records, rights? So we probably want to limit the amount of IDs we want to sent to the DB, this will be translated into select * from ... where id in (LIST), we don't want LIST to have 20,000 ids 😅
    • maybe chunk it to 100 entries per query and issue around 20 queries in parallel?
  2. getCQDirectoryEntriesByOids() should only return the exact data we need, in this case just the record's id;
  3. (looping trough the entries here is expensive, we're talking about +1,000 items) we should probably consider having a map instead of an array (I would even consider the listAllOrganizations from the SDK to return a map, so one less loop through 10s of 1,000s of records);
    • but that only if its worth doing it - i.e., no point converting to map if we're only going to iterate over it once (might as well just use the array)
  4. regardless of using a map or an array, we can add a property to the original payload to indicate whether it should be updated or created (I think it will be cheaper to filter twice later than keep adding items to array one by one like lines 45 and 47);
  5. speaking of which, we probably want to avoid object deconstructing like line 45 - can't we just set the id of orgData?
  6. re: reconstructing: also want to avoid that on lines 57 and 58 - could return separated properties
    • actually, do we absolutely need to return the result? if not, this means create and update also don't need to return

Copy link
Contributor Author
@RamilGaripov RamilGaripov Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, I took into account 2, 4, 5 and 6.

#1 I marked as a TODO, and for 3, we only iterate thru it once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1 I marked as a TODO

As in, this PR's TODO, or TODO for later/another PR?

Also, I forgot about this:
7. Is it worth to run check for existing entries and update them selectively, or should we just update it all? If we always recreate, we save (1) loading data (which I believe is like +99% of the records, right?) and (2) comparing it in memory (which will not be cheap, very CPU intensive considering the amount of data).
8. Do we need the directory's ID? If not, removing the PK also removes the associated index, which will speed up a lot building it.
9. We should prob remove indexes before we recreate the directory and add them afterwards, this should speed up a lot the whole process.

Copy link
Contributor Author
@RamilGaripov RamilGaripov Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later TODOs. I will update the description and create new issues to link here.

For 7), I think it's worth only updating the entries that need an update. I think at some point, when we start querying the directory on a daily basis, we will probably want to know which organizations actually got updated, so we can re-query for new organization x patient associations, especially in the case of an address change.

For 8), the directory ID is the OID of the organization.

For 9), I'll look into it, but might need clarifications - will take it to Slack

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. That's a good point, but I would make sure that's the case, because I assume the difference in execution time from always create vs. update to be significant. If we need to update, could look into .upsert() so we don't need to load and compare.

  2. Gotcha. And are we going to query by OID? Or always by lat/lon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. probably OID and point, rather than lat/lon. Maybe we should add another index for the point?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add another index for the point?

If we're going to query the directory using that as the main filter, likely yes. Would need to look at the filter/conditions to be able to better recommend what to do here. We can add it when we PR this query.

const orgs = [];

// TODO: MAKE SURE LAT LON WITHIN RANGE
for (const coord of coordinates) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's usually well under 5, could prob issue parallel requests to the DB - although I know the majority will be one, so NABD.

Comment on lines 79 to 80
Sequelize.literal(`earth_box(ll_to_earth (${coord.lat}, ${coord.lon}), ${radiusInMeters}) @> ll_to_earth (lat, lon)
AND earth_distance(ll_to_earth (${coord.lat}, ${coord.lon}), ll_to_earth (lat, lon)) < ${radiusInMeters}`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be something we can do some improvement as well. I'm assuming this is doing a full-table scan, right?

We probably can't do much about it since the filter criteria is applying a function on the columns with dynamic values - still worth checking how earth_box and earth_distance work WRT indexes, but maybe we can store ll_to_earth(lat, lon) as a precomputed column. It might not help much, but just something that crossed my mind as well.

IIRC, we don't have indexes on lat, long, but we don't do any query on those without applying a function to the values, right? If so, unless Postgres' geolocation functions can leverage it, not worth adding those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm that's a good point.. storing that instead of, or in addition to, the coordinates could save some time when querying for each patient's nearby facilities

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking to see if it's worth the time (I'd recommend you do too), and TIL Posgres has a type of index that supports complex types, it's called GiST: https://blog.devgenius.io/generalized-search-tree-gist-indexes-in-postgresql-ba77e1bf26a1

Found about it here (good reading): https://dba.stackexchange.com/questions/314984/optimizing-geolocation-query-in-postgres-with-earth-box-and-earth-distance

Found this too: https://stackoverflow.com/questions/18059740/how-to-create-a-postgresql-index-that-includes-latitude-longitude-using-gist

I'd look into these to better understand how it works, best practices, and caveats, while building this type of solution.

@RamilGaripov RamilGaripov marked this pull request as ready for review December 5, 2023 02:58
console.log("Parsed", orgs.length, "organizations for the CQ directory.");

for (let i = 0; i <= orgs.length; i += BATCH_SIZE) {
const batch = orgs.slice(i, i + BATCH_SIZE) as CQDirectoryEntryDataWithUpdateAndId[];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also save cycles skipping this .slice() and always passing the whole directory, indicating from which position to start working at and how many entries to be processed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to implement that, but with how getCQDirectoryEntriesIdsAndLastUpdated() is built, I had to either use slice, or have some for loops.. i felt like using slice() in this case to break it up into batches isn't as bad

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have some for loops

Yeah, the simplest form of "for loop":

for (const i = initialPos; i < Math.min(initialPos + batchSize, arr.length); i++) {
   ...
}

10000
async function computeEarthPoint(orgData: CQDirectoryEntryData): Promise<string | undefined> {
if (orgData.lat && orgData.lon) {
const query = "SELECT ll_to_earth(:lat, :lon) as point";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use sequelize orm instead of raw sql querries. So something like this instead:
const pointQueryResult = await CQDirectoryEntryModel.findOne({ attributes: [[sequelize.fn('ll_to_earth', orgData.lat, orgData.lon), 'point']], raw: true,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider the pros and cons here. I'm 100% for using Sequelize's ORM goodies in the vast majority of cases, especially to have type safety. But, if we don't get type safety from it and it makes it harder to code, I'd go with pure SQL for very specific applications/use cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in this particular case, couldn't we compute it in memory? This is likely some standard computation, so we probably don't need to ask the DB do calculate it for us.

This answer might leads us on the right path, I got to this source code for the ll_to_earth function.

const entries = await CQDirectoryEntryModel.findAll({
where: {
oid: oids,
},
Copy link
Contributor
@jonahkaye jonahkaye Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add:
attributes: ['id', 'oid', 'lastUpdated'],
to the search to improve efficiency by only retrieving necessary fields. Also sorta similar to the suggestion Raf has been making to explicitly decompose patient: { id, cxId } in function headers

const lat = parseFloat(position.latitude.value);
const lon = parseFloat(position.longitude.value);
if (lat < -90 || lat > 90) return [];
if (lon < -180 || lon > 180) return [];
Copy link
Contributor
@jonahkaye jonahkaye Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseFloat can return NaN if non numeric characters in position lat long. Maybe check for NaNs:
if (isNaN(lat) || isNaN(lon)) return [];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh cool, didnt know that!

Ref: #000

 - api@1.10.1-alpha.0
 - @metriport/api-sdk@7.8.3-alpha.0
 - carequality-cert-runner@1.0.1-alpha.0
 - @metriport/carequality-sdk@0.1.6-alpha.0
 - @metriport/commonwell-cert-runner@1.10.7-alpha.0
 - @metriport/commonwell-jwt-maker@1.7.7-alpha.0
 - @metriport/commonwell-sdk@4.9.7-alpha.0
 - connect-widget@1.7.11-alpha.0
 - @metriport/core@1.6.8-alpha.0
 - @metriport/ihe-gateway-sdk@0.1.9-alpha.0
 - infrastructure@1.5.9-alpha.0
 - @metriport/react-native-sdk@1.5.7-alpha.0
 - @metriport/shared@0.1.7-alpha.0
 - tester-node@1.1.11-alpha.0
 - utils@1.7.10-alpha.0
Copy link
Member
@leite08 leite08 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LFG!!! 🚀 🚀 🚀

@RamilGaripov RamilGaripov merged commit 22671b3 into develop Dec 8, 2023
@RamilGaripov RamilGaripov deleted the 1302-search-cq-directory branch December 8, 2023 19:04
Copy link
github-actions bot commented Dec 8, 2023

🎉 This PR is included in version 5.28.0-develop.3 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Copy link
github-actions bot commented Dec 9, 2023

🎉 This PR is included in version 5.29.0-develop.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Copy link

🎉 This PR is included in version 5.29.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0