8000 Support automatic de-duplication of interactions · Issue #769 · lenskit/lkpy · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Support automatic de-duplication of interactions #769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mdekstrand opened this issue May 23, 2025 · 0 comments
Open

Support automatic de-duplication of interactions #769

mdekstrand opened this issue May 23, 2025 · 0 comments
Labels
data Data management support.

Comments

@mdekstrand
Copy link
Member
mdekstrand commented May 23, 2025

Some data sets, like old Amazon ratings, include fully-duplicate entries. These can be supported as a stepping stone to #770.

The DataSetBuilder.add_interactions method should support de-duplicating interactions when they are added to the dataset.

Proposed interface:

  • Deprecate the allow_repeats option (to both add_interactions and add_relationships, and the corresponding relationship class method).
  • Add a new option repeats, with several options: allow, forbid, remove, and remove-duplicates.

The options work as follows:

  • allow allows repeated relationship records, including full duplicates. The schema repeat field is set to either ALLOWED or PRESENT, as appropriate.
  • forbid forbids repeated relationship records, raising an error if they are present. The schema's repeat field is set to FORBIDDEN.
  • remove removes repeated relationship records (they have the same set of entity IDs). The schema's repeat field is set to FORBIDDEN.
  • remove-duplicates removes duplicate relationship records (they have the same entity IDs and fields — the rows in the input table are fully duplicated). The schema's repeat field is set to ALLOWED or PRESENT, as appropriate.
@mdekstrand mdekstrand added the data Data management support. label May 23, 2025
@mdekstrand mdekstrand moved this from Backlog to Ready in LensKit Development May 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Data management support.
Projects
Status: Ready
Development

No branches or pull requests

1 participant
0