8000 Add postgresql as an alternative to the sqlite3 database by dholth · Pull Request #199 · conda/conda-index · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add postgresql as an alternative to the sqlite3 database #199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 54 commits into
base: main
Choose a base branch
from

Conversation

dholth
Copy link
Contributor
@dholth dholth commented Mar 27, 2025

Description

Taking a cue from Steam, we place a random unique identifier on disk. This and the subdir will become part of the "path" key or be placed in a separate column, so that one postgresql database can handle multiple indexes. (The current sqlite scheme uses a separate database per subdir and a plain archive name.)

Checklist - did you ...

  • Add a file to the news directory (using the template) for the next release's release notes?
  • Add / update necessary tests?
  • Add / update outdated documentation?

@conda-bot conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label Mar 27, 2025
@github-project-automation github-project-automation bot moved this to 🆕 New in 🔎 Review Mar 27, 2025
def postgresql_fixture():
"""
Run a local postgresql server for testing.
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random port?

__table__: Table
__tablename__ = "stat"

stage = mapped_column(TEXT, default="indexed", nullable=False, primary_key=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe enum type containing s3, clone, indexed values ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the user might want to invent more stages

size = mapped_column(Integer)
sha256 = mapped_column(TEXT)
md5 = mapped_column(TEXT)
last_modified = mapped_column(TEXT)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime type ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a literal Last-Modfied header from s3 e.g.

channel_url=channel_url,
upstream_stage=upstream_stage,
)
self.db_filename = self.channel_root / ".cache" / "cache.json"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this db_filename is required by the CondaIndexCache I suppose, but what is it for in case of postgres implementation ?

Comment on lines 55 to 57
self.db_filename.parent.mkdir(parents=True)
self.db_filename.write_text(json.dumps({"channel_id": os.urandom(8).hex()}))
self.cache_is_brand_new = True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems artificial to me, I wonder what CondaIndexCache is expecting there ? Does the cache_is_brand_new can be based somehow on the postgres database itself if it empty or something like that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was used to decide whether to convert the old json-files cache to sql

else:
self.cache_is_brand_new = False

self.channel_id = json.loads(self.db_filename.read_text())["channel_id"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I suppose we need to provide the unique channel_id, that is making more sense to me now

"""
# If recording information about the channel_root, use '_ROOT' for nice
# prefix searches
return f"{self.channel_id}/{self.subdir or '_ROOT'}/"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would that mean there will be separate postgres databases per each subdir + _ROOT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One postgresql database, each channel, subdir has a unique prefix. "_ROOT" is an idea to give a unique prefix when tracking information about channel/<files> rather than channel/subdir/<files>

Comment on lines 101 to 105
connection.execute(
stat.delete().where(stat.c.path.like(self.database_path_like))
)
for item in listdir_stat:
connection.execute(stat.insert(), {**item, "stage": "fs"})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it do exacly ? deletes the row and inserts ? if so can we do update instead ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the standard behavior where "exactly the files in os.listdir() need to be present in the index" but there are ways to skip calling this function.

"""
Write cache for a single package to database.
"""
database_path = self.database_path(fn)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is database_path in postgresql case ?

Copy link
Contributor Author
@dholth dholth Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is <random per channel id>/subdir/package.conda so that all packages in a given repodata.json output can be selected with a range query.

database_path = self.database_path(fn)
with self.engine.begin() as connection:
for have_path in members:
table = PATH_TO_TABLE[have_path]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what PATH_TO_TABLE means is postgresql ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It maps the filename info/index.json etc. to table names.

mtime: Number
size: Number


class CondaIndexCache:
Copy link
@M-Waszkiewicz-Anaconda M-Waszkiewicz-Anaconda Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe consider moving that class to seperate file and leave sqllitecache.py and psqlcache.py as pure implementation of that CondaIndexCache ? Having this we will not need import sqlitecache stuff in the psqlcache which is misleading

@M-Waszkiewicz-Anaconda
Copy link
M-Waszkiewicz-Anaconda commented Apr 4, 2025

BTW, @dholth we would probably need a tool which converts existing sqlite dbcache into postgres, is it something you also will work on, or the sirius team should handle this ?

Copy link
Member
@jezdez jezdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to split the refactoring of the cache interface into a separate PR while the Postgres feature is worked on, that would make it also a little easier to review.

For the postgres backend, I'd love to see a connection handling that isn't localized in the various index cache methods, for concerns of integration with the rest of the conda index functionality that parallelizes some parts.

I also don't see how this handles errors yet, what would happen if a postgres, sqa or psycopg level error shows up, does it rollback the change to the index for the other tables?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to separate the files by which db backend they are implementing, could we move this into conda_index/postgres so it has two files, conda_index/postgres/model.py and conda_index/postgres/cache.py? The same pattern could be applied to the sqlite based code..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the postgresql-specific JSONB column type in model. In psqlcache we use a few postgresql specific or at least not on sqlalchemy's core insert() things like its UPSERT (on_conflict_do_update). This is meant as a postgresql-only backend through sqlalchemy, not as a way to allow any SQL database to be used with conda-index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we could rename 'alchemy' to 'postgres' keeping the same basic structure and isolation of the optional dependencies.

@dholth
Copy link
Contributor Author
dholth commented May 5, 2025

It might be easier to split the refactoring of the cache interface into a separate PR while the Postgres feature is worked on, that would make it also a little easier to review.

For the postgres backend, I'd love to see a connection handling that isn't localized in the various index cache methods, for concerns of integration with the rest of the conda index functionality that parallelizes some parts.

I also don't see how this handles errors yet, what would happen if a postgres, sqa or psycopg level error shows up, does it rollback the change to the index for the other tables?

All updates for a particular archive (package file) are in the same transaction. If there is an error, they are rolled back.

@dholth
Copy link
Contributor Author
dholth commented May 8, 2025

For the postgres backend, I'd love to see a connection handling that isn't localized in the various index cache methods, for concerns of integration with the rest of the conda index functionality that parallelizes some parts.

SQLAlchemy default connection pool https://docs.sqlalchemy.org/en/20/core/pooling.html#sqlalchemy.pool.QueuePool

@jezdez
Copy link
Member
jezdez commented May 8, 2025

Liking where this is going!

sha256 = mapped_column(TEXT)
md5 = mapped_column(TEXT)
last_modified = mapped_column(TEXT)
etag = mapped_column(TEXT)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively,

stat = Table(
    "stat",
    metadata_obj,
    Column("stage", TEXT, default="indexed", nullable=False, primary_key=True),
    Column("path", TEXT, nullable=False, primary_key=True),
    Column("mtime", Integer),
    Column("size", Integer),
    Column("sha256", TEXT),
    Column("md5", TEXT),
    Column("last_modified", TEXT),
    Column("etag", TEXT),
)

@dholth dholth requested a review from jezdez May 14, 2025 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed [bot] added once the contributor has signed the CLA
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

4 participants
0