-
Notifications
You must be signed in to change notification settings - Fork 9
Add postgresql as an alternative to the sqlite3 database #199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
def postgresql_fixture(): | ||
""" | ||
Run a local postgresql server for testing. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random port?
conda_index/alchemy/model.py
Outdated
__table__: Table | ||
__tablename__ = "stat" | ||
|
||
stage = mapped_column(TEXT, default="indexed", nullable=False, primary_key=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe enum type containing s3, clone, indexed values ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the user might want to invent more stages
conda_index/alchemy/model.py
Outdated
size = mapped_column(Integer) | ||
sha256 = mapped_column(TEXT) | ||
md5 = mapped_column(TEXT) | ||
last_modified = mapped_column(TEXT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datetime type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a literal Last-Modfied header from s3 e.g.
conda_index/alchemy/psqlcache.py
Outdated
channel_url=channel_url, | ||
upstream_stage=upstream_stage, | ||
) | ||
self.db_filename = self.channel_root / ".cache" / "cache.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this db_filename is required by the CondaIndexCache I suppose, but what is it for in case of postgres implementation ?
conda_index/alchemy/psqlcache.py
Outdated
self.db_filename.parent.mkdir(parents=True) | ||
self.db_filename.write_text(json.dumps({"channel_id": os.urandom(8).hex()})) | ||
self.cache_is_brand_new = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems artificial to me, I wonder what CondaIndexCache is expecting there ? Does the cache_is_brand_new can be based somehow on the postgres database itself if it empty or something like that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was used to decide whether to convert the old json-files cache to sql
conda_index/alchemy/psqlcache.py
Outdated
else: | ||
self.cache_is_brand_new = False | ||
|
||
self.channel_id = json.loads(self.db_filename.read_text())["channel_id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I suppose we need to provide the unique channel_id, that is making more sense to me now
conda_index/alchemy/psqlcache.py
Outdated
""" | ||
# If recording information about the channel_root, use '_ROOT' for nice | ||
# prefix searches | ||
return f"{self.channel_id}/{self.subdir or '_ROOT'}/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would that mean there will be separate postgres databases per each subdir + _ROOT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One postgresql database, each channel, subdir has a unique prefix. "_ROOT" is an idea to give a unique prefix when tracking information about channel/<files>
rather than channel/subdir/<files>
conda_index/alchemy/psqlcache.py
Outdated
connection.execute( | ||
stat.delete().where(stat.c.path.like(self.database_path_like)) | ||
) | ||
for item in listdir_stat: | ||
connection.execute(stat.insert(), {**item, "stage": "fs"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it do exacly ? deletes the row and inserts ? if so can we do update instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the standard behavior where "exactly the files in os.listdir() need to be present in the index" but there are ways to skip calling this function.
conda_index/alchemy/psqlcache.py
Outdated
""" | ||
Write cache for a single package to database. | ||
""" | ||
database_path = self.database_path(fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is database_path in postgresql case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is <random per channel id>/subdir/package.conda
so that all packages in a given repodata.json output can be selected with a range query.
conda_index/alchemy/psqlcache.py
Outdated
database_path = self.database_path(fn) | ||
with self.engine.begin() as connection: | ||
for have_path in members: | ||
table = PATH_TO_TABLE[have_path] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what PATH_TO_TABLE means is postgresql ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It maps the filename info/index.json etc. to table names.
conda_index/index/sqlitecache.py
Outdated
mtime: Number | ||
size: Number | ||
|
||
|
||
class CondaIndexCache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe consider moving that class to seperate file and leave sqllitecache.py and psqlcache.py as pure implementation of that CondaIndexCache ? Having this we will not need import sqlitecache stuff in the psqlcache which is misleading
BTW, @dholth we would probably need a tool which converts existing sqlite dbcache into postgres, is it something you also will work on, or the sirius team should handle this ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be easier to split the refactoring of the cache interface into a separate PR while the Postgres feature is worked on, that would make it also a little easier to review.
For the postgres backend, I'd love to see a connection handling that isn't localized in the various index cache methods, for concerns of integration with the rest of the conda index functionality that parallelizes some parts.
I also don't see how this handles errors yet, what would happen if a postgres, sqa or psycopg level error shows up, does it rollback the change to the index for the other tables?
conda_index/alchemy/model.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to separate the files by which db backend they are implementing, could we move this into conda_index/postgres
so it has two files, conda_index/postgres/model.py
and conda_index/postgres/cache.py
? The same pattern could be applied to the sqlite based code..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the postgresql-specific JSONB column type in model. In psqlcache we use a few postgresql specific or at least not on sqlalchemy's core insert()
things like its UPSERT (on_conflict_do_update
). This is meant as a postgresql-only backend through sqlalchemy, not as a way to allow any SQL database to be used with conda-index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that we could rename 'alchemy' to 'postgres' keeping the same basic structure and isolation of the optional dependencies.
All updates for a particular archive (package file) are in the same transaction. If there is an error, they are rolled back. |
SQLAlchemy default connection pool https://docs.sqlalchemy.org/en/20/core/pooling.html#sqlalchemy.pool.QueuePool |
Liking where this is going! |
sha256 = mapped_column(TEXT) | ||
md5 = mapped_column(TEXT) | ||
last_modified = mapped_column(TEXT) | ||
etag = mapped_column(TEXT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively,
stat = Table(
"stat",
metadata_obj,
Column("stage", TEXT, default="indexed", nullable=False, primary_key=True),
Column("path", TEXT, nullable=False, primary_key=True),
Column("mtime", Integer),
Column("size", Integer),
Column("sha256", TEXT),
Column("md5", TEXT),
Column("last_modified", TEXT),
Column("etag", TEXT),
)
Description
Taking a cue from Steam, we place a random unique identifier on disk. This and the subdir will become part of the "path" key or be placed in a separate column, so that one postgresql database can handle multiple indexes. (The current sqlite scheme uses a separate database per subdir and a plain archive name.)
Checklist - did you ...
news
directory (using the template) for the next release's release notes?