Add postgresql as an alternative to the sqlite3 database #199

dholth · 2025-03-27T14:44:51Z

Description

Taking a cue from Steam, we place a random unique identifier on disk. This and the subdir will become part of the "path" key or be placed in a separate column, so that one postgresql database can handle multiple indexes. (The current sqlite scheme uses a separate database per subdir and a plain archive name.)

Checklist - did you ...

Add a file to the news directory (using the template) for the next release's release notes?
Add / update necessary tests?
Add / update outdated documentation?

…entation

dholth · 2025-04-01T17:49:05Z

tests/postgresql_fixture.py

+def postgresql_fixture():
+    """
+    Run a local postgresql server for testing.
+    """


Random port?

M-Waszkiewicz-Anaconda · 2025-04-04T06:48:13Z

conda_index/alchemy/model.py

+    __table__: Table
+    __tablename__ = "stat"
+
+    stage = mapped_column(TEXT, default="indexed", nullable=False, primary_key=True)


maybe enum type containing s3, clone, indexed values ?

I thought the user might want to invent more stages

M-Waszkiewicz-Anaconda · 2025-04-04T06:51:13Z

conda_index/alchemy/model.py

+    size = mapped_column(Integer)
+    sha256 = mapped_column(TEXT)
+    md5 = mapped_column(TEXT)
+    last_modified = mapped_column(TEXT)


datetime type ?

This is a literal Last-Modfied header from s3 e.g.

M-Waszkiewicz-Anaconda · 2025-04-04T06:56:30Z

conda_index/alchemy/psqlcache.py

+            channel_url=channel_url,
+            upstream_stage=upstream_stage,
+        )
+        self.db_filename = self.channel_root / ".cache" / "cache.json"


this db_filename is required by the CondaIndexCache I suppose, but what is it for in case of postgres implementation ?

M-Waszkiewicz-Anaconda · 2025-04-04T07:01:00Z

conda_index/alchemy/psqlcache.py

+            self.db_filename.parent.mkdir(parents=True)
+            self.db_filename.write_text(json.dumps({"channel_id": os.urandom(8).hex()}))
+            self.cache_is_brand_new = True


seems artificial to me, I wonder what CondaIndexCache is expecting there ? Does the cache_is_brand_new can be based somehow on the postgres database itself if it empty or something like that ?

It was used to decide whether to convert the old json-files cache to sql

M-Waszkiewicz-Anaconda · 2025-04-04T07:03:48Z

conda_index/alchemy/psqlcache.py

+        else:
+            self.cache_is_brand_new = False
+
+        self.channel_id = json.loads(self.db_filename.read_text())["channel_id"]


ok, I suppose we need to provide the unique channel_id, that is making more sense to me now

M-Waszkiewicz-Anaconda · 2025-04-04T07:06:39Z

conda_index/alchemy/psqlcache.py

+        """
+        # If recording information about the channel_root, use '_ROOT' for nice
+        # prefix searches
+        return f"{self.channel_id}/{self.subdir or '_ROOT'}/"


would that mean there will be separate postgres databases per each subdir + _ROOT ?

One postgresql database, each channel, subdir has a unique prefix. "_ROOT" is an idea to give a unique prefix when tracking information about channel/<files> rather than channel/subdir/<files>

M-Waszkiewicz-Anaconda · 2025-04-04T07:10:44Z

conda_index/alchemy/psqlcache.py

+            connection.execute(
+                stat.delete().where(stat.c.path.like(self.database_path_like))
+            )
+            for item in listdir_stat:
+                connection.execute(stat.insert(), {**item, "stage": "fs"})


what does it do exacly ? deletes the row and inserts ? if so can we do update instead ?

This is the standard behavior where "exactly the files in os.listdir() need to be present in the index" but there are ways to skip calling this function.

M-Waszkiewicz-Anaconda · 2025-04-04T07:12:07Z

conda_index/alchemy/psqlcache.py

+        """
+        Write cache for a single package to database.
+        """
+        database_path = self.database_path(fn)


what is database_path in postgresql case ?

It is <random per channel id>/subdir/package.conda so that all packages in a given repodata.json output can be selected with a range query.

M-Waszkiewicz-Anaconda · 2025-04-04T07:14:22Z

conda_index/alchemy/psqlcache.py

+        database_path = self.database_path(fn)
+        with self.engine.begin() as connection:
+            for have_path in members:
+                table = PATH_TO_TABLE[have_path]


what PATH_TO_TABLE means is postgresql ?

It maps the filename info/index.json etc. to table names.

M-Waszkiewicz-Anaconda · 2025-04-04T07:31:38Z

conda_index/index/sqlitecache.py

+    mtime: Number
+    size: Number
+
+
 class CondaIndexCache:


maybe consider moving that class to seperate file and leave sqllitecache.py and psqlcache.py as pure implementation of that CondaIndexCache ? Having this we will not need import sqlitecache stuff in the psqlcache which is misleading

M-Waszkiewicz-Anaconda · 2025-04-04T07:36:58Z

BTW, @dholth we would probably need a tool which converts existing sqlite dbcache into postgres, is it something you also will work on, or the sirius team should handle this ?

jezdez

It might be easier to split the refactoring of the cache interface into a separate PR while the Postgres feature is worked on, that would make it also a little easier to review.

For the postgres backend, I'd love to see a connection handling that isn't localized in the various index cache methods, for concerns of integration with the rest of the conda index functionality that parallelizes some parts.

I also don't see how this handles errors yet, what would happen if a postgres, sqa or psycopg level error shows up, does it rollback the change to the index for the other tables?

conda_index/alchemy/model.py

jezdez · 2025-05-05T11:03:12Z

conda_index/alchemy/model.py

It might be good to separate the files by which db backend they are implementing, could we move this into conda_index/postgres so it has two files, conda_index/postgres/model.py and conda_index/postgres/cache.py? The same pattern could be applied to the sqlite based code..

We use the postgresql-specific JSONB column type in model. In psqlcache we use a few postgresql specific or at least not on sqlalchemy's core insert() things like its UPSERT (on_conflict_do_update). This is meant as a postgresql-only backend through sqlalchemy, not as a way to allow any SQL database to be used with conda-index.

I see that we could rename 'alchemy' to 'postgres' keeping the same basic structure and isolation of the optional dependencies.

conda_index/alchemy/psqlcache.py

conda_index/cli/__init__.py

dholth · 2025-05-05T13:26:15Z

It might be easier to split the refactoring of the cache interface into a separate PR while the Postgres feature is worked on, that would make it also a little easier to review.

For the postgres backend, I'd love to see a connection handling that isn't localized in the various index cache methods, for concerns of integration with the rest of the conda index functionality that parallelizes some parts.

I also don't see how this handles errors yet, what would happen if a postgres, sqa or psycopg level error shows up, does it rollback the change to the index for the other tables?

All updates for a particular archive (package file) are in the same transaction. If there is an error, they are rolled back.

dholth · 2025-05-08T15:59:47Z

For the postgres backend, I'd love to see a connection handling that isn't localized in the various index cache methods, for concerns of integration with the rest of the conda index functionality that parallelizes some parts.

SQLAlchemy default connection pool https://docs.sqlalchemy.org/en/20/core/pooling.html#sqlalchemy.pool.QueuePool

conda_index/postgres/cache.py

…a boolean

jezdez · 2025-05-08T20:44:29Z

Liking where this is going!

dholth · 2025-05-14T13:29:01Z

conda_index/postgres/model.py

+    sha256 = mapped_column(TEXT)
+    md5 = mapped_column(TEXT)
+    last_modified = mapped_column(TEXT)
+    etag = mapped_column(TEXT)


Alternatively,

stat = Table( "stat", metadata_obj, Column("stage", TEXT, default="indexed", nullable=False, primary_key=True), Column("path", TEXT, nullable=False, primary_key=True), Column("mtime", Integer), Column("size", Integer), Column("sha256", TEXT), Column("md5", TEXT), Column("last_modified", TEXT), Column("etag", TEXT), )

dholth added 12 commits May 6, 2024 10:10

begin porting repodata-fly's json2jlap to use newer conda jlap implem…

4539d7b

…entation

pass test

1e986ae

refine checks

16c62e6

add trim code

ba7bd18

restore jlap size limit feature

b48d42b

disable trimming if high <= low

569bd69

change file size in test

9002e45

move script up a level

3ae416b

move run_exports query into cache class

1659ece

sqlalchemy model and psqlcache skeleton

140b35b

Merge remote-tracking branch 'origin/main' into psql-poc

2b7e662

Merge remote-tracking branch 'origin/main' into psql-poc

e74a73d

conda-bot added this to 🔎 Review Mar 27, 2025

conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label Mar 27, 2025

github-project-automation bot moved this to 🆕 New in 🔎 Review Mar 27, 2025

dholth added 3 commits March 31, 2025 15:39

script to run postgresql

553338e

move psql to own module

9dc6677

format

79bbc66

dholth commented Apr 1, 2025

View reviewed changes

tests/postgresql_fixture.py

def postgresql_fixture():

"""

Run a local postgresql server for testing.

"""

Copy link

Contributor Author

dholth Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random port?

dholth added 4 commits April 2, 2025 08:59

refactor

ee8b65e

additional sqlalchemy queries

f87fead

finish "store package" queries

ba281bd

pass parsed json to database

64c8435

M-Waszkiewicz-Anaconda reviewed Apr 4, 2025

View reviewed changes

dholth added 5 commits April 4, 2025 12:35

avoid raw LIKE query

50e43b2

check channel_id prefix is alphanumeric

4104699

begin extract query in postgresql

0ccae9e

"classic" repodata query in postgresql

f40cce3

add computed name, sha256 columns

e9952df

dholth added 7 commits April 30, 2025 11:16

additional abstract base

c61971d

adjust for moved imports

2a01877

complete postgresql load_all_from_cache(), run_exports()

ea5d11f

correct default index cache

17be2aa

Merge branch 'main' into psql-poc

a34a2b3

Delete news/125-jlap

4c6d657

always loads(bytes | str) members on store

6c1bf8a

dholth requested a review from M-Waszkiewicz-Anaconda April 30, 2025 18:15

dholth added 4 commits April 30, 2025 14:18

add news

a4f1af4

bump version to 0.7.0

912fd4e

typing

140a0eb

fix path for run_exports; update to yield (str, dict)

bc3fb28

jezdez reviewed May 5, 2025

View reviewed changes

dholth mentioned this pull request May 5, 2025

Add database-independent base class for CondaIndexCache #207

Merged

3 tasks

dholth added 2 commits May 5, 2025 16:50

Merge remote-tracking branch 'origin/main' into psql-poc

28ea309

remove obsolete comment

69991c5

dholth added 2 commits May 8, 2025 12:11

move postgresql code

946432f

place test database in tmp_path instead of source tree

4c12e45

dholth commented May 8, 2025

View reviewed changes

conda_index/postgres/cache.py Show resolved Hide resolved

dholth commented May 8, 2025

View reviewed changes

conda_index/postgres/cache.py Show resolved Hide resolved

dholth added 3 commits May 8, 2025 13:00

choose database backend as a choice of sqlite3|postgresql instead of …

e890d3a

…a boolean

cache engine at module scope

97bf6de

use tmp_path_factory in session-scoped fixture

6754fb5

update help message

95e88aa

dholth commented May 14, 2025

View reviewed changes

dholth requested a review from jezdez May 14, 2025 13:54

use ruff

d64c917

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add postgresql as an alternative to the sqlite3 database #199

Add postgresql as an alternative to the sqlite3 database #199

Add postgresql as an alternative to the sqlite3 database #199

Are you sure you want to change the base?

Add postgresql as an alternative to the sqlite3 database #199

Conversation

Description

Checklist - did you ...

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment