kvserver: limit the number of kv spans configurable by a tenant #70555

irfansharif · 2021-09-22T05:55:45Z

#66348 outlined a scheme to support zone configs for secondary tenants. Since the unit of what we can configure in KV is a Range, tenants now being able to set zone configs grants them the ability to induce an arbitrary number of splits in KV -- something we want to put guardrails against (see RFC for more details).

There's some discussion elsewhere for how we could achieve this, copying over one in particular:

We could have each pod lease out a local quota, counting against the tenant's global limit. For N pods of a tenant, and a limit L, we could either lease out partial quotas L/N, or have a single pod lease out L entirely and require all DDLs/zcfgs to through that one pod. The scheme below describes the general case of each pod getting some fractional share of the total quota.

When each pod considers a DDL/zcfg txn, before committing it will attempt to acquire a quota from its local leased limit. The quota acquired is an over-estimate of how much the txn will cost (in terms of additional splits induced); we cannot under-estimate as KV could then later reject the committed zcfg.

If there's insufficient quota, we'll abort the txn and error out accordingly.

If there's sufficient quota, we'll commit the txn and rely on the eventual reconciliation to reliably succeed. We'll queue up a work item for the for the reconciliation job, marking alongside it the lease expiration of the quota it was accepted with.

When the local pool runs out of quota and has to request/lease more, it needs to ensure that all the txns that have been committed have been reconciled. We don't want it to be the case that a quota is leased, allocated to an operation in the work queue, it stays in the work queue past its lease expiration date, and we're able to lease the "same" quota again and queue up a second work item that will now get rejected. We need to make sure that when we lease a limit at t, there are no unprocessed work items with leased tokens expiring before t.

Whenever the work item is processed, the amount we over-estimated by could either be returned to the local quota pool (with the same expiration ts), or we could simply drop it entirely and rely on the local quota pool to eventually re-acquire it from the global pool.

The RFC also captures a possible schema we could use on the host tenant to store these limits:

CREATE TABLE system.spans_configurations_per_tenant (
    tenant_id               INT,
    max_span_configurations INT,
)

The actual RPCs/interfaces are TBD. Probably we also want the ability to configure these limits on a per-tenant basis.

Epic CRDB-10563

Jira issue: CRDB-10116

irfansharif · 2021-11-03T15:38:57Z

This is something we'll try tackling (or at least prototyping) in November. A brief plug for why it's important:

After #67679, enabling the span configs infrastructure by default (something we're also looking to do over November) would mean that for every single schema object in a multi-tenant cluster, we would be inducing a range split. This is exactly what happens with the host tenant today where every table/index/partition gets its own range. Before #67679, for each tenant we started off with just a mega-range containing the entire tenant ID prefix.

--- Seeing range boundaries on the host tenant, with a single secondary tenant pre-67679. The last one is of interest.
> SELECT range_id, start_pretty, end_pretty from crdb_internal.ranges;
  range_id |         start_pretty          |          end_pretty
-----------+-------------------------------+--------------------------------
         1 | /Min                          | /System/NodeLiveness
         2 | /System/NodeLiveness          | /System/NodeLivenessMax
         3 | /System/NodeLivenessMax       | /System/tsd
         4 | /System/tsd                   | /System/"tse"
         5 | /System/"tse"                 | /Table/SystemConfigSpan/Start
         6 | /Table/SystemConfigSpan/Start | /Table/11
         7 | /Table/11                     | /Table/12
         8 | /Table/12                     | /Table/13
                                        ...
        42 | /Table/46                     | /Table/47
        43 | /Table/47                     | /Tenant/123
        44 | /Tenant/123                   | /Max

After #67679 + enabling the infra.

> SELECT range_id, start_pretty, end_pretty from crdb_internal.ranges;
  range_id |          start_pretty          |           end_pretty
-----------+--------------------------------+---------------------------------
                                        ...
        42 | /Table/46                      | /Table/47
        43 | /Table/47                      | /Tenant/123
        44 | /Tenant/123                    | /Tenant/123/Table/3
        45 | /Tenant/123/Table/3            | /Tenant/123/Table/4
        46 | /Tenant/123/Table/4            | /Tenant/123/Table/5
        47 | /Tenant/123/Table/5            | /Tenant/123/Table/6
                                        ...
        83 | /Tenant/123/Table/45           | /Tenant/123/Table/46
        84 | /Tenant/123/Table/46           | /Tenant/123/Table/47
        85 | /Tenant/123/Table/47           | /Max

Having O(descriptors across all tenants) ranges in the system is going to be difficult for us to work with, and can readily be used as an attack vector. We'd either need to impose a hard limit on the number of descriptor objects a tenant is allowed to have, or do what this issue suggests -- introducing a limit based on the number of individually configurable spans. This is still a pretty rudimentary safeguard, but fear not, we can do much better on the KV side to move away from always splitting on descriptor boundaries: #72389.

(+cc @shralex, @nvanbenschoten, @ajwerner, @arulajmani)

irfansharif · 2022-01-06T22:23:41Z

In the short-term (22.1) all we'll do is maintain a counter for secondary tenants and for every schema change, use a translator-like component to determine how many splits are being added/removed. We don't need to look at the corresponding zone config for the descriptor as that has no bearing here. We'll inc/decrement our counter accordingly, comparing it against a per-tenant limit exposed by the host -- perhaps using a "tenant read-only cluster setting", using the verbiage from multi-tenant cluster settings rfc. If we're over the limit, we'll abort the schema change txn. KV of course would still maintain its internal counter for the number of splits/etc per-tenant, and use that to reject span config updates if they go overboard. For unmodified tenants, we don't expect to run into these errors.

This scheme, in contrast to the one described above, has the downside that it would serialize all schema changes. That feels reasonable for the short term, especially considering it only applies to secondary tenants. Incorporating leasing in any form would purely be a performance optimization -- to have real schema change concurrency.

ajstorm · 2022-01-06T22:29:06Z

How would this work for internal APIs like AdminSplit? Would they get caught by this mechanism and tracked accordingly? Or would they enter KV below the "schema change"-level tracking? I'm asking because we may be able to enable AdminSplit directly if it gets tracked by this scheme.

irfansharif · 2022-01-07T15:03:35Z

we may be able to enable AdminSplit directly if it gets tracked by this scheme.

I don't think AdminSplit can directly leverage the mechanisms described above, at least not in this current form, and not without additional leg work. AdminSplits are issued directly to KV as a non-transactional requests (follow code here). The SQL side counter we're maintaining is done so transactionally, in the tenant's own keyspace. Crucially, it's maintained without ever having to ask KV about its own trackers. To track an AdminSplit request (these typically have expiration timestamps) we'd have to maintain this SQL side counter in tandem with issuing the request to KV, where it may or may not be rejected (some form of 2PC maybe?). In addition, once the AdminSplit request has expired, the SQL side counter would probably want to reflect the change -- it would perhaps need to reach out to KV to reconcile what the current state of those requests are. This is all doable with some effort, and there might be simpler approaches, but I'm not sure if it's quite as "drop-in" with everything else this issue is focused on.

I know we recently filed #74389, should we brainstorm there? It's not yet clear to me yet that even want to carry forward AdminSplit/Scatter as an API, let alone extend it to secondary tenants. There was some KV internal discussion here on the subject.

ajwerner · 2022-01-07T19:44:41Z

More internal discussion on AdminSplit and "advisory splits" here.

Splitter returns the set of all possible split points for the given table descriptor. It steps through every "unit" that we can apply configurations over (table, indexes, partitions and sub-partitions) and figures out the actual key boundaries that we may need to split over. For example: CREATE TABLE db.parts(i INT PRIMARY KEY, j INT) PARTITION BY LIST (i) ( PARTITION one_and_five VALUES IN (1, 5), PARTITION four_and_three VALUES IN (4, 3), PARTITION everything_else VALUES IN (6, default) ); Assuming a table ID of 108, we'd generate: /Table/108 /Table/108/1 /Table/108/1/1 /Table/108/1/2 /Table/108/1/3 /Table/108/1/4 /Table/108/1/5 /Table/108/1/6 /Table/108/1/7 /Table/108/2 --- This is going to serve as the underlying library for cockroachdb#70555. In a future commit we'll introduce two system tables, one for all tenants (both guest and host) and one on just the host. The first will be maintained by the SQL pod, using the library, as part of every schema change operation. We'll record in it a single row containing the total number of splits the pod's currently using up by comparing the before/after of a descriptor operation to see if the commit would add or reduce the number of necessary splits. We'll abort the txn if it's over a pre-set limit (INF for the host tenant, and using a KV-settable-only cluster setting for guests: cockroachdb#73857). The second table will be used only for KV's internal book keeping, sitting behind the KVAccessor interface. We'll use it to hard reject RPCs installing span configs that induce splits more than a tenant is allowed to. With the SQL pods coordinating using the KV-set limits, this would only happen as a result of faulty SQL code. Put together, we'll have a co-operative form of per-tenant split limits. Release note: None

75803: spanconfig: introduce spanconfig.Splitter r=irfansharif a=irfansharif Splitter returns the set of all possible split points for the given table descriptor. It steps through every "unit" that we can apply configurations over (table, indexes, partitions and sub-partitions) and figures out the actual key boundaries that we may need to split over. For example: ``` CREATE TABLE db.parts(i INT PRIMARY KEY, j INT) PARTITION BY LIST (i) ( PARTITION one_and_five VALUES IN (1, 5), PARTITION four_and_three VALUES IN (4, 3), PARTITION everything_else VALUES IN (6, default) ); ``` Assuming a table ID of 108, we'd generate: ``` /Table/108 /Table/108/1 /Table/108/1/1 /Table/108/1/2 /Table/108/1/3 /Table/108/1/4 /Table/108/1/5 /Table/108/1/6 /Table/108/1/7 /Table/108/2 ``` This is going to serve as the underlying library for #70555. In a future commit we'll introduce two system tables, one for all tenants (both guest and host) and one on just the host. The first will be maintained by the SQL pod, using the library, as part of every schema change operation. We'll record in it a single row containing the total number of splits the pod's currently using up by comparing the before/after of a descriptor operation to see if the commit would add or reduce the number of necessary splits. We'll abort the txn if it's over a pre-set limit (INF for the host tenant, and using a KV-settable-only cluster setting for guests: #73857). The second table will be used only for KV's internal book keeping, sitting behind the KVAccessor interface. We'll use it to hard reject RPCs installing span configs that induce splits more than a tenant is allowed to. With the SQL pods coordinating using the KV-set limits, this would only happen as a result of faulty SQL code. Put together, we'll have a co-operative form of per-tenant split limits. Release note: None 76751: ci: build `workload` in CI r=rail a=rickystewart This brings us to parity with the pre-Bazel `Compile Builds` job in CI. Release note: None 76758: backup: include tenant_settings in cluster backup r=RaduBerinde a=RaduBerinde We include the tenant_settings table in full cluster backups. This means that when we restore a cluster, all overrides should be in place. The overrides are treated as a host cluster property. When restoring only specific tenants, the overrides will not be carried over. In the future, we can add tenant-specific overrides to tenant metadata so restoring a single tenant restores the overrides as well. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com>

Splitter returns the set of all possible split points for the given table descriptor. It steps through every "unit" that we can apply configurations over (table, indexes, partitions and sub-partitions) and figures out the actual key boundaries that we may need to split over. For example: CREATE TABLE db.parts(i INT PRIMARY KEY, j INT) PARTITION BY LIST (i) ( PARTITION one_and_five VALUES IN (1, 5), PARTITION four_and_three VALUES IN (4, 3), PARTITION everything_else VALUES IN (6, default) ); Assuming a table ID of 108, we'd generate: /Table/108 /Table/108/1 /Table/108/1/1 /Table/108/1/2 /Table/108/1/3 /Table/108/1/4 /Table/108/1/5 /Table/108/1/6 /Table/108/1/7 /Table/108/2 --- This is going to serve as the underlying library for cockroachdb#70555. In a future commit we'll introduce two system tables, one for all tenants (both guest and host) and one on just the host. The first will be maintained by the SQL pod, using the library, as part of every schema change operation. We'll record in it a single row containing the total number of splits the pod's currently using up by comparing the before/after of a descriptor operation to see if the commit would add or reduce the number of necessary splits. We'll abort the txn if it's over a pre-set limit (INF for the host tenant, and using a KV-settable-only cluster setting for guests: cockroachdb#73857). The second table will be used only for KV's internal book keeping, sitting behind the KVAccessor interface. We'll use it to hard reject RPCs installing span configs that induce splits more than a tenant is allowed to. With the SQL pods coordinating using the KV-set limits, this would only happen as a result of faulty SQL code. Put together, we'll have a co-operative form of per-tenant split limits. Release note: None

77639: spanconfig: re-write spanconfig.Splitter r=irfansharif a=irfansharif The earlier implementation relied on decoding keys in order to determine precise split points. When plugging this library into the component designed for tenant-side span config limiting (#77337, to address \#70555), we realized it's not possible to grab a hold of type-hydrated table descriptors (needed for decoding). This is because today it's possible to GC type descriptors before GC-ing table descriptors that refer to them. Given the integration point for this library was around the GC job, we had to forego decoding routines. Instead of computing the precise split keys, we can however compute how many there are without having to look at keys at all. We Consider our table hierarchy: table -> index -> partition -> partition -> (...) -> partition Each partition is either a PARTITION BY LIST kind (where it can then be further partitioned, or not), or a PARTITION BY RANGE kind (no further partitioning possible). We can classify each parent-child link into two types: (a) Contiguous {index, list partition} -> range partition (b) Non-contiguous table -> index, {index, list partition} -> list partition - Contiguous links are the sort where each child span is contiguous with another, and that the set of all child spans encompass the parent's span. For an index that's partitioned by range: ``` CREATE TABLE db.range(i INT PRIMARY KEY, j INT) PARTITION BY RANGE (i) ( PARTITION less_than_five VALUES FROM (minvalue) to (5), PARTITION between_five_and_ten VALUES FROM (5) to (10), PARTITION greater_than_ten VALUES FROM (10) to (maxvalue) ); ``` With table ID as 106, the parent index span is `/Table/106/{1-2}`. The child spans are `/Table/106/1{-/5}`, `/Table/106/1/{5-10}` and `/Table/106/{1/10-2}`. They're contiguous; put together they wholly encompass the parent span. - Non-contiguous links, by contrast, are when child spans are neither contiguous with respect to one another, nor do they start and end at the parent span's boundaries. For a table with a secondary index: ``` CREATE TABLE db.t(i INT PRIMARY KEY, j INT); CREATE INDEX idx ON db.t (j); DROP INDEX db.t@idx; CREATE INDEX idx ON db.t (j); ``` With table ID as 106, the parent table span is `/Table/10{6-7}`. The child spans are `/Table/106/{1-2}` and `/Table/106/{3-4}`. Compared to the parent span, we're missing `/Table/106{-/1}`, `/Table/106/{2-3}`, `/Table/10{6/4-7}`. For N children: - For a contiguous link, the number of splits equals the number of child elements (i.e. N). - For a non-contiguous link, the number of splits equals N + 1 + N. For N children, there are N - 1 gaps. There are also 2 gaps at the start and end of the parent span. Summing that with the N children span themselves, we get to the formula above. This assumes that the N child elements aren't further subdivided, if they are (we can compute it recursively), the formula becomes N + 1 + Σ(grand child spans). Computing split count this way does come with one down-side: we might be overcounting. When comparing keys, we could for example recognize that partition-by-list values are adjacent, therefore there's no "gap" between them. We can also do this by potentially comparing encoded keys with one other. We just didn't because it's more annoying code to write, and over-counting here is more than fine for our purposes usages. Release justification: non-production code Release note: None 77653: rfcs: update storage parameters for TTL r=rafiss a=otan Release justification: non-production code change Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com>

Fixes cockroachdb#70555. In order to limit the number of span configs a tenant's able to install, we introduce a tenant-side spanconfig.Limiter. It presents the following interface: // Limiter is used to limit the number of span configs installed by // secondary tenants. It considers the committed and uncommitted // state of a table descriptor and computes the "span" delta, each // unit we can apply a configuration over. It uses these deltas to // maintain an aggregate counter, informing the caller if exceeding // the configured limit. type Limiter interface { ShouldLimit( ctx context.Context, txn *kv.Txn, committed, uncommitted catalog.TableDescriptor, ) (bool, error) } This limiter only applies to secondary tenants. The counter is maintained in a newly introduced (tenant-only) system table, using the following schema: CREATE TABLE system.span_count ( singleton BOOL DEFAULT TRUE, span_count INT NOT NULL, CONSTRAINT "primary" PRIMARY KEY (singleton), CONSTRAINT single_row CHECK (singleton), FAMILY "primary" (singleton, span_count) ); We need just two integration points for spanconfig.Limiter: - Right above CheckTwoVers 10000 ionInvariant, where we're able to hook into the committed and to-be-committed descriptor state before txn commit. - In the GC job, when gc-ing table state. We decrement a table's split count when GC-ing the table for good. The per-tenant span config limit used is controlled by a new tenant read-only cluster setting: spanconfig.tenant_limit. Multi-tenant cluster settings (cockroachdb#73857) provides the infrastructure for the host tenant to be able to control this setting cluster wide, or to target a specific tenant at a time. We also need a migration here, to start tracking span counts for clusters with pre-existing tenants. We introduce a migration that scans over all table descriptors and seeds system.span_count with the right value. Given cluster version gates disseminate asynchronously, we also need a preliminary version to start tracking incremental changes. It's useful to introduce the notion of debt. This will be handy if/when we lower per-tenant limits, and also in the migration above since it's possible for pre-existing tenants to have committed state in violation of the prescribed limit. When in debt, schema changes that add new splits will be rejected (dropping tables/indexes/partitions/etc. will work just fine). When attempting a txn that goes over the configured limit, the UX is as follows: > CREATE TABLE db.t2(i INT PRIMARY KEY); pq: exceeded limit for number of table spans Release note: None Release justification: low risk, high benefit change

blathers-crl · 2022-03-22T19:55:23Z

Hi @irfansharif, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

Fixes cockroachdb#70555. In order to limit the number of span configs a tenant's able to install, we introduce a tenant-side spanconfig.Limiter. It presents the following interface: // Limiter is used to limit the number of span configs installed by // secondary tenants. It considers the committed and uncommitted // state of a table descriptor and computes the "span" delta, each // unit we can apply a configuration over. It uses these deltas to // maintain an aggregate counter, informing the caller if exceeding // the configured limit. type Limiter interface { ShouldLimit( ctx context.Context, txn *kv.Txn, committed, uncommitted catalog.TableDescriptor, ) (bool, error) } This limiter only applies to secondary tenants. The counter is maintained in a newly introduced (tenant-only) system table, using the following schema: CREATE TABLE system.span_count ( singleton BOOL DEFAULT TRUE, span_count INT NOT NULL, CONSTRAINT "primary" PRIMARY KEY (singleton), CONSTRAINT single_row CHECK (singleton), FAMILY "primary" (singleton, span_count) ); We need just two integration points for spanconfig.Limiter: - Right above CheckTwoVersionInvariant, where we're able to hook into the committed and to-be-committed descriptor state before txn commit. - In the GC job, when gc-ing table state. We decrement a table's split count when GC-ing the table for good. The per-tenant span config limit used is controlled by a new tenant read-only cluster setting: spanconfig.tenant_limit. Multi-tenant cluster settings (cockroachdb#73857) provides the infrastructure for the host tenant to be able to control this setting cluster wide, or to target a specific tenant at a time. We also need a migration here, to start tracking span counts for clusters with pre-existing tenants. We introduce a migration that scans over all table descriptors and seeds system.span_count with the right value. Given cluster version gates disseminate asynchronously, we also need a preliminary version to start tracking incremental changes. It's useful to introduce the notion of debt. This will be handy if/when we lower per-tenant limits, and also in the migration above since it's possible for pre-existing tenants to have committed state in violation of the prescribed limit. When in debt, schema changes that add new splits will be rejected (dropping tables/indexes/partitions/etc. will work just fine). When attempting a txn that goes over the configured limit, the UX is as follows: > CREATE TABLE db.t2(i INT PRIMARY KEY); pq: exceeded limit for number of table spans Release note: None Release justification: low risk, high benefit change

Fixes cockroachdb#70555. In order to limit the number of span configs a tenant's able to install, we introduce a tenant-side spanconfig.Limiter. It presents the following interface: // Limiter is used to limit the number of span configs installed by // secondary tenants. It takes in a delta (typically the difference // in span configs between the committed and uncommitted state in // the txn), uses it to maintain an aggregate counter, and informs // the caller if exceeding the prescribed limit. type Limiter interface { ShouldLimit( ctx context.Context, txn *kv.Txn, delta int, ) (bool, error) } The delta is computed using a static helper, spanconfig.Delta: // Delta considers both the committed and uncommitted state of a // table descriptor and computes the difference in the number of // spans we can apply a configuration over. func Delta( ctx context.Context, s Splitter, committed, uncommitted catalog.TableDescriptor, ) (int, error) This limiter only applies to secondary tenants. The counter is maintained in a newly introduced (tenant-only) system table, using the following schema: CREATE TABLE system.span_count ( singleton BOOL DEFAULT TRUE, span_count INT NOT NULL, CONSTRAINT "primary" PRIMARY KEY (singleton), CONSTRAINT single_row CHECK (singleton), FAMILY "primary" (singleton, span_count) ); We need just two integration points for spanconfig.Limiter: - Right above CheckTwoVersionInvariant, where we're able to hook into the committed and to-be-committed descriptor state before txn commit; - In the GC job, when gc-ing table state. We decrement a table's split count when GC-ing the table for good. The per-tenant span config limit used is controlled by a new tenant read-only cluster setting: spanconfig.tenant_limit. Multi-tenant cluster settings (cockroachdb#73857) provides the infrastructure for the host tenant to be able to control this setting cluster wide, or to target a specific tenant at a time. We also need a migration here, to start tracking span counts for clusters with pre-existing tenants. We introduce a migration that scans over all table descriptors and seeds system.span_count with the right value. Given cluster version gates disseminate asynchronously, we also need a preliminary version to start tracking incremental changes. It's useful to introduce the notion of debt. This will be handy if/when we lower per-tenant limits, and also in the migration above since it's possible for pre-existing tenants to have committed state in violation of the prescribed limit. When in debt, schema changes that add new splits will be rejected (dropping tables/indexes/partitions/etc. will work just fine). When attempting a txn that goes over the configured limit, the UX is as follows: > CREATE TABLE db.t42(i INT PRIMARY KEY); pq: exceeded limit for number of table spans Release note: None Release justification: low risk, high benefit change Release note: None

77337: spanconfig: limit # of tenant span configs r=irfansharif a=irfansharif Fixes #70555. In order to limit the number of span configs a tenant's able to install, we introduce a tenant-side spanconfig.Limiter. It presents the following interface: // Limiter is used to limit the number of span configs installed by // secondary tenants. It considers the committed and uncommitted // state of a table descriptor and computes the "span" delta, each // unit we can apply a configuration over. It uses these deltas to // maintain an aggregate counter, informing the caller if exceeding // the configured limit. type Limiter interface { ShouldLimit( ctx context.Context, txn *kv.Txn, committed, uncommitted catalog.TableDescriptor, ) (bool, error) } This limiter only applies to secondary tenants. The counter is maintained in a newly introduced (tenant-only) system table, using the following schema: CREATE TABLE system.span_count ( singleton BOOL DEFAULT TRUE, span_count INT NOT NULL, CONSTRAINT "primary" PRIMARY KEY (singleton), CONSTRAINT single_row CHECK (singleton), FAMILY "primary" (singleton, span_count) ); We need just two integration points for spanconfig.Limiter: - Right above CheckTwoVersionInvariant, where we're able to hook into the committed and to-be-committed descriptor state before txn commit. - In the GC job, when gc-ing table state. We decrement a table's split count when GC-ing the table for good. The per-tenant span config limit used is controlled by a new tenant read-only cluster setting: spanconfig.tenant_limit. Multi-tenant cluster settings (#73857) provides the infrastructure for the host tenant to be able to control this setting cluster wide, or to target a specific tenant at a time. We also need a migration here, to start tracking span counts for clusters with pre-existing tenants. We introduce a migration that scans over all table descriptors and seeds system.span_count with the right value. Given cluster version gates disseminate asynchronously, we also need a preliminary version to start tracking incremental changes. It's useful to introduce the notion of debt. This will be handy if/when we lower per-tenant limits, and also in the migration above since it's possible for pre-existing tenants to have committed state in violation of the prescribed limit. When in debt, schema changes that add new splits will be rejected (dropping tables/indexes/partitions/etc. will work just fine). When attempting a txn that goes over the configured limit, the UX is as follows: > CREATE TABLE db.t2(i INT PRIMARY KEY); pq: exceeded limit for number of table spans Release note: None Release justification: low risk, high benefit change 79462: colexecproj: break it down into two packages r=yuzefovich a=yuzefovich **colexecproj: split up default cmp proj op file into two** This commit splits up a single file containing two default comparison projection operators into two files. This is done in preparation of the following commit (which will move one of the operators to a different package). Release note: None **colexecproj: extract a new package for projection ops with const** This commit extracts a new `colexecprojconst` package out of `colexecproj` that contains all projection operators with one constant argument. This will allow for faster build speeds since both packages tens of thousands lines of code. Special care had to be taken for default comparison operator because we need to generate two files in different packages based on a single template. I followed the precedent of `sort_partitioner.eg.go` which had to do the same. Addresses: #79357. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

Fixes #70555. In order to limit the number of span configs a tenant's able to install, we introduce a tenant-side spanconfig.Limiter. It presents the following interface: // Limiter is used to limit the number of span configs installed by // secondary tenants. It takes in a delta (typically the difference // in span configs between the committed and uncommitted state in // the txn), uses it to maintain an aggregate counter, and informs // the caller if exceeding the prescribed limit. type Limiter interface { ShouldLimit( ctx context.Context, txn *kv.Txn, delta int, ) (bool, error) } The delta is computed using a static helper, spanconfig.Delta: // Delta considers both the committed and uncommitted state of a // table descriptor and computes the difference in the number of // spans we can apply a configuration over. func Delta( ctx context.Context, s Splitter, committed, uncommitted catalog.TableDescriptor, ) (int, error) This limiter only applies to secondary tenants. The counter is maintained in a newly introduced (tenant-only) system table, using the following schema: CREATE TABLE system.span_count ( singleton BOOL DEFAULT TRUE, span_count INT NOT NULL, CONSTRAINT "primary" PRIMARY KEY (singleton), CONSTRAINT single_row CHECK (singleton), FAMILY "primary" (singleton, span_count) ); We need just two integration points for spanconfig.Limiter: - Right above CheckTwoVersionInvariant, where we're able to hook into the committed and to-be-committed descriptor state before txn commit; - In the GC job, when gc-ing table state. We decrement a table's split count when GC-ing the table for good. The per-tenant span config limit used is controlled by a new tenant read-only cluster setting: spanconfig.tenant_limit. Multi-tenant cluster settings (#73857) provides the infrastructure for the host tenant to be able to control this setting cluster wide, or to target a specific tenant at a time. We also need a migration here, to start tracking span counts for clusters with pre-existing tenants. We introduce a migration that scans over all table descriptors and seeds system.span_count with the right value. Given cluster version gates disseminate asynchronously, we also need a preliminary version to start tracking incremental changes. It's useful to introduce the notion of debt. This will be handy if/when we lower per-tenant limits, and also in the migration above since it's possible for pre-existing tenants to have committed state in violation of the prescribed limit. When in debt, schema changes that add new splits will be rejected (dropping tables/indexes/partitions/etc. will work just fine). When attempting a txn that goes over the configured limit, the UX is as follows: > CREATE TABLE db.t42(i INT PRIMARY KEY); pq: exceeded limit for number of table spans Release note: None Release justification: low risk, high benefit change Release note: None

kenliu-crl · 2022-05-24T05:01:14Z

manually reviewed and brought up to date

blathers-crl bot added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Sep 22, 2021

irfansharif added A-kv Anything in KV that doesn't belong in a more specific category. A-zone-configs A-multitenancy Related to multi-tenancy labels Sep 22, 2021

exalate-issue-sync bot added the T-kv KV Team label Oct 11, 2021

exalate-issue-sync bot assigned irfansharif Oct 26, 2021

irfansharif mentioned this issue Nov 3, 2021

kv: collapse adjacent span configs to reduce number of necessary splits #72389

Closed

cockroachdb deleted a comment from blathers-crl bot Nov 5, 2021

irfansharif mentioned this issue Dec 15, 2021

spanconfig: harden infrastructure for v22.1 #73874

Closed

24 tasks

irfansharif mentioned this issue Jan 24, 2022

spanconfigsqltranslator: populate protected_timestamps in SpanConfig #74803

Merged

irfansharif mentioned this issue Feb 1, 2022

spanconfig: introduce spanconfig.Splitter #75803

Merged

irfansharif mentioned this issue Mar 3, 2022

sql,spanconfig: attribute usage costs to tenant-settable fields #77344

Open

irfansharif mentioned this issue Mar 11, 2022

spanconfig: limit # of tenant span configs #77337

Merged

irfansharif added the GA-blocker label Mar 22, 2022

nvanbenschoten added the branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 label Mar 22, 2022

irfansharif added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 22, 2022

celiala added the blocks-22.1.0-beta.1 label Mar 29, 2022

irfansharif removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. blocks-22.1.0-beta.1 labels Mar 29, 2022

craig bot closed this as completed in 90054c6 Apr 8, 2022

blathers-crl bot mentioned this issue Apr 8, 2022

release-22.1: spanconfig: limit # of tenant span configs #79627

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kvserver: limit the number of kv spans configurable by a tenant #70555

kvserver: limit the number of kv spans configurable by a tenant #70555

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kvserver: limit the number of kv spans configurable by a tenant #70555

kvserver: limit the number of kv spans configurable by a tenant #70555

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!