CN118489106A

CN118489106A - Fast database expansion with storage and computation decoupled architecture

Info

Publication number: CN118489106A
Application number: CN202280087387.1A
Authority: CN
Inventors: A·J·惠特克; M·K·哈德芙; S·奥茨科夫斯基; S·毛希丁; A·O·韦尔比茨基; M·布拉玛德萨姆; N·K·三卡那达; L·C·D·萧; A·博塞尔; G·S·钦奇瓦德卡尔; J·恩格斯贝格
Original assignee: Amazon Technologies Inc
Current assignee: Amazon Technologies Inc
Priority date: 2021-11-26
Filing date: 2022-11-22
Publication date: 2024-08-13

Abstract

Techniques for fast online expansion of databases in a database service via a split architecture including decoupled storage and compute layers are described. The cluster of Database (DB) nodes is expanded to add new DB nodes. The expanding includes determining a split of data of the first volume managed by the existing DB node. A second DB node is obtained and the first volume is cloned in accordance with a lightweight replication technique to produce a second volume for use by the second DB node. After cloning, a set of database modifications is applied to the second volume based on modifications caused by database traffic related to the volume received by the first DB node during cloning of the first volume. Each DB node may delete portions of the volume that are not needed by the node based on the split.

Description

Fast database expansion with storage and computation decoupled architecture

Background

Traditionally, databases can be "extended" to accommodate the ever-increasing demand using a technique called sharding (sometimes called partitioning). The key idea of slicing is to subdivide the user's data onto multiple support nodes (or "slices") so that the load is dispersed. Typically, the proxy queue is responsible for mapping the user's query to one or more of these tiles, e.g., based on the user-provided tile keys. However, a key challenge of such a sharding implementation is how to implement dynamic expansion in an efficient manner, i.e., the ability to incorporate new shards into the system while the system remains operational. Traditionally, this process involves rebalancing (i.e., copying) data from an existing shard to a newly created shard.

Drawings

Various embodiments according to the present disclosure will be described with reference to the drawings, in which:

FIG. 1 is a diagram illustrating problematic aspects of expanding a database cluster in an online manner.

Fig. 2 is a diagram illustrating a rapid online expansion of a database via a split architecture including decoupled storage and compute layers in a database service, according to some embodiments.

FIG. 3 is a diagram illustrating exemplary components of a database system computing layer utilized with different storage layers, according to some embodiments.

FIG. 4 is a diagram illustrating exemplary components of a distributed storage tier of a database system utilized with different computing tiers according to some embodiments.

FIG. 5 is a diagram illustrating exemplary operational aspects of a split architecture database including decoupled storage and compute layers in a database service, according to some embodiments.

FIG. 6 is a flowchart illustrating the operation of a method of expanding a database cluster using a storage and computing decoupled architecture, according to some embodiments.

FIG. 7 illustrates an example provider network environment, according to some embodiments.

FIG. 8 is a block diagram of an example provider network providing storage services and hardware virtualization services to customers, according to some embodiments.

FIG. 9 is a block diagram illustrating an example computer system that may be used in some embodiments.

Detailed Description

The present disclosure relates to methods, devices, systems, and non-transitory computer-readable storage media for implementing fast database expansion with storage and computing decoupling-based architecture.

As described herein, dynamically expanding a database to incorporate new shards into a cluster is a key challenge and typically involves rebalancing efforts implemented by copying data from one or more existing shards to a new shard or a set of new shards. For example, FIG. 1 is a diagram illustrating problematic aspects of an extended database cluster 106A. In this example, the database may be implemented by database service 110 (e.g., of cloud provider network 100) via database cluster 106A, wherein database service 110 may implement a large number of different database clusters 106A-106M for one or more (and typically a plurality of) different users at any point in time.

In general, complex databases (e.g., relational databases, document databases, etc.) that provide complex analysis/query functionality may be constructed using database clusters 106A of one or more database nodes 114. A database node may include an instantiation of a database, for example in the form of a database engine 116 (including a query engine) having associated data (shown here as a collection of "records" 118) belonging to the database that is "managed" by the node. In some embodiments, the database node may be implemented by launching a virtual machine or other type of computing instance having database software pre-installed or installed after the instance is launched. In this example, to increase processing power, a database may be "sliced" into implementations by a plurality of different database nodes, meaning herein that a first database node is responsible for a first portion of the database's data (e.g., records A-L118A of a particular record table or collection, or data sets 1-10 of a data set collection, etc.), while a second database node may be responsible for a second portion of the database (e.g., records M-Z118B).

In some cases, the control plane 104 set of components of the database service may determine: rebalancing will occur; a particular database node 114B is exhausting its available resources (e.g., storage, memory, computation, bandwidth); the user desires to rebalance the database (e.g., via receiving a command from the user to do so), and so forth. Thus, the control plane may wish to extend the cluster 106A of nodes, e.g., by adding a new database node 114C, which may then be responsible for one or more portions of existing data (e.g., from the first database node, the second database node, and/or both database nodes), e.g., the record T-Z118C may be managed by the new node, thereby relieving the second database node of responsibility. Thus, the overall capacity of the database is increased, which may provide improved performance for its clients 102A-102B (e.g., executing within the service 108 of the provider network, or within another third party network external to the provider network that is accessible via one or more intermediary networks 160 (such as the internet), and accessing the database cluster 106 via the agent layer 112 comprised of one or more database agents).

However, such rebalancing processes are very slow due to a mismatch between storage capacity and storage intake and/or data transfer rate. For example, there may be a large amount of data stored that needs to be transmitted and added to a new database node (reflecting "shards"), but the amount of bandwidth available for transmission and the processing power to perform the necessary updates are relatively limited, resulting in longer network transmission times and/or ingestion times, as indicated by the circles (a) in fig. 1.

Furthermore, the rebalancing process can place additional load on the source database nodes, as reflected by circle (B). Thus, dynamic expansion may exacerbate capacity shortages in the short term, resulting in database current limiting or a risk of complete shutdown/unavailability due to excessive loads imposed on the shards, thus significantly affecting the performance of existing clusters during the expansion process.

According to some embodiments, database expansion is optimized by pushing rebalancing logic to separate layers. Fig. 2 is a diagram illustrating a split architecture including decoupled storage layer 204 and computation layer 202 in a database service, according to some embodiments. Embodiments utilize a two-layer architecture: a computation layer 202 that manages query execution; and a storage layer 204 that manages data persistence and consistency. Because of its limited interface, the storage layer may be designed in a multi-tenant fashion to support workloads from different users (e.g., users of a multi-tenant provider network that implement a database, such as via managed database services). Multi-tenancy of the storage layer may allow the system to support thermal management in an efficient manner. In some embodiments, the system provider may oversubscribe (oversupply) the storage tier beyond the needs of any single user. The resource utilization of data rebalancing can be amortized according to a huge user group, thereby amortizing the resource utilization. Thus, in some embodiments, the storage layer may be implemented in a multi-tenant fashion, allowing users to utilize larger capacity pools.

In some embodiments, the storage layer may be designed to support copy-on-write snapshot (copy-on-write snapshot). Copy-on-write snapshots (e.g., data obtained from a source shard/node) allow the system to quickly start a new shard/node (by making the data available as a virtual storage volume for the new shard using the copy-on-write snapshot) without immediately (or always) performing a complete physical copy of the underlying data. Embodiments include implementing a shard "split" primitive in which two nodes agree to split data for a single shard. Copy-on-write primitives can greatly reduce the time required to instantiate a new shard/node by avoiding costly bulk data loading.

In some embodiments, the "lightweight" database header node 214 of the database cluster 208A may be used to rapidly increase capacity. For example, the system may utilize a shared disk architecture via the use of these "lightweight" head nodes 214 that do not require actual replication of the data. The system can utilize these read-only copies to facilitate the dynamic expansion process without affecting the primary database node. For example, an embodiment may use a read-only copy to replay a change log that was reached when the copy-on-write snapshot was instantiated (e.g., via log transfer 252). By doing so, embodiments avoid the foreground workload that affects the user's primary "writer" instance.

In some implementations, the storage layer may perform extremely fast data transfers. For example, the storage tier may use copy-on-write volume cloning, where a new volume 234B (created based on database volume 234A) is accessible by a different database head node 214X using the same underlying storage volume data 250. When a database node makes changes to its "own" (shared) view of a volume, data may be written to only that volume's specific individual underlying storage area, effectively "forking" the underlying volume into individual versions based on the original version. In some embodiments, this allows for the new database node to be started/configured for use at an extremely fast rate when compared to prior art techniques that require the underlying data to be transferred in its entirety as a new copy. Furthermore, when a new database node is started, the underlying data of the volume itself need not be moved immediately (or at all) to the database node (or a separate storage location within the storage layer), e.g., until needed in an lazy manner.

As a specific example, in some embodiments, the data sets of records 1-10 may be managed by the first database node 214A. The control plane 104 may detect that the first database node 214A has a heavy load and determine to extend the database by adding additional database nodes 214X to mitigate some of the heavy load. In some embodiments, a copy-on-write policy may be used to obtain a "snapshot" of the volumes used by the first database node, and the snapshot may be used to create a second database node 214X that may access the same data via a cloned database volume 234B, which may have different volume metadata 230B (different from volume metadata 230A) but point to (at least initially) the same underlying volume data 250 stored by the one or more storage devices 235.

In some embodiments, for any modifications/writes that occur during this snapshot generation time range, these modifications/writes may be recorded in the modification log 228 and sent via log transfer 252 (e.g., via a "stream" type transfer to such modifications that lasts for the duration of the entire transfer), may be created to capture changes made to the volume provided to the new database node 214X (or to a particular portion of the volume that is "given" to the new node) to allow it to replay the changes at circle (4B) to "catch up" with the same state of the first database node.

In other embodiments, the underlying storage layer 204 itself may be implemented to configure itself at circle 2A ' to record changes to the underlying database volume during snapshot generation (e.g., a set of change/redo logs) and subsequent attachment of the volume to a new database instance as reflected by circle (4A '), and thus the "new" database volume (e.g., which may include all data or pointers to the original data) may be updated accordingly by replaying the log itself at circle (4B '), thereby removing the burden imposed on the database nodes in the computing layer. These actions may be performed by storage layer control plane 244, or at least partially controlled by storage layer control plane 244 (e.g., via configuring the associated storage nodes at circle (2A') and performed by storage nodes 206A-206M).

At some point, for example, when it is determined that both database nodes 214 are up-to-date, the proxy layer 112 (e.g., a set of one or more database proxy nodes, software routers, etc.) may be updated (e.g., by updating its mapping information 216 cache, associating a type of database traffic (such as traffic associated with a particular partition of the database) with a particular database node/network address) to cause traffic associated with a first portion of the data set (e.g., data sets 1-8) to continue to be sent to the first database node 214A, but traffic associated with a second portion of the data set (e.g., data sets 9-10) is sent to the new database node 214X. For example, control plane 104 may send one or more messages to proxy layer 112 to allow the proxy layer to update its mapping information 216 and thus be able to identify a partition key (e.g., a particular value of a particular field or set of fields) from a client's query and map the partition key to an identifier of a database header node that manages the associated data.

At this point or later, database node 214 may then delete the portions of its volume 234 that they are no longer responsible for, e.g., the first existing database node may delete data sets 9-10, while the new second database node may delete data sets 1-8. This may allow the database node itself to reserve as much capacity as possible, for example, by pruning the amount of addressable storage space that has been utilized to allow for further growth (even though the underlying storage may not actually be modified). Notably, such pruning need not occur immediately, and may occur at a later point in time, such as when the load on the database is reduced. In some embodiments pruning may occur at a fine-grained level (e.g., record-by-record, table-by-table, etc.), but in other embodiments pruning may occur via custom logic in the storage layer that is capable of implementing larger scale deletions using a single command (or a limited set of commands).

Thus, a powerful database can be enabled to implement the expansion extremely quickly and efficiently. In some embodiments, a fast lightweight "clone" may be performed on a database volume, wherein volume metadata 230 associated with the volume (e.g., including pointers or identifiers of the underlying data) may be created in the storage tier for the new clone volume without the need to copy or modify the underlying data of the volume, and then each side of the clone (e.g., each database head node) arranges for deletion of a (possibly) half of the data partition that is no longer responsible for each. This clone-and-drop approach provides for fast and efficient data migration because data movement occurs at the storage layer and the source database instance does not generate the processing, memory or bandwidth associated with data transfer.

Thus, embodiments disclosed herein allow for a quick expansion of a database via a fast volume cloning technique to efficiently create copies of existing shards (e.g., within seconds to minutes), and thereafter perform metadata updates to cause the proxy layer to redirect requests to newly established shards. In some embodiments, after cloning is complete, the system may delete unwanted data from each half of the splitting operation.

Thus, embodiments may be particularly fast because the addition of the slices is completed quickly, e.g., within seconds or minutes, rather than tens of minutes or hours as required in a similar system. Embodiments are also relatively easy to understand, implement and operate, and thus easy to implement and maintain continuously.

In addition, embodiments are also efficient in that the extension mechanism has minimal impact on the user's foreground workload and may be at zero (or near zero) downtime, as the extension mechanism does not trigger errors or downtime visible to the client application. One fundamental challenge in increasing capacity is: capacity is increased without causing a visible downtime of the clients of the database. In the embodiments disclosed herein, the cloning operation, while very fast, may not be immediate. Thus, in some embodiments, to avoid stopping writes, embodiments supplement the cloning process with a lightweight log capture/replay mechanism, where writes to the database that arrive during the cloning process are replayed onto the sub-shards before initiating the switch.

As shown in fig. 2, the database may be implemented in a provider network 100 (or "cloud" provider network) that provides users with the ability to use one or more of various types of computing-related resources, such as computing resources (e.g., executing Virtual Machine (VM) instances and/or containers, executing batch jobs, executing code without provisioning servers), data/storage resources (e.g., object storage, block-level storage, data archive storage, databases and database tables, etc.), network-related resources (e.g., configuring a virtual network including multiple sets of computing resources, content Delivery Networks (CDNs), domain Name Services (DNS)), application resources (e.g., databases, application build/deployment services), access policies or roles, identity policies or roles, machine images, routers, and other data processing resources, etc. These and other computing resources may be provided as services such as: hardware virtualization services that may execute computing instances, storage services that may store data objects, and the like. A user (or "customer") of provider network 100 may use one or more user accounts associated with the customer account, but these terms may be used interchangeably to some extent depending on the context of use. A user may interact with provider network 100 via one or more interfaces across one or more intermediary networks (e.g., the internet), such as through the use of Application Programming Interface (API) calls, via a console implemented as a website or application, and so forth. An API refers to an interface and/or communication protocol between a client and a server such that if a client makes a request in a predefined format, the client should receive a response in a specific format or initiate a defined action. In the cloud provider network context, APIs provide a gateway to enable clients to access the cloud infrastructure by allowing them to obtain data from or cause actions within the cloud provider network, thereby enabling the development of applications that interact with resources and services hosted in the cloud provider network. The API may also enable different services of the cloud provider network to exchange data with each other. The interface may be part of or act as a front end to the control plane of the provider network 100, including a "back end" service that supports and implements services that may be provided more directly to the customer.

For example, a cloud provider network (or simply "cloud") generally refers to a large pool of accessible virtualized computing resources, such as computing, storage, and networking resources, applications, and services. The cloud may provide convenient, on-demand network access to a shared pool of configurable computing resources that may be programmatically configured and released in response to client commands. These resources may be dynamically configured and reconfigured to adjust to variable loads. Thus, cloud computing may be considered both as an application for service delivery via publicly accessible networks (e.g., the internet, cellular communication networks), as well as hardware and software in a cloud provider data center that provides these services.

The cloud provider network may be formed as a plurality of areas, where an area is a geographic area in which the cloud provider gathers data centers. Each zone includes a plurality of (e.g., two or more) Available Zones (AZ) connected to each other via a private high-speed network (e.g., a fiber-optic communication connection). The AZ (also referred to as a "zone") provides an isolated fault domain that includes one or more data center facilities having separate power supplies, separate networking, and separate cooling relative to the data center facilities in another AZ. A data center refers to a physical building or enclosure that houses and provides power and cooling for servers of a cloud provider network. Preferably, the AZs within an area are located far enough from each other that a natural disaster (or other failure causing event) does not affect more than one AZ at the same time or take more than one AZ offline.

Users may connect to AZ of a cloud provider network, e.g., through a Transit Center (TC), via publicly accessible networks (e.g., the internet, cellular communication networks). TC is the primary backbone location linking users to the cloud provider network and may be located at other network provider facilities (e.g., internet Service Provider (ISP), telecommunications provider) and securely connected to AZ (e.g., via VPN or direct connection). Each region may operate two or more TCs to achieve redundancy. The zones are connected to a global network that includes a private networking infrastructure (e.g., a fiber optic connection controlled by a cloud provider) that connects each zone to at least one other zone. The cloud provider network may deliver content through edge locations and regional edge cache servers from access points (or "POPs") that are outside of, but networked with, these regions. This partitioning and geographical distribution of computing hardware enables cloud provider networks to provide users with low-latency resource access with a high degree of fault tolerance and stability worldwide.

In general, the services and operations of a provider network can be broadly subdivided into two categories: control plane operations carried on the logical control plane and data plane operations carried on the logical data plane. The data plane represents movement of user data through the distributed computing system, and the control plane represents movement of control signals through the distributed computing system. The control plane typically includes one or more control plane components distributed across and implemented by one or more control servers. Control plane traffic typically includes management operations such as system configuration and management (e.g., resource placement, hardware capacity management, diagnostic monitoring, system status information). The data plane includes user resources (e.g., compute instances, containers, block storage volumes, databases, file stores) implemented on the provider network. Data plane traffic typically includes non-administrative operations such as transferring user data to and from user resources. The control plane components are typically implemented on a set of servers separate from the data plane servers, and the control plane traffic and data plane traffic may be sent over separate/distinct networks.

To provide these and other computing resource services, provider network 100 typically relies on virtualization technology. For example, virtualization techniques may provide a user with the ability to control or use computing resources (e.g., a "computing instance," such as a VM that uses a guest operating system (O/S) that operates using a hypervisor that may or may not further operate above an underlying host O/S, a container that may or may not operate in a VM, a computing instance that may execute on "bare metal" hardware without the underlying hypervisor), where one or more computing resources may be implemented using a single electronic device. Thus, users may directly use computing resources hosted by the provider network (e.g., provided by hardware virtualization services) to perform various computing tasks. Additionally or alternatively, the user may indirectly use the computing resources by submitting code to be executed by the provider network (e.g., via an on-demand code execution service), which in turn executes the code using one or more computing resources, typically without the user having any control or knowledge of the underlying computing instance involved.

The provider network 100 shown in fig. 2 includes a database service 110, as well as any number of other services that it may provide to users. According to some embodiments, database service 110 enables users to create, manage, and use databases (e.g., relational databases) in a cloud-based environment in a manner that provides enhanced security, availability, and reliability relative to other database environments. In some embodiments, database service 110 features a distributed, fault tolerant, and self-healing storage system that automatically expands (e.g., implemented in part using an extensible storage service). In some embodiments, the database system provided by database service 110 organizes the basic operations of the database (e.g., query processing, transaction management, caching, and storage) into layers that may be individually and independently scalable. For example, in some embodiments, each database instance provided by database service 110 utilizes a compute layer (which may include one or more database nodes 214, sometimes also referred to as "head nodes") and a separate distributed storage layer (which may include multiple storage nodes 206 that collectively perform some operations traditionally performed in the database layers of existing database systems, and an optional backup storage layer.)

In general, a database is a group of data, a collection of records, or other groupings of data objects stored in a data store. In some embodiments, the data store includes one or more direct or network attached storage devices (e.g., storing data that may be disclosed as virtual storage volumes) accessible to the database engine 118 (e.g., block-based storage devices such as hard disk drives or solid state drives, file systems, etc.). As described above, in some embodiments, the data store is managed by a separate storage service. In some embodiments, managing the data store at a separate storage service includes distributing data among a plurality of different storage nodes (e.g., storage nodes 206) to provide redundancy and availability of the data.

In some embodiments, the data of the database is stored in one or more portions of a data store (such as data pages). One or more data values, records, or objects may be stored in a data page. In some embodiments, the data page also includes metadata or other information for providing access to the database. For example, the data pages may store data dictionaries, transaction logs, undo and redo log records, etc., but these may also be stored separately. The query engine 120 of the database engine 118 performs access requests (e.g., requests to read, obtain, query, write, update, modify, or otherwise access) based on the state information. The state information may include, for example, a data dictionary, revocation log, transaction log/table, index structure, mapping information, data page cache or buffer, etc., or any other information for performing an access request with respect to the database. For example, the state information may include mapping information (e.g., an index) for obtaining data records that match certain search criteria (e.g., query predicates).

In some embodiments, some operations of the database (e.g., backup, restore, journaling manipulations, and/or various space management operations) may be offloaded from the database engine 118/compute layer to the storage layer and may be distributed across multiple storage nodes and associated storage devices. For example, in some embodiments, rather than database engine 118 applying the changes to the database (or its data pages) and then sending the modified data pages to the storage layer, it may be the responsibility of the storage layer itself to apply the changes to the stored database (and its data pages). According to an embodiment, the database engine 118 instead sends the redo log record to the storage layer instead of the modified data page. The storage tier then performs the redo process (e.g., application redo log records) in a distributed manner (e.g., by a background process running on storage node 206).

Further specific details of exemplary embodiments are now provided. In some implementations, the database service may implement the extension model based on four underlying abstractions. First, a "set" may be equivalent to a relational table, and an embodiment may support a sliced set whose content is distributed over a group of nodes in a cluster of nodes. Next, "partition" represents a subset of data of a single set. Embodiments support partitioning in a variety of ways, e.g., hash partitioning according to user-provided keys, where each partition corresponds to a series of hash key values. In some embodiments, a logical partition represents a set of partitions of the same keyspace that manage all sets in the system. For example, logical tile "37" manages the 37 th hash bucket of all tile sets. Further, "physical shards" may represent resources (e.g., computation and storage) required to manage a set of logical shards. Each physical partition may be mapped to a classical cluster.

The relationship between these entities may be as follows. The set maps to one or more partitions, each partition corresponding to a subset of the key space. Each partition maps to a single logical tile that is shared with other sets within the cluster. Each logical partition may be associated with a particular physical partition.

In some embodiments, the database service manages cluster metadata using two storage mechanisms: global metadata and local metadata. In some embodiments, global metadata 230 is managed by a database service control plane and may be stored in a storage system (e.g., a relational database or otherwise). Global metadata 230 manages information about the physical shards, including their number, configuration, and endpoints (e.g., URLs or other network addresses). The global metadata may also store a final consistent snapshot of the logical shard information, as described below.

In some embodiments, local metadata 232 is stored with the user data on a database shard (e.g., database header node 214). Embodiments use local metadata 232 to manage logical tile metadata and aggregate metadata. Note that in some embodiments, the system need not explicitly store partition metadata.

Global metadata

Global metadata has two main roles. First, it is an authoritative source of information about the physical shards (i.e., the group of clusters). Second, the global metadata stores the final consistent description of the logical shards. In steady state, the database service agent may extract the physical to logical shard map entirely from the global metadata. The global metadata may include the following information for each cluster:

● Endpoint identifiers (e.g., URLs, network addresses, unique identifiers, etc.) of the associated clusters.

● Version number incremented with each change of cluster configuration.

● Wall clock time stamp of cluster configuration.

● Number of logical slices. For example, this may be a power of two value (e.g., 1, 2, 4, 8, 16, etc.).

● For each physical partition, the embodiment maintains the following information:

Unique identifier value uniquely identifying the physical patch.

Endpoint identifier (e.g., URL, network address, etc.) for a fragment

Fragment index available for routing requests

"State" of the tile, where valid states may include CREATING (in creation), ACTIVE, dynamic, etc. These states may be used when adding or removing fragments

Each physical tile may be associated with an index that is used to map the physical tile to a logical tile. The mapping scheme is similar to consistent hashing in that both the server and the keys are managed using the same hash key space. Consider an example with 128 logical slices and 4 physical slices. Physical slices are assigned index values of 0,32, 64, and 96. A first physical partition management logical partition [0, 32); a second physical slice management logical slice [32, 64); etc. Dynamic expansion may be achieved by introducing new physical slices into the system. The system may assign a tile index equidistant from two existing tiles.

Backup/restore presents challenges to managing cluster metadata. In general, implementations assume that the shard configuration may change between cluster snapshots. Thus, embodiments maintain all previous cluster configurations referencing potentially valid restore points. However, the proxy layer may only care about the latest metadata snapshot of the cluster.

Local metadata

Each physical partition may manage local metadata describing a locally existing logical partition. Embodiments maintain a logical tile map that contains the following information:

● Logical fragment number

● Logic slicing states, e.g. one of active, migrate, inaccessible, fail

● Physical fragmentation successor (optional). For recently moved logical tiles, the system may maintain subsequent physical tiles.

Embodiments also use local metadata to manage aggregate metadata. Embodiments assume that metadata for a given set is maintained on a particular tile, and that the following information may be maintained for each set:

● A database and a collection name. In general, these can act as primary keys for the aggregate metadata.

● The unique identifier value of the collection is uniquely identified.

● Slicing keys (or null for non-sliced collections)

● Logical slices of data are stored. The set of non-tiles may be mapped to a single logical tile, while the set of tiles may be mapped to all logical tiles.

Steady state request routing

In this section, a flow of data plane queries is provided, according to some embodiments, starting with a static system without any client connections.

First, the client opens a new TCP connection to its database endpoint (e.g., a network address within the cloud private network in which the client operates). The load balancer may route connections to multi-tenant agents of the database agent layer described herein.

The proxy layer may extract the cloud provider network endpoint identifier from the request information (e.g., supplied by the load balancer). The proxy layer may find shard metadata corresponding to the cluster. The shard metadata is cached for subsequent reuse.

The client performs authentication and may then send the query using a database wired protocol. The proxy layer may decode the query and extract the target database and the collection. The proxy layer may find the aggregate metadata to obtain the sharded key for the aggregate. The aggregate metadata is cached for subsequent reuse.

The proxy layer may hash the sharding keys and indexes into an array of logical shards. The proxy layer consults the shard metadata to map from logical shards to physical shards and routes the request to the appropriate physical shard. The proxy layer gathers the fragmented responses and sends a single aggregate response to the client, ending the flow.

Resolving the query involves mapping from the customer-visible collection name (e.g., "my_collection") to the low-level partition name (e.g., "ls2_a8098c1a-f 86e-11da-bd1a-00112444be1 e"). Embodiments choose to implement this name mapping within the proxy layer, but other embodiments use alternative policies where name resolution is pushed to the database engine itself, e.g., by leveraging the support of declarative partitions by the database engine. However, proxy-centric approaches have the benefit of limiting the distributed system complexity to one logical component.

Embodiments manage metadata related to physical shards, logical shards, and collections. Embodiments have the option to store metadata globally or locally on the tile. The physical shard metadata may be stored globally to grant access to the local metadata.

In contrast, in some embodiments, the aggregate metadata is stored on a single tenant tile. Storing the aggregate metadata locally on the shards may reduce the burst radius (blast radius) of the noise client. This is particularly important because list operations are difficult to cache at the proxy layer (e.g., list collection, list database).

There are several methods for mapping aggregate metadata to shards. One strategy is to store all aggregate metadata on a specified tile ("tile 0"). This approach is simple, but if the collection metadata is accessed frequently, there is a risk of introducing hotspots. The second approach is to split the aggregate metadata into individual slices. Essentially, the set metadata can be considered a sharded set, with the set name as the sharded key. This approach has the benefit of distributing the load more evenly across the cluster. The last strategy is to store the complete aggregate metadata redundantly on each physical partition. The replication metadata approach may facilitate functionality that relies on resolving customer-oriented collection names (e.g., DML audits).

The final form of the metadata is the placement of the logical tiles. Embodiments use shard local metadata as an authoritative source of information about logical shards. This allows the system to manage logically partitioned metadata and data using local transactions. Embodiments also copy logical tile information to global metadata with final consistency. This construction allows the agent of the agent layer to obtain the complete physical-to-logical slice map from the global metadata only in steady state, thereby reducing the miss penalty of filling the cold cache. During a sharding split event, an agent may update its view logical shard state by pulling a route "hint" from the physical shard.

Fragmentation split protocol

The sharding split protocol is used to introduce new physical shards into the cluster. The basis of the protocol is to use a volume cloning function (e.g., not copy any/all of the underlying volume data, but only create metadata for new volumes that point to existing underlying data) that is used to quickly guide the copying of existing shards. After replication is complete, the system deletes half the data on each tile and updates the metadata (global and local) to reflect the update. Embodiments may also utilize a lightweight log capture/replay mechanism in order to render the protocol non-blocking. Embodiments initiate log records prior to cloning and ensure that all log records are replayed onto the client before initiating a handoff.

The protocol is now highly outlined, in sufficient detail to enable one of ordinary skill in the art to practice the protocol. First, a new physical fragment ("child") is registered in the global metadata in CREATING state at circle (1) of fig. 2. At this time, the fragmentation is not supported at all.

Next, as step two, a split point is selected to subdivide the parent's logical partition and log records are enabled at circle (2) for the target logical partition to be migrated (as reflected by circle (2A)), but alternatively, as shown by circle (2B), storage tier control plan 244 may be configured to perform its own log records. As step three, the transaction waiting to open is quiesced (e.g., terminated or otherwise entered into an inactive period). The log primitive may not contain the effect of a pre-existing transaction. Thus, embodiments introduce barrier waiting for open transactions and may incorporate timeouts to force closing of transactions that exceed a time limit.

As step four, the target logical shard is marked as having MIGRATING states in the logical shard map (e.g., in the local metadata 232) on the parent shard (e.g., database head node 214A). The sub-tiles may be identified herein as designated successor.

In step five, the parent volume is cloned (as reflected by circle (3)) using a lightweight cloning function that does not replicate any/all of the underlying data for that volume. The system then waits for the sub-tile to become active.

At step six, the log record is applied to the child shard (e.g., according to the logical flow of changes provided by circle (4A) from parent head node 214A, or via a storage layer control plane at circle (4B)); this step may force a timeout to ensure convergence if necessary, but at the cost of increased downtime. As step seven, the target logical tile is marked as INACCESSIBLE on the parent tile. This results in temporary failure of ongoing queries and new queries. In step eight, all remaining log records are applied, and as a ninth step, the target logical tile is marked as ACTIVE state on the child tile and DELETING state on the parent tile.

As step ten, global metadata is updated at circle (5) to mark child physical shards as ACTIVE state, allowing full use of the new database head node 214X and continued use of the parent database head node 214A. At step eleven, the data of the logical tile in DELETING state will be deleted as reflected by circles (A1) and (A2). This is a clean-up step that may occur at any time, e.g., it may be performed fully or partially during the split-slice protocol (e.g., during or near the actual cloning) or after the split-slice protocol ends in an inert or delayed manner.

The request routing during the shard splitting process is similar to the steady state routing described above. However, during the switching period (e.g., steps seven and eight), the logical tiles may not be accessible. A request for a parent of a logical partition in the migration may generate an error, where an error code indicates to the proxy layer that the request may be safely retried. Additionally, in some embodiments, the error includes a route "hint" value that allows the agent to update its route cache without consulting global metadata. Thus, the hint avoids the "frightening group effect (thundering herd)" of global metadata (where a large number of participants attempt to access the same resource, here related metadata, at the same time) and allows the global metadata to eventually agree.

In some embodiments, the database engine supports operations on logical tiles to support dynamic cluster extensions. Three advanced operations may be supported: first, a lightweight logging mechanism; second, concurrency control primitives; and third, a barrier waiting mechanism for ongoing transactions.

The system supports a lightweight logical log that describes a set of modifications to a set partition within a logical partition. Lightweight journals are conceptually similar to change streams. However, the lightweight log may be smaller (and faster) because the embodiment assumes that there are reader nodes that can reconstruct the complete system state. For example, a lightweight log may describe only the set of documents/records that are inserted or deleted within a given window. One possible entry format is as follows:

second, embodiments support concurrency control of logical tile granularity. The system may support at least three possible states as described below. Note that the logical tile state still exists after the system is restarted.

And (3) a step of: and (3) activity. All sets of partitions within a logical partition are accessible.

And II: inaccessible. Accessing the partition set may return an error. Read-only access may be allowed to continue (assuming continuity of the underlying snapshot) starting before the state change.

Thirdly,: failure. The logical slices are no longer locally present. Any access attempt will return an error. This is the termination state (i.e., no edge out).

Finally, embodiments may support barrier wait primitives for operations on logical tiles. The barrier wait returns whenever any pre-existing transactions on the logical partition terminate. The barrier waits to accept the timeout parameter after which any ongoing operation will be immediately aborted.

Thus, for further understanding, pseudocode for the shareholder splitting process is provided as follows:

to appreciate the separation architecture in further detail, fig. 3 is a diagram illustrating aspects of a computational layer for a database in a database service, according to some embodiments. In this example, database system 300 includes a respective database engine head node 214 for each of a number of databases and a storage tier implemented, for example, using a distributed storage service (which may or may not be visible to clients of the database system, such as shown by database clients 102A-102N). As shown in this example, one or more of database clients 102102A-102102N may access database head node 214 (e.g., head node 214A, a..the head node 214N, each of which is a component of a respective database instance) via one or more networks (e.g., these components may be network-addressable and accessible to database clients 102102A-102102N) and through proxy layer 112. However, in different embodiments, a storage tier may or may not be network-addressable and may or may not be accessible to the storage clients 102102A-102102N, which may be employed by the database system to store database volumes (such as data pages of one or more databases, and redo log records and/or other metadata associated therewith) and/or copies of database volumes on behalf of the database clients 102102A-102102N, as well as perform other functions of the database system described herein. For example, in some embodiments, the storage layer 204 may perform various storage, access, change log recording, recovery, log record manipulation, and/or space management operations via the storage nodes 206A-206M in a manner that is not visible to the storage clients 102A-102N.

As previously described, each database instance may include a database engine head node 214 that receives requests (e.g., queries to read or write data, etc.) from various clients, such as programs (e.g., applications) and/or subscribers (users). The database engine head node 214 may parse the request, optimize the request, and formulate an execution plan to perform the associated database operations. In the example shown in fig. 3, the query engine component 120 of the database engine 118 of the database engine head node 214A may perform these functions for queries received from the database client 102A and targeted to the database instance in which the database engine head node 214A is a component. In some implementations, the database engine 118 can return a query response to the database client 102, which can include a record set, a write acknowledgement, a requested data page (or portion thereof), an error message, and/or other response (as appropriate). As shown in this example, database engine head node 214A may also include a client-side storage service driver 325 that may route read requests and/or redo log records to the various storage nodes 206 within storage layer 204, receive write acknowledgements from storage layer 204, receive requested data pages from storage layer 204, and/or return data pages, error messages, or other responses to query engine 305 (which may in turn return them to database client 102A). Client-side storage drive 325 may maintain mapping information about database volumes stored in storage tier 204 such that a particular protection group that maintains partitions of the database volumes may be determined. The read request and the redo log record may then be routed to the storage node 206 that is a member of the protection group based on the user data partition to which the read request is directed or the user data partition to which the redo log record pertains.

In this example, database engine head node 214A includes a data page cache 335 in memory 304 in which recently accessed data pages may be temporarily stored. In some embodiments, if an online restore operation occurs, the data pages in the cache may be retained to perform faster query processing operations using local data, etc. Additionally or alternatively, the database engine head node 214A may include a transaction and consistency management component that may be responsible for providing transaction and consistency in a database instance in which the database engine head node 214A is a component. For example, the component may be responsible for ensuring the atomicity, consistency, and isolation properties of the database instance and transactions for the database instance. As shown in fig. 3, database engine head node 214A may also include a transaction log 340 and a revocation log 345 that may be employed by the transaction and coherency management component to track the state of individual transactions and roll back any locally cached results of uncommitted transactions.

Note that other database engine head nodes 214 (e.g., nodes 214B-214N) may include similar components and may perform similar functions on queries (or other database statements) received by one or more of the database clients 102A-102N and directed to respective database instances in which the database engine head nodes are components.

In some implementations, the storage layer 204 described herein may organize data in various logical data volumes, blocks in a protected group of storage nodes that become persistent (which may include database partitions (e.g., user data space) in the volumes and log segments of the volumes), segments (which may be data stored on a single storage node of the protected group), and pages for storage on one or more storage nodes. For example, in some embodiments, each database is represented by a logical volume, and each logical volume is divided into a plurality of chunks (extents) on a set of storage nodes. The protection group may be made up of different storage nodes in the storage service that together make the block persistent. A block may be made persistent using a plurality of segments, each of which is located on a particular one of the storage nodes in the protected group.

In some embodiments, each data page may be stored in a segment such that each segment stores a set of one or more data pages and a change log (also referred to as a redo log) for each data page stored by the segment (e.g., a log of redo log records). Thus, the change log may be a log record segmented into a protected group of which the segment is a member. The storage node may receive the redo log records and merge them to create a new version of the corresponding data page (e.g., if the data page of the copy of the database is shared with the database and a new version is created to create a different version that is included in the copy and is not visible to the database) and/or additional or alternative log records (e.g., lazily and/or in response to a request for a data page or database crash). In some embodiments, the data pages and/or change logs may be mirrored on multiple storage nodes according to a variable configuration, such as in a protection group (which may be specified by a client on behalf of which the database is maintained in a database system). For example, in different embodiments, one, two, or three copies of the data or change log may be stored in one, two, or three different available regions or areas, depending on default configuration, application-specific persistence preferences, or client-specified persistence preferences.

The block diagram in fig. 4 illustrates one embodiment of a storage layer. In particular, FIG. 4 is a diagram illustrating aspects of a storage tier for a database in a database service according to some embodiments. In at least some embodiments, the storage nodes 206A-206M can store data for different clients as part of a multi-tenant storage service. For example, each segment may correspond to a different protection group and volume for a different client.

In some embodiments, a client (from the perspective of the storage tier), such as database engine head node 214, may communicate with storage nodes 206A-206M of storage tier 204 that store data as part of a database managed by a client-side storage service driver at the client. In this example, the storage tier 204 includes a plurality of storage tier storage nodes 206A-206M, each of which may include storage for the data pages 441 and the redo logs 439 for the segments it stores, and the hardware and/or software may perform various segment management functions 437. For example, each storage tier 204 storage node 206 may include hardware and/or software that may perform at least a portion of any or all of the following: replication (locally, e.g., within a storage node), merging redo logs to generate data pages, log management (e.g., manipulating log records), crash recovery (e.g., determining candidate log records for volume recovery), creating snapshots of segments stored at the storage node, and/or space management (e.g., for segment or state storage). Each storage tier 204 storage node 206 may also have one or more attached storage 435 (e.g., SSD, HDD, or other persistent storage) that may store data blocks thereon on behalf of clients (e.g., users, client applications, and/or database service subscribers).

In the example shown in fig. 4, storage node 206A of storage layer 204 includes data page 441, segment redo log 439, segment management function 437, and attached storage 435. Similarly, storage node 206M of storage layer 204 includes data page 441, segment redo log 439, segment management function 447, and attached storage 435.

In some implementations, the storage layer 204 in the storage system stores a set of processes that each of the nodes 206 can implement to manage communications with the database engine head node (e.g., to receive a redo log, send back a data page, etc.) running on the operating system of the node. In some embodiments, all data blocks written to the storage system may be backed up to long-term and/or archive storage (e.g., in a remote key-value durable backup storage system).

In some embodiments, the storage layer 204 may also implement a storage layer control plane 244. Storage tier control plane 244 may be one or more computing nodes that may perform a variety of different storage system management functions. For example, the storage layer control plane 244 may implement a volume manager (not shown) that may maintain mapping information or other metadata for a volume, such as current volume state, current writers, truncation tables, or other truncation information, or any other information for a volume, while the volume is maintained in different blocks, segments, and protection groups. The volume manager may communicate with clients of the storage tier 204, such as client-side storage drives, to "mount" or "unwrap" volumes for the clients, to provide mapping information for the client-side storage drives, protection group policies, and various other information needed to send write and read requests to the storage nodes 206. The volume manager may provide the maintained information to a storage client, such as database engine head node 214 or a client side storage drive, or to other system components, such as backup agent 418. For example, the volume manager may provide a current volume state (e.g., clean, dirty, or restored), a current epoch (epoch) or other version indicator of the volume, and/or any other information about the data volume.

In at least some embodiments, the storage tier control plane 244 can implement a backup management component. The backup management component may implement or direct a plurality of backup agents 418 that may backup data volumes stored at storage nodes. For example, in some embodiments, a task queue may be implemented that identifies backup operations to be performed on the data volumes (e.g., LSN ranges describing redo log records included in chunks or portions of data to be uploaded to a backup data store). Volume backup metadata may be included as part of the backup performed by backup agent 418, including volume geometry or configuration. Changes made to the database after the restore operation may be included in the log. In some embodiments, backup of the log records may be performed by backup agent 418 whether the log records are within exclusion or not.

FIG. 5 is a diagram illustrating exemplary operational aspects of a split architecture database including decoupled storage and compute layers in a database service, according to some embodiments. In this example, one or more client 102 processes can read and/or store data from/to one or more databases maintained by a database system that includes the database engine 118 of the computing layer 202 and the separate storage layer 204. In the example shown in fig. 5, database head node 214A includes a database layer component (e.g., database engine 118) and client-side storage drive 325 described elsewhere herein. In some embodiments, the database engine 118 may perform functions such as query profiling, optimization, and execution, and transaction and consistency management, for example, as discussed in fig. 3, and/or may store data pages, transaction logs, and/or revocation logs (such as those stored by the data page cache 335, the transaction log 340, and the revocation log 345 of fig. 3). In various embodiments, database engine 118 may have obtained a volume epoch indicator or other identifier from storage layer 204, thereby granting access write rights to a particular data volume, such as by sending a request to storage layer 204 to open the data volume.

In this example, one or more client 102 processes may send a database request 515 (e.g., a query or other statement, which may include read and/or write requests for data stored by one or more of the storage nodes 206A-206N implemented by the computing device 538) to the database engine 118, and may receive a database response 517 (e.g., a response including a write acknowledgement and/or requested data) back from the computing layer 202. Each database request 515 may be processed by database engine 118 and result in a need to write to a data page, and thus may be parsed and optimized by database engine 118 to generate one or more write record requests 541, which may be sent to client-side storage driver 325 for subsequent routing to storage layer 204. In this example, client-side storage driver 325 may generate one or more redo log records 531 corresponding to each write record request 541, and may send the one or more redo log records to a particular storage node in storage nodes 206 of a particular protection group, thereby storing partition user data of user data space related to the write record requests in storage layer 204. Storage node 206 may perform various peer-to-peer communications to copy the redo log record 531 received at the storage node to other storage nodes that may not have received the redo log record 531. For example, not every storage node may receive a redo log record to satisfy the number of write arbiters (e.g., three out of five storage nodes may be sufficient). The remaining storage nodes that do not receive or acknowledge the redo log record may receive an indication of the redo log record from the peer storage nodes that have acknowledged or received the redo log record. The client-side storage drive 325 may generate metadata for each redo log record that includes an indication of the previous log sequence number of the log record maintained at the particular protection group. The storage layer 204 may return a corresponding write acknowledgement 523 for each redo log record 531 to the client-side storage drive 325. The client-side storage driver 325 may pass these write acknowledgements to the database engine 118 (as write responses 542), which may then send the corresponding response (e.g., write acknowledgement or result set) as one of the database query responses 517 to one or more client 102 processes.

In this example, each database request 515 that generates a request to read a page of data may be parsed and optimized by database engine 118 to generate one or more read page requests 543, which may be sent to client-side storage drive 325 for subsequent routing to storage tier 204. In this example, client-side storage drive 325 may send these requests to a particular one of storage nodes 206 of storage tier 204, and storage tier 204 may return requested data page 533 to client-side storage drive 325. The client-side storage driver 325 may send the returned data page as a return page response 544 to the database engine 118, and the database engine 118 may then use the data page to generate a database query response 517.

In some implementations, various error and/or data loss messages 534 may be sent from the storage layer 204 to the database engine 118 (specifically, to the client-side storage drive 325). These messages may be passed from the client-side storage drive 325 to the database engine 118 as error and/or loss report messages 545 and then passed to one or more client 102 processes, optionally along with (or in lieu of) the database query response 517.

In some embodiments, backup agent 418 may receive a peer indication from storage node 206. By evaluating these indications, backup agent 418 may identify additional redo log records received at storage node 206 that have not yet been backed up. Backup agent 418 may send a chunk or object containing a set of redo log records 551 to backup storage system 570 for storage as part of a backup version of the data volume. In some embodiments, a data page 553 may be requested from a storage node for creating a full backup of a data volume (rather than a log record describing changes to the data volume) or a data volume copy of a data page that may reference a data page stored in another data volume in the backup storage system 570 and sent to the backup storage system 570.

In some embodiments, APIs 531-534 of storage layer 204 and APIs 541-545 of client-side storage driver 325 may expose the functionality of storage layer 204 to database engine 118 as if database engine 118 were a client of storage layer 204. For example, the database engine 118 (via the client-side storage drive 325) may write redo log records or request data pages via these APIs to perform (or facilitate performance of) various operations (e.g., storage, access, change log records, recovery, and/or space management operations) of the database system implemented by the combination of the database engine 118 and the storage tier 204. As shown in fig. 5, the storage layer 204 may store data blocks on storage nodes 206A-206N, each of which may have multiple SSDs attached. In some embodiments, storage layer 204 may provide high persistence for stored data blocks by applying various types of redundancy schemes.

Note that in various embodiments, API calls and responses between the computing layer 202 and the storage layer 204 (e.g., APIs 531-534) and/or between the client-side storage driver 325 and the database engine 118 (e.g., APIs 541-545) and between the backup agent 418 and the backup data store 570 in fig. 5 may be performed over a secure proxy connection (e.g., a connection managed by a gateway control plane), or may be performed over a public network, or alternatively may be performed over a private channel such as a Virtual Private Network (VPN) connection. These and other APIs to and/or between components of the database systems described herein may be implemented according to different techniques, including, but not limited to, simple Object Access Protocol (SOAP) techniques and representational state transfer (REST) techniques. For example, these APIs may (but need not) be implemented as SOAP APIs or RESTful APIs. SOAP is a protocol for exchanging information in the context of web-based services. REST is an architectural style of hypermedia system. The RESTful API (which may also be referred to as RESTful web-based services) is a web-based services API implemented using HTTP and REST techniques. In some embodiments, the APIs described herein may be packaged with client libraries in various languages including, but not limited to C, C ++, java, C# and Perl to support integration with system components.

In the storage systems described herein, a chunk may be a logical concept representing a high-endurance storage unit, which may be combined (concatenated or striped) with other chunks to represent a volume. Each tile may be made persistent by becoming a member of a single protection group. A block may provide an LSN type read/write interface for a contiguous byte sub-range having a fixed size defined at creation time. Read/write operations to a block may be mapped to one or more appropriate segment read/write operations by the contained protection groups. As used herein, the term "volume block" may refer to a block that is used to represent a particular byte sub-range within a volume.

As described above, a data volume may be composed of multiple blocks, each block represented by a protection group, which is composed of one or more segments. In some implementations, log records pointing to different blocks may have interleaved LSNs. In order for changes to volumes to be persistent up to a particular LSN, it may be desirable to have all log records up to that LSN persistent, regardless of which block they belong to. In some embodiments, a client may track outstanding log records that have not become persistent, and once all ULRs up to a particular LSN become persistent, the client may send a volume persistent LSN (VDL) message to one protection group in the volume. The VDL may be written to all synchronous mirror segments of the protected group (i.e., group members). This is sometimes referred to as an "unconditional VDL" and it may be periodically saved to each segment (or more specifically to each protection group) along with the write activity that occurs on the segment. In some implementations, the unconditional VDL can be stored in a log sector header.

As used herein, in some embodiments, a volume may be a logical concept that represents a high-durability storage unit understood by a user/client/application of a storage system. Volumes may be stored or maintained in a distributed repository that appears to a user/client/application as a single consistent, ordered log of write operations to individual user pages of the repository. Each write operation may be encoded in a User Log Record (ULR) representing logically ordered variations of the contents of a single user page within a volume. As noted above, the ULR may also be referred to herein as a redo log record. Each ULR may include a unique identifier (e.g., a Logical Sequence Number (LSN)) assigned from the log sequence number space. Each ULR may be saved to one or more sync segments in the log structured store that form a Protection Group (PG) that maintains partitions (i.e., blocks) of user data space to which updates indicated by the log records pertain in order to provide high persistence and availability to the ULR. Volumes may provide an LSN type read/write interface for a variable-sized contiguous byte range. In some implementations, a volume may be made up of multiple blocks, each block made persistent by protecting a group. In such implementations, a volume may represent a storage unit consisting of a sequence of variable contiguous volume blocks. Reads and writes to a volume may map to corresponding reads and writes to the constituent volume blocks. In some implementations, the volume size may be changed by adding or removing volume blocks at the end of the volume.

FIG. 6 is a flowchart illustrating operations 600 of a method for expanding a database cluster using a storage and computing decoupled architecture, according to some embodiments. Some or all of the operations 600 (or other processes described herein, or variations and/or combinations thereof) are performed under control of one or more computer systems configured with executable instructions and are implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) that collectively execute on one or more processors. The code is stored on a computer readable storage medium, for example, in the form of a computer program comprising instructions executable by one or more processors. The computer-readable storage medium is non-transitory. In some embodiments, one or more (or all) of operations 600 are performed by database service 110 of fig. 2.

Operation 600 comprises determining, at block 602, to add a new Database (DB) node to a cluster of one or more DB nodes implementing a database in a database service, wherein the cluster comprises a first DB node of a computation layer of the database service. In some embodiments, block 602 includes receiving a user initiated command to extend a database on behalf of a database service; or determining that an automatic expansion condition of the expansion policy is satisfied, wherein the automatic expansion condition is based at least in part on metrics associated with the cluster or respective DB nodes of the cluster.

Operation 600 comprises expanding a cluster of one or more DB nodes to add at least new DB nodes at block 604. Block 604 includes determining a split of data of a first volume managed by a first DB node at block 610, the split including a first portion of data to be maintained managed by the first DB node and a second portion of data to be managed by a new DB node.

Block 604 includes obtaining a second DB node in the computation layer to use as a new DB node at block 612. Obtaining the second DB node may include, for example, obtaining a compute instance from a hardware virtualization service, starting a compute instance, and so on.

Block 604 includes cloning a first volume at block 614 within a storage tier different from the compute tier to produce a second volume for use by a second DB node, wherein the cloning does not involve making a complete copy of the data of the first volume. In some embodiments, block 614 includes creating a copy-on-write volume as the second volume based on creating one or more metadata elements for the second volume that identify the data of the first volume, wherein the copy-on-write volume does not include a different copy of the data of the first volume.

Block 604 includes, at block 616, applying a set of database modifications to the second volume after cloning, wherein the set of database modifications results from database traffic relating to the second portion received by the first DB node during cloning of the first volume.

In some embodiments, block 616 occurs within/through the storage layer without involvement of the computing layer, and may include: identifying, by the storage layer, a modification to the first volume; and causing, by the storage layer, the modification to be applied to the second volume.

Block 604 includes updating the proxy layer to begin sending database traffic related to the second portion of the data to the second DB node at block 618. In some embodiments, block' Z18 includes at least transmitting a message to the proxy layer, the message including data identifying a set of one or more partition key values and identifying the second DB node.

In some embodiments, block 604 further comprises configuring the first DB node to insert data corresponding to the set of database modifications into a modification log, and transmitting the data from the modification log to the second DB node, wherein application of the set of database modifications occurs at least in part via the second DB node of the computing layer and comprises obtaining, by the second DB node, data corresponding to the set of database modifications from the first DB node.

Optionally, in some embodiments, the operation 600 further comprises: at block 630, causing the first DB node to delete the second portion of the data from the first volume; and causing the second DB node to delete the first portion of the data from the second volume at block 632. In some embodiments, blocks 630 and/or 632 may be completed after cluster expansion, but one or both of these blocks may be initiated during expansion of block 604 (and optionally performed partially or fully during expansion), and may be completed within some amount of time after expansion ends, as deletion may not be time critical, and thus delaying its completion until after expansion occurs may allow the expanded database to be put into production for faster use.

In some embodiments, the database service is implemented within a multi-tenant cloud provider network, and the storage tier provides the first volume and the second volume to the first DB node and the second DB node via use of a plurality of storage nodes implemented by a plurality of computing devices, wherein the plurality of computing devices are multi-tenant and provide the storage tier service of the database service to a plurality of different databases of a plurality of different users.

In some embodiments, operation 600 further comprises waiting for a period of time to complete an existing transaction involving the first DB node prior to cloning of the first volume, and then performing cloning of the first volume.

In some embodiments, during or after cloning of the first volume, but prior to the update of the proxy layer, operation 600 further comprises: receiving, at the first DB node, a request relating to a second portion of data to be managed by the new DB node; and sending a response message to the proxy layer indicating that the request is denied. In some embodiments, the response message indicates that the request may be retried, and in some embodiments, the response message further includes a hint value that allows the agent layer to update its route cache to send additional requests involving the second portion of data to the second DB node.

FIG. 7 illustrates an example provider network (or "service provider system") environment, according to some embodiments. Provider network 700 may provide resource virtualization for customers via one or more virtualization services 710 that allow customers to purchase, lease, or otherwise obtain instances 712 of virtualized resources (including, but not limited to, computing resources and storage resources) implemented on devices within one or more provider networks in one or more data centers. A local Internet Protocol (IP) address 716 may be associated with resource instance 712; the local IP address is the internal network address of the resource instance 712 on the provider network 700. In some embodiments, provider network 700 may also provide public IP address 714 and/or a public IP address range (e.g., internet protocol version 4 (IPv 4) or internet protocol version 6 (IPv 6) addresses) that are available to clients from provider 700.

Conventionally, provider network 700 may allow a customer of a service provider (e.g., a customer operating one or more customer networks 750A-750C (or "client networks") including one or more customer devices 752) to dynamically associate at least some public IP addresses 714 assigned or allocated to the customer with particular resource instances 712 assigned to the customer via virtualization service 710. The provider network 700 may also allow a customer to remap a public IP address 714 previously mapped to one virtualized computing resource instance 712 assigned to the customer to another virtualized computing resource instance 712 also assigned to the customer. Using virtualized computing resource instance 712 and public IP address 714 provided by a service provider, a customer of the service provider (such as an operator of customer networks 750A-750C) can, for example, implement a customer-specific application and present the customer's application on an intermediary network 740 such as the internet. Other network entities 720 on the intermediate network 740 may then generate traffic destined for the destination public IP address 714 published by the client networks 750A-750C; traffic is routed to the service provider data center and at the data center via the network floor to the local IP address 716 of the virtualized computing resource instance 712 that is currently mapped to the destination public IP address 714. Similarly, response traffic from virtualized computing resource instance 712 can be routed back to source entity 720 over intermediate network 740 via the network floor.

A local IP address as used herein refers to an internal or "private" network address of a resource instance in, for example, a provider network. The local IP address may be within an address block reserved by an annotation Request (RFC) 1918 of the Internet Engineering Task Force (IETF) and/or have an address format specified by IETF RFC 4193, and may be variable within the provider network. Network traffic originating outside the provider network is not routed directly to the local IP address; instead, traffic uses a public IP address that maps to the local IP address of the resource instance. The provider network may include networking devices or equipment that provide Network Address Translation (NAT) or similar functionality to perform mapping from public IP addresses to local IP addresses and from local IP addresses to public IP addresses.

The public IP address is an internet-variable network address assigned to a resource instance by a service provider or customer. Such as traffic routed to a public IP address via 1:1nat translation, and forwards the traffic to the corresponding local IP address of the resource instance.

Some public IP addresses may be assigned to specific resource instances by the provider network infrastructure; these public IP addresses may be referred to as standard public IP addresses, or simply standard IP addresses. In some embodiments, the mapping of standard IP addresses to local IP addresses of resource instances is a default startup configuration for all resource instance types.

At least some of the public IP addresses may be assigned to or obtained by a customer of the provider network 700; the client may then assign its assigned public IP address to the particular resource instance assigned to the client. These public IP addresses may be referred to as client public IP addresses, or simply client IP addresses. Instead of being assigned to a resource instance by the provider network 700 as in the case of a standard IP address, a client IP address may be assigned to a resource instance by a client, e.g., via an API provided by a service provider. Unlike standard IP addresses, client IP addresses are assigned to client accounts and can be remapped to other resource instances by the corresponding clients as needed or desired. The client IP address is associated with the client account and not the particular resource instance, and the client controls the IP address until the client chooses to release the IP address. Unlike conventional static IP addresses, client IP addresses allow clients to mask resource instances or availability failures by remapping the client's public IP address to any resource instances associated with the client account. For example, the client IP address enables the client to solve the problem of the client's resource instance or software by remapping the client IP address to an alternate resource instance.

FIG. 8 is a block diagram of an example provider network environment providing storage services and hardware virtualization services to customers, according to some embodiments. The hardware virtualization service 820 provides a plurality of computing resources 824 (e.g., computing instances 825, such as VMs) to the guest. Computing resources 824 may be provided to customers of provider network 800 (e.g., customers implementing customer network 850), for example, as services. Each computing resource 824 may be provided with one or more local IP addresses. The provider network 800 may be configured to route packets from the local IP address of the computing resource 824 to the public internet destination and to route packets from the public internet source to the local IP address of the computing resource 824.

Provider network 800 may provide client network 850, coupled to intermediate network 840 via local network 856, for example, with the ability to implement virtual computing system 892 via hardware virtualization service 820 coupled to intermediate network 840 and provider network 800. In some embodiments, the hardware virtualization service 820 may provide one or more APIs 802, such as web services interfaces, via which the client network 850 can access functionality provided by the hardware virtualization service 820, such as via a console 894 (e.g., web-based application, standalone application, mobile application, etc.) of the client device 890. In some embodiments, each virtual computing system 892 at the customer network 850 may correspond to a computing resource 824 leased, or otherwise provided to the customer network 850 at the provider network 800.

From an instance of virtual computing system 892 and/or another client device 890 (e.g., via console 894), a client can access functionality of storage service 810, e.g., via one or more APIs 802, to access and store data from and to storage resources 818A-818N (e.g., folders or "buckets," virtualized volumes, databases, etc.) of virtual data store 816 provided by provider network 800. In some embodiments, a virtualized data storage gateway (not shown) may be provided at customer network 850 that may locally cache at least some data (e.g., frequently accessed data or critical data) and may communicate with storage service 810 via one or more communication channels to upload new or modified data from the local cache such that a master data store (virtualized data store 816) is maintained. In some embodiments, a user via virtual computing system 892 and/or another client device 890 may install and access volumes of virtual data store 816 via storage service 810 acting as a storage virtualization service, and these volumes may appear to the user as local (virtualized) storage 898.

Although not shown in fig. 8, virtualization services may also be accessed from resource instances within provider network 800 via API 802. For example, a client, device service provider, or other entity may access a virtualization service from within a corresponding virtual network on provider network 800 via API 802 to request allocation of one or more resource instances within the virtual network or within another virtual network.

Illustrative System

In some embodiments, a system implementing some or all of the techniques described herein may include a general-purpose computer system (such as computer system 900 shown in fig. 9) that includes or is configured to access one or more computer-accessible media. In the illustrated embodiment, computer system 900 includes one or more processors 910 coupled to a system memory 920 via an input/output (I/O) interface 930. Computer system 900 also includes a network interface 940 coupled to I/O interface 930. Although fig. 9 illustrates computer system 900 as a single computing device, in various embodiments, computer system 900 may include one computing device or any number of computing devices configured to operate together as a single computer system 900.

In various embodiments, computer system 900 may be a single processor system including one processor 910, or a multi-processor system including several processors 910 (e.g., two, four, eight, or another suitable number). Processor 910 may be any suitable processor capable of executing instructions. For example, in various embodiments, processor 910 may be a general purpose processor or an embedded processor implementing any of a variety of Instruction Set Architectures (ISAs), such as the x86, ARM, powerPC, SPARC, or MIPS ISAs, or any other suitable ISA. In a multiprocessor system, each of processors 910 typically, but not necessarily, implement the same ISA.

The system memory 920 may store instructions and data that may be accessed by the processor 910. In various embodiments, system memory 920 may be implemented using any suitable memory technology, such as Random Access Memory (RAM), static RAM (SRAM), synchronous Dynamic RAM (SDRAM), non-volatile/flash-type memory, or any other type of memory. In the illustrated embodiment, program instructions and data that implement one or more desired functions, such as those methods, techniques, and data described above, are shown stored within system memory 920 as database service code 925 (e.g., executable to implement database service 110 in whole or in part) and data 926.

In some embodiments, I/O interface 930 may be configured to coordinate I/O traffic between processor 910, system memory 920, and any peripheral devices in the device, including network interface 940 and/or other peripheral interfaces (not shown). In some embodiments, I/O interface 930 may perform any necessary protocol, timing, or other data transformations to convert data signals from one component (e.g., system memory 920) into a format suitable for use by another component (e.g., processor 910). In some embodiments, for example, I/O interface 930 may include devices that support attachment over various types of peripheral buses, such as variants of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard. In some embodiments, for example, the functionality of I/O interface 930 may be split into two or more separate components, such as a north bridge and a south bridge. Additionally, in some embodiments, some or all of the functionality of the I/O interface 930 (such as an interface to the system memory 920) may be incorporated directly into the processor 910.

For example, the network interface 940 may be configured to allow data to be exchanged between the computer system 900 and other devices 960 attached to one or more networks 950 (such as other computer systems or devices shown in fig. 1). In various embodiments, for example, network interface 940 may support communication via any suitable wired or wireless general-purpose data network, such as various types of ethernet networks. In addition, network interface 940 may support communication via a telecommunications/telephony network (such as an analog voice network or a digital fiber communications network), via a Storage Area Network (SAN) (such as a fibre channel SAN), and/or via any other suitable type of network and/or protocol.

In some embodiments, computer system 900 includes one or more offload cards 970A or 970B (including one or more processors 975, and possibly one or more network interfaces 940) connected using an I/O interface 930 (e.g., a bus or another interconnect implementing some version of the peripheral component interconnect express (PCI-E) standard, such as a QuickPath interconnect (QPI) or UltraPath interconnect (UPI)). For example, in some embodiments, computer system 900 may act as a host electronic device hosting computing resources such as computing instances (e.g., operating as part of a hardware virtualization service), and one or more offload cards 970A or 970B execute a virtualization manager that may manage computing instances executing on the host electronic device. As an example, in some embodiments, the offload card 970A or 970B may perform compute instance management operations such as suspending and/or canceling suspending compute instances, starting and/or terminating compute instances, performing memory transfer/copy operations, and the like. In some embodiments, these management operations may be performed by the off-load card 970A or 970B in cooperation with a hypervisor (e.g., upon request from a hypervisor) executed by the other processors 910A-910N of the computer system 900. However, in some embodiments, the virtualization manager implemented by the offload card 970A or 970B may accommodate requests from other entities (e.g., from the computing instance itself) and may not cooperate with (or serve) any separate hypervisor.

In some embodiments, system memory 920 may be one embodiment of a computer-accessible medium configured to store program instructions and data as described above. However, in other embodiments, program instructions and/or data may be received, transmitted, or stored on different types of computer-accessible media. In general, computer-accessible media may include any non-transitory storage media or memory media, such as magnetic or optical media, e.g., disks or DVD/CDs coupled to computer system 900 via I/O interface 930. Non-transitory computer-accessible storage media may also include any volatile or non-volatile media such as RAM (e.g., SDRAM, double Data Rate (DDR) SDRAM, SRAM, etc.), read Only Memory (ROM), etc., that may be included in some embodiments of computer system 900 as system memory 920 or another type of memory. Furthermore, computer-accessible media may include transmission media or signals such as electrical, electromagnetic, or digital signals transmitted via a communication medium (such as a network and/or wireless link), such as may be implemented via network interface 940.

The various embodiments discussed or presented herein may be implemented in a wide variety of operating environments, which in some cases may include one or more user computers, computing devices, or processing devices that may be used to operate any of a number of applications. The user devices or client devices may include any of a number of general purpose personal computers, such as desktop or laptop computers running a standard operating system, as well as cellular, wireless, and handheld devices running mobile software and capable of supporting a number of networking and messaging protocols. Such a system may also include a number of workstations running any of a variety of commercially available operating systems and other known applications for purposes such as development and database management. These devices may also include other electronic devices such as dummy terminals, thin clients, gaming systems, and/or other devices capable of communicating via a network.

Most embodiments use at least one network familiar to those skilled in the art to support communications using any of a variety of widely available protocols, such as transmission control protocol/internet protocol (TCP/IP), file Transfer Protocol (FTP), universal plug and play (UPnP), network File System (NFS), common Internet File System (CIFS), extensible messaging and presence protocol (XMPP), appleTalk, and the like. The network may include, for example, a Local Area Network (LAN), a Wide Area Network (WAN), a Virtual Private Network (VPN), the internet, an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network, and any combination thereof.

In embodiments using web servers, the web servers may run any of a variety of servers or middle tier applications, including HTTP servers, file Transfer Protocol (FTP) servers, common Gateway Interface (CGI) servers, data servers, java servers, business application servers, and the like. The server may also be capable of executing programs or scripts in response to requests from user devices, such as by executing programs or scripts that are implementable in any programming language (such asC. One or more Web applications of one or more scripts or programs written in c# or c++) or any scripting language (such as Perl, python, PHP or TCL), and combinations thereof. The servers may also include database servers including, but not limited to, those commercially available from Oracle (R), microsoft (R), sybase (R), IBM (R), and the like. The database servers may be relational or non-relational (e.g., "NoSQL"), distributed or non-distributed, and the like.

The environments disclosed herein may include various data stores, as well as other memories and storage media as discussed above. These may reside in various locations, such as on a storage medium local to (and/or residing in) one or more computers, or on a storage medium remote from any or all of the computers on the network. In a particular set of embodiments, the information may reside in a Storage Area Network (SAN) familiar to those skilled in the art. Similarly, any necessary files for performing the functions attributed to a computer, server, or other network device may be stored locally and/or remotely as appropriate. Where the system includes computerized devices, each such device may include hardware elements that may be electrically coupled via a bus, including, for example, at least one Central Processing Unit (CPU), at least one input device (e.g., a mouse, keyboard, controller, touch screen, or keypad), and/or at least one output device (e.g., a display device, printer, or speaker). Such a system may also include one or more storage devices, such as disk drives, optical storage devices, and solid state storage devices, such as Random Access Memory (RAM) or Read Only Memory (ROM), as well as removable media devices, memory cards, flash memory cards, and the like.

Such devices may also include a computer-readable storage medium reader, a communication device (e.g., modem, network card (wireless or wired), infrared communication device, etc.), and working memory as described above. The computer-readable storage medium reader may be connected to or configured to receive a computer-readable storage medium representing remote, local, fixed, and/or removable storage devices and storage media for temporarily and/or more permanently containing, storing, transmitting, and retrieving computer-readable information. The system and various devices will typically also include a number of software applications, modules, services, or other elements within at least one working memory device, including an operating system and application programs, such as a client application or web browser. It should be appreciated that alternative embodiments may have many variations from the above description. For example, custom hardware may also be used, and/or particular elements may be implemented in hardware, software (including portable software, such as applets), or both. In addition, connections to other computing devices, such as network input/output devices, may be employed.

Storage media and computer-readable media for storing code or portions of code may include any suitable media known or used in the art, including storage media and communication media, such as but not limited to volatile and nonvolatile media, removable and non-removable media including RAM, ROM, electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk-read-only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the system device, implemented in any method or technology. Based on the present disclosure and the teachings provided herein, one of ordinary skill in the art will appreciate other ways and/or methods of implementing the various embodiments.

In the foregoing description, various embodiments are described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. It will also be apparent, however, to one skilled in the art that the embodiments may be practiced without these specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the described embodiments.

Bracketed text and boxes with dashed borders (e.g., large dashes, small dashes, dot dashes, and dots) are used herein to illustrate optional operations to add additional features to some embodiments. However, such representations should not be construed to mean that these are the only options or alternative operations, and/or that in some embodiments, the boxes with solid borders are not optional.

Reference numerals with suffix letters (e.g., 818A-818N) may be used to indicate that one or more instances of the referenced entity may be present in various embodiments, and when multiple instances are present, each instance need not be identical, but may share some general features or function in a common manner. Furthermore, the use of a particular suffix is not meant to imply that a particular number of entities are present, unless specifically indicated to the contrary. Thus, in various embodiments, two entities using the same or different suffix letters may or may not have the same number of instances.

References to "one embodiment," "an example embodiment," etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Furthermore, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

Furthermore, in the various embodiments described above, disjunctive language such as the phrase "at least one of A, B or C" is intended to be understood to mean A, B or C or any combination thereof (e.g., A, B and/or C), unless specifically indicated otherwise. Similarly, a language such as "at least one or more of A, B and C" (or "one or more of A, B and C") is intended to be understood to mean A, B or C, or any combination thereof (e.g., A, B and/or C). Thus, disjunctive language is neither intended nor should it be construed to imply that at least one of the requirements a, at least one of the requirements B, and at least one of the requirements C are each present in a given embodiment.

As used herein, the term "based on" (or the like) is an open term used to describe one or more factors that affect a determination or other action. It should be understood that the term does not exclude additional factors that may affect the determination or action. For example, the determination may be based solely on the listed factors or on the factors and one or more additional factors. Thus, if action a is "based on" B, it should be understood that B is one factor affecting action a, but this does not preclude the action from being based on one or more other factors, such as factor C, as well. However, in some cases, action a may be based entirely on B.

Articles such as "a/an" should generally be construed to include one or more of the recited items unless expressly stated otherwise. Thus, phrases such as "a device configured as … …" or "a computing device" are intended to include one or more stated devices. The one or more stated devices may be collectively configured to perform the recited operations. For example, "a processor configured to perform operations A, B and C" may include a first processor configured to perform operation a working with a second processor configured to perform operations B and C.

Furthermore, the words "may" or "may" are used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). The words "include", "including" and "comprising" are used to indicate an open relationship and are therefore meant to include, but not be limited to. Similarly, the words "have", and "have" also indicate an open relationship, and thus mean having, but not limited to. The terms "first," "second," "third," and the like as used herein are used as labels for nouns following them and do not imply any type of order (e.g., spatial, temporal, logical, etc.) unless such order is explicitly indicated otherwise.

At least some implementations of the disclosed technology may be described in view of the following clauses:

1. A computer-implemented method, comprising:

Determining to add a new Database (DB) node to a cluster of one or more DB nodes implementing a database in a database service, wherein the cluster of DB nodes comprises a first DB node operating in a computation layer of the database service, the computation layer being implemented by a first one or more electronic devices and managing query execution, wherein the database service further comprises a storage layer implemented by a second one or more electronic devices, the storage layer comprising a storage node managing storage and persistence of data of a database operated by the computation layer;

expanding the cluster of DB nodes to add at least the new DB node, comprising:

Determining a split of data of a first volume managed by the first DB node, the split comprising a first portion that is to hold the data managed by the first DB node and a second portion of the data that is to be managed by the new DB node;

Obtaining a second DB node in the computation layer to be used as the new DB node;

Cloning the first volume within the storage layer to produce a second volume for use by the second DB node, wherein the cloning does not involve making a complete copy of the first volume within the storage layer;

configuring the second DB node to utilize the second volume;

Applying a set of database modifications to the second volume after the cloning, wherein the set of database modifications results from traffic received by the first DB node during the cloning of the first volume; and

Updating a proxy layer to begin sending database traffic relating to the second portion of the data to the second DB node;

Causing the first DB node to delete the second portion of the data from the first volume; and

Causing the second DB node to delete the first portion of data from the second volume.

2. The computer-implemented method of clause 1, further comprising waiting a period of time before the cloning of the first volume to complete an existing transaction involving the first DB node before the cloning of the first volume is performed.

3. The computer-implemented method of any of clauses 1-2, wherein during or after the cloning of the first volume, but before the updating of the proxy layer, the method further comprises:

Receiving, at the first DB node, a request relating to the second portion of data to be managed by the new DB node; and

Transmitting a response message to the proxy layer indicating that the request is denied, wherein the response message is at least one of:

indicating that the request can be retried; or (b)

Including hint values that allow the agent layer to update its route cache to send additional requests involving the second portion of data to the second DB node.

4. A computer-implemented method, comprising:

determining to add a new Database (DB) node to a cluster of one or more DB nodes implementing a database in a database service, wherein the cluster comprises a first DB node of a computation layer of the database service; and

Expanding the cluster of one or more DB nodes to add at least the new DB node, comprising:

cloning the first volume within a storage tier different from the compute tier to produce a second volume for use by the second DB node, wherein the cloning does not involve making a complete copy of the data of the first volume;

Applying a set of database modifications to the second volume after the cloning, wherein the set of database modifications results from database traffic involving the second portion received by the first DB node during the cloning of the first volume; and

The proxy layer is updated to begin sending database traffic relating to the second portion of the data to the second DB node.

5. The computer-implemented method of clause 4, further comprising:

6. The computer-implemented method of clause 5, wherein after the expanding of the cluster, the first DB node deletes at least some of the second portion of the data from the first volume or the second DB node deletes at least some of the first portion of data from the second volume.

7. The computer-implemented method of any of clauses 4 to 6, wherein expanding the cluster further comprises:

configuring the first DB node to insert data corresponding to the set of database modifications into a modification log; and

Transmitting the data from the modification log to the second DB node,

Wherein the applying of the set of database modifications occurs at least in part via the second DB node of the computation layer and comprises obtaining, by the second DB node, the data corresponding to the set of database modifications from the first DB node.

8. The computer-implemented method of any of clauses 4 to 7, wherein the application of the set of database modifications occurs within the storage layer without participation of the computing layer, and comprising:

Identifying, by the storage layer, a modification to the first volume; and

The modification is caused by the storage layer to be applied to the second volume.

9. The computer-implemented method of any of clauses 4 to 8, wherein cloning the first volume to produce the second volume comprises:

Creating a copy-on-write volume as the second volume based on creating one or more metadata elements for the second volume that identify the data of the first volume, wherein the copy-on-write volume does not include a different copy of the data of the first volume.

10. The computer-implemented method of any of clauses 4 to 9, wherein the determining to add the new DB node to the cluster comprises:

Receiving a user initiated command representing the database service for expanding the database; or (b)

An automatic expansion condition of an expansion policy is determined to be satisfied, wherein the automatic expansion condition is based at least in part on metrics associated with the cluster or respective DB nodes of the cluster.

11. The computer-implemented method of any of clauses 4 to 10, wherein the database service is implemented within a multi-tenant cloud provider network, and wherein the storage tier provides the first volume and the second volume to the first DB node and the second DB node via use of a plurality of storage nodes implemented by a plurality of computing devices, wherein the plurality of computing devices are multi-tenant and provide storage tier services of the database service to a plurality of different databases of a plurality of different users.

12. The computer-implemented method of any of clauses 4 to 11, wherein updating the proxy layer comprises at least:

A message is transmitted to the proxy layer, the message including data identifying a set of one or more partition key values and identifying the second DB node.

13. The computer-implemented method of any of clauses 4 to 12, further comprising:

A period of time is waited for an existing transaction involving the first DB node to complete before the cloning of the first volume is performed before the cloning of the first volume.

14. The computer-implemented method of any of clauses 4 to 13, wherein during or after the cloning of the first volume, but before the updating of the proxy layer, the method further comprises:

And sending a response message indicating that the request is refused to the proxy layer.

15. The computer-implemented method of clause 14, wherein the response message indicates that the request can be retried.

16. The computer-implemented method of clause 15, wherein the response message further comprises a hint value that allows the agent layer to update its route cache to send additional requests related to the second portion of data to the second DB node.

17. A system, comprising:

A first one or more computing devices for implementing a computing layer of a database service in a multi-tenant provider network to manage query execution for a database;

A second one or more computing devices for implementing a storage tier of the database service to manage storage and persistence of data of the database operated by the computing tier; and

A third one or more computing devices for implementing a control plane of the database service in the multi-tenant provider network, the control plane comprising instructions that, when executed, cause the control plane to:

Determining a cluster of one or more Database (DB) nodes to add to an implemented database a new DB node, wherein the cluster includes a first DB node implemented in the computing layer; and

Determining a split of data of a first volume managed by the first DB node, the split comprising a first portion to hold the data managed by the first DB node and a second portion of the data to be managed by the new DB node;

Causing the storage layer to clone the first volume to generate a second volume for use by the second DB node, wherein the cloning does not involve making a complete copy of the data of the first volume;

Causing a set of database modifications to be applied to the second volume after the cloning, wherein the set of database modifications results from database traffic received by the first DB node during the cloning of the first volume involving the second portion; and

18. The system of clause 17, wherein the control plane further comprises instructions that when executed further cause the control plane to:

19. The system of any one of clauses 17 to 18, wherein at least one of causing the first DB node to delete the second portion of the data from the first volume or causing the second DB node to delete the first portion of data from the second volume occurs after the expansion of the cluster.

20. The system of any of clauses 17 to 19, wherein to extend the cluster, the control plane further:

The first DB node is configured to insert data corresponding to the set of database modifications into a data stream,

Wherein said applying of said set of database modifications occurs at least in part via said computation layer and comprises obtaining said data corresponding to said set of database modifications from said data stream by said second DB node.

The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the disclosure as set forth in the claims.

Claims

1. A computer-implemented method, comprising:

applying a set of database modifications to the second volume after the cloning, wherein

The set of database modifications results from database traffic received by the first DB node during the cloning of the first volume involving the second portion; and

2. The computer-implemented method of claim 1, further comprising:

3. The computer-implemented method of claim 2, wherein after the expanding of the cluster, the first DB node deletes at least some of the second portion of the data from the first volume or the second DB node deletes at least some of the first portion of data from the second volume.

4. The computer-implemented method of any of claims 1-3, wherein expanding the cluster further comprises:

Transmitting the data from the modification log to the second DB node,

5. The computer-implemented method of any of claims 1-4, wherein the application of the set of database modifications occurs within the storage layer without participation of the computing layer, and comprising:

Identifying, by the storage layer, a modification to the first volume; and

6. The computer-implemented method of any of claims 1 to 5, wherein cloning the first volume to produce the second volume comprises:

7. The computer-implemented method of any of claims 1-6, wherein the determining to add the new DB node to the cluster comprises:

8. The computer-implemented method of any of claims 1 to 7, wherein the database service is implemented within a multi-tenant cloud provider network, and wherein the storage tier provides the first volume and the second volume to the first DB node and the second DB node via use of a plurality of storage nodes implemented by a plurality of computing devices, wherein the plurality of computing devices are multi-tenant and provide storage tier services of the database service to a plurality of different databases of a plurality of different users.

9. The computer-implemented method of any of claims 1 to 8, wherein updating the proxy layer comprises at least:

10. The computer-implemented method of any of claims 1 to 9, further comprising:

11. The computer-implemented method of any of claims 1 to 10, wherein during or after the cloning of the first volume, but before the updating of the proxy layer, the method further comprises:

12. The computer-implemented method of claim 11, wherein the response message indicates that the request can be retried.

13. The computer-implemented method of claim 12, wherein the response message further includes a hint value that allows the proxy layer to update its route cache to send additional requests related to the second portion of data to the second DB node.

14. A system, comprising:

15. The system of claim 14, wherein the control plane further comprises instructions that when executed further cause the control plane to: