CN114201444B

CN114201444B - Method, medium, program product, system, and apparatus for storage management

Info

Publication number: CN114201444B
Application number: CN202111480059.1A
Authority: CN
Inventors: 杨经纬; 李甲; 赵鹏; 徐立宝; 谢钢锋; 王磊; 许飞翔; 仇小钢
Original assignee: Hexaflake Nanjing Information Technology Co Ltd
Current assignee: Hexaflake Nanjing Information Technology Co Ltd
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2023-11-14
Anticipated expiration: 2041-12-06
Also published as: WO2023103397A1; CN114201444A

Abstract

A method, medium, program product, system, and apparatus for storage management are described herein. In some embodiments of the present disclosure, a page to be accessed by an application is determined, the page having data stored therein; setting a value of a first reference counter corresponding to the page based on a number of processing engines to be started to run the application; updating the value of the first reference counter based on the access status of the application to the page on the processing engine; and releasing or replacing the page stored data based on the updated value of the first reference counter. By maintaining the first reference counter, the reuse rate of pages can be increased and the reuse rate of the storage space of the on-chip memory can be increased.

Description

Method, medium, program product, system, and apparatus for storage management

Technical Field

Embodiments of the present disclosure relate generally to the field of electronics and, more particularly, relate to a method, medium, program product, system, and apparatus for storage management.

Background

Processor systems such as Graphics Processors (GPUs) have been proposed in which multiple processor cores may provide a parallel multi-threaded approach to processing and thus may provide higher processing speeds. These processing systems may break down complex computations into smaller tasks and perform parallel processing by multiple cores, multiple threads, thereby reducing processing time.

In some cases, the amount of data to be processed by the program (e.g., tensor data) may be large, while the capacity of the on-chip memory (e.g., L2 cache) is limited, so that a large amount of data cannot be loaded into the on-chip memory at the same time, which may affect the parallel processing efficiency of the data.

Disclosure of Invention

Embodiments of the present disclosure provide a scheme for storage management.

In a first aspect, a storage management method is provided. The method includes determining a page to be accessed by an application program, the page having data stored therein; setting a value of a first reference counter corresponding to a page based on the number of processing engines to be started to run the application; updating a value of a first reference counter based on an access state of an application to a page on a processing engine; and releasing or replacing data in the page based on the updated value of the first reference counter.

According to the scheme, the value of the first counter can reflect the use condition of the page by the PE, such as how many PEs are used and how many PEs have used the page, so that the page is prevented from being deleted or replaced when the page is not used. By maintaining the first reference counter, the reuse rate of pages can be increased and the reuse rate of the storage space of the on-chip memory can be increased.

In some embodiments, the method further comprises: setting a value of a second reference counter corresponding to the page based on a ready state of data in the page in the on-chip memory or the off-chip memory; and running an application on the processing engine based on the value of the second reference counter.

In some embodiments, setting the value of the second reference counter for the page includes: setting a second reference counter to a first value if the data in the page is not ready in on-chip memory or off-chip memory; and setting the second reference counter to a second value if the data in the page is ready in the on-chip memory or the off-chip memory.

In some embodiments, running the application includes: if the second reference counter is the first value, preventing the application from performing access operations on the page at the processing engine; and if the second reference counter is a second value, allowing the application to perform an access operation on the page at the processing engine.

In some embodiments, setting the value of the first reference counter for the page includes: the value of the first reference counter is set equal to the number of processing engines.

In some embodiments, updating the value of the first reference counter comprises: if the access operation of the application to the page on one of the processing engines is completed, the value of the first reference counter is decremented by one.

In some embodiments, releasing or replacing data in a page includes: if the updated value of the first reference counter indicates that there is no processing engine to perform an access operation on the page, the data in the page is released or replaced from on-chip memory.

In some embodiments, another application is to access a page, and releasing the data in the page or replacing the page from on-chip memory includes: if the updated value of the first reference counter indicates that there is no processing engine to perform an access operation on the page and the value of the second reference counter indicates that the page is accessible, the data in the page is replaced with data to be accessed by another application.

In some embodiments, pages have corresponding page table entries in the page table and are mapped to physical addresses in physical memory space.

In a second aspect of the present disclosure, a computer-readable storage medium is provided. A plurality of programs are stored, the plurality of programs configured for execution by one or more processing units, the plurality of programs comprising instructions for performing the method of the first aspect.

In a third aspect of the present disclosure, a computer program product is provided. The computer program product comprises a plurality of programs configured for execution by one or more processing units, the plurality of programs comprising instructions for performing the method of the first aspect.

In a fourth aspect of the present disclosure, an accelerator system is provided. The accelerator system includes: a processing unit; and a memory coupled to the processing unit, the memory having instructions stored therein that when executed by the processing unit perform the method of the first aspect.

In a fifth aspect of the present disclosure, an apparatus for storage management is provided. The apparatus includes a page determining unit configured to determine a page to be accessed by an application program, the page storing data therein; a first counter setting unit configured to set a value of a first reference counter corresponding to a page based on the number of processing engines to be started to run an application; a first counter updating unit configured to update a value of a first reference counter based on an access state of the application program to the page on the processing engine; and a data release or replacement unit configured to release or replace data in the page based on the updated value of the first reference counter.

In some embodiments, the apparatus further comprises: a second counter setting unit configured to set a value of a second reference counter corresponding to the page based on a ready state of data in the page in the on-chip memory or the off-chip memory; and a program running unit configured to run the application program on the processing engine based on the value of the second reference counter.

In some embodiments, the second counter setting unit includes: a first value setting unit configured to set the second reference counter to a first value if the data in the page is not ready in the on-chip memory or the off-chip memory; and a second value setting unit configured to set the second reference counter to a second value if the data in the page is ready in the on-chip memory or the off-chip memory.

In some embodiments, the program execution unit includes: an access blocking unit configured to block execution of an access operation by the application on the processing engine if the second reference counter is a first value; and an access start unit configured to allow the application program to perform an access operation on the page on the processing engine if the second reference counter is a second value.

In some embodiments, the first counter setting unit is configured to: the value of the first reference counter is set equal to the number of processing engines.

In some embodiments, the first counter updating unit is configured to: if an access operation of an application to a page on one of the processing engines is completed, the value of the first reference counter is decremented by one.

In some embodiments, the data release or replacement unit is configured to: if the updated value of the first reference counter indicates that there is no processing engine to perform an access operation on the page, the data in the page is released or replaced from on-chip memory.

In some embodiments, another application is to access the page, and the data release or replacement unit is configured to: if the updated value of the first reference counter indicates that there is no processing engine to perform an access operation on the page and the value of the second reference counter indicates that the page is accessible, the data in the page is replaced with the data to be accessed by the other application.

The summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the disclosure.

FIG. 1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure may be implemented;

FIG. 2 illustrates a schematic block diagram of a chip according to some embodiments of the disclosure;

FIG. 3 illustrates a schematic block diagram of a parallel processing engine architecture, according to some embodiments of the present disclosure;

FIG. 4 illustrates an example of on-chip virtual storage space, according to some embodiments of the present disclosure;

FIG. 5 illustrates a schematic flow diagram of a method of storage management according to some embodiments of the present disclosure;

FIG. 6 illustrates a schematic flow diagram of a method of storage management according to further embodiments of the present disclosure;

FIG. 7 illustrates a schematic block diagram of an apparatus for storage management, according to some embodiments of the present disclosure; and

fig. 8 shows a schematic block diagram of an apparatus for storage management according to further embodiments of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "some embodiments" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As mentioned above, the amount of data (e.g., tensor data) to be accessed at the time of execution of an application program may be large, and the capacity of on-chip memory (e.g., L2 cache) is limited, so that a large amount of data cannot be loaded into the on-chip memory at the same time, which may affect the parallel processing efficiency of the data.

In some embodiments of the present disclosure, a scheme for on-chip virtual storage is presented. Unlike virtual storage techniques that utilize secondary storage devices (e.g., hard disk, remote memory, etc.) to extend the main memory space, in embodiments of the present disclosure, the on-chip memory and off-chip memory of the accelerator system are combined into a unified virtual memory space. The data to be accessed by the application program is addressed in the virtual storage space, so that a larger unified addressable storage space is provided for the application program, the usable memory space is expanded, and the parallel processing efficiency is improved, particularly for large-size data such as tensor data.

FIG. 1 illustrates a schematic diagram of an example environment 100 in which various embodiments of the present disclosure may be implemented. The example environment 100 may be, for example, an electronic device with computing capabilities such as a computer. In some embodiments, example environment 100 includes, for example, a Central Processing Unit (CPU) 20, a system memory 10, a north bridge/memory bridge 30, an accelerator system 40, a device memory 50, and a south bridge/Input Output (IO) bridge 60. The system memory 10 may be, for example, a volatile memory such as a Dynamic Random Access Memory (DRAM). The north bridge/memory bridge 30 integrates, for example, a memory controller, PCIe controller, etc., which is responsible for data exchange between the CPU 20 and the high speed interface, bridging the CPU 20 and the south bridge/IO bridge 60. The south bridge/IO bridge 60 is used for a low-speed interface of a computer, such as a serial advanced technology interface (SATA) controller, etc. The accelerator system 40 may include, for example, devices or chips such as Graphics Processors (GPUs) and Artificial Intelligence (AI) accelerators for accelerating the processing of graphics, video, and the like. The device memory 50 may be, for example, volatile memory such as DRAM that is located external to the accelerator system 40.

In this disclosure, the device memory 50 is also referred to as off-chip memory, i.e., memory located outside the chip of the accelerator system 40. In contrast, the accelerator system 40 also has volatile memory within its chip, such as a level one (L1) cache and optionally a level two (L2) cache. This will be described in detail below in connection with some embodiments of the present disclosure.

While one example environment 100 in which embodiments of the present disclosure may be implemented is shown in fig. 1, the present disclosure is not limited thereto. Some embodiments of the present disclosure may also be used in some application environments, such as ARM architectures and RISC-V architectures, having accelerator systems, such as GPUs.

Fig. 2 illustrates a schematic block diagram of an accelerator system 200, according to some embodiments of the present disclosure. The accelerator system 200 may be, for example, one particular implementation of a chip of the accelerator system 40 of fig. 1. The accelerator system 200 is, for example, an accelerator system chip such as a GPU. In some embodiments, accelerator system 200 includes a Stream Processor (SP) 210, a page table device 220, a Processing Engine (PE) unit 230, a Direct Memory Access (DMA) controller 240, an L1 cache (cache) 260, and an L2 cache 250.

The accelerator system 200 may be controlled by a host device such as the CPU 20 and receive instructions from the CPU 20. SP 210 analyzes instructions from CPU 20 and assigns the analyzed operations to PE unit 230, page table means 220, and DMA controller 240 for processing.

The page table means 220 maintains page tables for managing on-chip virtual storage accessible to the accelerator system 200. As will be described in detail below, in embodiments of the present disclosure, on-chip memory, such as L2 cache 250, and off-chip memory, such as device memory 50 in FIG. 1, form a virtual storage system with a uniform virtual addressing space. The page tables in page table means 220 may be accessed and updated jointly by SP 210, PE unit 230, and DMA controller 240.

PE unit 230 can include one or more processing engines (processing engine, PE) PE_1, PE_2 … … PE_N, where N represents an integer greater than or equal to 1. Each processing engine may be associated with a corresponding L1 cache. For example, as shown in FIG. 1, PE_1 may be associated with L1_1, PE_2 may be associated with L1_2, and so on. Each PE in PE unit 230 may be a Single Instruction Multithreading (SIMT) device. Fig. 3 illustrates a schematic diagram of a parallel PE structure 300 of SIMT in accordance with some embodiments of the present disclosure. The parallel PE structure 300 shown in fig. 3 may be implemented within a PE in PE unit 230.

As shown, there may be one or more threads 320-1, 320-2, … 320-M in the PE, where M is an integer greater than or equal to 1, and the data to be processed by each thread is from a respective buffer 310-1, 310-2, … 310-M. Each thread in a PE may have its own register file and all threads of each PE also share a unified register file (uniform register file) (not shown).

Multiple PEs may perform the same or different processing tasks in parallel, and address translation and access to target data in memory may be performed in parallel, thereby reducing processing time. For example, in performing a computing task such as machine learning (DL), the PE may perform processing such as sorting, convolution, or the like on data to be processed.

A user (e.g., programmer) may write an application to achieve a particular goal. For applications requiring a large computational effort, the application may be run in parallel at multiple PEs, respectively, to process different data portions, respectively. An application is also called a kernel (kernel). Further, one or more threads may be started at each PE. Each thread may exchange thread-level data between its own register file and the memory subsystem. It will be appreciated that a user may specify that multiple (e.g., tens, hundreds, or even more) threads are to be started at a PE to perform certain operations in parallel. Each thread has its own arithmetic logic execution unit and uses its own memory address, which employs a typical register access architecture (load-store architecture). Each execution unit includes a floating point/fixed point unit that supports multiple data types and an arithmetic logic unit.

Most instructions execute arithmetic and logical operators, such as floating point and fixed point numbers plus, minus, multiply, divide, or logical AND, OR, NOT, and the like. The operands come from registers. Memory read-write instructions may provide for data exchange between registers and on-chip/off-chip memory. In general, all execution units in a PE may execute the same instruction in synchronization. By using predicate (predicate) registers, part of the execution units may be masked, thereby implementing the function of the branch instruction.

In some embodiments, the data processed by the accelerator system 200 may be multi-dimensional tensor data, or may be one-dimensional tensor data. For example, in some embodiments, the tensor may be a four-dimensional tensor having four dimensions D1, D2, D3, and D4, and the dimensions of the tensor may be different across the dimensions. In other embodiments, the tensor may be a one-dimensional, two-dimensional, three-dimensional, or more-dimensional tensor, which is not limiting of the present disclosure.

Further, in embodiments of the present disclosure, tensor internals may support such as uint8, int8, bfoat 16, float16, uint16, int16, float32, int32, uint32, and other custom element types, which is also not limiting of the present disclosure. For addressing of tensors, it is in elementary units of elements. For example, if the element type is int8, the element is in bytes. For another example, if the element type is int16, the addressing base unit is double bytes, and so on.

In a typical computing system, on-chip storage is faster and consumes less power, but its capacity is limited, while off-chip storage is longer in access latency and consumes more power, and its bandwidth is relatively low. Typically, on-chip storage is designed as a cache (cache) and is not explicitly addressable. In a typical computer system, the main memory is typically off-chip storage, and the data access uses physical addresses.

As mentioned above, unlike existing on-chip memory mapping and management approaches, in embodiments of the present disclosure, virtual memory is used to manage on-chip memory rather than L2 cache, which forms a uniformly addressable virtual memory space with off-chip memory, providing a virtual on-chip memory perspective for programs. The data to be accessed by an application is managed by a page table indicating the mapping between the logical addresses of the data in virtual memory space and the physical addresses on-chip memory or off-chip memory.

Thus, when an application is run, data is accessed via page tables, which may be physically stored in on-chip memory or off-chip memory. The unified virtual on-chip storage space is not only beneficial to storage space management, but also beneficial to program design and operation. For example, an application may use a logical address to address data to be accessed without knowing the physical address information of the data, nor knowing on which physical medium the virtually stored data is stored. This facilitates the convenient and flexible configuration of the different data to be processed by the programmer, as long as the logical address corresponding to the portion of data to be processed by each application is defined. The program runs without managing migration of data.

Storage management in virtual storage space according to some embodiments of the present disclosure will be described below with reference to fig. 4 and 5. Fig. 4 illustrates a schematic block diagram of a portion of a virtual storage space 400, according to some embodiments of the present disclosure. FIG. 5 illustrates a flow chart of an example process 500 of storage management. The process 500 may be implemented in the accelerator system 200.

Virtual memory space 400 is mapped to on-chip memory and off-chip memory. On-chip memory refers to on-chip memory of the accelerator system 200, such as the L2 cache in FIG. 2, which may be Static Random Access Memory (SRAM) or other types of on-chip memory. Off-chip memory is, for example, off-chip memory of accelerator system 200, such as device memory 50 in FIG. 1, which may be Dynamic Random Access Memory (DRAM) or other type of off-chip memory.

In process 500, at block 510, a page table for a virtual memory space, such as page table for virtual memory space 400, is created based on data to be accessed at execution of an application. The page table indicates at least a mapping relationship between logical addresses in virtual memory space 400 and physical addresses on-chip memory or off-chip memory for data to be accessed. At block 520, the page table is utilized to access data while the application is being run.

In some embodiments, the page table is maintained in the accelerator system 200, for example in the page table means 220. In some embodiments, the SP 210 may receive a command sequence sent by the host to initiate the running of the application. The SP 210 may create a page table corresponding to data according to the data to be accessed at the execution of the application program to indicate a mapping relationship between a logical address and a physical address of the data.

In some embodiments, the storage structure of the data to be accessed in the virtual storage space can be flexibly defined in different application programs. In particular, data to be accessed upon execution of an application may be organized in virtual memory space 400 in segments and pages. Herein, a "segment" is sometimes also referred to as a memory segment or data segment, and a "page" is sometimes also referred to as a memory page or data page.

The data may be divided into one or more segments, and each segment may include one or more pages. The number and size of segments, and the number and size of pages may be determined according to the application. The running of an application in each PE may use one or more segments, each of which may include one or more pages. The data to be accessed at the runtime of the application may include data to be processed by the application, such as tensor data or other forms of data. The data to be accessed at runtime of the application may also include application-related program instructions.

Since the physical addresses of data in on-chip memory or off-chip memory can be addressed by logical addresses in virtual memory space, the programmer need not care about the actual physical memory locations of the data at the time of programming. This provides a great degree of flexibility to the programmer in defining segments and pages of data, and also makes full use of on-chip storage space. In processing tasks such as machine learning, more matrix multiplication operations may need to be performed. Thus, dividing the data into larger chunks (e.g., larger segments or pages) that are executed in parallel by different PEs would be highly advantageous for improving computing performance.

Further, since it is not necessary to know the physical address information of each segment and page in which data is located, a programmer can specify a portion of data to be processed by a logical address in an application program. For example, the programmer only needs to configure the overall data (e.g., tensor data) to be processed in the application program and the structure attribute information, and the data portions to be processed by the respective PEs. The logical addresses may be mapped to physical addresses of on-chip or off-chip memory by establishing page tables while running a program on the accelerator system 200.

As an example, in fig. 4, a virtual storage space 400 is used to store tensor data having three dimensions D1, D2, and D3, which schematically illustrates a first segment S1, a second segment S2, and a third segment S3 of a single application data (program) storage space. Different applications may use different numbers of segments. Each segment of data may have a different size, so a programmer may flexibly configure the segments based on design needs. In some embodiments, the number of segments an application occupies may be limited, e.g., it may be specified that an application may occupy up to 16 segments.

In some embodiments, within a segment, at least one page may also be provided to further subdivide the data. For tensor data, the division of pages may be implemented in any one or more dimensions, and the number of pages divided in each dimension is independent of the other. For example, segment S1 in FIG. 4 may have 4 pages P [1], P [2], P [3] and P [4]; the second segment S2 has only one page, and so on. Here, the page size is defined by the application program and may be variable. In embodiments of the present disclosure, the number of pages that each segment has may be different, so a programmer may flexibly configure the size of the pages within the segment based on design needs. For example, since the data on the entire page is loaded into the on-chip memory when the application is running, the page size can be configured to fit in the on-chip memory as a whole, so that the on-chip memory space can be fully utilized.

Further, each segment may be accessed, including read, write, or execute, by one or more PEs. For example, segment S1 may be accessed by 8 PEs (i.e., PE_1, PE_2, PE_3, PE_4, PE_5, PE_6, PE_7, PE_8), where segment S1 stores data in the form of tensors that these PEs are to process at runtime. It will be appreciated that to improve data processing performance, data may be processed in parallel by multiple threads at each PE. For example, in FIG. 4, the data for segment S1 may be designated as being processed by PE_1, PE_2, PE_3, PE_4, PE_5, PE_6, PE_7, and PE_8. The application itself may be stored in a segment in addition to the data to be processed by the application. For example, segment S2 may be used to store program instructions for one or more applications. The program instructions stored in segment S2 may be executed by one or more PEs.

In establishing the page table, the SP 210 in the accelerator system 200 may establish page table entries in the page table that respectively correspond to page identifications (also referred to as "page numbers") of pages into which data is partitioned, each page table entry indicating at least a mapping relationship between physical addresses of the corresponding page on-chip memory or off-chip memory. The page identification (or page number) of a page is derived from the logical address of the data to be accessed.

In defining logical addresses, each segment may have segment identification and reference address data, referred to as anchors (anchors) or reference points. For example, if a segment is divided into a plurality of sections for execution on different PEs, the base address data may represent the starting coordinate point of the data assigned for each PE. For example, the reference address data may be coordinates of (0, 0) or (0,4,0,0) for each dimension of the tensor. Multiple PEs may have the same or different base address data.

The data within a segment may be addressed within the segment relative to the reference address data. The logical address of the data within a segment may include a segment identification of the segment in which the data is located, the base address data, and an intra-segment offset address, wherein the intra-segment offset address may include a page identification of a page in which the data is located, and an offset value of the page relative to the base address data within the segment.

In some embodiments, in the page table, each page table entry may include a page identification of the page and a physical address to which the page is mapped, which may be a physical address in on-chip memory or off-chip memory. In some embodiments, the number of page table entries established in the page table may be limited, and the number may be configured according to the actual application. In some embodiments, the page table is stored in on-chip memory to facilitate subsequent quick access to the page table.

After the page table is established, the page table is utilized to access data while the application is running. For example, the SP 210 may receive a command sequence from a host including storage mapping information and other commands, such as an initialization command, etc. The SP 210 may create a page table based on the memory mapping information and store it in the page table means 230. The SP 210 may control the running of an application on the PE.

At run-time, if a PE is to access data in the target segment while running the application, the page identification (or page number) of the target page where the data is located is derived from the logical address. The logical address is also more specifically used to determine the intra-page offset address of the data within the page. The intra-page offset address may be used to indicate the starting location of the data to be accessed within a page. The PE may access the page table via page table means 230 to locate a corresponding page table entry based on the page identity, within which the physical address of the target page in on-chip or off-chip memory is read. In some implementations, an address translator may be included in the PE to perform translation between logical and physical addresses. The PE can access on-chip or off-chip memory using the determined physical address to access the corresponding data portion.

When an application is running, the data is accessed by a physical address and an intra-page offset address. The access mode of the data can comprise direct access and indirect access. Direct access refers to direct access from off-chip memory or on-chip memory, regardless of whether the data is located in off-chip memory or on-chip memory. Indirect access refers to the loading of data to be accessed into on-chip memory first, and then the access. Indirect access for the case where the target page storing the data is mapped to off-chip memory, the loading of the data from off-chip to on-chip is required. The manner of access to the data may be default or may be set by a programmer as desired.

In some embodiments, where a direct access approach is employed, the page table that is constructed indicates the mapping relationship between the logical address of the data in the virtual memory space and the physical address on the on-chip memory or off-chip memory. When running an application, if it is found by the determined physical address that the target page is mapped to off-chip memory or on-chip memory, the data of the target page located in the off-chip memory or on-chip memory may be directly read or may be directly written to the off-chip memory or on-chip memory based on the physical address and the intra-page offset address. For example, if the data of the target page includes data to be processed by the application, the data to be processed may be read from or written to the off-chip memory or on-chip memory. For another example, if the data of the target page includes program instructions of an application program, the program instructions may be fetched and executed directly from off-chip memory or on-chip memory, or the program instructions may be written directly to off-chip memory or on-chip memory.

In some embodiments, where an indirect access approach is employed, it is first ensured that the data to be accessed is placed into on-chip memory, and then subsequent access operations are performed. In this case, the constructed page table indicates the mapping relationship between the logical address of the data in the virtual memory space and the physical address of the on-chip memory. If a target page mapped into off-chip memory is to be read while running an application, the physical address of the target page in off-chip memory may be used to load the data of the target page from off-chip memory to on-chip memory for access. In some embodiments, the SP 210 may instruct the DMA controller 240 in the accelerator system 200 to read data from off-chip memory and cache to on-chip memory. In some embodiments, the DMA operations and execution of the application may operate in parallel to enable stream processing. After being loaded into on-chip memory, the physical address of the target page loaded into on-chip memory may be determined through the page table, and the intra-page offset address of the data to be read may be determined. Data may be read from the on-chip memory based on the physical address of the target page in the on-chip memory and the determined intra-page offset address.

In some embodiments, if a target page is to be written to while the application is running and mapped to off-chip memory, the data of the target page may be first written to on-chip memory by the physical address of the target page in on-chip memory and the determined intra-page offset address when the application is running. After the application program is run, the data of the target page is flushed from on-chip memory to off-chip memory by the SP 210 using the physical address of the target page in off-chip memory. For example, the SP 210 may perform a FLUSH of data from on-chip memory to off-chip memory through a FLUSH command. This may free up more on-chip memory for use at runtime.

In some embodiments, in addition to the mapping between the logical and physical addresses of the record page, each page table entry in the page table indicates the value of one or more reference counters of the corresponding page. As described in more detail below with reference to fig. 6, the value of the reference counter may be used to manage the data dependencies of the pages, and the value of the reference counter in each page table entry may be updated based on the state that the corresponding page is referenced on-chip memory or off-chip memory, and in particular may be updated based on at least one of: the ready state of the data of the corresponding page in the on-chip memory or the off-chip memory, or the access state of the PE to access the corresponding page to the page.

In some embodiments, tensor data may be stored in an on-chip cache, such as L2 cache 250. But because of the small capacity of the on-chip high-speed memory, the programmer may divide the tensor into multiple segments, each segment describing a portion of the tensor, when the tensor is large. The application may be started multiple times, each time one segment of the tensor is moved from off-chip storage to on-chip storage in advance by DMA controller 240 and used for application operations. After starting the application several times, all segments contained in the tensor are processed, and the whole running process is finished. When the on-chip cache is sufficient to accommodate all tensors that the application is to access, only one segment description is needed for one tensor, and the application is also only required to be started once.

In a parallel processing acceleration system, the same application may be run on one or more PEs. These applications are written to process specific data, such as tensor data. As previously described, data may be stored page by page in virtual memory space and may be used by an application at runtime after being written to on-chip memory. Thus, the same page may be used by different PEs. In this case, management of pages is important. In other embodiments of the present disclosure, it is also proposed to manage data dependencies on pages in virtual storage space using reference counters corresponding to the pages.

Fig. 6 illustrates a flow chart of a process 600 for storage management according to further embodiments of the present disclosure. The process 600 may be implemented in the accelerator system 200.

As shown in FIG. 6, at block 610, a page to be accessed by an application is determined, the page having data stored therein. In the accelerator system 200, the SP 210 may receive a command sequence sent by a host to initiate the running of an application. By analyzing the command sequence, the SP 210 may determine the pages to be accessed by the application to be run.

In some embodiments, an application may access one or more pages. Herein, an "access" page or "access" page refers to an instruction in which data is read, written, or executed to a page in a memory space, and the memory space may be a virtual memory space obtained using an on-chip virtualization technique as described above, or a memory space in an accelerator system that does not use such an on-chip virtualization technique.

As one example, in a task associated with a machine learning model, an application may be configured to perform a matrix multiplication operation, then the application may have access to three pages for storing data, where a first page stores matrix a, a second page stores matrix B, and a third page stores the result of multiplying matrix a by matrix B. In some embodiments, the addressing information for the page to be accessed may be determined from a command sequence for the application. For example, a first page storing matrix A may be located at page P [1] in segment 1 (S1), a second page storing matrix B may be located at page P [2] in segment 2 (S2), and a third page storing matrix multiplication results may be located at page P [5] in segment 3 (S3). In some embodiments, the PE may fetch instructions from a page in which the program instructions are stored and execute the instructions.

At block 620, the value of the first reference counter for the page is set based on the number of PEs to be started to run the application.

An application may perform access operations to data on one or more PEs. In embodiments of the present disclosure, access to a page is managed by setting a reference counter (v-counter) to avoid that data in the page is deleted or replaced without the associated PE having been used up.

In some embodiments, the value of the reference counter may be maintained in a page table. In the page table, each page table entry corresponds to a page and includes address information for the page, completing the translation between logical addresses to physical addresses as described previously, and may also include a value of a reference counter. In some embodiments, each page may correspond to one or more reference counters, which may be set to respective values, as will be described below.

In some embodiments, the value of the page corresponding first reference counter (denoted as v-counter [1 ]) may be based on the number of PEs to run the application in order to maintain access to the page by the PEs. In some embodiments, the value of the first reference counter may be set equal to the number of PEs to run the application.

In some embodiments, the value of another reference counter (sometimes referred to herein as a second reference counter) corresponding to the page may also be set for characterizing the ready state of data in the page in on-chip memory or off-chip memory. By maintaining the value of the second reference counter, it is avoided that the data in the page is accessed when it is not yet ready, e.g. for subsequent calculations.

Specifically, after determining the page to access, the SP 210 may set the value of the second reference counter corresponding to the page based on the ready state of the page in on-chip memory or off-chip memory. When the application is started on the PE, an access operation may be performed based on the value of the second reference counter.

Depending on the actual storage and execution of the respective application, the data in the page may be originally stored in on-chip memory or may be stored in off-chip memory. For a page used to store the results of the processing, such as the page used to store the results of the matrix multiplication in the example above, the data in that page is not completely written to on-chip memory until the computation of the matrix multiplication is completed.

In some embodiments, if the access to the data is direct access, the value of the second reference counter may be set in consideration of the ready state of the data on-chip memory or off-chip memory. In some embodiments, if the access to the data is an indirect access, i.e., the data needs to be loaded from off-chip memory to on-chip memory to reduce storage latency, the value of the second reference counter may be set in consideration of the ready state of the data in on-chip memory. That is to say that the first and second,

In some embodiments, if it is determined that the data in the page is not ready in on-chip memory or off-chip memory, the value of the second reference counter may be set to a first value, e.g., may be set to 1, to indicate that the data for the page cannot yet be accessed. For example, if data in a page is to be moved from off-chip memory to on-chip memory for re-access, or if the data in the page is to be obtained after the end of the computation, then the value of the second reference counter corresponding to the page is set to 1 at the beginning of the move or computation to avoid access to the page by other entities when the move or computation is not complete. In some embodiments, if the data in the page is ready in on-chip memory or off-chip memory, the second reference counter is set to a second value indicating that the page is accessible or the data is ready, e.g., may be set to 0.

In some embodiments, the second reference counter may be set to a first value (e.g., 1) by the SP 210. Continuing with the example above, if matrix A in first page P [1] is physically stored in off-chip memory, SP 210 may set a second reference counter (e.g., denoted as v-counter [0 ]) in a page table entry in the page table corresponding to the first page to 1 to indicate that the page is being loaded, and SP 210 may instruct DMA controller 240 to load matrix A into on-chip memory. After completing the data loading of matrix A, DMA controller 240 may set v-counter [0] to 0 in the page table entry corresponding to the first page. The value of the second reference counter v-counter [0] may also be similarly set for the second page P [2] for holding matrix B.

For the third page P [5] of the matrix multiplied result for storage, if an application running on the PE is to write the result to that page, then the PE may first set the first counter (e.g., v-counter [0], which is referred to as the first counter for the writer and the second counter for the reader) in the page table entry corresponding to the third page P [5] to the number of PEs to avoid that the page is accessed when the result has not been completely written. After the result writing is completed, the PE may set v-counter [0] in the page table entry corresponding to the third page P [5] to 0.

For a PE to access a page, it may be determined by the value of the counter whether the data in the page is ready. Specifically, if the value of the second reference counter is found to indicate that the data in the page is not yet accessible (e.g., a value of 1 or the number of PEs), then access to the page needs to wait. The PE may first prevent the application from performing the access operation. In some embodiments, the PE may periodically walk the corresponding page table entry in the page table to determine the ready state of the data in the page. In some embodiments, if the data in the page is found not ready by the value of the second reference counter, an interrupt may also be sent to the host to tell the host that the data in the page is not ready.

The setting of the values of some of the reference counters for the page is discussed above. For a better understanding, a specific example will be described. Still assume that one application is configured to perform multiplication of matrix a and matrix B, and that the application is to run on 4 PEs. The following is an example of the command stream of SP 210.

LOAD P0 (where P0 may include an application and static global parameters);

LOAD P1 (matrix a, i.e. the input data to be processed);

o sets the counter v-counter 0 corresponding to page P1 to 1, and enables DMA controller 240 to load matrix a from off-chip memory to on-chip memory,

o sets the counter v-counter [1] corresponding to page P [1] to the number of PEs used (e.g., equal to 4)

LOAD P2 (matrix B, i.e. the input data to be processed);

o sets the counter v-counter 0 corresponding to page P2 to 1, and enables DMA controller 240 to load matrix B from off-chip memory to on-chip memory,

o sets the counter v-counter [1] corresponding to page P [2] to the number of PEs used (e.g., equal to 4)

INIT P [5] (for storing the result of matrix A multiplied by matrix B)

o initializes a counter v-counter [0] corresponding to page P [5], e.g., 4.

After determining that the page to be used is ready, the application may begin performing access operations on the selected PE or PEs.

In process 600, at block 630, the value of the first reference counter is updated based on the access status of the application to the page on the PE. As previously described, the value of the first reference counter is set based on the number of PEs. The value of the first reference counter may be used to reflect the real-time access status of the application on the PE to the page to which the first reference counter corresponds, by updating in real-time.

In some embodiments, if a PE has completed an access operation of an application to a page, the PE may update the value of the first reference counter of the corresponding page to reflect that the PE has completed use of the page. For example, the PE may decrement the value of the first reference counter by one. As access operations to a page by applications running on the respective PEs are completed, the value of the first reference counter is decremented.

After all PEs complete access operations of the application to a page, the value of the first reference counter may be updated to a PE that is able to indicate that no page is to be accessed. For example, if the value of the first reference counter (e.g., v_counter [1 ]) corresponding to a page is set to 4, then after the access operation of the application to a page is completed on all 4 PEs, the value of the first reference counter v_counter [1] is 0, indicating that no PE is to access the page.

At block 640, the data in the page is released or replaced based on the updated value of the first reference counter. In some embodiments, if the updated value of the first reference counter indicates that there are no PEs to perform access operations on the page, e.g., the value of the first reference counter is 0, then this means that the use of the page by the associated PEs has been completed. The data in the page may be released, e.g., deleted from on-chip memory, or replaced with other data. The choice of release or replacement depends on the particular application.

The value of the first counter may reflect the use of the page by the PEs, e.g., how many PEs are still used and how many PEs have used the page, avoiding deletion or replacement when the page is not used. By maintaining the first reference counter, the reuse rate of pages can be increased and the reuse rate of the storage space of the on-chip memory can be increased.

Continuing with the above example of the command stream of SP 210. After executing LOAD P [1], LOAD P [2] and INIT P [5], SP 210 continues to execute the following command streams:

LAUNCH application running on selected PE

The oLAUNCH command establishes a logical address-to-physical address mapping for P0, 1, 2, 5,

oPE queries the value of v_counter [0] for each page, and access operations of applications on the PE refer to the value of v_counter [1] for those pages,

after the o-matrix multiplication is completed, the result is put into P5,

o each PE decrements the v_counter [1] or v_counter [0] corresponding to the accessed page by 1 when the computing task is completed.

FLUSH P [5], updating the states of P [0], 1, 2, 5 and writing the result of the matrix multiplication in P [5] to external memory. P5 data in on-chip memory is released.

In the example above, if there are more applications in the accelerator system 200 to access the results of the operation in P [5 ]. Since the result of P5 is itself located in on-chip memory, the value of the first reference counter (v_counter 0) corresponding to P5 is 0 for the reader, and the data in P5 can be used directly when the application is running on PE. That is, the result in P5 may not be written to external memory and then reloaded to on-chip memory.

In some embodiments, if an application is to access a page, for example, to write new data to the page, then the value of the reference counter corresponding to the page is also queried at runtime of the application. If the value of the first reference counter corresponding to the page indicates that there are no PEs to perform access operations on the page and the value of the second reference counter indicates that the page is accessible, e.g., the values of the first and second reference counters corresponding to the page are both 0, then the data to be accessed by the application may be substituted for the data already in the page and the value of the first reference counter corresponding to the page is updated synchronously. Note that the application here may be another run of the same application that previously accessed the page, or may be a different application.

The use of some reference counters for page correspondence is discussed above. The reference counter corresponding to the page may be used to manage various uses of the page.

In some embodiments, multiple reference counters may be maintained in a page table entry of a page table, such as two or more (e.g., 3) reference counters. In the case of maintaining the values of multiple reference counters, the values of some of the counters may be selected as needed to indicate the ready state of the data in the page and the access state of the application to the page on the respective PE. The value of the unused reference counter may be initialized to 0. Thus, it is possible to determine whether a page is accessible, can be deleted, or can be replaced by judging that the value of all the counters corresponding to the page is 0.

Further, it should be understood that the values of the counters given above are examples, and that other values may be set as long as the indicated state can be reflected.

Fig. 7 illustrates a schematic block diagram of an apparatus 700 for storage management, according to some embodiments of the present disclosure. The apparatus 700 may be implemented as or included in the accelerator system 200 of fig. 2. The apparatus 700 may include a plurality of modules for performing corresponding steps in the method 500 as discussed in fig. 5.

As shown in fig. 7, the apparatus 700 comprises a creation unit 710 configured to create a page table for a virtual memory space based on data to be accessed at execution of an application program. Virtual memory space is mapped to on-chip memory and off-chip memory. The page table indicates at least a mapping relationship between logical addresses of data in the virtual memory space and physical addresses on the on-chip memory or off-chip memory. The apparatus 700 further comprises an access unit 720 configured to access the data using the page table when the application is executed.

In some embodiments, the data is divided into at least one segment, each segment including at least one page. In some embodiments, the creation unit 710 is configured to: page table entries corresponding to the pages divided by the data are established in the page table, and each page table entry at least indicates the mapping relation between the logical address of the corresponding page in the virtual storage space and the physical address on the on-chip memory or the off-chip memory.

In some embodiments, each page table entry in the page table also indicates the value of the reference counter for the corresponding page. In some embodiments, the value of the reference counter in each page table entry is updated based on at least one of: the ready state of the data of the corresponding page in the on-chip memory or the off-chip memory, or the access state of the processing engine to access the corresponding page to the page.

In some embodiments, the logical address of the data in the virtual memory space indicates a segment identification of the segment in which the data is located, the base address data, a page identification of the page in which the data is located, and an offset value of the page relative to the base address data.

In some embodiments, the data includes tensor data and/or program instructions.

In some embodiments, the page table is stored in on-chip memory.

In some embodiments, the access unit comprises: a logical address determination unit configured to determine a target page from a logical address of data in the virtual storage space; an address translation unit configured to determine a physical address of a target page in on-chip memory or off-chip memory using a page table; and an intra-page offset address determination unit configured to determine an intra-page offset address of the data from the logical address; and an address-based access unit configured to access data using the physical address of the target page and the intra-page offset address.

In some embodiments, the address-based access unit is configured to: if the access to the target page includes a read of the target page, reading data directly from the on-chip memory or the off-chip memory using the physical address and the intra-page offset address; and if the access to the target page includes a write to the target page, writing data directly to the on-chip memory or the off-chip memory using the physical address and the intra-page offset address.

In some embodiments, the target page is mapped into off-chip memory and the physical address determined using the page table includes the physical address of the target page in on-chip memory. The address-based access unit may be further configured to: if the access of the target page includes a read of the target page, loading data of the target page from the off-chip memory to the on-chip memory using a physical address of the target page in the off-chip memory, and reading the data from the on-chip memory based on the physical address of the target page in the on-chip memory and the intra-page offset address; and if the access of the target page includes a write to the target page, writing data to the on-chip memory using the physical address of the target page in the on-chip memory and the intra-page offset address, and flushing data of the target page from the on-chip memory to the off-chip memory using the physical address of the target page in the off-chip memory.

Fig. 8 shows a schematic block diagram of an apparatus 800 for storage management according to further embodiments of the present disclosure. The apparatus 800 may be implemented as or included in the accelerator system 200 of fig. 2. The apparatus 800 may include a plurality of modules for performing corresponding steps in the method 600 as discussed in fig. 6.

As shown in fig. 8, the apparatus 800 includes a page determining unit 810 configured to determine a page to be accessed by an application program, the page having data stored therein. The apparatus 800 further comprises a first counter setting unit 820 configured to set the value of the first reference counter corresponding to the page based on the number of processing engines to be started to run the application. The apparatus 800 further comprises a first counter updating unit 830 configured to update the value of the first reference counter based on the access status of the application to the page on the processing engine. The apparatus 800 further comprises a data release or replacement unit 840 configured to release or replace data in the page based on the updated value of the first reference counter.

In some embodiments, the apparatus 800 may further include: a second counter setting unit configured to set a value of a second reference counter corresponding to the page based on a ready state of data in the page in the on-chip memory or the off-chip memory; and a program running unit configured to run the application program on the processing engine based on the value of the second reference counter.

In some embodiments, the program execution unit includes: an access blocking unit configured to block the application program from performing an access operation on the page on the processing engine if the second reference counter is the first value; and an access start unit configured to allow the application program to start performing an access operation on the page on the processing engine if the second reference counter is a second value.

In some embodiments, the first counter setting unit 820 is configured to: the value of the first reference counter is set equal to the number of processing engines.

In some embodiments, the first counter updating unit 830 is configured to: if an access operation of an application to a page on one of the processing engines is completed, the value of the first reference counter is decremented by one.

In some embodiments, the data release or replacement unit 840 is configured to: if the updated value of the first reference counter indicates that there is no processing engine to perform an access operation on the page, the data in the page is released or replaced from on-chip memory.

In some embodiments, another application is to access the page. In some embodiments, the data release or replacement unit 840 is configured to: if the updated value of the first reference counter indicates that there is no processing engine to perform an access operation on the page, and the value of the second reference counter indicates that the page is accessible, the data in the page is replaced with data to be accessed by another application.

Moreover, although operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A storage management method, comprising:

determining a page to be accessed by an application program, wherein tensor data is stored in the page;

setting a value of a first reference counter corresponding to the page based on a number of processing engines to be started to run the application;

setting a value of a second reference counter corresponding to the page based on a ready state of tensor data in the page in on-chip memory or off-chip memory;

running the application on the processing engine based on the value of the second reference counter;

updating the value of the first reference counter based on the access status of the application to the page on the processing engine; and

releasing or replacing the tensor data in the page based on the updated value of the first reference counter.

2. The method of claim 1, wherein setting the value of the second reference counter for the page comprises:

setting the second reference counter to a first value if tensor data in the page is not ready in the on-chip memory or the off-chip memory; and

the second reference counter is set to a second value if tensor data in the page is ready in the on-chip memory or the off-chip memory.

3. The method of claim 2, wherein running the application comprises:

if the second reference counter is the first value, preventing the application from performing an access operation on the page at the processing engine; and

and if the second reference counter is the second value, allowing the application program to execute access operation on the page on the processing engine.

4. The method of claim 1, wherein setting the value of the first reference counter corresponding to the page comprises:

the value of the first reference counter is set equal to the number of processing engines.

5. The method of claim 1, wherein updating the value of the first reference counter comprises:

if the access operation of the application to the page on one of the processing engines is complete, the value of the first reference counter is decremented by one.

6. The method of claim 1, wherein releasing or replacing tensor data in the page comprises:

releasing the page from the on-chip memory or replacing tensor data in the page if the updated value of the first reference counter indicates that there is no processing engine to perform an access operation on the page.

7. The method of claim 1, wherein another application is to access the page and releasing the page from the on-chip memory or replacing tensor data in the page comprises:

if the updated value of the first reference counter indicates that there is no processing engine to perform an access operation on the page and the value of the second reference counter indicates that the page is accessible, replacing tensor data in the page with tensor data to be accessed by the other application.

8. The method of claim 1, wherein the page has a corresponding page table entry in a page table and is mapped to a physical address in a physical memory space.

9. A computer readable storage medium storing a plurality of programs configured for execution by one or more processing units, the plurality of programs comprising instructions for performing the method of any of claims 1-8.

10. An accelerator system, comprising:

a processing unit; and

a memory coupled with the processing unit, the memory having instructions stored therein, which when executed by the processing unit, perform the method of any of claims 1-8.

11. An apparatus for storage management, comprising:

a page determining unit configured to determine a page to be accessed by an application program, the page storing therein tensor data;

a first counter setting unit configured to set a value of a first reference counter corresponding to the page based on the number of processing engines to be started to run the application program;

a first counter updating unit configured to update a value of the first reference counter based on an access state of the application program to the page on the processing engine;

a data release or replacement unit configured to release or replace tensor data in the page based on the updated value of the first reference counter;

a second counter setting unit configured to set a value of a second reference counter corresponding to the page based on a ready state of tensor data in the page in an on-chip memory or an off-chip memory; and

and a program running unit configured to run the application program on the processing engine based on the value of the second reference counter.

12. The apparatus of claim 11, wherein the second counter setting unit comprises:

A first value setting unit configured to set the second reference counter to a first value if tensor data in the page is not ready in the on-chip memory or the off-chip memory; and

a second value setting unit configured to set the second reference counter to a second value if tensor data in the page is ready in the on-chip memory or the off-chip memory.

13. The apparatus of claim 12, wherein the program execution unit comprises:

an access blocking unit configured to block the application program from performing an access operation on the page on the processing engine if the second reference counter is the first value; and

an access start unit configured to allow the application program to perform an access operation on the page on the processing engine if the second reference counter is the second value.

14. The apparatus of claim 11, wherein the first counter setting unit is configured to:

15. The apparatus of claim 11, wherein the first counter updating unit is configured to:

16. The apparatus of claim 11, wherein the data release or replacement unit is configured to:

17. The apparatus of claim 11, wherein another application is to access the page, and the data release or replacement unit is configured to: