WO2022082342A1

WO2022082342A1 - Computer memory-side pointer chasing

Info

Publication number: WO2022082342A1
Application number: PCT/CN2020/121799
Authority: WO
Inventors: Shuangchen Li; Dimin Niu; Hongzhong Zheng; Tianchan GUAN
Original assignee: Alibaba Group Holding Limited
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2022-04-28

Abstract

A computer system includes a memory-side component that executes instructions to implement memory-side pointer chasing in, for example, a graph neural network applied to the analysis of graphs.

Description

COMPUTER MEMORY-SIDE POINTER CHASING

BACKGROUND

A graph is a type of computer-resident data structure that models a set of objects (nodes) and the relationships (edges) between nodes.

Pointer chasing is a technique for traversing a computer-resident graph. Pointer chasing involves a sequence of memory accesses, where the data accessed at a first pointer address is used to determine a subsequent pointer address, forming a chain of data reads or fetches. In other words, a pointer is a link that references a location in memory, a pointer may point to another pointer, and so on. Pointer chasing is widely used in types of applications such as graph analytics and graph neural networks.

Each pointer/reference adds a performance cost. More specifically, pointer chasing operations consume substantial bandwidth and increase latency, affecting the speed at which graph analytics can be executed.

SUMMARY

Embodiments according to the present invention provide a solution to the problems described above. In general, embodiments include a computer system that includes a component that executes instructions to implement memory-side pointer chasing in, for example, a graph neural network (GNN) applied to the analysis of graphs including, but not limited to, a GNN that is applied to graph data structures or to recommendation system applications.

More specifically, in embodiments, the computer system includes (but is not limited to) a processor core, a memory device that stores data, and a component coupled to the processor core and to the memory device. The component may be referred to as a memory-side component.

In embodiments, the memory-side component includes a first buffer and a second buffer. The component can receive an instruction from the processor core. The instruction includes a first address corresponding to a first location in the memory device. The component can decode the instruction to identify the type of the instruction, and adds the first address to the first buffer. A first instance of data is read from the first location in the memory device and added to the second buffer. The first instance of data is then further processed according to the type of instruction. In an embodiment, the instruction includes both the first address and a second address (e.g., a vector) corresponding to a second location in the memory device, in which case a second instance of data is also read from the second location.

In an embodiment, the processing includes generating a pointer (a second address) based on the first instance of data. An instance of data can then be read from a location in a memory device corresponding to that second address.

In an embodiment, the processing includes incrementing the first address by a value of a stride to generate a second address corresponding to a second location in the memory device, and a second instance of data can then be read from the second location in the memory device.

In an embodiment, the processing includes determining a value of an instance of data from the values of two or more other instances of data read from one or more memory devices. For example, the mean of the instances of data read from memory can be determined.

In an embodiment, the processing includes performing an atomic operation on an instance read from the memory device, and writing the result to the memory device.

In embodiments, the computer system includes a hierarchy of memories including a cache level and a memory controller level. In an embodiment, the memory-side component is at the cache level. In another embodiment, the memory-side component is at the memory controller level. In embodiments, the computer system also includes a media controller, and the memory-side component is at the media controller level.

Because the memory-side component is at the memory side of the computer system, embodiments according to the invention decrease latency and save bandwidth, and also save processor cycles.

These and other objects and advantages of the various embodiments of the invention will be recognized by those of ordinary skill in the art after reading the following detailed description of the embodiments that are illustrated in the various drawing figures.

BRIEF DESCRIPTION OF DRAWINGS

The accompanying drawings, which are incorporated in and form a part of this specification and in which like numerals depict like elements, illustrate embodiments of the present disclosure and, together with the detailed description, serve to explain the principles of the disclosure.

Figure 1 is a block diagram illustrating an example of a computer system that includes a memory-side component in embodiments according to the present invention.

Figures 2A and 2B illustrate an example of a first instruction that can be executed by the memory-side component in embodiments according to the present invention.

Figures 3A and 3B illustrate an example of a second instruction that can be executed by the memory-side component in embodiments according to the present invention.

Figure 4 illustrates an example of a third instruction that can be executed by the memory-side component in embodiments according to the present invention.

Figures 5A and 5B illustrate an example of a fourth instruction that can be executed by the memory-side component in embodiments according to the present invention.

Figure 6 illustrates an example of a fifth instruction that can be executed by the memory-side component in embodiments according to the present invention.

Figure 7 is a block diagram illustrating a first architecture of a computer system that includes a memory-side component in embodiments according to the present invention.

Figure 8 is a block diagram illustrating a second architecture of a computer system that includes a memory-side component in embodiments according to the present invention.

Figure 9 is a block diagram illustrating a third architecture of a computer system that includes a memory-side component in embodiments according to the present invention.

Figure 10 is a block diagram illustrating a fourth architecture of a computer system that includes a memory-side component in embodiments according to the present invention.

Figure 11 is a block diagram illustrating a fifth architecture of a computer system that includes a memory-side component in embodiments according to the present invention.

Figure 12 is a flowchart of an example of a method of memory-side pointer chasing in embodiments according to the present invention.

Figures 13A and 13B illustrate an example of a comparison of conventional pointer chasing and pointer chasing in embodiments according to the present invention.

DETAILED DESCRIPTION

Reference will now be made in detail to the various embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings. While described in conjunction with these embodiments, it will be understood that they are not intended to limit the disclosure to these embodiments. On the contrary, the disclosure is intended to cover alternatives, modifications and equivalents, which may be included within the spirit and scope of the disclosure as defined by the appended claims. Furthermore, in the following detailed description of the present disclosure, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. However, it will be understood that the present disclosure may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the present disclosure.

Some portions of the detailed descriptions that follow are presented in terms of procedures, logic blocks, processing, and other symbolic representations of operations on data bits within a computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. In the present application, a procedure, logic block, process, or the like, is conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those utilizing physical manipulations of physical quantities. Usually, although not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as transactions, bits, values, elements, symbols, characters, samples, pixels, or the like.

It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present disclosure, discussions utilizing terms such as “receiving, ” “accessing, ” “determining, ” “storing, ” “selecting, ” “generating, ” “sending, ” “performing, ” “reading, ” “fetching, ” “writing, ” “decoding, ” “encoding, ” “responding, ” “adding, ” “processing, ” or the like, refer to actions and processes of an apparatus or computer system or similar electronic computing device or system (e.g., the system of Figure 1) . A computer system or similar electronic computing device manipulates and transforms data represented as physical (electronic) quantities within memories, registers or other such information storage, transmission or display devices.

Some elements or embodiments described herein may be discussed in the general context of computer-executable instructions residing on some form of computer-readable storage medium, such as program modules, executed by one or more computers or other devices. By way of example, and not limitation, computer-readable storage media may comprise non-transitory computer storage media and communication media. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or distributed as desired in various embodiments.

Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, random access memory (RAM) , read only memory (ROM) , electrically erasable programmable ROM (EEPROM) , flash memory (e.g., an SSD) or other memory technology, compact disk ROM (CD-ROM) , digital versatile disks (DVDs) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and that can accessed to retrieve that information.

Communication media can embody computer-executable instructions, data structures, and program modules, and includes any information delivery media. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF) , infrared and other wireless media. Combinations of any of the above can also be included within the scope of computer-readable media.

Embodiments according to the present invention include a computer system that includes a component that executes instructions to implement memory-side pointer chasing for the analysis of graphs using a graph neural network.

Figure 1 is a block diagram illustrating an example of a computer system 100 in embodiments according to the present invention. The computer system 100 is described below as including certain devices or components; however, the invention is not so limited. A computer system according to the present invention may include devices or components in addition to those described, including more than one of the devices or components that are mentioned below.

In the Figure 1 embodiments, the computer system 100 includes a processor 110, which includes at least one processor core 111, at least one cache 112, and a memory controller 113. The computer system 110 also includes a memory controller 120, which may or may not be the same memory controller as the memory controller 113. The

memory controller

113 or 120 is coupled to a media controller 125, which in turn is coupled to a memory device 130 that stores data. The memory device 130 may be or may include, for example, dynamic random access memory (DRAM) , Hybrid Memory Cube (HMC) , High Bandwidth Memory (HBM) , or non-volatile dual inline memory module (NVDIMM) .

Thus, the computer system 100 includes a hierarchy of levels of memories that includes the cache 112 at a cache level, the memory controller 113 at a memory controller level, and the media controller 125 at a media controller level.

In embodiments according to the present invention, a component 140, also referred to herein as a memory-side component, is coupled to the processor 110 and to the memory device 130. The memory-side component 140 can be attached at the cache level (to the cache 112) , attached at the memory controller level (to the memory controller 113) , or attached at the media controller level (to the media controller 125) . Because the memory-side component 140 is at the memory side of the computer system 100, embodiments according to the invention decrease latency and save bandwidth, and also save processor cycles. Additional information is provided below in the discussion of Figures 7, 8, 9, 10, and 11.

In embodiments, the memory-side component 140 of Figure 1 includes a first (e.g., request) buffer 141 and a second (e.g. data) buffer 142 or scoreboard. Significantly, there is a data path 143 from the second buffer 142 to the first buffer 141. The memory-side component 140 also includes a decoder 144 (or decoding functionality) and an encoder 146 (or encoding functionality) . The functions performed by the memory-side component 140 are described further below.

A memory management unit (MMU) 150 is coupled to the memory-side component 140. In embodiments, the memory-side component 140 is also coupled to a communication interface 160. The communication interface 160 may be, for example, a network interface controller, an input/output (IO) , or a router. The communication interface 160 can be coupled to one or more other memory-side components (like the component 140) , media controllers (like the media controller 125) , and memory devices (like the memory device 130) . Additional information is provided below in the discussion of Figures 7, 8, 9, 10, and 11.

Embodiments according to the invention include an extension to an instruction set architecture (ISA) or an application programming interface (API) that includes instructions for memory-side pointer chasing. The instructions can be issued by the processor 110 (by the processor core 111) and executed by the memory-side component 140. Examples of those instructions are listed in Figures 2A, 3A, 4, 5A, and 6.

Figure 2A is a listing of an example of a first instruction “ISAEXT_PC, ” referred to herein as a pointer chasing instruction 200, that can be executed by the memory-side component 140 (Figure 1) in embodiments according to the present invention. With reference also to Figure 2B, the instruction 200 includes a first address addr corresponding to a first location in the memory device 130 (which may be referred to as the first memory device) . A first instance of data D (addr) is read from the first location in the first memory device 130 into the second buffer 142; this may be referred to as the first hop.

The first instance of data can be sent from the second buffer 142 to the processor 110. The first instance of data may also be used to generate a pointer (a second address) . That is, the second address is an address based on the first instance of data. The second address can be generated by the MMU 150, as described further below. A second instance of data D (D (addr) ) can then be read from a second location at the second address in a memory device of the computer system 100; this may be referred to as the second hop.

The second instance of data may be read from the first memory device 130, or it may be read from one of the other memory devices via the communication interface 160. In other words, the second hop may or may not be to the same memory device as the first hop. In general, a hop may or may not be to the same memory device as the preceding hop.

If the second hop is to the memory device 130, then the second address is sent from the second buffer 142 to the first buffer 141 over the data path 143. In that case, the memory-side component 140 uses the second address to access an instance of data at the corresponding location in the memory device 130.

If the second hop is to, for example, one of the other memory devices instead of the first memory device 130, then the second address is sent from the second buffer to the encoder 146 and encoded by the memory-side component 140, then sent to the other memory device over the communication interface 160.

An address based on an instance of data (e.g., the second address, which is based on the first instance of data) is a logical address that needs to be translated into a physical address. If the translation is nonlinear, then the MMU 150 performs the translation. If the translation is linear, then the MMU 150 does not need to perform the translation.

In an embodiment, the MMU 150 is implemented as hierarchical MMUs. In the nonlinear translation, the physical address is divided into two parts: a memory channel address represented by a first part, and an address inside the memory channel represented by a second part, which is smaller than the first part. With hierarchical MMUs, the channel address is local and so data comparisons on the first part of the address are not needed. Only data comparisons on the second and smaller part are needed, which reduces the time needed to perform the comparisons.

The process of using data at one address to generate another address is repeated until the pointer chasing instruction 200 is executed to completion, at which point the final instance of data is read from one of the memory devices of the computer system 100. The final instance of data is sent to the processor core 111 (e.g., to the cache 112) . Generally speaking, the first instance of data is sent to the processor 110. Note that the final instance of data may be the first instance of data read from the first memory device 130; that is, that first instance of data may be the only instance of data read during execution of the pointer chasing instruction 200 and may not be used to generate a second address.

Figure 3A is a listing of an example of a second instruction “ISAEXT_vec_PC, ” referred to herein as a vector (or multiple) pointer chasing instruction 300, that can be executed by the memory-side component 140 (Figure 1) in embodiments according to the present invention. With reference also to Figure 3B, the vector pointer chasing instruction 300 includes a vector address vaddr that includes multiple (e.g., two or more) addresses corresponding to different locations in the memory device 130. That is, relative to the pointer chasing instruction 200 of Figure 2A, the vector pointer chasing instruction 300 includes the first address addr and at least one additional address. The multiple addresses share the same instruction but different instances of data are read simultaneously or concurrently. The different instances of data can each be used to generate addresses, and the data read from those addresses can each be used to generate addresses, and so on, in a manner similar to that described above for the pointer chasing instruction 200, until the vector pointer chasing instruction 300 is executed to completion.

Figure 4 is a listing of an example of a third instruction “ISAEXT_Atomic_PC, ” referred to herein as an atomic-operation pointer chasing instruction 400, that can be executed by the memory-side component 140 (Figure 1) in embodiments according to the present invention. An atomic operation is a term of the art. The atomic-operation instruction 400 includes an address addr corresponding to a location in the memory device 130. The atomic-operation instruction 400 also specifies an atomic operation that is to be performed on an instance of data that is read from that location, and writes the result of that operation back to a location (e.g., the same location that the instance of data was read from) in the memory device 130. An atomic operation will automatically trigger a cache flush and so cache coherency is maintained.

Figure 5A is a listing of an example of a fourth instruction “ISAEXT_Gather_PC, ” referred to herein as a gather pointer chasing instruction 500, that can be executed by the memory-side component 140 (Figure 1) in embodiments according to the present invention. With reference also to Figure 5B, the gather pointer chasing instruction 500 includes an address addr corresponding to a first location in the memory device 130, and also specifies a value of a width or stride. In a manner similar to the vector pointer chasing instruction 300 of Figure 3A, the gather pointer chasing instruction 500 causes different instances of data to be read simultaneously or concurrently using the same instruction. However, when the memory-side component 140 executes the gather pointer chasing instruction 500, a first instance of data D (addr) is read from a first location corresponding to the first address addr, and a second instance of data is read from a second location that is separated from the first location by the value of the stride. That is, the address of the second location is the first address incremented by the value of the stride. For example, if the value of the stride is one and the first instance of data D (addr) is read from the address addr in the memory device 130, then the second instance of data D (addr+1) is read from the address addr+1 corresponding to the location in the memory device 130 that is next to the location of the first instance of data.

As illustrated in the example of Figure 5B, the different instances of data D (addr) and D (addr+1) can each be used to generate addresses, and the two instances of data read from those addresses can each be used to generate addresses, and so on, in a manner similar to that described above for the

instructions

200 and 300. For example, for a stride value of one, the data D (addr) can be used to generate a first address, the data D (addr+1) can be used to generate a second address, an instance of data D (D (addr) ) can be read from the memory location corresponding to the first address, an instance of data D (D (addr) +1) can be read from the memory location adjacent to the first address, an instance of data D (D (addr+1) ) can be read from the memory location corresponding to the second address, and an instance of data D (D (addr+1) +1) can be read from the memory location adjacent to the second address, and so on, until the gather pointer chasing instruction 500 is executed to completion.

Figure 6 is a listing of an example of a fifth instruction “ISAEXT_Collective_PC, ” referred to herein as a collective-communication pointer chasing instruction 600, that can be executed by the memory-side component 140 (Figure 1) in embodiments according to the present invention. In this example, the collective-communication pointer chasing instruction 600 includes the gather pointer chasing instruction 500 (Figure 5) in the portion of the listing labeled 602 in Figure 6. When the portion of the listing labeled 604 is executed, the collective-communication pointer chasing instruction 600 determines a value of an instance of data from the values of the instances of data that are read from memory when the portion 602 of the instruction is executed. As examples, the mean, the maximum, the minimum, or the sum of the instances of data read from memory can be determined. In an embodiment, collective-communication pointer chasing operations will mark memory regions where pointer chasing is to be performed as non-cacheable regions or will manually flush the cache.

Figure 7 is a block diagram illustrating a first architecture 700 of the computer system 100 in embodiments according to the present invention. In the architecture 700, the memory-side component 140 is located at the cache level. In the example of Figure 7, the memory-side component 140 receives a pointer chasing instruction (see Figures 2A, 3A, 4, 5A, and 6) from the processor 110 (from the processor core 111) and fetches (reads) an instance of data from the memory device 130 to the memory-side component 140 at the cache level. The memory-side component 140 processes the instance of data depending on the type of the pointer chasing instruction, as described previously herein, and without having to necessarily rely on further instructions from the processor 110.

Figure 8 is a block diagram illustrating a second architecture 800 of the computer system 100 in embodiments according to the present invention. In the architecture 800, the memory-side component 140 is located at the memory controller level. In the example of Figure 8, the memory-side component 140 receives a pointer chasing instruction (see Figures 2A, 3A, 4, 5A, and 6) from the processor 110 (from the processor core 111) and fetches (reads) an instance of data from the memory device 130 to the memory-side component 140 at the memory controller level. The memory-side component 140 processes the instance of data depending on the type of the pointer chasing instruction, as described previously herein, and without having to necessarily rely on further instructions from the processor 110.

Figure 9 is a block diagram illustrating a third architecture 900 of the computer system 100 in embodiments according to the present invention. In the architecture 900, the memory-side component 140 is located at the media controller level. In the example of Figure 9, the memory-side component 140 receives a pointer chasing instruction (see Figures 2A, 3A, 4, 5A, and 6) from the processor 110 (from the processor core 111) , and fetches (reads) an instance of data from the memory device 130 to the memory-side component 140 at the media controller level. The memory-side component 140 processes the instance of data depending on the type of the pointer chasing instruction, as described previously herein, and without having to necessarily rely on further instructions from the processor 110.

Figure 10 is a block diagram illustrating a fourth architecture 1000 of the computer system 100 in embodiments according to the present invention. Relative to the architecture 900 of Figure 9, the architecture 1000 includes a second memory device 170, a second media controller 1025, and two memory-

side components

140a and 140b. The memory-

side components

140a and 140b are each exemplified by the memory-side component 140 of Figure 1. In the architecture 1000, the memory-

side components

140a and 140b are located at the media controller level.

In the example of Figure 10, the memory-side component 140a receives a pointer chasing instruction (see Figures 2A, 3A, 4, 5A, and 6) from the processor 110 (from the processor core 111) , and fetches (reads) an instance of data from the memory device 130. The memory-side component 140a processes the instance of data depending on the type of the pointer chasing instruction, as described previously herein, and without having to necessarily rely on further instructions from the processor 110.

In the example of Figure 10, the MMU 150 (Figure 1) translates the instance of data read from the memory device 130 into an address. That address may be for a location in the memory device 130 or for a location in the memory device 170. In the former case, another instance of data is read from the memory device 130. In the latter case, the address is forwarded to the memory-side component 140b through the communication interface 160 (Figure 1) . The memory-side component 140b can use that address to read an instance of data in the memory device 170. That instance of data can be used to generate another address, and so on. This type of process –in which instances of data are read from any of the memory devices based on the type of the pointer chasing instruction from the processor 110 –continues until that instruction executes to completion.

Thus, generally speaking, Figure 10 illustrates an example of an inter-channel memory architecture, in which instances of data can be read from different memory devices and in which pointer chasing can hop to different locations in a memory device, or can hop from one memory device to another memory device, without having to necessarily rely on further instructions from the processor 110.

Figure 11 is a block diagram illustrating a fifth architecture 1100 of the computer system 100 in embodiments according to the present invention. Relative to the architecture 1000 of Figure 10, the architecture 1100 includes a third memory device 1130, a fourth memory device 1170, a third media controller 1125, a fourth media controller 1126, and two additional memory-

side components

140c and 140d. The memory-

side components

140c and 140d are each exemplified by the memory-side component 140 of Figure 1. In the architecture 1100, the memory-

side components

140c and 140d are located at the media controller level.

In the example of Figure 11, the memory-side component 140a receives a pointer chasing instruction (see Figures 2A, 3A, 4, 5A, and 6) from the processor 110 (from the processor core 111) , and fetches (reads) an instance of data from the memory device 130. The memory-side component 140a processes the instance of data depending on the type of the pointer chasing instruction, as described previously herein, and without having to necessarily rely on further instructions from the processor 110.

In the example of Figure 11, the MMU 150 (Figure 1) translates the instance of data read from the memory device 130 into an address. That address may be for a location in the memory device 130 or for a location in one of the

other memory devices

170, 1130, or 1170. In the former case, another instance of data is read from the memory device 130. In the latter case, the address is forwarded to the memory-

side component

140b, 140c, or 140d based on the translated address. The memory-

side component

140b, 140c, or 140d can use that address to read an instance of data from the

corresponding memory device

170, 1130, or 1170. That instance of data can be used to generate another address, and so on. This type of process –in which instances of data are read from any of the memory devices based on the type of the pointer chasing instruction from the processor 110 –continues until that instruction executes to completion.

Thus, generally speaking, Figure 11 illustrates an example of a networked memory architecture, in which instances of data can be read from different memory devices and in which pointer chasing can hop from one location to another location in a memory device, or can hop from any one memory device to any other memory device in a network, without having to necessarily rely on further instructions from the processor 110.

Figure 12 is a flowchart 1200 of a method of memory-side pointer chasing in embodiments according to the present invention. All or some of the operations represented by the blocks in the flowchart 1200 can be implemented as computer-executable instructions residing on some form of non-transitory computer-readable storage medium, and executed by a system such as the computer system 100, namely the memory-side component 140, of Figure 1.

In block 1202 of Figure 12, with reference also to Figure 1, an instruction (e.g., an instruction such as the instructions of Figures 2A, 3A, 4, 5A, and 6) is received at the memory-side component 140 from a processor core 111 of a processor 110. The instruction includes a first address corresponding to a first location in a memory that includes at least a first memory device that stores data.

In block 1204, the memory-side component 140 determines the type of the instruction.

In block 1206, the memory-side component 140 receives a first instance of data read from the first location in the first memory device.

In block 1208, the memory-side component 140 processes the first instance of data according to the type of the instruction.

In an embodiment, the processing includes generating an address based on the first instance of data. An instance of data can then be read from a location in a memory device corresponding to that address.

In an embodiment, the processing includes incrementing the first address by a value of a stride to generate another address corresponding to a second location in the memory device, and a second instance of data can then be read from the second location in the memory device.

In an embodiment, the instruction also includes information for a second address (e.g., a vector) corresponding to a second location in the memory device, in which case a second instance of data is also read from the second location. That second instance of data is then processed according to the type of instruction.

The process parameters and sequence of steps described and/or illustrated herein are given by way of example only and can be varied as desired. For example, while the steps illustrated and/or described herein may be shown or discussed in a particular order, these steps do not necessarily need to be performed in the order illustrated or discussed. The various example methods described and/or illustrated herein may also omit one or more of the steps described or illustrated herein or include additional steps in addition to those disclosed.

While the foregoing disclosure sets forth various embodiments using specific block diagrams, flowcharts, and examples, each block diagram component, flowchart step, operation, and/or component described and/or illustrated herein may be implemented, individually and/or collectively, using a wide range of configurations. In addition, any disclosure of components contained within other components should be considered as examples because many other architectures can be implemented to achieve the same functionality.

Figures 13A and 13B illustrate an example of a comparison of conventional pointer chasing and pointer chasing in embodiments according to the present invention. In this example, a computer system 1302 according to the present invention includes memory-side components 140a-140d that are the media controller level like in the architecture 1100 of Figure 11, while a conventional computer system 1301 does not include these memory-side components. In this example, an instance of data D (addr) is read from the memory device 1310, that instance of data is used to generate an address for a second instance of data D (D (addr) ) in the memory device 1310, and the second instance of data is used to generate an address for an instance of data D (D (D (addr) ) ) in the memory device 1312.

For this example, the latency between the processor and a media controller is taken to be 200 nanoseconds (ns) , the latency associated with the memory devices is taken to be 100 ns, and the latency of the communication interfaces is taken to be 900 ns. In this example, the total latency for conventional pointer chasing is 13,500 ns, while the total latency for pointer chasing according to the present invention is only 5,900 ns. The difference in latencies is because, in conventional pointer chasing, interactions with the processor 1321 occur with each data read, while those interactions are not necessary in pointer chasing according to the invention, as mentioned above.

Thus, embodiments according to the invention reduce latency. In addition, embodiments according to the invention reduce process-to-memory bandwidth consumption and save processor cycles.

In addition to the disclosure herein, embodiments according to the present invention can be used in a packed memory access (PMA) memory subsystem in which hardware components at both the host and the media side accumulate and pack, unpack, and compress read/write requests, and send and respond in a single transaction.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the disclosure is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the disclosure.

Embodiments according to the invention are thus described. While the present disclosure has been described in particular embodiments, the invention should not be construed as limited by such embodiments, but rather construed according to the following claims.

Claims

A computer-implemented method, comprising:

a component receiving an instruction from a processor core, the instruction comprising a first address corresponding to a first location in memory, wherein the memory comprises a first memory device that stores data;

the component determining a type of the instruction;

the component receiving a first instance of data read from the first location in the first memory device; and

the component processing the first instance of data according to the type of the instruction.
The computer-implemented method of Claim 1, wherein the instruction further comprises information for a second address corresponding to a second location in the first memory device, and wherein the method further comprises receiving, at the component, a second instance of data read from the second location, the component also processing the second instance of data according to the type of the instruction.
The computer-implemented method of Claim 1, wherein said processing comprises generating a second address in the memory based on the first instance of data received from the first memory device, and wherein the method further comprises receiving, at the component, a second instance of data read from a second location corresponding to the second address.
The computer-implemented method of Claim 1, wherein said processing comprises incrementing the first address by a value of a stride to generate a second address corresponding to a second location in the first memory device, and wherein the method further comprises receiving, at the component, a second instance of data read from the second location in the first memory device.
The computer-implemented method of Claim 1, wherein the instruction further comprises information for a second address corresponding to a second location in the memory, wherein the method further comprises receiving, at the component, a second instance of data read from the second location, and wherein said processing comprises determining a third instance of data from the first and second instances of data.
The computer-implemented method of Claim 1, wherein the component comprises a first buffer coupled to a second buffer, wherein said processing comprises generating a second address based on the first instance of data received from the first memory device; and wherein the method further comprises the component sending the second address from the second buffer to the first buffer.
The computer-implemented method of Claim 1, wherein said processing comprises:

performing an atomic operation on the first instance of data to generate a second instance of data based on the first instance of data; and

writing the second instance of data to the first memory device.
A computer system, comprising:

a processor core;

memory comprising a first memory device that stores data and a second memory device that stores data; and

a first component comprising a first buffer and a second buffer, the component coupled to the processor core and to the first memory device, wherein the component is operable for performing operations comprising:

receiving a first instruction from the processor core, the first instruction comprising a first address corresponding to a first location in the first memory device;

decoding the first instruction to identify a type of the instruction;

adding the first address to the first buffer;

receiving, into the second buffer, a first instance of data read from the first location in the first memory device; and

processing the first instance of data read from the first memory device according to the type of the instruction.
The computer system of Claim 8, further comprising a hierarchy of memories comprising a cache at a cache level in the hierarchy and a memory controller at a memory controller level in the hierarchy, wherein the component is at the cache level and is coupled to the cache.
The computer system of Claim 8, further comprising a hierarchy of memories comprising a cache at a cache level in the hierarchy and a memory controller at a memory controller level in the hierarchy, wherein the component is at the memory controller level and is coupled to the memory controller.
The computer system of Claim 8, further comprising a media controller coupled to the processor core and to the first memory device, wherein the component is coupled to the media controller.
The computer system of Claim 8, wherein the first instruction further comprises information for a second address corresponding to a second location in the first memory device, wherein the component is also operable for performing operations comprising:

receiving a second instance of data read from the second location; and

processing the second instance of data received from the first memory device according to the type of operation.
The computer system of Claim 8, wherein said processing comprises generating a second address in the memory based on the first instance of data received from the first memory device.
The computer system of Claim 8, wherein said processing comprises generating a second address in the first memory device based on the first instance of data received from the first memory device, and wherein the component is also operable for performing an operation comprising adding the second address to the first buffer.
The computer system of Claim 8, wherein said processing comprises generating a second address in the memory based on the first instance of data received from the first memory device, wherein the computer system further comprises a memory management unit coupled to the second buffer and operable for translating the second address from a logical address into a physical address, and wherein the component is also operable for performing operations comprising encoding the physical address.
The computer system of Claim 8, wherein said processing comprises incrementing the first address by a value of a stride to generate a second address corresponding to a second location in the first memory device, wherein the component is also operable for performing operations comprising receiving a second instance of data read from the second location in the first memory device.
The computer system of Claim 8, wherein the instruction further comprises information for a second address in the memory, wherein the component is also operable for performing an operation comprising receiving a second instance of data read from the second address, and wherein said processing comprises determining a third instance of data from the first and second instances of data.
The computer system of Claim 8, wherein said processing comprises:

performing an atomic operation on the first instance of data to generate a second instance of data based on the first instance of data; and

writing the second instance of data to the first memory.
A non-transitory computer-readable storage medium comprising computer-executable instructions stored therein, the computer-executable instructions comprising:

instructions to decode an instruction received from a processor, the instruction received from the processor comprising a first address corresponding to a first location in a first memory device that stores data;

an instruction to read a first instance of data from the first location in the first memory device;

instructions to process the first instance of data, wherein the instructions to process the first instance of data comprise:

an instruction to generate a second address based on the first instance of data received from the first memory device;

an instruction to increment the first address by a value of a stride to generate a second address corresponding to a second location in the first memory device;

instructions to determine a third instance of data from the first instance of data and from a second instance of data read from a second location in a memory device;

instructions to generate a second address based on the first instance of data received from the first memory, and to send the second address from a second buffer to a first buffer, wherein the first buffer and the second buffer are coupled to the processor and to the first memory device;

instructions to perform an atomic operation on the first instance of data to generate a second instance of data based on the first instance of data, and to write the second instance of data that is based on the first instance of data to the first memory device;

an instruction to send an instance of data read from the first memory device to the processor; and

an instruction to send an instance of data read from the first memory device to a memory management unit coupled to the processor.
The non-transitory computer-readable storage medium of Claim 19, wherein the instruction received from the processor further comprises information for a second address corresponding to a second location in the first memory device.