CN115881188A - Integrating resistive memory systems into multi-core CPU die to achieve large-scale memory parallelism - Google Patents
Integrating resistive memory systems into multi-core CPU die to achieve large-scale memory parallelism Download PDFInfo
- Publication number
- CN115881188A CN115881188A CN202110982177.6A CN202110982177A CN115881188A CN 115881188 A CN115881188 A CN 115881188A CN 202110982177 A CN202110982177 A CN 202110982177A CN 115881188 A CN115881188 A CN 115881188A
- Authority
- CN
- China
- Prior art keywords
- memory
- resistive
- sub
- arrays
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Semiconductor Memories (AREA)
Abstract
A monolithic Integrated Circuit (IC) computing device is disclosed having multiple independent processing cores (multi-core) and an embedded non-volatile resistive memory that acts as system memory. The resistive system memory is fabricated on a substrate and the logic circuit including the processing core is fabricated on the substrate. Further, access circuitry for operating on the resistive system memory, as well as circuitry including a memory controller, routing devices, and other logic components are disposed at least partially on the substrate. Large main memory capacities of tens or hundreds of Gigabytes (GB) are provided and can operate with multiple processing cores, all on a single chip. This monolithic integration provides close physical proximity between the processing core and main memory, promotes significant memory parallelism, reduces power consumption, and eliminates off-chip main memory access requests.
Description
Government licensing rights
The invention was made with government support under contract number FA807514D0002 awarded by the air force in the united states. The government has certain rights in this invention.
Technical Field
The present disclosure relates generally to an integrated circuit including a network-on-chip computing system, e.g., a multi-core chip with large capacity embedded resistive system memory and very high parallelism between the processing core and the embedded system memory.
Background
Resistive memory represents a recent innovation in the field of integrated circuit technology. While most resistive memory technologies are in the development phase, various technical concepts have been demonstrated by the inventors and are in one or more verification phases to prove or disprove the relevant theory or technology. Resistive memory technology is expected to have significant advantages over competing technologies in the semiconductor electronics industry.
The resistive memory cell may be configured to have multiple states with different resistance values. For example, for a one-bit cell, a resistive memory cell may be configured to exist in a relatively low resistance state, or alternatively, a relatively high resistance state. The multi-bit cell may have additional states that differ in resistance value from each other, from a relatively low resistance state and a relatively high resistance state. The state of the resistive memory cell represents a discrete logical information state, facilitating digital memory operations. When combined into an array of many such memory cells, larger capacity digital memory storage becomes feasible.
Resistive memory also shows significant promise in its ability to scale to more advanced (e.g., smaller) technology nodes. Made in part from thin films and having a rather simple geometry with respect to some integrated circuit devices, individual resistive memory cells can reliably operate with very small lithographic feature sizes. As feature sizes continue to decrease, the power efficiency and density of resistive memories further increase, leading to increased performance and flexibility of the technology.
In view of the above, practical development of technology using resistive memory is still continuing.
Disclosure of Invention
The following presents a simplified summary of the specification in order to provide a basic understanding of some aspects of the specification. This summary is not an extensive overview of the specification. It is intended to neither identify key or critical elements of the specification nor delineate the scope of any particular embodiment of the specification or the scope of any claim. Its purpose is to present some concepts of the specification in a simplified form as a prelude to the more detailed description that is presented in this disclosure.
The present disclosure provides a monolithic Integrated Circuit (IC) computing device having a plurality of independent processing cores (multi-core), and an embedded non-volatile resistive memory that acts as system memory or Random Access Memory (RAM). The resistive system memory is fabricated on a substrate, and the logic circuit including the processing core is fabricated on the substrate. Further, access circuitry for operating on the resistive system memory, and circuitry including the memory controller, router device, and other logic components are disposed at least partially on the substrate. Since resistive memory is very small and highly scalable to advanced process nodes, large main memory capacities (e.g., hundreds of Gigabytes (GB) or more) can be implemented by many processing cores, all on a single die (die). This monolithic integration provides close physical proximity between the processing cores and the main memory, promoting significant parallelism therebetween. Additional embodiments subdivide a large main memory array into a number of sub-arrays, each of which is independently accessible. In addition to many embedded processing cores, each processing core is operable to access any independent sub-array, achieving large-scale parallelism between the processing core and the resistive system memory, and achieving extremely high performance in addition to reduced power consumption. Various embodiments described above are provided herein, including alternative or additional features and characteristics.
In further embodiments, the present disclosure provides an integrated circuit device. The integrated circuit device may include a plurality of processing cores formed on a substrate of the integrated circuit device, and a resistive memory array structure formed over the substrate of the integrated circuit device and at least partially overlying the plurality of processing cores. The resistive memory array structure may include a plurality of resistive memory sub-arrays, each resistive memory sub-array including a non-volatile two-terminal resistive switching memory cell. Further, the integrated circuit may include an access circuit formed at least partially on the substrate of the integrated circuit device, the access circuit providing independent operative access to a respective resistive memory sub-array of the plurality of resistive memory sub-arrays. In embodiments, the access circuit may be integrated within a logic circuit comprising a processing core formed on a substrate of an integrated circuit device. The access circuitry may be integrated within the processing core in a fine-grained cohesive manner. Still further, the integrated circuit may include a plurality of memory controllers including a first group of memory controllers communicatively coupled with a first processing core of the plurality of processing cores and operable to receive a first memory instruction from the first processing core and execute the first memory instruction on a first group of resistive memory sub-arrays of the plurality of resistive memory sub-arrays in response to the first memory instruction, and a second group of memory controllers communicatively coupled with a second processing core of the plurality of processing cores and operable to receive a second memory instruction from the second processing core and execute the memory instruction on a second group of resistive memory sub-arrays of the plurality of resistive memory sub-arrays in response to the second memory instruction. In one or more embodiments, the first memory instruction or the second memory instruction is a memory read that returns less than 128 bytes of data.
Additional embodiments of the present disclosure provide a method of manufacturing an integrated circuit device. The method may include providing logic circuitry on a substrate of the chip, the logic circuitry including a plurality of processing cores and a cache memory for the processing cores, and providing, at least in part, access circuitry for individual sub-arrays of resistive system memory on the chip substrate. Further, the method may include disposing, at least in part, circuitry on a substrate of the chip, the circuitry including a plurality of memory controllers for each of the plurality of processing cores. According to various embodiments, the method may further include forming a non-volatile two-terminal resistive memory device including an independent sub-array of resistive system memory overlaid on the substrate and overlaid on at least a portion of the logic circuit, the access circuit, or the circuit including the plurality of memory controllers. Still further, the method may include forming an electrical connection between a respective portion of the access circuitry on the chip substrate and each individual sub-array of the resistive system memory overlaid on the chip substrate, and forming an electrical connection between circuitry comprising each memory controller and the respective portion of the access circuitry. The method may also include providing a communication path between logic circuitry including a plurality of processing cores and circuitry including a plurality of memory controllers, and configuring a memory controller of the plurality of memory controllers to implement memory instructions on an associated independent sub-array of the resistive system memory in response to main memory requests originating from the cache memory of the logic circuitry.
In another embodiment of the present disclosure, an integrated circuit device is provided. The integrated circuit device may include a plurality of processor tiles (tiles), wherein a processor tile includes a processing core, a cache memory and a cache controller, a memory controller, and a multiple data memory instruction set, wherein the plurality of processing tiles are formed on a substrate of the integrated circuit device. The integrated circuit device may also include a resistive memory array structure formed on a substrate of the integrated circuit device, the resistive memory array structure including a plurality of independently addressable sub-arrays formed of non-volatile two-terminal resistive switching memory, wherein a portion of the independently addressable sub-arrays are managed by a memory controller. Further, the integrated circuit device may include access circuitry formed at least partially on the substrate of the integrated circuit device, the access circuitry interconnecting the memory controller with the portion of the independently addressable subarray managed by the memory controller. In various embodiments, the integrated circuit device may further include a command and data bus interconnecting respective ones of the plurality of processor tiles, wherein the resistive memory array structure serves as system memory for the processing cores of the processor tiles.
The following description and the annexed drawings set forth certain illustrative aspects of the specification. These aspects are indicative, however, of but a few of the various ways in which the principles of the specification may be employed. Other advantages and novel features of the specification will become apparent from the following detailed description of the specification when considered in conjunction with the drawings.
Drawings
Many aspects, embodiments, objects, and advantages of the invention will become apparent from the following detailed description when considered in conjunction with the accompanying drawings in which like reference characters refer to the same parts throughout. In the description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It is understood, however, that certain aspects of the present disclosure may be practiced without these specific details, or with other methods, components, materials, etc. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing the present disclosure.
Fig. 1 depicts a block diagram of an example monolithic sub-chip-level computing architecture of an Integrated Circuit (IC) chip in an embodiment of the disclosure.
FIG. 2 illustrates a block diagram of an example circuit layout for a portion of a substrate in the disclosed monolithic computing architecture, in accordance with further embodiments.
FIG. 3 shows a simplified diagram of a perspective view of a monolithic computing device with resistive system memory overlaid on a substrate circuit in one or more embodiments.
FIG. 4 depicts a block diagram of an example operational arrangement for memory access for an embedded resistive system memory of the disclosed monolithic computing device.
Fig. 5 illustrates a block diagram of a network-on-chip architecture for embedded memory access by multiple processing cores of the disclosed IC chip in one or more embodiments.
FIG. 6 depicts a block diagram of a network-on-chip architecture for multi-core embedded memory access in accordance with further disclosed embodiments.
Fig. 7 illustrates an example 2D arrangement of processing cores and memory controller devices for a monolithic IC computing architecture in one or more disclosed aspects.
Fig. 8 depicts an example 2D arrangement of processing cores and memory controller devices for a monolithic IC computing architecture in additional aspects.
FIG. 9 depicts a block diagram of an example parallel memory access architecture for disclosing that an IC chip implements massive memory parallelism in a further aspect.
FIG. 10 depicts a graph of memory parallelism for an example 64 processor architecture with embedded resistive memory for different parallel instruction modalities.
FIG. 11 illustrates a processing tile having multiple embedded resistive memory clusters monolithically integrated with a processor core in at least one embodiment.
Fig. 12 illustrates a flow diagram of an example method in one or more aspects for fabricating a monolithic IC chip with embedded resistive memory and high memory parallelism.
FIG. 13 depicts a flowchart of an example method for implementing a main memory request in conjunction with a cache process of the disclosed IC chip in one embodiment.
FIG. 14 illustrates a flow chart of an example method for fabricating a monolithic IC chip with embedded resistive memory, according to further disclosed embodiments.
FIG. 15 illustrates a block diagram of an example electronic operating environment, in accordance with certain embodiments of the present disclosure.
Detailed Description
Introduction to the design reside in
The present disclosure relates to a monolithic Integrated Circuit (IC) device having a plurality of processing cores, and an embedded non-volatile resistive memory serving as a main memory (or Random Access Memory (RAM)) for the plurality of processing cores. The use of a non-volatile main memory facilitates applications that do not require a continuous external power source, as the threat of data loss is avoided or greatly mitigated. Furthermore, highly scalable resistive switching two-terminal memory cells (also referred to as resistive switching memory cells, resistive memory cells, or resistive memory) can provide very high system memory capacity, such as tens or hundreds of gigabits (Gb) or more, far beyond the capacity of embedded Dynamic Random Access Memory (DRAM). To achieve data throughput between the processing core and the system memory that approaches or exceeds that of modern DRAM memory, a high degree of parallelism between the processing core and the embedded resistive system memory is provided. High parallelism can be achieved by a variety of mechanisms, including a large number of processing cores, a very large number of independently operable resistive memory sub-arrays, an embedded memory controller serving each processing core, and a multi-threaded, multi-data, and non-blocking multi-data memory instruction set, among others. In at least some example embodiments, but the disclosure is in no way limited to this example, a 20mm by 20mm IC chip based on 16 nanometer (nm) process technology is provided having at least 64 processing cores, more than 32GB of non-volatile system memory, arranged in more than 8000 independently operable memory sub-arrays of 2048 cells each, in a single two-dimensional (2D) crossbar array. In this embodiment, stacking two of these 2D crossbar arrays implements 64GB of non-volatile system memory, and over 16000 independent sub-arrays. Likewise, stacking 8 2D crossbar arrays enables 256GB of non-volatile system memory, and over 64000 independent sub-arrays. Furthermore, with the extremely high wiring density achieved through Very Large Scale Integration (VLSI) semiconductor fabrication techniques, high data throughput between multiple cores and system memory can be achieved, supporting thousands or tens of thousands of concurrent memory requests. Other examples within the scope of the present disclosure include other process technologies (e.g., 14nm process technology, 12nm process technology, 7nm process technology, etc.) that facilitate even greater memory capacity, sub-arrays that are individually accessible, the number of cells per sub-array, or the like, or combinations thereof. Still further, additional or alternative features of a monolithic memory chip with a multicore processor and embedded resistive memory known in the art or by one of ordinary skill in the art in view of the context provided herein are considered to be within the scope of the present disclosure.
As used herein, the term "processing core" refers to any suitable analog or digital instruction and data execution device capable of being contained within an integrated circuit chip. Suitable examples of processing cores include general purpose devices, such as a Central Processing Unit (CPU). Other suitable examples include dedicated devices, such as accelerators and the like. Examples include Graphics Processing Units (GPUs), digital Signal Processors (DSPs), physical Processing Units (PPUs), application specific instruction set processors (ASIPs), network processors, image processors, and so forth. Other examples known in the art or known to one of ordinary skill in the art through the context provided herein are considered to be within the scope of the present disclosure.
In one or more additional embodiments, the disclosed multi-core processor tiles of a monolithic IC computing chip may utilize a two-terminal resistive switching memory sub-array sub-page size access capability. For other memory technologies, such as DRAM or FLASH memory, the minimum memory that can be accessed per memory request is a page (e.g., an entire row of an array or sub-array) of data. If a memory request only needs to save a portion of the data to a page, irrelevant data may be returned, thereby reducing the throughput of useful data. As used herein, the term "useful data throughput" refers to the ratio of desired or target data transferred between main memory and a set of processing cores as compared to the total data transferred (including unrelated data). By enabling sub-page sized memory access requests, finer granularity of data access may be achieved. For example, in some cases (e.g., 1 byte, 2 bytes, 4 bytes, etc.), the fetch size may be similar or equal to the size of the useful data, or only moderately larger. This results in a higher throughput of useful data between the processing core and the system memory. Thus, the disclosed embodiments may minimize or avoid data redundancy, further reduce power consumption, and maximize useful data throughput.
In various embodiments disclosed herein, variable access granularity may be implemented. In such embodiments, the disclosed processing core (or cache controller) may specify a non-fixed data fetch size. Conventional memory is limited to fetching large contiguous blocks of data (e.g., 128 bytes for many DRAM main memory systems) on each main memory access. This is effective for programs that exhibit good spatial multiplexing, facilitating high memory bandwidth. However, for programs with lower spatial multiplexing, fetching large blocks of data can result in lower useful data throughput because most of the data returned per memory request is ignored or wasted. Resistive memory may support multiple fetch sizes and may support variable fetch sizes that vary with each memory request. As a result, the disclosed computing architecture incorporates dynamic fetch size requests to resistive main memory, which can be dynamically configured to match the spatial multiplexing of target memory observed at runtime.
In accordance with one or more additional embodiments, the disclosed monolithic computing chip may be partially or fully fabricated utilizing a Complementary Metal Oxide Semiconductor (CMOS) fabrication process. This enables the processing logic circuitry, cache and cache controller circuitry, routing device circuitry, memory controller circuitry, and high capacity embedded resistive memory arrays to be fabricated through a series of CMOS logic processing steps to form a complete computing architecture on a single IC chip. In one or more embodiments, this includes a plurality of processing cores, command and data routing devices, and integrated command and data paths between the processing cores and the routing devices, resulting in a network-on-chip architecture including very high capacity resistive system memory. This results in a significant advance of system-on-chip devices over the prior art.
A variety of resistive memory technologies and their features having characteristics suitable for use in various embodiments are considered to be within the scope of the present disclosure. As used herein, a resistive switching memory cell may be a two-terminal memory device that includes a circuit member having a conductive contact (e.g., an electrode or terminal) with an active region between the two conductive contacts. In the context of resistive switching memories, the active region of a two-terminal memory device exhibits multiple stable or semi-stable resistive states, each resistive state having a different resistance. Further, respective ones of the plurality of states may be formed or activated in response to appropriate electrical signals applied at the two conductive contacts. The appropriate electrical signal may be a voltage value, a current value, a voltage or current polarity, or the like, or a suitable combination thereof. Examples, but not exhaustive, of resistive switching two-terminal memory devices may include resistive random access memory (ReRAM), phase Change RAM (PCRAM), conductive bridge RAM (CB-RAM), and Magnetic RAM (MRAM).
One example of a resistive memory is a wire resistive memory cell. In general, the composition of the wire-like resistive memory cells may vary from device to device, with different components being selected to achieve desired characteristics (e.g., volatile/non-volatile resistive switching, on/off current ratios, switching times, read times, memory endurance, program/erase cycles, etc.). One example of a wire resistive memory cell may include: conductive layers, such as metals, metal alloys (including, for example, metal-metal alloys such as TiW, and the like, as well as various suitable metal-nonmetal alloys), metal-nitrides (including, for example, tiN, taN, or other suitable metal-nitride compounds); in such embodiments, the conductive filament (e.g., formed from ions) may facilitate conductivity through at least a subset of the RSL, and as an example, the resistance of the filament-based device may be determined by the tunneling resistance between the filament and the conductive layer.
The RSL may comprise, for example, an undoped amorphous Si-containing layer, a semiconductor layer having intrinsic characteristics, nitrogenSilicon (e.g. SiN, si) 3 N 4 、SiN x Etc., where x is a positive number), sub-silicon oxide (e.g. SiO) x Where x has a value between 0.1 and 2), silicon nitride, metal oxides, metal nitrides, non-stoichiometric silicon compounds, silicon and nitrogen containing materials, metal and nitrogen containing materials, and the like. Other examples of amorphous and/or non-stoichiometric materials suitable for use in RSL may include Si X Ge Y O Z (where X, Y and Z are correspondingly suitable positive numbers), silica (e.g., siO N Where N is a suitable positive number), silicon oxynitride, undoped amorphous silicon (a-Si), amorphous SiGe (a-SiGe), taO B (where B is a suitable positive number), hfO C (wherein C is a suitable positive number), tiO D (wherein D is an appropriate number), al 2 O E (where E is a suitable positive number), other suitable oxides, and the like, nitrides (e.g., alN, siN), and the like, or suitable combinations thereof (e.g., see below).
In some embodiments, an RSL used as part of a non-volatile memory device (non-volatile RSL) may include a relatively large number (e.g., compared to a volatile selector device) of material voids or defects to trap neutral metal particles within the RSL (e.g., at relatively low voltages, such as < -3 volts). The large number of voids or defects may promote the formation of a thick, stable neutral metal particle structure. In such a structure, these trapped particles can hold the non-volatile memory device in a low resistance state in the absence of an external stimulus (e.g., power), thereby enabling non-volatile operation. In other embodiments, the RSL for a volatile selector device (volatile RSL) may have very few material voids or defects. The conductive filaments formed in such RSLs can be very thin due to few voids/defects that trap particles, and unstable in the absence of a suitably high external stimulus (e.g., an electric field, a voltage greater than about 0.5 volts, 1 volt, 1.5 volts, etc., an electric current, joule heating, or a suitable combination thereof). Furthermore, the particles can be selected to have a high surface energy and good diffusivity within the RSL. This allows the conductive filament to form quickly in response to an appropriate stimulus, but also to deform easily, for example, in response to an external stimulus that drops below the amount of deformation. Note that one volatile RSL and conductive filament for the selector device may have different electrical characteristics than the conductive filament and non-volatile RSL for the non-volatile memory device. For example, the selector means RSL may have a higher material resistance and may have a higher on/off current ratio, etc.
The active metal-containing layer for the wire-based memory cell may include, among others: silver (Ag), gold (Au), titanium (Ti), titanium nitride (TiN), or other suitable compounds of titanium, nickel (Ni), copper (Cu), aluminum (Al), chromium (Cr), tantalum (Ta), iron (Fe), manganese (Mn), tungsten (W), vanadium (V), cobalt (Co), platinum (Pt), hafnium (Hf), and palladium (Pd). In some aspects of the present disclosure, other suitable conductive materials as well as compounds, oxides, nitrides, alloys, or combinations of the foregoing or similar materials may be used for the active metal-containing layer. Further, in at least one embodiment, a non-stoichiometric compound, such as a non-stoichiometric metal oxide or metal nitride (e.g., alO) x 、AlN x 、CuO x 、CuN x 、AgO x 、AgN x Etc., where x is a suitable positive number 0 < x < 2, which may have different values for different non-stoichiometric compounds) or other suitable metal compounds, may be used for the active metal-containing layer.
In some embodiments, the disclosed wire resistive switching devices can include an active metal layer comprising a metal nitride selected from the group consisting of: tiN (titanium nitride) x 、TaN x 、AlN x 、CuN x 、WN x And AgN x Wherein x is a positive number. In further embodiments, the active metal layer may include a metal oxide selected from the group consisting of: tiO 2 x 、TaO x 、AlO x 、CuO x 、WO x And AgO x . In yet another or more embodiments, the active metal layer may include a metal oxynitride selected from the group consisting of: tiO 2 a N b 、AlO a N b 、CuO a N b 、WO a N b And AgO a N b Wherein a and b are positive numbers. The disclosed filamentary resistive switching devices may also include a switching layer comprising a switching material selected from the group consisting of: siO 2 y 、AlN y 、TiO y 、TaO y 、AlO y 、CuO y 、TiN x 、TiN y 、TaN x 、TaN y 、SiO x 、SiN y 、AlN x 、CuN x 、CuN y 、AgN x 、AgN y 、TiO x 、TaO x 、AlO x 、CuO x 、AgO x And AgO y Wherein x and y are positive numbers and y is greater than x. Various combinations of the above are contemplated and considered to be within the scope of embodiments of the present invention.
In one example, the disclosed filamentary resistive switching devices include a particle donor layer (e.g., an active metal-containing layer) that includes a metal compound and a resistive switching layer. In an alternative embodiment of this example, the particle donor layer comprises a metal nitride: MN (Mobile node) x E.g. AgN x 、TiN x 、AlN x And the resistive switching layer comprises a metal nitride: MN (Mobile node) y E.g. AgO y 、TiO y 、AlO y Etc., where y and x are positive numbers, and in some cases y is greater than x. In an alternative embodiment of this example, the particle donor layer comprises a metal oxide: MO (metal oxide semiconductor) x E.g. AgO x 、TiO x 、AlO x And the resistive switching layer comprises metal oxides: MO (metal oxide semiconductor) y E.g. AgO y 、TiO y 、AlO y Etc., where y and x are positive numbers, and in some cases y is greater than x. In yet another option, the metal compound of the particle donor layer is MN x (e.g., agN) x 、TiN x 、AlN x Etc.) and the resistive switching layer is selected from MO y (e.g., agO) x 、TiO x 、AlO x Etc.) and SiO y Group of wherein the comparative atomic weights: x and y may be suitable stoichiometric or non-stoichiometric values in this disclosure. Such asAs used herein, variables x, a, b, etc., representing values or ratios of one element relative to another (or other) element in a compound can have different values that apply to the corresponding compound, and are not intended to represent the same or similar values or ratios in the compound.
As described above, application of a programming voltage (also referred to as a "programming pulse") to one of the electrodes of a two-terminal memory may result in the formation of a conductive filament (e.g., RSL) in the interface layer. By convention, and as generally described herein, the TE receives the programming pulses and the BE is grounded (or held at a lower voltage or opposite polarity than the programming pulses), but this is not intended to limit all embodiments. Conversely, applying an "erase pulse" to one of the electrodes (typically a pulse of opposite polarity to the programming pulse or applied as a programming pulse to the opposite electrode) may disrupt the continuity of the filament, for example by driving metal particles or other material forming the filament back to the active metal source of the non-volatile filamentary device. For volatile filamentary devices, lowering the voltage below the activation threshold voltage (or holding voltage in some embodiments) can cause the metal particles to disperse to form volatile filaments, resulting in discontinuity of the volatile filaments. The characteristics of such a conductive filament and its presence or absence affect the electrical characteristics of a two-terminal memory cell, e.g., decreasing the resistance between the two terminals and/or increasing the conductance between the two terminals when the conductive filament is present, as opposed to when the conductive filament is absent.
After a program or erase pulse, it can be a read pulse. The read pulse is typically lower in amplitude than the program or erase pulse and is typically insufficient to affect the conductive filament and/or change the state of the two-terminal (non-volatile) memory cell. By applying a read pulse to one of the electrodes of a two-terminal memory, the measured current (e.g. I) on ) The conductive state of a two-terminal memory cell may be indicated. For example, when a conductive filament has been formed (e.g., in response to application of a programming pulse), the conductance of the cell is greater than otherwise, and the measured current (e.g., I) in response to a read pulse on ) The reading will be greater. On the other hand, when the conductive filament is removed (e.g., in response to application of an erase pulse), the conductive filament is removed byHas a relatively high resistance at the interface layer, and thus the resistance of the cell is high, so the conductance of the cell is low and the measured current (e.g., I) in response to a read pulse off ) The reading will be smaller.
Conventionally, when a conductive filament is formed, the memory cell is said to be in an "on state" with high conductance. When the conductive filament is not present, the memory cell is said to be in an "off state". Non-volatile memory cells in an on state or an off state may be logically mapped to binary values, such as, for example, "1" and "0". It should be understood that the convention used herein in connection with the state of a cell or with an associated logical binary mapping is not intended to be limiting, as other conventions, including the reverse convention, may be used in connection with the disclosed subject matter. The techniques detailed herein are described and illustrated in connection with Single Level Cell (SLC) memory, but it should be understood that the disclosed techniques may also be used with multi-level cell (MLC) memory where a single memory cell may retain a set of measurably different states representing multi-bit information. Embodiments of the present disclosure may increase the capacity of the disclosed memory array by incorporating MLC memory cells instead of SLC memory, which is proportional to the number of bits per MLC memory cell (e.g., a dual bit MLC cell may double the disclosed memory capacity, a four bit MLC cell may quadruple the disclosed memory capacity, etc.).
As used herein, a resistive memory structure may be formed as a two-dimensional (2D) array between intersecting conductive lines of an IC chip, such as between back-end-of-line conductive lines (e.g., metals, metal alloys/compounds, doped semiconductors, etc.). Stacking multiple two-dimensional arrays can affect a three-dimensional (3D) array called a 3D crossbar array. In a 3D crossbar array, two-terminal memory cells are formed at the intersection of two metal lines within each 2D array, and a plurality of such 2D arrays stacked on top of each other form a 3D crossbar structure. Two general conventions are provided for the arrangement of memory cells in a 2D or 3D array. The first convention is a 1T1R memory array, in which each memory cell is isolated from the electrical effects (e.g., current flow, including leakage path current) of surrounding circuitry by an associated transistor. A second convention is a 1TnR memory array (n is a positive number greater than 1) in which multiple sub-arrays of memory cells (e.g., 2k × 2k cells, or other suitable array sizes) are isolated from the electrical effects of surrounding circuitry and sub-arrays by a single transistor (or group of transistors). In the context of 1TnR, each memory cell may include a selector device (e.g., a volatile two-terminal wire resistance device) electrically connected in series with the two-terminal non-volatile memory cell between crossing conductors of a crossing array. The selector means has a very high off-resistance and when the applied voltage across the conductors is below the activation amplitude of the selector means, the selector means can greatly suppress current leakage between the conductors. Since two-terminal memory cells can be fabricated with much smaller thin films than transistors and can be highly scalable, a 1TnR array with large n values can result in very high storage densities.
Example monolithic computing architecture
Various aspects or features of the disclosure are described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It is understood, however, that certain aspects of the disclosure may be practiced without these specific details, or with other methods, components, materials, etc. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing the present disclosure.
Fig. 1 illustrates a block diagram of an example monolithic sub-chip level computing architecture 100, in accordance with one or more embodiments of the present disclosure. In some embodiments, the computing architecture 100 may form part of a monolithic network-on-chip architecture. In particular, computing architecture 100 may include multiple processor tiles 104 NM The plurality of processor tiles are connected with X resistive memory sub-arrays 130 that serve the non-volatile system memory (e.g., random access memory, or RAM) of the plurality of processor tiles. Resistive memory sub-array 130The high density and scalability of dual-ended resistive memory technologies (e.g., resistive switching, dual-ended memory, also known as resistive random access memory, or ReRAM) are used to achieve large storage capacity of embedded non-volatile system memory. Due to the non-volatility of system memory, reduced power consumption and simplified power maintenance circuitry may be implemented for the computing architecture 100. Furthermore, small or self-limiting size memory access requests may maximize useful data throughput of the computing architecture 100. Still further, by enabling massive memory parallelism, overall data throughput approaching or even exceeding that of modern DRAM memory systems can be achieved by the various disclosed embodiments, resulting in significant improvements in embedded system-on-chip (or network-on-chip) architectures by the present disclosure.
As shown, computing architecture 100 may include a substrate layer 110, the substrate layer 110 including logic circuitry and other active device circuitry, including a plurality of processor tiles 104 NM Where N and M are suitable positive integers. In an embodiment, the substrate layer 110 includes N x M processor tiles. In certain embodiments, N x M comprises a number selected from the group consisting of: 64 processor tiles, 128 processor tiles, 256 processor tiles, and 512 processor tiles. In other embodiments, substrate layer 110 may include other numbers of processor tiles (e.g., 8, 16, or 32 processor tiles, 1024 processor tiles, 2048 processor tiles, or other suitable numbers).
Above the substrate layer 110 are a plurality of back end layers 115. Back end layer 115 is located above substrate layer 110 and may be fully or partially associated with processor tile 104 NM And (4) overlapping. Not depicted in the computing architecture 100 (but see, e.g., fig. 2 and 3, etc.) are memory access circuits for electrically accessing the back end layers 115, including resistive memory (ReMEM) banks fabricated between the back end layers 115. The memory access circuitry may include row and column control, sense arrays, voltage and current control circuitry, multiplexers, clock sources, and so on (see, e.g., fig. 4 and 14, below). The back end layer may partially or completely overlie the memory access circuitry, among other thingsOn active or passive components formed in the substrate layer 110.
Processor tiles 104 formed on a substrate layer 110 NM May be formed by a CMOS fabrication process. Furthermore, the memory access circuitry for accessing and controlling the ReMEM memory bank 130 may be formed by a CMOS fabrication process. For example, the logic and access circuitry may be formed entirely or at least in part by a front-end-of-line (CMOS) process. In addition, the ReMEM memory bank 130 may also be formed by a CMOS process, including at least in part a subsequent CMOS process. This facilitates the integration of a single monolithic chip including the substrate layers 110 and the back-end layers 115 into a single die or wafer (e.g., a 20mm-20mm chip, or other suitably sized chip).
An example component diagram of each processor tile 104 is illustrated by processor tile 120. Processor tile 120 may include a processing core (or cores) 122 including logic circuitry formed on substrate layer 110. Further, for example, a cache memory and cache controller 124 may be provided for caching data associated with one or more process threads executed by the processing core 122, retrieving cached data in response to a cache hit, or issuing a memory request to one or more of the X number of ReMEM banks 130 in response to a cache miss, where X is a suitable positive integer. Memory access circuitry may be within processor tile 120 (or adjacent to processor tile 120 in some embodiments) that provides electrical connections and control components enabling independent access to each of the X number of ReMEM memory banks 130.
As shown, processor tile 120 may include one or more memory controllers 125 to facilitate the execution of memory operations on a ReMEM bank 130 connected to processor tile 120. The memory controller 125 may be configured to be operable in conjunction with the physical requirements of the resistive memory cells forming the ReMEM memory bank 130. Example configurations include a read delay configuration, a write delay configuration, an override configuration, power control to activate/deactivate a subset of the ReMEM memory banks 130 or to activate one or more bits or bytes of memory included in the ReMEM memory banks 130, address decoding to identify the physical location of the memory cells identified by the memory request, error correction coding and instructions, data verification instructions to verify correct read or write results, and so forth. In some implementations, the processor tile 120 includes multiple memory controllers 125 per processor core 122. As one example, processor tile 120 may include three memory controllers per processor core (e.g., see fig. 7, below). As an alternative example, processor tile 120 may include eight memory controllers per processor core (e.g., see fig. 8, below). In other embodiments, other suitable numbers of memory controllers may be provided for each processor core 122.
The memory controller 125 may also operate with a stored multiple data instruction set 126. The multiple data instruction set may provide instructions and rules for issuing multiple concurrent memory requests to the processor cores 122 (or each processor core 122, in the case of multiple cores per tile). One example includes a multi-threaded instruction set for each thread executing on the processor cores 122 to issue at least one memory request to system memory (e.g., the ReMEM memory bank 130). According to this example, the processor core 122 is capable of concurrently executing n threads, n being a suitable positive integer (e.g., 4 threads, or other suitable number). Another example of a multiple data instruction set includes Single Instruction Multiple Data (SIMD) instructions. This type of multiple data instruction set refers to instructions (single instructions) that are concurrently implemented on multiple processing cores, or within multiple process threads, etc., using different sets of data (multiple data). Scatter-gather (scatter-gather) is an example SIMD instruction that may be incorporated within a multiple data instruction set and operable with memory controller 125. In general, SIMD instructions may be described as vector instructions, which are extensions of the common scalar instruction set. Scatter-gather refers to a SIMD instruction that may perform multiple, i.e., y, memory operations (e.g., to different physical memory access locations) within a given instruction, allowing multiple reads or writes from the same SIMD instruction at non-sequential memory addresses. For y memory operations, also referred to as y way scatter-gather (e.g., 8 way scatter-gather, or other suitable integer)Y x n physical location accesses may be issued concurrently by the memory controller 125 of each processor core 122. As yet another example, the multiple data instruction set 126 may include a non-blocking scatter-gather SIMD instruction, wherein each process thread of the processor core 122, a plurality of non-blocking memory requests of the processing core 122 (or cache controller 124) are organized in sequence by the controller memory 125. A non-blocking memory request is a memory request that can be completed or otherwise performed independently of other memory requests, and thus can be issued concurrently by the processor core 122 (or cache controller 124) without halting activity on the processor core 122. Multiple non-blocking memory requests, e.g., z memory requests, may be organized by the memory controller 125 to be issued concurrently every n process threads, each thread having a defined y physical location accesses. This may result in multiple concurrent physical memory accesses per processor core 122zx y n, thereby achieving large-scale memory parallelism between the processor core 122 and the resistive system memory of the computing architecture 100. In one or more embodiments, the memory controller 125 may be configured to sequentially stack 8-depth (deep) non-blocking memory requests and physically memory requests to 8*y sub-arrays of resistive system memory (e.g., each sub-array is a subset of the ReMEM bank 130, see fig. 2, although included in connection to other processor tiles 104 NM And a sub-array within a memory bank of an associated memory controller). As an illustrative and non-limiting example, where the multiple data instruction set 126 is configured as an 8-deep non-blocking organization, 8-way scatter gather instructions, and 4-threads per core execution, the memory controller 125 may issue 8x 4 or 256 memory requests concurrently per processor core 122 (or per cache controller 124). However, this example is in no way limiting and other suitable values of n process threads, y-way scatter-gather implementations, and z-non-blocking request organization may be implemented by the computing architecture 100 and other computing architectures of the network-on-chip systems disclosed herein.
To facilitate access connections to different processor tiles 104 NM Processor tile 120 mayIncluding a router device 128. Router apparatus 128 may be configured to route other processor tiles 104 NM Distributing commands and data, and from other processor tiles 104 NM Commands and data are received. As an operational example, the processor core 122 (or cache controller 124) is paired with and connected to the processor tile 104 01 May decode (e.g., in response to a cache miss or other request from the processor core 122) the memory address associated with the ReMEM bank 130 to the processor tile 104 01 A memory command (e.g., read, write, overwrite, etc.) is issued. Router devices 128 located on such processor tiles will receive the memory commands and provide them to the processor tiles 104 01 To the associated memory controller 125. Acknowledgements or data associated with memory requests may be received by the processor tile 104 01 Returns to the processor cores 122 (or cache controller 124) of the processor tile 120. As a result, as an example embodiment, the processor cores 122/cache controllers 124 of the processor tile 120 may access not only the X number of ReMEM memory banks 130 connected to the processor tile 120, but also any of the ReMEM memory banks included within the computing architecture 100.
The individual memory cells of the ReMEM memory bank 130 may include a series combination of volatile resistance-switching selector devices and non-volatile resistance-switching memory cells. As a result, each of the ReMEM banks 130 may be embodied by a plurality of 1TnR sub-arrays. In an embodiment, n =2048 by 2048 memory cells, although other sizes of 1TnR sub-arrays may be provided in various embodiments of the disclosed computing architecture. In an alternative embodiment, the sub-array may be accessed by a set of multiple transistors. In these embodiments, instead of 1TnR, the subarray may be 2TnR, 4TnR, or other suitable number.
Fig. 2 shows a block diagram of an example circuit layout of the individual sub-arrays 240 of the aforementioned ReMEM memory bank 130 of fig. 1. In the embodiment of fig. 2, the ReMEM memory banks 130 are managed by a bank controller 230. Bank controller 230 activates up to L independently activatable subarrays 240 of the ReMEM bank 130. As shown in FIG. 2, RThe eMEM bank 130 includes 8 independently activatable sub-arrays 240, although in various embodiments the ReMEM bank 130 may be organized into other numbers of sub-arrays (e.g., 2 sub-arrays, 4 sub-arrays, 16 sub-arrays, 32 sub-arrays, etc.). Thus, the number of independently activatable subarrays, L act The request size attribute may vary according to one or more embodiments, and in at least one embodiment may be dynamically determined, for example, in response to read/write requests.
In at least one embodiment, the ReMEM memory bank 130 may be replicated on the IC die and over the substrate of the IC die. Given a particular die size and sub-array size, a given number N may be formed on a chip sa The resistive memory sub-array 240. Likewise, based on the area consumed by access circuit 210 and processor circuit 220, N may be formed on a chip core And a processor core.
In operation, a given main memory access (e.g., an access to ReMEM memory bank 130) causes bank controller 230 to activate a number equal to L act A number of resistive memory sub-arrays 240. Bank controller 230 fetches bits from each activated sub-array 240, aggregates and then returns the fetched bits in response to a main memory access. The number of bits retrieved from each subarray x L act (number of activated subarrays) = fetch size of a given memory bank. The number of bits retrieved per sub-array may be configured at the time of manufacture of the disclosed computing architecture, or may be selectively programmed in a post-manufacture configuration. Unlike DRAM, resistive memory in a crossbar array is decoupled from the page size of the crossbar array. In other words, the granularity of data entering and exiting the subarray is independent of the subarray page size. Specifically, any number of memory cells may be activated by individually applying appropriate voltages to selected cross-point cells of interest (e.g., up to a number of memory cells that may be activated by the maximum wordline current of the page connection). Even if a larger page size is selected (e.g., to better amortize the substrate area consumed by the access circuitry), the page's fetch size may be as small as 1 bit, or as large as the maximum word line current can support, or any suitable number in between (e.g., 2 bits, 4 bits, 8 bits, etc.), and may be dynamically configured after fabrication in some embodiments.
As mentioned above, a set is equal to L act Form a single ReMEM memory bank 130. In some embodiments, bank controller 230 may be configured to change L per memory request act Thereby changing the fetch size of a given ReMEM bank 130. For example, with a sub-array fetch size set to 1 byte, bank controller 230 may fetch a total of 4 bytes for a given memory transaction by activating and executing sub-array fetches on four (4) sub-arrays 240. In other embodiments, L for each bank controller 230, L act Is fixed, so the minimum memory request granularity is extracted by the sub-array by size x L act Is expressed in multiples of (c). In these latter embodiments, for L as shown in FIG. 2 act At 8, the minimum extraction size would be 8x sub-array extraction (e.g., 8 bytes for a 1 byte sub-array, 4 bytes for a 4 bit sub-array, and so on). Larger memory requests may be achieved by activating multiple ReMEM banks 130. To storeTaking a standard cache block of 64 bytes, as yet another example, 64 sub-arrays (where the sub-array extraction size is 1 byte) within 8 different ReMEM banks 130 may be activated to obtain 64 bytes of data. In these embodiments, the disclosed computing architecture may dynamically change the total fetch size for a given memory transaction by configuring bank controller 230 to activate an appropriate number of sub-arrays 240 in response to the memory transaction.
With one controller 230 per bank, the maximum number of outstanding memory requests (equal to the number of ReMEM banks 130) at any given time is N sa /N act . Different embodiments of the present disclosure may have different amounts of memory parallelism. Including 2 kilo (K) x 2K sub-arrays (or more precisely 2048x2048 sub-arrays in the specific example) and 400mm each at 2D interleaved levels in the overall main memory 2 die In the embodiment of (1), the number of sub-arrays is N sa Possibly 64K sub-arrays. For example, at L act In an embodiment equal to 8, the disclosed computing architecture may support up to 8K outstanding concurrent main memory requests across the IC chip.
Maximum chip-wide memory parallelism is achieved when all memory requests use the minimum fetch size. If the amount of data extracted by each memory request exceeds the minimum amount of data, there are effectively fewer resistive memory banks available for independent memory requests. However, the maximum amount of memory parallelism for smaller fetch sizes is useful, for example, for memory-intensive applications with irregular memory access patterns that lack spatial multiplexing. Many applications, including graphics computing and sparse matrix workloads, exhibit these characteristics and can maximize the parallelism of the disclosed computing architecture.
FIG. 3 illustrates a perspective view of an example resistive memory sub-array 300 of the disclosed monolithic computing architecture, in accordance with alternative or additional embodiments of the present disclosure. Sub-array 300 includes a substrate 310 layer at the base of the IC die, and one or more layers 315 (e.g., back end layers) over substrate 310. An example layout of logic circuitry and memory access circuitry within and thus beneath the footprint of the resistive memory sub-array 320 of the resistive main memory is shown. This example layout provides memory access decoder 330 along a first edge of substrate 310 and memory access sense amplifier 340 along a second edge of substrate 310 perpendicular to the first edge. The remaining area of the substrate 310 includes non-memory circuitry 350, which may include logic circuitry, memory controller circuitry, router device circuitry, cache and cache controller circuitry, power control circuitry, multiplexers for routing power, connectors for routing data or instructions, and other suitable active or passive logic devices for a processor device (e.g., processing core). It should be understood that such processor devices, also referred to herein as processing cores, may include general-purpose processors such as a Central Processing Unit (CPU), or special-purpose processors such as accelerators (e.g., graphics Processing Units (GPUs), etc.), or any other suitable CMOS processing architecture.
In the arrangement depicted in fig. 3, a set of access circuits is provided for the resistive memory sub-array 300 (and, implicitly, other resistive memory sub-arrays adjacent to the resistive memory sub-array 300 within the disclosed monolithic computing architecture). Thus, when the sub-array size is larger, the amortization of the access circuit is greatest, making the area for integrating the CPU logic larger. For example, for a sub-array containing 2K x-2K memory cells, assuming a 16nm technology node is used for resistive memory, one embodiment utilizes approximately 26% of the area under the sub-array for the access circuitry (e.g., including access decoder 330 plus sense amplifier 340), leaving 74% of the area for non-memory circuitry 350. For using different process technologies (e.g., 14nm, 12nm, 7nm, etc., or in some embodiments even larger process technologies: 22nm, 28nm, etc.), different amortization of the substrate area is achieved by the access circuitry and the processing circuitry. Generally, by implementing logic circuits under a set of memory sub-arrays, all in a single IC die (e.g., with fine-meshed integration of logic circuits and access circuits), the distance between the processor core and the main memory may be minimized, and a large number of wires (e.g., hundreds of thousands or millions of wires, as available in modern VLSI manufacturing, but may also include a larger number of wires available in future semiconductor manufacturing technologies) may be provided to interconnect the processor core and the main memory. The high degree of interconnection between processing cores (e.g., 64 or more processing cores) and independently accessible memory banks (e.g., 8K or more independent sub-arrays per 2D crossbar) facilitates very high processor-to-main memory interconnections and, thus, extremely high memory access parallelism. The physical proximity between the non-memory circuitry 350 and the resistive memory sub-array 320 may significantly reduce the power consumption associated with main memory access requests.
One difference of the disclosed resistive switching memory from, for example, DRAM, is that the read latency (and write latency) of the resistive switching memory is longer. For example, in some resistive memory technologies, read delays on the order of hundreds of nanoseconds are observed. To achieve main memory throughput similar to or greater than that available in DRAM systems, high throughput is achieved, albeit with longer latency, by providing a high degree of parallelism — a very large number of memory access requests can be issued per clock cycle (or group of clock cycles) that can be performed concurrently by a corresponding large number of memory banks. Furthermore, very large data paths (e.g., 256 bits or more) may be embodied with large interconnects between processing cores and memory banks. In some embodiments, the disclosed computing architecture may implement a 23.4GTEPS with 16K-way memory parallelism, as compared to a 2.5 giga-edge-per-second traversal (GTEPS) for a DRAM system with a total data throughput of 320 GB/s. Furthermore, better access granularity (e.g., 8 bytes, 4 bytes, 1 byte, etc.) may facilitate higher useful data throughput as compared to DRAMs that retrieve data in a minimum 128 byte block per memory access.
Table 1 provides example embodiments of the disclosed monolithic computing architecture, but the disclosure is not limited to these embodiments. One advantage of the disclosed monolithic computing architecture is the scalability of the components. For example, when expandingWith the number of processor tiles 120, the core count will increase and the number of memory controllers 125 will also increase, providing more access points into the ReMEM sub-array 130. Typically, the number of subarrays N for a given chip design sa And keeping fixed. Thus, increasing the number of processor tiles 120 also reduces the number of resistive memory sub-arrays controlled by each memory controller 125. As provided in table 1, the number of resistive memory sub-arrays and resistive memory banks per processor tile 120 is given for increasing the number of processor tiles 120: from 64 to 1024 processor tiles 120 (e.g., for a tile having N) sa =64K and L act = one or more embodiments of 8; see fig. 2 above). Likewise, increasing the number of processor tiles 120 increases the number of router devices 128, providing greater network capacity for routing memory request packets between cores and memory controller 125. In other words, the expansion of processor tile 120 increases the parallelism of both the computing architecture and the memory controller while reducing the number of resistive memory banks per memory controller. The number of processor tiles 120 significantly increases/decreases system memory parallelism.
Number of |
64 | 128 | 256 | 512 | 1024 |
Subarrays of each tileColumn(s) of | 1024 | 512 | 256 | 128 | 64 |
Per |
128 | 64 | 32 | 16 | 8 |
Table 1: for different numbers of processor tiles and for N sa =64K and L act Embodiment of =8, number of subarrays and memory banks per processor tile 120
Fig. 4 illustrates a block diagram of an example monolithic sub-chip level computing architecture 400, in accordance with further embodiments of the present disclosure. Computing architecture 400 illustrates multiple memory controllers per processor core architecture to facilitate high system memory parallelism for the disclosed computing architecture. The processor core 410 is communicatively coupled with a plurality of memory controllers 420. Each memory controller 420 is likewise connected to a set of resistive memory sub-arrays 430 that include a plurality of individual sub-arrays 440. Because processor cores 410 are connected to many independent sub-arrays 440 through multiple memory controllers 420, processor cores 410 may generate a large number of memory requests that may be executed concurrently on independent sub-arrays 440. This allows processor core 410 to share the relatively long access latency of independent sub-arrays 440 in many memory requests (e.g., hundreds of nanoseconds), thereby achieving high overall throughput.
Given the access time, the general equation that supports the degree of parallelism of the required bandwidth is as follows:
memory parallelism = byte/second/access/byte the above equation gives the required bandwidth in the first term: bytes/sec, access latency is given in the second term: seconds/access, and in the third entry the number of bytes transferred per access is given: access/byte. Considering the problem of matrix multiplication performed on multiple cores, for a core running at 1GHz, there are two 64-bit floating-point multiply-accumulate operations per cycle that last. The required bandwidth is 4x8 bytes/nanosecond =32GB/s. For a main memory access equal to 8 bytes per access (e.g., to facilitate reasonably high access granularity and good useful data throughput), given an access time of 200 nanoseconds, the minimum parallelism per core would be:
32 bytes/ns 200 ns/access 8 bytes =800 the result 800 is the number of concurrent memory requests that each core is required to process by the computing architecture 400 to maintain a data rate of 32GB/s given a 200ns access time and 8 bytes per access. As access times increase, the required parallelism increases proportionally. For example, a 500ns access time requires 2.5 times as much parallelism, or 2000 concurrent memory requests per core. This parallelism exceeds the capacity of conventional main memory (e.g., DRAM) by several orders of magnitude.
It should be appreciated that the above requirements do not impose a requirement of 800 or 2000 memory channels per core, but rather the memory system must be able to manage the above number of synchronization requests that overlap in time. For example, each memory controller 420 may concurrently control multiple memory banks, each of which may be in a different state, thereby allowing each memory controller 420 to "pipeline" multiple memory requests concurrently.
Referring to fig. 5, an embodiment of an example monolithic multiprocessor network on chip (NoC) architecture 500 is described. The NoC architecture 500 provides a command and data path 510 that connects multiple processor cores, including a processor core 1 410 to processor core X 412 where X is a suitable number greater than 1 (e.g., 32, 64, 128, 256, 512, etc.). Processor cores are hereinafter collectively referred to as processor cores 1-X ,410-412. Processor core 1-X 410-412 eachEach connected to a separate memory subsystem, each memory subsystem including multiple memory controllers 1-X 420-422, with multiple memory controllers respectively connected to corresponding sets of memory sub-arrays 430 1 ......430 X . Each group of memory sub-arrays 430 connected to a memory controller 1 ......430 X Including multiple independently accessed memory sub-arrays 440 1 -440 X . Routing devices (not depicted, but see FIG. 1 above) associated with each processor core may issue memory access requests to different memory subsystems through command and data path 510, as appropriate. For example, if the processor core 1 410 1 The issued memory request includes a processor core X 412 of a group of memory sub-arrays 430 X Inner independent sub-array 440 X Can be submitted by the routing device onto the command and data path 510 and connected to the processor core X 412 is received by the corresponding routing device. Is serving the signals from independent sub-arrays 440 X Is returned on the command and data path 510 and is processed in the processor core 1 Is received at 410. As one of ordinary skill in the art will appreciate from the context provided herein, the NoC architecture 500 may accommodate a similar process involving multiple memory requests originating from multiple cores and targeting a memory subsystem in multiple other cores (or some of the same cores) to multiple physical data locations.
Where applications of the NoC computing architecture desire a large amount of data sharing between threads and cores of a multi-core system, significant congestion may occur on the data and communication paths 510 interconnecting the multiple cores. For example, congestion may arise within routers connecting each core to data and communication paths 510, as requests for non-local data must pass through and utilize routers connected to other cores. Fig. 6 illustrates a monolithic multi-processor network-on-chip computing architecture 600 in accordance with an alternative or additional embodiment that may minimize NoC congestion. The computing architecture 600 may manage both simpler applications that share a small number of memory requests between cores (and associated memory subsystems) and congestion situations involving significant sharing between cores and memory subsystems. The computing architecture 600 places the memory controllers 410-427 as endpoints on the data and communication path 610 such that each memory controller 410-427 on the NoC has equal bandwidth. The computing architecture 600 creates a truly distributed main memory, where the cores 410-412 act as clients to the memory controllers 410-427, rather than their owners.
Fig. 7 illustrates an example 2D arrangement 700 of processing cores and memory controllers for the disclosed NoC computing architecture, in further disclosed embodiments. The 2D arrangement 700 includes 3 memory controllers 420 per processor core 410, further facilitating high memory parallelism. Each processor core 410 may exchange commands and data between connected memory controllers 420, and each memory controller 420 may likewise send and receive data and commands with other connected memory controllers 420. The arrangement of the memory controller 420 and the processor core 410 may be conceptual (depicting interactivity, but not the physical location of associated memory circuitry and logic circuitry), or may reflect the physical arrangement of memory circuitry and logic circuitry on the substrate of the corresponding memory controller 420 and processor core 410.
Fig. 8 depicts an alternative 2D arrangement 800 of processing cores and memory controllers for the disclosed NoC computing architecture in other embodiments. The 2D arrangement 800 includes 4108 memory controllers 420 per processor core, increasing memory parallelism over that provided by the 2D arrangement 700. Similar to fig. 7, the arrangement of memory controller 420 and processor cores 410 for the 2D arrangement 800 may be conceptual or may reflect the physical arrangement of memory circuitry and logic circuitry on a substrate.
In some implementations, a processor tile of a computing architecture (e.g., computing architecture 100 of fig. 1 described above) may have a single core, network router, and memory controller per processor tile 120. This simplicity is attractive to minimize design overhead, but is inflexible as all hardware resources scale out at the same rate with the number of processor tiles. Some alternative implementations instead form heterogeneous tiles. In these embodiments, a separate memory controller tile that integrates a memory controller with a router device may be implemented in addition to a separate processor tile (e.g., in connection with the computing architecture 400 or the NoC architecture 500 in at least some embodiments). In these embodiments, the number of memory controllers (and routers) may be decoupled from the number of processor cores provided by the 2D arrangements 700, 800 of fig. 7 and 8. The heterogeneous tiles can be independently designed for the size of each core and the number of access points into the resistive memory sub-array.
Fig. 9 depicts a block diagram of an example NoC monolithic computing system 900, according to one or more embodiments of the disclosure. Depicted is a single independent IC chip 902 having multiple processing cores 910-912, each having access to cache and cache controllers 914-916, respectively. On a cache miss, a request is issued to main memory through the network on chip architecture 920. In some embodiments, the request may be a complete cache block, a portion of a cache block (e.g., a minimum fetch size, such as 8 bytes or other suitable minimum fetch size) in other embodiments, or multiple cache blocks. A memory controller associated with resistive main memory 930 may return data associated with the memory request.
In one or more embodiments, the NoC monolithic computing system 900 may include 64 processing cores 910-912 with 8K independent banks of resistive main memory 930 in each of two stacked 2D crossbar arrays for a total memory parallelism of 16K memory requests. The access latency to the resistive main memory 930 is approximately 700ns, with a minimum fetch size of 8 bytes (e.g., a minimum subarray fetch of 1 byte, and L) act = 8), the computing system 900 estimates that 23.4GTEPS is implemented. This high performance, coupled with the non-volatility of system memory, the elimination of high capacity system memory (e.g., 64 GB) and off-chip memory accesses, is expected to significantly improve the state of existing processing systems.
Fig. 10 illustrates a graph 1000 of memory parallelism for an example embedded resistive memory computing architecture, according to one or more embodiments shown herein. The diagram 1000 plots different instruction sets for achieving parallelism in a multi-core system along the horizontal axis and the number of maximum synchronous memory requests along the vertical axis for a 64-core computing system.
The basic parallelism is evidenced by the independence of multiple cores, where each core issues a single memory request that is separate from the other cores. For numbered systems: c core/processor tiles, each of which can issue a maximum of a single memory request per clock cycle, C concurrent memory requests can be issued and executed simultaneously by such a system (sometimes referred to in the art as a scalar system). For a 64-core scalar system, there may be 64 concurrent memory requests. The next step is an example where separate process threads can execute independently and concurrently on different cores (superscalar), or where a core is configured to switch between hardware contexts to issue multiple memory requests across threads in an interleaved fashion (multithreading). This results in a multiplier n of process threads being interleaved, requested, executed by C cores, which reaches a maximum of n x C concurrent memory requests. Where n =4 in a 64-core system, the number of simultaneous memory requests increases to 256.
As a further refinement, multiple sets of data instructions (e.g., single Instruction Multiple Data (SIMD)) that can process multiple data elements concurrently may be implemented to further increase memory parallelism. For example, SIMD pipeline processing supports scatter-gather, allowing each subword from a single scatter-gather operation to generate a separate memory request for a different physical memory location. For y-way scatter gather, memory parallelism may be increased to y x n C concurrent memory requests. Depicted in fig. 10 is an 8-way scatter-gather paradigm that, in conjunction with a 4-way multithreading and 64-core system, implements 2048 concurrent memory requests. This level of memory parallelism far exceeds that of conventional DRAM systems, but with a greater number of resistive memory sub-arrays (e.g., -64K) per chip and a smaller L act The value (e.g. 8) is more than this, for exampleTo greater memory parallelism (e.g., 8000 synchronization requests). At a smaller L act In embodiments with values such as 4, even greater memory parallelism is physically possible (e.g., 16000 synchronization requests).
To further increase memory parallelism, some embodiments of the present disclosure implement non-blocking SIMD scatter-gather. For a blocking instruction, when a core issues a long latency memory operation (e.g., in response to a cache miss generating a main memory request), the core stalls while waiting for the results of the memory operation to be returned. Non-blocking memory operations, on the other hand, are those operations that a core may continue to execute while long-latency memory operations are still pending. As one example, a write operation may be non-blocking in conjunction with a data buffer configured to temporarily store stored data before it is written to memory. In various embodiments, a per register present bit is provided in a register file to identify instructions that depend on non-blocking loads, and to organize successive non-blocking instructions to delay a core stall that occurs in response to a dependent instruction. Alternatively, the scoreboard structure may be organized in memory to continuously identify and organize non-blocking instructions. Both non-blocking loads and stores allow a single thread to issue multiple memory requests if multiple non-blocking memory operations are encountered in succession. Thus, the amount of memory parallelism generated is limited by the number of outstanding non-blocking operations allowed.
Embodiments of the present disclosure combine non-blocking instruction techniques with SIMD scatter-gather operations to further enhance memory parallelism. For example, buffering of write operations and tracking dependent read operations may be applied to SIMD pipelines and register files to integrate with scatter-gather operations. In an embodiment, in an n-way multithreading system, z-depth non-blocking instructions may be organized in order for the y-way SIMD scatter-gather paradigm. This results in an overall memory parallelism of: z x y n C. With z =4 in the 8-way scatter-gather, 4-way multithreading 64-core example described above, the memory parallelism increases to 8K. In fig. 10, for z =8, an 8-deep non-blocking instruction set is provided, resulting in a memory parallelism of 16K.
Fig. 11 illustrates an example processor tile 1100 demonstrating monolithic integration of a CPU with resistive main memory, according to one or more embodiments of the present disclosure. In one or more embodiments, processor tile 1100 may replace processor tile 120 of FIG. 1. Processor tile 1100 shows the location of resistive memory sub-arrays 1104A, 1104B, 1104C, 1104D (collectively ReRAM sub-arrays 1104A-D) of a single cross-shaped ReRAM cluster 1102 relative to surrounding circuitry. The resistive memory that models resistive memory sub-array 1104A-D is resistive random access memory (ReRAM) technology produced by Crossbar, inc (although various other types of resistive non-volatile memory are contemplated within the scope of the present disclosure). Because the single resistive memory sub-array is small compared to the size of the processor tile 120, peripheral access circuitry to the various resistive memory sub-arrays will be distributed across the processor tile 1100, affecting the logic circuitry of the processor core that includes the processor tile 1100.
The arrangement of the blocking areas of the peripheral access circuits 1106B, 1106C that make up the CPU core is selected so that the blocks abut each other, resulting in a continuous blocking area. Note that the peripheral access circuitry represents two types of blocking. The first is to place the blocking, which prevents the standard cells of the CPU core from being placed in these blocked areas. The second is route blocking at a particular metal layer to limit routing. In fig. 11, metal layers 1-8 are blocked for routing, enabling the APR tool to route through the blocked area using metal layers 9 and 10.
The above-described figures have been described with respect to various components of an integrated circuit chip, system-on-chip, or network-on-chip, including arrangements of memory arrays, memory circuits, logic circuits, and system components (e.g., memory controllers, cache controllers, router devices, etc.), as well as monolithic groups of layers used to form some or all of these components. It should be understood that in some suitable alternative aspects of the present disclosure, the various figures may include the depicted arrangement of the specified components/arrays/circuits/devices/layers therein, some of the specified components/arrays/circuits/devices/layers, or additional components/arrays/circuits/devices/layers. The sub-components may also be implemented in electrical connection with other sub-components, rather than being included within a parent component/layer. For example, memory controller 125, router 128, and SIMD instruction set 126 may be included on separate heterogeneous tiles, rather than integrated as part of processor tile 120. Further, components/arrays/circuits/devices etc. depicted in one figure should be understood as operable in other figures as applicable. For example, the processor core/memory controller organization depicted in any of fig. 4, 5, and 6 may be implemented in the architecture of fig. 1 as an alternative embodiment. Further variations, combinations, reductions, or additions of parts not specifically described herein but within the purview of one of ordinary skill in the art or reasonably suggested by the context provided herein are considered to be within the scope of the present disclosure. Further, it should be noted that one or more of the disclosed processes may be combined into a single process that provides aggregate functionality. The components of the disclosed architecture may also interact with one or more other components not specifically described herein but known to those of skill in the art.
In view of the above-described exemplary diagrams, the process methodologies that may be implemented in accordance with the disclosed subject matter will be better appreciated with reference to the flow charts of fig. 12-14. While, for purposes of simplicity of explanation, the methodologies of fig. 12-14 are shown and described as a series of blocks, it is to be understood and appreciated that the claimed subject matter is not limited by the order of the blocks, as some blocks may occur in different orders or concurrently with other blocks from what is depicted and described herein. Moreover, not all illustrated blocks may be required to implement the methodologies disclosed herein. Moreover, it should be further appreciated that some or all of the methodologies disclosed throughout this specification are capable of being stored on an article of manufacture to facilitate transporting and transferring such methodologies to electronic devices. The term article of manufacture, as used, is intended to encompass a computer program accessible from any computer-readable device, device or storage medium incorporating a carrier.
Referring to fig. 12, a method for fabricating a monolithic IC chip including a resistive system memory is depicted in one or more embodiments. At 1202, method 1200 may include forming logic circuits of a processor core, including cache memory and cache controller circuits, on a substrate of an integrated circuit chip. In some embodiments, logic circuits may be formed in contiguous areas of the substrate, under the back end of the wafer layout design. However, non-contiguous layouts of at least portions of the logic circuitry within non-contiguous portions of the substrate are contemplated in other embodiments.
At 1204, the method 1200 may include forming, at least in part on the substrate and adjacent to the logic circuit, a memory access circuit for operating on the non-volatile resistive system memory. The memory access circuitry may include sense amplifiers, address decoders, multiplexers to couple power to a subset of the memory sub-arrays, and so forth. At 1206, method 1200 can additionally include forming circuitry for a system memory controller at least partially on the substrate.
At 1208, the method 1200 may include providing electrical contacts to communicatively connect the logic circuit and the memory access circuit with a system memory controller. At 1210, the method 1200 may include forming a non-volatile resistive memory array overlying the substrate and the logic and memory access circuits. The resistive memory array may be formed in a cross pattern between conductive lines of a monolithic IC chip. In addition, the resistive memory array may be formed using a CMOS logic process. In an embodiment, a plurality of crossbar arrays are formed, at least partially stacked on top of each other to form a 3D memory array. A 3D memory array overlies the substrate and logic circuitry, and at least partially overlies the memory access circuitry.
At 1212, the method 1200 may include connecting a memory access circuit to the non-volatile resistive memory array to form a plurality of independently accessible sub-arrays. In an embodiment, the size of each sub-array may include about 2 thousand (2K) by about 2K memory cells. In at least one embodiment, the total number of independently accessible sub-arrays may be about 64K sub-arrays. Still further, the sub-arrays may be arranged in tiles, with each tile connected to one processor core (or in other embodiments, multiple processor cores per tile). In an embodiment, each processor tile provides 1024 sub-arrays. In further embodiments, each processor tile provides 512 sub-arrays. In other embodiments, 128 sub-arrays are provided per processor tile. In yet another embodiment, each processor tile provides 64 sub-arrays.
In an alternative implementation, independently accessible sub-arrays may be connected to the memory controller and routing tiles independently of the processor tiles. In these embodiments, a similar number of sub-arrays per memory/router tile may be provided as described above.
At 1214, method 1200 may include configuring a memory controller to independently perform memory operations on respective sub-arrays in response to processor core or cache controller commands. The memory controller may be partitioned into separate memory bank controllers that each access a number, L, of memory accesses act And (4) sub-arrays. Number of subarrays activated by bank controller L act Corresponds to a single memory bank and serves as the minimum fetch size for a monolithic IC chip memory request.
Referring to fig. 13, depicted is a flow diagram of an example method for operating a processor in a multi-core chip including a resistive system memory, in one or more embodiments. At 1302, method 1300 may include implementing a process thread on logic circuits of a processor core of a multi-core chip. At 1304, the method 1300 can access a cache to meet the storage requirements of the process thread. At 1306, it is determined whether the cache access caused a cache hit. If a cache hit occurs, the method 1300 may proceed to 1308. Otherwise, method 1300 proceeds to 1316.
At 1308, method 1300 can include obtaining desired data from a cache, and at 1310, executing a process instruction that completes the desired cached data. At 1312, it is determined whether the process thread is complete. If the process thread is complete, the method 1300 may proceed to reference numeral 1322 and end. Otherwise, the method 1322 proceeds to reference numeral 1314, increments the instruction set of the process thread, and returns to reference numeral 1302.
At 1316, method 1300 generates a system memory access request (e.g., read) having less than 128 bytes of data. The multi-core chip facilitates extracting smaller sizes than a standard DRAM page of 128 bytes. Thus, an access request may be a single cache block (e.g., 64 bytes), a half cache block (e.g., 32 bytes), or even a smaller amount of data: e.g., 16 bytes, 8 bytes, 4 bytes, 1 byte, etc.
At 1318, the method 1300 may include issuing a memory request to an on-chip resistive memory system memory. At 1320, method 1300 may optionally include executing additional process threads while the memory request is pending. At 1322, the additional process thread may also optionally include generating and issuing one or more additional resistive memory access requests. In various embodiments, the additional memory access requests may include multi-threaded requests (issued as part of a separate hardware context or a separate process thread), separate memory addresses for scatter-gather memory instructions, or subsequent non-blocking memory access instructions. At 1324, method 1300 may include obtaining less than 128 bytes from the system memory in response to the memory request of reference 1318. From 1424, the method 1300 may proceed to 1310 and complete the processing instruction for which a cache miss was determined at reference numeral 1306. Variations of the method 1300 known in the art or known to one of ordinary skill in the art through the context provided herein are considered to be within the scope of the present disclosure.
Referring to fig. 14, a flow diagram of an example method 1400 of fabricating a monolithic IC chip according to an alternative or additional embodiment is provided. At 1402, the method 1400 may include disposing logic circuitry on a substrate of an integrated circuit, the logic circuitry including a plurality of processing cores and a cache/controller. The logic circuit may be provided using CMOS process technology. At 1404, method 1400 can include providing independent subarray access circuitry for resistive system memory on a substrate. In some embodiments, the access circuit may be adjacent to the logic circuit and located near the associated back-end memory sub-array according to a semiconductor design layout. In other embodiments, the access circuitry may be integrated within the logic circuitry in a fine-grained grid implementation. In other embodiments, combinations of the above may be implemented.
At 1406, method 1400 may include providing circuitry including a plurality of memory controllers per processing core, and at 1408, method 1400 may include providing circuitry including at least one router device per processing core. In at least one embodiment, the memory controller of each processing core and router device may be organized into a controller tile that is independent of the processor tile, as described herein.
At 1410, method 1400 may include setting up command and data paths interconnecting the processing cores and the router device. At 1412, the method 1400 may include forming a resistive memory structure overlying the substrate, the resistive memory structure including individual sub-arrays of resistive memory. In an embodiment, the resistive memory structure may be formed using a CMOS process (e.g., a post process). At 1414, method 1400 may provide electrical connections between the sub-array groups and the respective memory controllers. In an embodiment, the memory subsystem is connected to individual controller tiles, which are interconnected to other controller tiles and processor tiles through command and data paths. In such implementations, the memory controller tiles act as endpoints in the NoC architecture, and the processor tiles may operate as clients to the memory controller tiles.
Additionally, at 1416, the method 1400 may include configuring the memory controller to be responsive to processing core memory requests or cache controller memory requests of the plurality of processing cores. Further, at 1418, the method 1400 may include configuring the processing core, the cache controller to issue a plurality of concurrent memory requests to respective sub-arrays of the resistive system memory according to a multiple data instruction toolset. Examples of multiple data instruction toolsets may include a multithreading instruction set, a scatter-gather SIMD multithreading instruction toolset, or a non-blocking scatter-gather SIMD multithreading instruction toolset. In various embodiments, the memory controller is configured to distribute concurrent memory requests from multiple processing cores or cache controllers to various memory banks to concurrently execute multiple memory requests.
In various embodiments of the present disclosure, the disclosed memory architecture may be used as a stand-alone or integrated embedded memory device with a CPU or microcomputer. For example, some embodiments may be implemented as part of a computer memory (e.g., random access memory, cache memory, read-only memory, storage memory, etc.). Other embodiments may be implemented as components of a portable memory device, for example.
FIG. 15 illustrates a block diagram of an example operating and control environment 1500 for a memory array 1502 of memory cell arrays according to aspects of the subject disclosure. In at least one aspect of the present disclosure, the memory array 1502 can include memory selected from a variety of memory cell technologies. In at least one implementation, the memory array 1502 may comprise two-terminal memory technology arranged in a compact two-dimensional or three-dimensional architecture. As disclosed herein, example architectures may include a 1T1R memory array and a 1TnR memory array (or 1TNR memory array). Suitable two-terminal memory technologies may include resistive switching memory, conductive bridge memory, phase change memory, organic memory, magnetoresistive memory, or the like, or suitable combinations of the foregoing. In some embodiments, the memory array 1502 may be a memory bank comprising a plurality of independently accessible memory sub-arrays. In further embodiments, the memory array 1502 may serve as an embedded main memory for a multi-core IC chip, as described herein.
In addition, the operating and control environment 1500 may include a row controller 1504. The row controller 1504 may be formed adjacent to, and electrically connected to, word lines of the memory array 1502. Also using the control signals of the reference and control signal generator 1518, the row controller 1504 can select a particular row of memory cells having the appropriate select voltage. In addition, the row controller 1504 may facilitate program, erase, or read operations by applying appropriate voltages on selected word lines.
Buffer 1512 may be configured to receive write data, receive erase instructions, receive status or maintenance instructions, output read data, output status information, and receive address data and command data, as well as address data for the corresponding instructions. The address data may be transferred to row controller 1504 and column controller 1506 through address register 1510. Further, input data is transmitted to the memory array 1502 via a signal input line between the sense amplifier 1508 and the input/output buffer 1512, and output data is received from the memory array 1502 via a signal output line from the sense amplifier 1508 to the buffer 1512. Input data may be received from the processing core or the cache controller, and output data may be transferred to the processing core/cache controller through the memory access circuitry.
Commands received from the processing cores or cache controllers may be provided to command interface 1516. The command interface 1516 may be configured to receive internal control signals from the processing core/buffer controller and determine whether the data input to the input/output buffer 1512 is write data, a command, or an address. Where applicable, the input commands may be passed to an optional state machine 1520.
Optional state machine 1520 may be configured to manage programming and reprogramming of memory array 1502 (as well as other memory banks of a multi-bank memory array). The instructions provided to the state machine 1520 are implemented in accordance with a control logic configuration that enables the state machine to manage reads, writes, erases, data inputs, data outputs, and other functions associated with the memory cell array 1502. In some aspects, state machine 1520 may send and receive acknowledgements and negative acknowledgements regarding successful receipt or execution of various commands. In further embodiments, state machine 1520 may decode and implement state-related commands, decode and implement configuration commands, and so forth.
To perform read, write, erase, input, output, etc., functions, state machine 1520 may control clock source 1508 or reference and control signal generator 1518. Control of clock source 1508 may cause the output pulses to be configured to facilitate row controller 1504 and column controller 1506 to implement particular functions. The output pulse may be transmitted to a selected bit line, e.g., by column controller 1506, or to a word line, e.g., by row controller 1504. In some implementations, the state machine 1520 can be replaced by a memory controller for performing memory operations on the memory array 1502 as described herein. In alternative embodiments, state machine 1520 may function as a memory controller and be configured to implement the functionality of the memory controller of the present disclosure.
The illustrated aspects of the disclosure may also be practiced in distributed computing environments where certain tasks are performed by multiple monolithic IC chips that include embedded resistive memory that are linked through a communications network. In a distributed computing environment, program modules or stored information, instructions or the like may be located in both local and remote memory storage devices.
As used herein, the terms "component," "system," "architecture," or the like are intended to refer to a computer or electronic related entity, either hardware, a combination of hardware and software, software (e.g., in execution), or firmware. For example, a component may be one or more transistors, memory cells, an arrangement of transistors or memory cells, a gate array, a programmable gate array, an application specific integrated circuit, a controller, a processor, a process running on a processor, an object, an executable, a program or application accessed or interfaced with semiconductor memory, a computer, or the like, or a suitable combination thereof. The component may include erasable programming (e.g., process instructions at least partially stored in erasable memory) or hard programming (e.g., process instructions burned into non-erasable memory at the time of manufacture).
For example, processes executed from memory and a processor may both be components. As another example, an architecture may include an arrangement of electronic hardware (e.g., parallel or series transistors), processing instructions, and a processor, which implements the processing instructions in a manner suitable for the electronic hardware arrangement. Further, an architecture may include a single component (e.g., a transistor, a gate array, etc.) or an arrangement of components (e.g., a series or parallel arrangement of transistors, gate arrays connected to program circuitry, power supply lines, electrical ground, input and output signal lines, etc.). A system may include one or more components and one or more architectures. One example system may include a switching block architecture that includes cross input/output lines and pass gate transistors, as well as power supplies, signal generators, communication buses, controllers, I/O interfaces, address registers, and so forth. It is to be understood that some overlap in the definitions is contemplated and that an architecture or system may be a standalone component or a component of another architecture, system, etc.
In addition to the foregoing, the disclosed subject matter may be implemented as a method, apparatus, or article of manufacture using typical manufacturing, programming, or engineering techniques to produce hardware, firmware, software, or any suitable combination thereof to control an electronic device to implement the disclosed subject matter. The terms "apparatus" and "article of manufacture" as used herein are intended to encompass an electronic device, a semiconductor device, a computer, or a computer program accessible from any computer-readable device, carrier, or media. The computer readable medium may include a hardware medium or a software medium. Further, the medium may include a non-transitory medium or a transmission medium. In one example, the non-transitory medium may include a computer-readable hardware medium. Specific examples of computer-readable hardware media may include, but are not limited to, magnetic storage devices (e.g., hard disks, floppy disks, magnetic strips.), optical disks (e.g., compact Disks (CDs), digital Versatile Disks (DVDs)), smart cards, flash memory devices (e.g., cards, sticks, key drives). The computer readable transmission medium may include carrier waves and the like. Of course, those skilled in the art will recognize many modifications may be made to this configuration without departing from the scope or spirit of the disclosed subject matter.
What has been described above includes examples of the subject innovation. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the subject innovation, but one of ordinary skill in the art may recognize that many further combinations and permutations of the subject innovation are possible. Accordingly, the disclosed subject matter is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the present disclosure. Furthermore, where the terms "include", "including", "having" or "having" and variants thereof are used in the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the following: the term "comprising" when used as a transitional word in a claim is to be interpreted as "comprising".
Moreover, the word "exemplary" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the word "exemplary" is intended to present concepts in a concrete fashion. As used in this application, the term "or" is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise, or clear from context, "X employs A or B" is intended to mean any of the naturally inclusive permutations. That is, if X employs A; b is used as X; or X employs A and B, then "X employs A or B" is satisfied under any of the foregoing circumstances. In addition, the articles "a" and "an" as used in this application and the appended claims should generally be construed to mean "one or more" unless specified otherwise or the context clearly dictates otherwise.
Moreover, some portions of the detailed description have been presented in terms of algorithms or processing operations on data bits within an electronic memory. These procedural descriptions or representations are the mechanisms used by those skilled in the art to effectively convey the substance of their work to others skilled in the art. A process is here, and generally, conceived to be a self-consistent sequence of acts leading to a desired result. The acts are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical and/or magnetic signals capable of being stored, transferred, combined, compared, and/or otherwise manipulated.
It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like. It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, or as is apparent from the preceding discussion, it is appreciated that throughout the disclosed subject matter, discussions utilizing terms such as processing, computing, copying, modeling, determining, or transmitting, refer to the action and processes of a processing system, and/or similar consumer or industrial electronic device or machine, that manipulates and transforms data or signals represented as physical (electrical or electronic) quantities within the electronic device's circuits, registers, or memories and other data or signals similarly represented as physical quantities within the machine or computer system memories or registers or other such information storage, transmission, and/or display devices.
In regard to the various functions performed by the above described components, architectures, circuits, processes, and the like, the terms (including a reference to a "means") used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., a functional equivalent), even though not structurally equivalent to the disclosed structure, which performs the function in the herein illustrated exemplary aspects of the embodiments. In addition, while a particular feature may have been disclosed with respect to only one of several implementations, such feature may be combined with one or more other features of the other implementations as may be desired and advantageous for any given or particular application. It will also be recognized that the embodiments include a system as well as a computer-readable medium having computer-executable instructions for performing the acts and/or events of the various processes.
Claims (20)
1. An integrated circuit device, comprising:
a plurality of processing cores formed on a substrate of the integrated circuit device;
a resistive memory array structure formed over the substrate of the integrated circuit device, the resistive memory array structure comprising a plurality of resistive memory sub-arrays, each resistive memory sub-array comprising a non-volatile two-terminal resistive switching memory cell;
access circuitry formed at least in part on the substrate of the integrated circuit device, the access circuitry providing independent operational access to respective resistive memory sub-arrays of the plurality of resistive memory sub-arrays; and
a plurality of memory controllers including a first set of memory controllers communicatively coupled with a first processing core of the plurality of processing cores and operable to receive a first memory instruction from the first processing core and execute the first memory instruction on a first set of resistive memory sub-arrays of the plurality of resistive memory sub-arrays in response to the first memory instruction, and a second set of memory controllers communicatively coupled with a second processing core of the plurality of processing cores and operable to receive a second memory instruction from the second processing core and execute the memory instruction on a second set of resistive memory sub-arrays of the plurality of resistive memory sub-arrays in response to the second memory instruction, wherein the first memory instruction or the second memory instruction is a memory read that returns less than 128 bytes of data.
2. The integrated circuit device of claim 1, wherein the resistive memory array structure is at least partially overlaid on the plurality of processing cores, and further comprising a cache memory and a cache controller to service a data demand of the first processing core or the second processing core, wherein the first memory instruction or the second memory instruction originates from the cache controller in response to a cache miss associated with servicing the data demand.
3. The integrated circuit device of claim 1, further comprising:
a first router device associated with the first processing core and the first set of memory controllers;
a second router device associated with the second processing core and the second set of memory controllers; and
a command and data path interconnecting the first router device and the second router device, wherein at least one of:
the first router device decoding a memory address included in the first memory command addressed within the second set of the plurality of resistive memory sub-arrays and forwarding at least a portion of the first memory command associated with the memory address to the second router device over the command and data path for execution by the second set of memory controllers; or
The second router device decodes a second memory address included with the second memory instruction that is addressed within the first set of the plurality of resistive memory sub-arrays and forwards at least a portion of the second memory instruction associated with the second memory address to the first router device over the command and data path for execution by the first set of memory controllers.
4. The integrated circuit device of claim 1, wherein the plurality of memory controllers are capable of concurrently servicing a number of main memory requests from the plurality of processing cores at least equal to the number of the plurality of resistive memory sub-arrays.
5. The integrated circuit device of claim 4, wherein the integrated circuit device is organized on the substrate into a number of computing tiles, wherein a computing tile of the number of computing tiles includes one of the plurality of processing cores, includes the first set of memory controllers, and includes access circuitry dedicated to and operably connected with the first set of resistive memory sub-arrays of the plurality of resistive memory sub-arrays, wherein a number of the first set of resistive memory sub-arrays of the plurality of resistive memory sub-arrays associated with the computing tile is selected from a group consisting of: about 64, about 128, about 256, about 512, and about 1024.
6. The integrated circuit device of claim 4, wherein the plurality of processing cores are selected from the group consisting of: about 16 or more processing cores, about 32 or more processing cores, about 64 or more processing cores; about 128 or more processing cores, about 256 or more processing cores, about 512 or more processing cores, and about 1024 or more processing cores.
7. The integrated circuit device of claim 4, wherein each of the plurality of processing cores is capable of issuing a single outstanding memory instruction, and the plurality of memory controllers are configured to service a number of concurrent memory instructions at least equal to the number of processing cores.
8. The integrated circuit device of claim 4, wherein each of the plurality of processing cores is a multi-threaded processing core configured to issue a second number, x, of outstanding memory instructions, and the plurality of memory controllers are configured to service a number of concurrent memory instructions equal to the number of the plurality of processing cores multiplied by x.
9. The integrated circuit device of claim 8, wherein each of the plurality of processing cores comprises n scatter-gather Single Input Multiple Data (SIMD) process instructions to facilitate each of the plurality of processing cores to issue x n outstanding memory instructions, wherein the plurality of memory controllers are configured to serve a number of concurrent memory instructions equal to the number of the plurality of processing cores multiplied by x n.
10. The integrated circuit device of claim 9, wherein each processing core includes a non-blocking scatter-gather SIMD process instruction that aggregates memory instructions into blocking and non-blocking memory instructions, including up to z consecutive non-blocking scatter-gather memory instructions, to facilitate each of the plurality of processing cores to issue up to z x n outstanding memory instructions, wherein the plurality of memory controllers are configured to serve a number of concurrent memory instructions equal to the number of the plurality of processing cores multiplied by z x n.
11. The integrated circuit device of claim 1, wherein the access circuit is divided into a number of access circuit portions equal to a number of the plurality of resistive memory sub-arrays, each access circuit portion facilitating operational access to a single one of the plurality of resistive memory sub-arrays by a single one of the plurality of memory controllers.
12. The integrated circuit device of claim 1, wherein the memory read returns bytes of data selected from the group consisting of: 1 byte, 2 bytes, 4 bytes, 8 bytes, 16 bytes, 32 bytes, and 64 bytes.
13. The integrated circuit device of claim 1, wherein:
the first set of memory controllers includes a first memory controller and a second memory controller;
the first memory instruction includes a set of memory addresses located within a first memory bank of the first one of the plurality of resistive memory sub-arrays controlled by the first memory controller, and the first memory instruction includes a second set of memory addresses located within a second memory bank of the second one of the plurality of resistive memory sub-arrays controlled by the second memory controller;
the first memory controller activating a resistive memory sub-array associated with the first memory bank in response to the first memory command and retrieving data from a data location within at least one of the activated memory sub-arrays defined by the set of memory addresses, wherein the data location includes an amount of data; and
the second memory controller activates a resistive memory sub-array associated with the second memory bank in response to the first memory command and retrieves data from a second data location within at least one of the activated memory sub-arrays defined by the second set of memory addresses, wherein the second data location includes a second amount of data, further wherein: the first amount of data or the second amount of data is selected from the group consisting of: 1 byte, 2 bytes, 4 bytes, and 8 bytes of data.
14. A method of fabricating an integrated circuit device, comprising:
providing a logic circuit on a substrate of a chip, the logic circuit including a plurality of processing cores and a cache memory for the processing cores;
providing, at least in part, on the substrate of the chip, access circuitry for individual sub-arrays of resistive system memory;
providing, at least in part, on the substrate of the chip, circuitry comprising a plurality of memory controllers for each of the plurality of processing cores;
forming a non-volatile two-terminal resistive memory device including an independent sub-array of the resistive system memory overlying the substrate and overlying at least a portion of the logic circuitry, the access circuitry, or circuitry including the plurality of memory controllers;
forming electrical connections between respective portions of access circuitry on the substrate of the chip and each individual sub-array of resistive system memory overlaid on the substrate of the chip;
forming electrical connections between circuitry comprising each memory controller and respective portions of the access circuitry;
providing a communication path between the logic circuitry including the plurality of processing cores and circuitry including the plurality of memory controllers; and
configuring a memory controller of the plurality of memory controllers to implement memory instructions on an associated independent sub-array of the resistive system memory in response to main memory requests originating from the cache memory of the logic circuit.
15. The method of claim 14, further comprising: providing a plurality of router devices within the logic circuitry comprising the plurality of processing cores, and providing command and data paths interconnecting the router devices.
16. The method of claim 15, wherein the command and data path is configured to communicate memory commands between router devices of the plurality of router devices and to communicate data associated with the memory commands between the router devices.
17. The method of claim 14, further comprising: configuring the processing core or the cache memory to issue a plurality of concurrent memory requests to respective ones of the plurality of memory controllers according to a multi-process instruction set selected from the group consisting of: the method includes the steps of n-way multithreading process set, n x-way scatter-gather multithreading process set, and z n x Lu Fei blocking scatter-gather multithreading process set, wherein n, x, and z are suitable positive integers.
18. An integrated circuit device, comprising:
a plurality of processor tiles, wherein a processor tile of the plurality of processor tiles comprises a processing core, a cache memory and cache controller, a memory controller, and a multiple data memory instruction set, wherein the plurality of processing tiles are formed on a substrate of the integrated circuit device;
a resistive memory array structure formed over the substrate of the integrated circuit device and at least partially overlying the plurality of processor tiles, the resistive memory array structure comprising a plurality of independently addressable sub-arrays formed from non-volatile two-terminal resistive switching memory, wherein a portion of the independently addressable sub-arrays are managed by the memory controller;
access circuitry formed at least in part on the substrate of the integrated circuit device, the access circuitry interconnecting the memory controller with the portions of the independently addressable subarrays managed by the memory controller; and
a command and data bus interconnecting respective ones of the plurality of processor tiles, wherein the resistive memory array structure serves as system memory for the processing cores of the processor tiles.
19. The integrated circuit device according to claim 18, wherein the memory controller is responsive to a memory request issued by the cache controller resulting from a cache miss, retrieves data from the portion of the independently addressable subarrays managed by the memory controller in response to the memory request, and submits the data to the cache controller or the processing tile in response to the memory request.
20. The integrated circuit device of claim 19, wherein the memory request defines a data location that is less than 128 bytes in size.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110982177.6A CN115881188A (en) | 2021-08-25 | 2021-08-25 | Integrating resistive memory systems into multi-core CPU die to achieve large-scale memory parallelism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110982177.6A CN115881188A (en) | 2021-08-25 | 2021-08-25 | Integrating resistive memory systems into multi-core CPU die to achieve large-scale memory parallelism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115881188A true CN115881188A (en) | 2023-03-31 |
Family
ID=85762323
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110982177.6A Pending CN115881188A (en) | 2021-08-25 | 2021-08-25 | Integrating resistive memory systems into multi-core CPU die to achieve large-scale memory parallelism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115881188A (en) |
-
2021
- 2021-08-25 CN CN202110982177.6A patent/CN115881188A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11126550B1 (en) | Integrating a resistive memory system into a multicore CPU die to achieve massive memory parallelism | |
US12019895B2 (en) | Apparatuses and methods for data movement | |
US11614877B2 (en) | Apparatuses and methods for data movement | |
Wu et al. | Power and performance of read-write aware hybrid caches with non-volatile memories | |
Xu et al. | Overcoming the challenges of crossbar resistive memory architectures | |
Talati et al. | mmpu—a real processing-in-memory architecture to combat the von neumann bottleneck | |
US9940990B1 (en) | Data shift apparatuses and methods | |
Goswami et al. | Power-performance co-optimization of throughput core architecture using resistive memory | |
US10074416B2 (en) | Apparatuses and methods for data movement | |
Shevgoor et al. | Improving memristor memory with sneak current sharing | |
CN109003640A (en) | Data transmission between memory neutron array | |
Yu et al. | Energy-efficient monolithic three-dimensional on-chip memory architectures | |
Chen et al. | Recent technology advances of emerging memories | |
Wang et al. | Nonvolatile CBRAM-crossbar-based 3-D-integrated hybrid memory for data retention | |
Yakopcic et al. | Hybrid crossbar architecture for a memristor based cache | |
Kingra et al. | SLIM: simultaneous logic-in-memory computing exploiting bilayer analog OxRAM devices | |
Talati et al. | CONCEPT: A column-oriented memory controller for efficient memory and PIM operations in RRAM | |
Zuloaga et al. | Scaling 2-layer RRAM cross-point array towards 10 nm node: A device-circuit co-design | |
Bi et al. | Cross-layer optimization for multilevel cell STT-RAM caches | |
Jagasivamani et al. | Memory-systems challenges in realizing monolithic computers | |
Wang et al. | Design of low power 3d hybrid memory by non-volatile cbram-crossbar with block-level data-retention | |
Jao et al. | Programmable non-volatile memory design featuring reconfigurable in-memory operations | |
CN115881188A (en) | Integrating resistive memory systems into multi-core CPU die to achieve large-scale memory parallelism | |
Jagasivamani et al. | Analyzing the Monolithic Integration of a ReRAM-based Main Memory into a CPU's Die | |
Wu et al. | Bulkyflip: A NAND-SPIN-based last-level cache with bandwidth-oriented write management policy |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |