KR20220078566A

KR20220078566A - memory-based processor

Info

Publication number: KR20220078566A
Application number: KR1020227008116A
Authority: KR
Inventors: 엘라드 시티; 엘리아드 힐렐; 샤니 브라우도; 데이비드 샤미르; 갈 다얀
Original assignee: 뉴로블레이드, 리미티드.
Priority date: 2019-08-13
Filing date: 2020-08-13
Publication date: 2022-06-10
Also published as: WO2021028723A2; WO2021028723A3; EP4010808A2; EP4010808A4; TW202122993A; CN114586019A

Abstract

일부 실시예들에서, 집적회로는 기판 및 상기 기판에 배치된 메모리 어레이를 포함할 수 있고, 상기 메모리 어레이는 복수의 이산 메모리 뱅크를 포함한다. 상기 집적회로는 또한 상기 기판에 배치된 프로세싱 어레이를 포함할 수 있고, 상기 프로세싱 어레이는 복수의 프로세서 서브유닛을 포함하고, 상기 복수의 프로세서 서브유닛의 각 프로세서 서브유닛은 상기 복수의 이산 메모리 뱅크 중의 하나 이상의 이산 메모리 뱅크와 연관된다. 상기 집적회로는 또한 상기 집적회로의 동작에 대한 적어도 하나의 보안 조치를 이행하고 상기 적어도 하나의 보안 조치가 촉발되는 경우에 하나 이상의 해결책을 취하도록 구성된 컨트롤러를 포함할 수 있다. In some embodiments, an integrated circuit may include a substrate and a memory array disposed on the substrate, the memory array including a plurality of discrete memory banks. The integrated circuit may also include a processing array disposed on the substrate, the processing array including a plurality of processor subunits, each processor subunit of the plurality of processor subunits comprising one of the plurality of discrete memory banks. Associated with one or more discrete memory banks. The integrated circuit may also include a controller configured to implement at least one security measure for operation of the integrated circuit and to take one or more solutions when the at least one security measure is triggered.

Description

memory based processor

본 개시는 메모리 집약적 동작을 가능하게 하는 장치에 관한 것이다. 특히, 본 개시는 전용 메모리 뱅크에 연결된 처리 소자를 포함하는 하드웨어칩에 관한 것이다. 본 개시는 또한 전력 효율과 메모리 칩의 속도를 개선하는 장치에 관한 것이다. 특히, 본 개시는 메모리 칩 상에 리프레시를 부분적으로 하거나 전혀 하지 않는 시스템 및 방법에 관한 것이다. 본 개시는 또한 선택 가능한 용량의 메모리 칩 및 메모리 칩 상의 듀얼 포트 능력에 관한 것이다.The present disclosure relates to devices that enable memory intensive operation. In particular, the present disclosure relates to a hardware chip comprising processing elements coupled to dedicated memory banks. The present disclosure also relates to devices for improving power efficiency and speed of memory chips. In particular, the present disclosure relates to systems and methods with partial or no refreshes on memory chips. The present disclosure also relates to memory chips of selectable capacities and dual port capabilities on the memory chips.

본 출원은 2019년 8월 13일에 출원된 미국 가출원 번호 제62/886,328, 2019년 9월 29일에 출원된 미국 가출원 번호 제62/907,659, 2020년 2월 7일에 출원된 미국 가출원 번호 제62/971,912, 및 2020년 2월 28일에 출원된 미국 가출원 번호 제62/983,174를 우선권으로 주장하며, 그 내용이 사실상 여기에 전부 포함된다. This application is entitled to U.S. Provisional Application No. 62/886,328, filed on August 13, 2019, U.S. Provisional Application No. 62/907,659, filed on September 29, 2019, and U.S. Provisional Application No., filed February 7, 2020 62/971,912, and U.S. Provisional Application Serial No. 62/983,174, filed on February 28, 2020, the contents of which are incorporated herein by reference.

프로세서 속도와 메모리 용량이 지속적으로 상승함에 따라, 효과적인 처리 속도에 대한 중대한 한계는 폰노이만 병목현상(von Neumann bottleneck)이다. 폰노이만 병목현상은 기존의 컴퓨터 아키텍처의 스루풋(throughput) 한계에서 기인한다. 특히, 프로세서에 의해 수행되는 실제 계산에 비해 메모리로부터 프로세서로의 데이터 전송에 병목이 생기는 경우가 많다. 이에 따라, 메모리 집약적 처리에서는 메모리에 읽기와 쓰기를 위한 클럭 사이클의 수가 상당히 증가한다. 클럭 사이클이 메모리에 읽기와 쓰기에 소비되고 데이터에 대한 연산을 수행하는데 활용될 수 없으므로, 그 결과 효과적인 처리 속도에 손실이 발생한다. 또한, 프로세서의 계산 대역폭은 일반적으로 프로세서가 메모리에 접근하기 위해 사용하는 버스의 대역폭보다 크다.As processor speed and memory capacity continue to rise, a significant limit to effective processing speed is the von Neumann bottleneck. The von Neumann bottleneck is caused by the throughput limit of the existing computer architecture. In particular, the transfer of data from memory to the processor is often a bottleneck compared to the actual computation performed by the processor. Accordingly, in memory-intensive processing, the number of clock cycles for reading and writing to the memory significantly increases. As clock cycles are spent reading and writing to memory and cannot be utilized to perform operations on data, the result is a loss in effective processing speed. Also, the computational bandwidth of a processor is generally greater than the bandwidth of the bus that the processor uses to access memory.

이러한 병목현상은 신경망 및 기타 머신러닝 알고리즘, 데이터베이스 구축, 검색 인덱싱, 및 쿼리 처리와 같은 메모리 집약적 프로세스, 및 데이터 처리 동작보다 많은 읽기와 쓰기 동작을 수반하는 기타 작업에서 더욱 두드러진다. This bottleneck is more pronounced in neural networks and other machine learning algorithms, in memory-intensive processes such as database building, search indexing, and query processing, and other tasks that involve more read and write operations than data processing operations.

또한, 사용 가능한 디지털 데이터의 크기와 입도가 급성장하면서 머신러닝 알고리즘의 구성의 기회가 생겼고 새로운 기술이 가능해졌다. 그러나 이로 인해 데이터베이스와 병렬 연산의 세계에 번거로운 문제도 생겨났다. 예를 들어, 소셜미디어와 사물인터넷의 증가로 인해 전례 없는 속도로 디지털 데이터가 생성되고 있다. 이러한 새로운 데이터는 새로운 광고 방식에서부터 더욱 정교한 산업 공정 제어 방법에 이르기까지 다양한 목적을 위한 알고리즘을 생성하는데 활용될 수 있다. 그러나 새로운 데이터의 저장, 처리, 분석, 및 취급이 쉽지 않았다. In addition, the rapid growth in the size and granularity of available digital data created opportunities for the construction of machine learning algorithms and enabled new technologies. However, this has also created cumbersome problems in the world of databases and parallel operations. For example, the rise of social media and the Internet of Things is creating digital data at an unprecedented rate. This new data can be used to create algorithms for a variety of purposes, from new advertising methods to more sophisticated industrial process control methods. However, storing, processing, analyzing, and handling the new data has not been easy.

새로운 데이터 소스는 페타바이트 내지 제타바이트 규모로 거대할 수 있다. 또한, 이러한 데이터 소스의 성장 속도는 데이터 처리 능력을 능가할 수 있다. 따라서, 데이터 과학자들은 이러한 문제를 해결하기 위하여 병렬 데이터 처리 방식을 채택해왔다. 계산 능력을 향상하고 거대한 양의 데이터를 취급하려는 노력의 일환으로, 과학자들은 병렬 집약 계산이 가능한 시스템과 방법을 개발하려고 시도했다. 그러나 기존의 시스템과 방법은 그 방식이 데이터 관리, 분리된 데이터의 통합, 및 분리된 데이터의 분석을 위한 추가적인 리소스의 필요에 의해 제한을 받는 경우가 많기 때문에 데이터 처리의 요구조건을 따라가지 못했다. New data sources can be huge, on the scale of petabytes to zettabytes. In addition, the growth rate of these data sources may exceed their data processing capabilities. Therefore, data scientists have adopted parallel data processing methods to solve these problems. In an effort to improve computational power and handle huge amounts of data, scientists have attempted to develop systems and methods capable of parallel intensive computation. However, existing systems and methods have not been able to keep up with the requirements of data processing because their methods are often limited by the need for additional resources for data management, integration of separated data, and analysis of separated data.

대형 데이터 세트의 취급을 가능하게 하기 위하여, 엔지니어들과 과학자들은 이제 데이터 분석에 사용되는 하드웨어를 향상하려고 노력한다. 예를 들어, 산술적 계산보다는 메모리 운용에 더 적합한 기술로 제조된 단일 기판 내에 메모리 및 처리 기능을 도입함으로써, 새로운 반도체 프로세서 또는 칩(예, 여기에 기재된 프로세서 또는 칩)이 데이터 집약적 작업에 특정하여 설계될 수 있다. 데이터 집약적 작업을 위해 특정 설계된 집적회로가 있으면, 새로운 데이터 처리 요구조건의 충족이 가능하다. 그럼에도 불구하고, 대형 데이터 세트의 데이터 처리 문제를 해결하기 위한 이러한 접근방식에는 칩 설계와 제조에서 새로운 문제의 해결이 요구된다. 예컨대, 데이터 집약적 작업을 위해 설계된 새로운 칩이 일반적인 칩에 사용되는 제조 기술과 아키텍처로 제조된다면, 새로운 칩은 성능 저하 및/또는 수율 저조를 겪게 될 것이다. 또한, 새로운 칩이 기존의 데이터 취급 방법으로 동작하도록 설계된다면, 새로운 칩이 병렬 연산을 처리할 능력은 기존의 방법에 의해 한계가 있으므로 새로운 칩의 성능은 저하될 것이다.To enable handling of large data sets, engineers and scientists are now trying to improve the hardware used to analyze data. For example, by introducing memory and processing functions within a single substrate fabricated in a technology that is more suited to memory operation than arithmetic calculations, a new semiconductor processor or chip (such as the processor or chip described herein) can be specifically designed for data-intensive tasks. can be With integrated circuits specifically designed for data-intensive tasks, new data processing requirements can be met. Nevertheless, this approach to solving the data processing problem of large data sets requires solving new problems in chip design and manufacturing. For example, if a new chip designed for data-intensive tasks is manufactured with manufacturing techniques and architectures used in conventional chips, the new chip will suffer from reduced performance and/or poor yield. In addition, if the new chip is designed to operate with the existing data handling method, the new chip's ability to process parallel operations is limited by the existing method, so the performance of the new chip will be degraded.

일부 실시예들에서, 집적회로는 기판 및 상기 기판 상에 배치된 메모리 어레이를 포함할 수 있고, 상기 메모리 어레이는 복수의 이산 메모리 뱅크를 포함한다. 상기 집적회로는 또한 상기 기판 상에 배치된 프로세싱 어레이를 포함할 수 있고, 상기 프로세싱 어레이는 상기 복수의 이산 메모리 뱅크 중의 하나 이상의 이산 메모리 뱅크와 각각 연관되는 복수의 프로세서 서브유닛을 포함한다. 상기 집적회로는 또한, 상기 집적회로의 동작에 대한 적어도 하나의 보안 조치를 이행하고 상기 적어도 하나의 보안 조치가 촉발되는 경우에 하나 이상의 해결책을 취하도록 구성된 컨트롤러를 포함할 수 있다. In some embodiments, an integrated circuit may include a substrate and a memory array disposed on the substrate, the memory array including a plurality of discrete memory banks. The integrated circuit may also include a processing array disposed on the substrate, the processing array including a plurality of processor subunits each associated with one or more of the plurality of discrete memory banks. The integrated circuit may also include a controller configured to implement at least one security measure for operation of the integrated circuit and to take one or more solutions when the at least one security measure is triggered.

개시된 실시예들은 또한, 위조로부터 집적회로를 보호하는 방법을 포함할 수 있고, 상기 방법은 상기 집적회로의 동작에 대한 적어도 하나의 보안 조치를 상기 집적회로와 연관된 컨트롤러를 사용하여 이행하는 단계 및 상기 적어도 하나의 보안 조치가 촉발되는 경우에 하나 이상의 해결책을 취하는 단계를 포함하고, 상기 집적회로는: 기판; 상기 기판 상에 배치되고 복수의 이산 메모리 뱅크를 포함하는 메모리 어레이; 및 상기 기판 상에 배치되고 상기 복수의 이산 메모리 뱅크 중의 하나 이상의 이산 메모리 뱅크와 각각 연관되는 복수의 프로세서 서브유닛을 포함하는 프로세싱 어레이를 포함한다. Disclosed embodiments may also include a method of protecting an integrated circuit from counterfeiting, the method comprising: implementing at least one security measure for operation of the integrated circuit using a controller associated with the integrated circuit; taking one or more solutions when at least one security measure is triggered, the integrated circuit comprising: a substrate; a memory array disposed on the substrate and comprising a plurality of discrete memory banks; and a processing array disposed on the substrate and comprising a plurality of processor subunits each associated with one or more of the plurality of discrete memory banks.

개시된 실시예들은 집적회로를 포함할 수 있고, 상기 집적회로는: 기판; 상기 기판 상에 배치되고 복수의 이산 메모리 뱅크를 포함하는 메모리 어레이; 상기 기판 상에 배치되고 상기 복수의 이산 메모리 뱅크 중의 하나 이상의 이산 메모리 뱅크와 각각 연관되는 복수의 프로세서 서브유닛을 포함하는 프로세싱 어레이; 및 상기 집적회로의 동작에 대한 적어도 하나의 보안 조치를 이행하도록 구성된 컨트롤러를 포함하고, 여기서 상기 적어도 하나의 보안 조치는 적어도 2개의 상이한 메모리 부분에 프로그램 코드를 복제하는 것을 포함한다. Disclosed embodiments may include an integrated circuit comprising: a substrate; a memory array disposed on the substrate and comprising a plurality of discrete memory banks; a processing array disposed on the substrate and comprising a plurality of processor subunits each associated with one or more of the plurality of discrete memory banks; and a controller configured to implement at least one security measure for operation of the integrated circuit, wherein the at least one security measure comprises copying the program code to at least two different memory portions.

일부 실시예들에서, 기판, 상기 기판 상에 배치된 메모리 어레이, 상기 기판 상에 배치된 프로세싱 어레이, 제1 통신 포트, 및 제2 통신 포트를 포함하는 분산 프로세서 메모리 칩이 제공된다. 상기 메모리 어레이는 복수의 이산 메모리 뱅크를 포함할 수 있다. 상기 프로세싱 어레이는 상기 복수의 이산 메모리 뱅크 중에서 하나 이상의 이산 메모리 뱅크와 각각 연관되는 복수의 프로세서 서브유닛을 포함할 수 있다. 상기 제1 통신 포트는 상기 분산 프로세서 메모리 칩과, 다른 분산 프로세서 메모리 칩이 아닌, 외부 엔티티 사이의 통신 연결을 구축하도록 구성될 수 있다. 상기 제2 통신 포트는 상기 분산 프로세서 메모리 칩과 제1 추가 분산 프로세서 메모리 칩 사이의 통신 연결을 구축하도록 구성될 수 있다. In some embodiments, a distributed processor memory chip is provided that includes a substrate, a memory array disposed on the substrate, a processing array disposed on the substrate, a first communication port, and a second communication port. The memory array may include a plurality of discrete memory banks. The processing array may include a plurality of processor subunits each associated with one or more of the plurality of discrete memory banks. The first communication port may be configured to establish a communication connection between the distributed processor memory chip and an external entity other than another distributed processor memory chip. The second communication port may be configured to establish a communication connection between the distributed processor memory chip and the first additional distributed processor memory chip.

일부 실시예들에서, 제1 분산 프로세서 메모리 칩과 제2 분산 프로세서 메모리 칩 사이에 데이터를 전송하는 방법은: 상기 제1 분산 프로세서 메모리 칩 상에 배치된 복수의 프로세서 서브유닛 중의 제1 프로세서 서브유닛이 상기 제2 분산 프로세서 메모리 칩에 포함된 제2 프로세서 서브유닛으로 데이터를 전송할 준비가 되었는지 여부를 상기 제1 분산 프로세서 메모리 칩과 상기 제2 분산 프로세서 메모리 칩 중의 적어도 하나와 연관된 컨트롤러를 사용하여 판단하는 단계; 및 상기 제1 프로세서 서브유닛이 상기 제2 프로세서 서브유닛으로 상기 데이터를 전송할 준비가 되어 있다는 판단을 한 후에 상기 컨트롤러에 의해 제어되는 클럭 인에이블 신호를 사용하여 상기 제1 프로세서 서브유닛에서 상기 제2 프로세서 서브유닛으로 상기 데이터의 전송을 개시하는 단계를 포함할 수 있다. In some embodiments, a method of transferring data between a first distributed processor memory chip and a second distributed processor memory chip includes: a first processor subunit of a plurality of processor subunits disposed on the first distributed processor memory chip. Determining, using a controller associated with at least one of the first distributed processor memory chip and the second distributed processor memory chip, whether data is ready to be transmitted to a second processor subunit included in the second distributed processor memory chip to do; and after determining that the first processor subunit is ready to transmit the data to the second processor subunit, using a clock enable signal controlled by the controller, from the first processor subunit to the second initiating the transmission of the data to the processor subunit.

일부 실시예들에서, 메모리 유닛은: 복수의 메모리 뱅크를 포함하는 메모리 어레이; 상기 복수의 메모리 뱅크에 대한 읽기 동작의 적어도 일 양상을 제어하도록 구성된 적어도 하나의 컨트롤러; 및 상기 복수의 메모리 뱅크의 특정 어드레스에 저장된 다중 비트 0 값을 검출하도록 구성된 적어도 하나의 0 값 검출 논리부를 포함할 수 있고, 상기 적어도 하나의 컨트롤러와 상기 적어도 하나의 0 값 검출 논리부는 상기 적어도 하나의 0 값 검출 논리부에 의한 0 값 검출에 응답하여 0 값 지시자를 상기 메모리 유닛 외부의 하나 이상의 회로로 보내도록 구성된다. In some embodiments, the memory unit comprises: a memory array including a plurality of memory banks; at least one controller configured to control at least one aspect of a read operation for the plurality of memory banks; and at least one zero value detection logic configured to detect a multi-bit zero value stored at a specific address of the plurality of memory banks, wherein the at least one controller and the at least one zero value detection logic include the at least one and send a zero-value indicator to one or more circuits external to the memory unit in response to detecting a zero-value by the zero-value detection logic of the memory unit.

일부 실시예들은 복수의 이산 메모리 뱅크의 특정 어드레스에서 0 값을 검출하는 방법을 포함할 수 있고, 상기 방법은: 복수의 이산 메모리 뱅크의 어드레스에 저장된 데이터의 읽기 요청을 메모리 유닛 외부의 회로로부터 수신하는 단계; 상기 수신된 요청에 응답하여 0 값 검출 논리부를 활성화하여, 수신된 어드레스에서 컨트롤러가 0 값을 검출하는 단계; 및 상기 0 값 검출 논리부에 의한 상기 0 값 검출에 응답하여 상기 컨트롤러가 0 값 지시자를 상기 회로로 전송하는 단계를 포함한다. Some embodiments may include a method of detecting a zero value at a specific address of a plurality of discrete memory banks, the method comprising: receiving from circuitry external to the memory unit a request to read data stored at addresses of the plurality of discrete memory banks to do; activating a zero value detection logic in response to the received request so that the controller detects a zero value at the received address; and in response to detecting the zero value by the zero value detection logic, the controller sending a zero value indicator to the circuit.

일부 실시예들은 메모리 유닛이 복수의 이산 메모리 뱅크의 특정 어드레스에서 0 값을 검출하게 유발하도록 상기 메모리 유닛의 컨트롤러에 의해 실행 가능한 일련의 명령을 저장하는 비일시적 컴퓨터 판독가능 매체를 포함할 수 있고, 상기 방법은: 복수의 이산 메모리 뱅크의 어드레스에 저장된 데이터의 읽기 요청을 메모리 유닛 외부의 회로로부터 수신하는 단계; 상기 수신된 요청에 응답하여 0 값 검출 논리부를 활성화하여, 수신된 어드레스에서 컨트롤러가 0 값을 검출하는 단계; 및 상기 0 값 검출 논리부에 의한 상기 0 값 검출에 응답하여 상기 컨트롤러가 0 값 지시자를 상기 회로로 전송하는 단계를 포함한다. Some embodiments may include a non-transitory computer-readable medium storing a series of instructions executable by a controller of the memory unit to cause the memory unit to detect a value of zero at a particular address in a plurality of discrete memory banks, The method includes: receiving a read request for data stored in addresses of a plurality of discrete memory banks from circuitry external to the memory unit; activating a zero value detection logic in response to the received request so that the controller detects a zero value at the received address; and in response to detecting the zero value by the zero value detection logic, the controller sending a zero value indicator to the circuit.

일부 실시예들에서, 메모리 유닛은: 하나 이상의 메모리 뱅크; 뱅크 컨트롤러; 및 어드레스 생성기를 포함할 수 있고, 상기 어드레스 생성기는 연관된 메모리 뱅크 내에서 접근될 현재 행의 현재 어드레스를 상기 뱅크 컨트롤러로 제공하고, 상기 연관된 메모리 뱅크 내에서 접근될 다음 행의 예측 어드레스를 판단하고, 상기 현재 어드레스와 연관된 상기 현재 행에 대한 읽기 동작이 완료되기 전에 상기 뱅크 컨트롤러로 상기 예측 어드레스를 제공하도록 구성된다. In some embodiments, the memory unit comprises: one or more memory banks; bank controller; and an address generator, wherein the address generator provides a current address of a current row to be accessed in an associated memory bank to the bank controller, determines a predicted address of a next row to be accessed in the associated memory bank, and and provide the predicted address to the bank controller before a read operation on the current row associated with the current address is completed.

일부 실시예들에서, 메모리 유닛은 하나 이상의 메모리 뱅크를 포함할 수 있고, 상기 하나 이상의 메모리 뱅크의 각 메모리 뱅크는 복수의 행, 상기 복수의 행의 제1 서브세트를 제어하도록 구성된 제1 행 컨트롤러, 상기 복수의 행에 저장될 데이터를 수신하기 위한 단일 데이터 입력, 및 상기 복수의 행으로부터 가져온 데이터를 제공하기 위한 단일 데이터 출력을 포함한다. In some embodiments, a memory unit may include one or more memory banks, each memory bank of the one or more memory banks a first row controller configured to control a plurality of rows, a first subset of the plurality of rows , a single data input for receiving data to be stored in the plurality of rows, and a single data output for providing data retrieved from the plurality of rows.

일부 실시예들에서, 분산 프로세서 메모리 칩은: 기판; 상기 기판 상에 배치되고 복수의 이산 메모리 뱅크를 포함하는 메모리 어레이; 상기 기판 상에 배치되고 상기 복수의 이산 메모리 뱅크 중의 상응하는 전용 이산 메모리 뱅크와 각각 연관되는 복수의 프로세서 서브유닛을 포함하는 프로세싱 어레이; 상기 복수의 프로세서 서브유닛 중의 하나를 그에 상응하는 전용 메모리 뱅크에 각각 연결하는 제1 복수의 버스; 및 상기 복수의 프로세서 서브유닛 중의 하나를 상기 복수의 프로세서 서브유닛의 다른 하나에 각각 연결하는 제2 복수의 버스를 포함할 수 있다. 상기 메모리 뱅크의 적어도 하나는 상기 기판 상에 배치된 적어도 하나의 DRAM 메모리 매트를 포함할 수 있다. 상기 프로세서 서브유닛의 적어도 하나는 상기 적어도 하나의 메모리 매트와 연관되는 하나 이상의 논리 소자를 포함할 수 있다. 상기 적어도 하나의 메모리 매트와 상기 하나 이상의 논리 소자는 상기 복수의 프로세서 서브유닛의 하나 이상에 대한 캐시 역할을 하도록 구성될 수 있다. In some embodiments, a distributed processor memory chip comprises: a substrate; a memory array disposed on the substrate and comprising a plurality of discrete memory banks; a processing array disposed on the substrate and comprising a plurality of processor subunits each associated with a corresponding dedicated one of the plurality of discrete memory banks; a first plurality of buses each coupling one of said plurality of processor subunits to a corresponding dedicated memory bank; and a second plurality of buses respectively connecting one of the plurality of processor subunits to the other one of the plurality of processor subunits. At least one of the memory banks may include at least one DRAM memory mat disposed on the substrate. At least one of the processor subunits may include one or more logic elements associated with the at least one memory mat. The at least one memory mat and the one or more logic elements may be configured to serve as a cache for one or more of the plurality of processor subunits.

일부 실시예들에서, 분산 프로세서 메모리 칩의 적어도 하나의 명령을 실행하는 방법은: 상기 분산 프로세서 메모리 칩의 메모리 어레이에서 하나 이상의 데이터 값을 가져오는 단계; 상기 하나 이상의 데이터 값을 상기 분산 프로세서 메모리 칩의 메모리 매트에 형성된 레지스터에 저장하는 단계; 및 프로세서 요소에 의해 실행된 적어도 하나의 명령에 따라, 상기 레지스터에 저장된 상기 하나 이상의 데이터 값에 접근하는 단계를 포함할 수 있고; 상기 메모리 어레이는 상기 기판 상에 배치된 복수의 이산 메모리 뱅크를 포함하고; 상기 프로세서 요소는 상기 기판 상에 배치된 프로세싱 어레이에 포함된 복수의 프로세서 서브유닛 중의 프로세서 서브유닛이고, 상기 프로세서 서브 유닛 각각은 상기 복수의 이산 메모리 뱅크의 상응하는 전용 이산 메모리 뱅크와 연관되고; 상기 레지스터는 상기 기판 상에 배치된 메모리 매트에 의해 제공된다. In some embodiments, a method of executing at least one instruction of a distributed processor memory chip includes: retrieving one or more data values from a memory array of the distributed processor memory chip; storing the one or more data values in a register formed in a memory mat of the distributed processor memory chip; and accessing, according to at least one instruction executed by a processor element, the one or more data values stored in the register; the memory array includes a plurality of discrete memory banks disposed on the substrate; the processor element is a processor subunit of a plurality of processor subunits included in a processing array disposed on the substrate, each processor subunit associated with a corresponding dedicated discrete memory bank of the plurality of discrete memory banks; The resistor is provided by a memory mat disposed on the substrate.

일부 실시예들은 기판; 상기 기판 상에 배치된 처리부; 및 상기 기판 상에 배치된 메모리 유닛을 포함할 수 있고, 상기 메모리 유닛은 상기 처리부에 의해 접근될 데이터를 저장하도록 구성되고, 상기 처리부는 상기 처리부에 대한 캐시 역할을 하도록 구성된 메모리 매트를 포함한다. Some embodiments include a substrate; a processing unit disposed on the substrate; and a memory unit disposed on the substrate, wherein the memory unit is configured to store data to be accessed by the processing unit, and the processing unit includes a memory mat configured to serve as a cache for the processing unit.

프로세싱 시스템들은 점점 더 많은 양의 정보를 매우 빠른 속도로 처리할 것으로 기대된다. 예를 들어, 5세대(5G) 모바일 인터넷 네트워크는 방대한 양의 정보 스트림을 수신하고 이러한 정보 스트림을 더 빠른 속도로 처리할 것으로 예상된다. Processing systems are expected to process increasingly large amounts of information at very high speeds. For example, fifth-generation (5G) mobile Internet networks are expected to receive vast amounts of information streams and process these information streams at higher rates.

프로세싱 시스템은 하나 이상의 버퍼 및 프로세서를 포함할 수 있다. 프로세서에 의해 적용되는 프로세싱 동작에는 지연이 있을 수 있고, 이에 따라 방대한 버퍼가 필요할 수 있다. 방대한 버퍼는 비용 및/또는 공간이 많이 필요할 수 있다. The processing system may include one or more buffers and processors. There may be delays in processing operations applied by the processor, which may require extensive buffers. A large buffer can be costly and/or space intensive.

버퍼에서 프로세서로 방대한 양의 정보를 전달하려면 버퍼와 프로세서 사이에 고대역폭 커넥터 및/또는 고대역폭 버스가 필요할 수 있고, 이 또한 프로세싱 시스템의 비용과 공간을 증가시킬 수 있다. Passing vast amounts of information from the buffer to the processor may require a high-bandwidth connector and/or a high-bandwidth bus between the buffer and the processor, which may also increase the cost and space of the processing system.

이에 따라, 효율적인 프로세싱 시스템의 제공에 대한 요구가 증대되고 있다. Accordingly, there is an increasing demand for providing an efficient processing system.

분리된 서버는 다중 서브시스템을 포함하고, 각 서브 시스템은 고유의 역할을 가지고 있다. 예를 들어, 분리된 서버는 하나 이상의 스위칭 서브시스템, 하나 이상의 컴퓨팅 서브시스템, 및 하나 이상의 스토리지 서브시스템을 포함할 수 있다. A separate server contains multiple subsystems, each with a unique role. For example, a separate server may include one or more switching subsystems, one or more computing subsystems, and one or more storage subsystems.

상기 하나 이상의 컴퓨팅 서브시스템과 상기 하나 이상의 스토리지 서브시스템은 상기 하나 이상의 스위칭 서브시스템을 통해 서로 결합된다. The one or more computing subsystems and the one or more storage subsystems are coupled to each other via the one or more switching subsystems.

컴퓨팅 서브시스템은 다중 컴퓨팅 유닛을 포함할 수 있다. A computing subsystem may include multiple computing units.

스위칭 서브시스템은 다중 스위칭 유닛을 포함할 수 있다. The switching subsystem may include multiple switching units.

스토리지 서브시스템은 다중 스토리지 유닛을 포함할 수 있다. A storage subsystem may include multiple storage units.

이러한 분리된 서버의 병목은 서브시스템 사이의 정보 전달에 필요한 대역폭에 있다. The bottleneck of these separate servers is the bandwidth required to pass information between subsystems.

상이한 컴퓨팅 서브시스템의 모든(또는 상당한 부분의) 컴퓨팅 유닛 사이에 정보 단위를 공유하는데 필요한 분산 계산을 수행하는 경우에는 특히 그렇다. This is especially true when performing distributed computations necessary to share units of information among all (or a significant portion) of computing units in different computing subsystems.

공유에 가담하는 N개의 컴퓨팅 유닛이 있고, N은 매우 큰 정수(예를 들어, 적어도 1024)이고, N개의 컴퓨팅 유닛은 각각 모든 다른 컴퓨팅 유닛으로(또는 모든 다른 컴퓨팅 유닛으로부터) 정보 단위를 전송(또는 수신)해야 한다고 가정하면, 약 N 번의 정보 단위 전송 프로세스를 수행할 필요가 있다. 이러한 많은 수의 전송 프로세스는 시간과 에너지를 많이 필요로 하고 분리된 서버의 처리량을 극적으로 제한하게 된다. There are N computing units participating in the sharing, where N is a very large integer (eg at least 1024), each of the N computing units transmitting (or from) an information unit to (or from) every other computing unit. or receive), it is necessary to perform about N information unit transmission processes. This large number of transfer processes is time- and energy-consuming and dramatically limits the throughput of separate servers.

이에 따라, 효율적인 분리된 서버와 분산 프로세싱의 효율적 수행 방법에 대한 요구가 증대되고 있다. Accordingly, there is an increasing demand for an efficient separate server and a method for efficiently performing distributed processing.

데이터베이스는 다중 필드를 포함하는 엔트리를 많이 포함한다. 데이터베이스 프로세싱은 하나 이상의 필터링 파라미터(예를 들어, 하나 이상의 관련 필드의 정체 및 값)를 포함하고 실행될 연산의 종류를 판단할 수 있는 하나 이상의 연산 파라미터, 상기 연산을 적용하는 경우에 사용될 변수 또는 정수 등을 포함하는 하나 이상의 쿼리의 실행을 포함하는 것이 일반적이었다. A database contains many entries containing multiple fields. Database processing includes one or more filtering parameters (eg, identity and values of one or more related fields) and one or more operation parameters capable of determining the kind of operation to be performed, a variable or integer to be used when applying the operation, etc. It was common to include the execution of one or more queries containing

예를 들어, 데이터베이스 쿼리는 특정 필드가 소정의 범위 이내의 값을 가진(필터링 파라미터) 상기 데이터베이스의 모든 기록에 대한 통계 연산(연산 파라미터)의 수행을 요청할 수 있다. 다른 예를 들면, 데이터베이스 쿼리는 임계값보다 작은(필터링 파라미터) 특정 필드를 가진 기록의 삭제(연산 파라미터)를 요청할 수 있다. For example, a database query may request to perform a statistical operation (operation parameter) on all records in the database in which a specific field has a value within a predetermined range (filtering parameter). As another example, a database query may request deletion of records (operation parameters) that have certain fields that are less than a threshold (a filtering parameter).

양이 큰 데이터베이스는 일반적으로 스토리지 장치에 저장된다. 쿼리에 응답하기 위하여, 상기 데이터베이스는 메모리 유닛으로 전송되며, 일반적으로 데이터베이스 세그먼트별로 차례로 전송된다. Large databases are typically stored on storage devices. In order to respond to a query, the database is transferred to a memory unit, typically in turn by database segment.

상기 데이터베이스의 상기 엔트리는 상기 메모리 유닛으로부터 상기 메모리 유닛과 동일한 집적회로에 속해 있지 않은 프로세서로 전송된다. 이어서 상기 엔트리는 상기 프로세서에 의해 처리된다. The entry in the database is transferred from the memory unit to a processor not belonging to the same integrated circuit as the memory unit. The entry is then processed by the processor.

상기 메모리 유닛에 저장된 상기 데이터베이스의 각 데이터베이스 세그먼트에 대해, 상기 처리는: (i) 상기 데이터베이스 세그먼트의 기록을 선택하는 단계; (ii) 상기 기록을 상기 메모리 유닛으로부터 상기 프로세서로 전송하는 단계; (iii) 상기 프로세서가 상기 기록을 필터링하여 상기 기록이 관련이 있는지 여부를 판단하는 단계; (iv) 상기 관련 기록에 대해 하나 이상의 추가 연산(합산, 기타 수학적 및/또는 통계적 연산의 적용)을 수행하는 단계를 포함한다. For each database segment of the database stored in the memory unit, the processing includes: (i) selecting a record of the database segment; (ii) transferring the write from the memory unit to the processor; (iii) the processor filtering the records to determine whether the records are relevant; (iv) performing one or more additional operations (sums, application of other mathematical and/or statistical operations) on the related records.

상기 필터링 단계는 상기 기록이 모두 상기 프로세서로 전송되고 상기 프로세서가 어느 기록이 관련이 있는지 판단을 한 후에 중단한다. The filtering step stops after all the records have been sent to the processor and the processor determines which records are relevant.

데이터 베이스의 상기 관련 엔트리가 상기 프로세서에 저장되어 있지 않은 경우에 이러한 관련 기록을 상기 프로세서로 전송하여 상기 필터링 단계 이후에 추가적으로 처리(상기 프로세싱에 후속하는 상기 연산을 적용)할 필요가 있다. If the relevant entry in the database is not stored in the processor, it is necessary to send this relevant record to the processor for further processing (applying the operation following the processing) after the filtering step.

다중 프로세싱 연산이 단일 필터링에 후속하는 경우, 각 연산의 결과는 상기 메모리 유닛으로 전송된 후에 다시 상기 프로세서로 전송될 수 있다. When multiple processing operations follow a single filtering, the results of each operation may be transmitted to the memory unit and then transmitted back to the processor.

이러한 프로세스에는 대역폭과 시간이 많이 소모된다. This process is bandwidth and time consuming.

이에 따라, 데이터베이스 프로세싱을 수행하기 위한 효율적인 방식에 대한 요구가 증대되고 있다. Accordingly, there is an increasing demand for an efficient method for performing database processing.

워드 임베딩(word embedding)은 어휘에서 단어 또는 구절이 요소의 벡터로 매핑되는 자연어 처리(natural language processing　(NLP))의 언어 모델링 및 특징 학습 방식의 모음에 대한 포괄적인 명칭이다. 개념적으로, 단어당 많은 차원이 있는 공간으로부터 훨씬 적은 차원이 있는 연속적 벡터 공간으로의 수학적 매핑이 개입된다(www.wikipedia.org 참고). Word embedding is an umbrella term for a collection of language modeling and feature learning approaches in natural language processing (NLP) in which words or phrases in a vocabulary are mapped to vectors of elements. Conceptually, a mathematical mapping is involved from a space with many dimensions per word to a continuous vector space with many fewer dimensions (see www.wikipedia.org).

이러한 매핑을 생성하는 방법에는 신경망, 단어 동시 등장 행렬(word　co-occurrence matrix)에 대한 차원 축소(dimensionality reduction), 설명 가능한 지식 기반 방법, 및 단어가 등장하는 맥락 차원의 명시적 표현이 포함된다. Methods for generating such mappings include neural networks, dimensionality reduction for word co-occurrence matrices, descriptive knowledge-based methods, and explicit representations of the context dimension in which words appear.

단어 및 구절 임베딩은 입력 표현의 근간으로 사용되는 경우에 구문 분석(syntactic parsing) 및 감성 분석(sentiment analysis)과 같은 NLP 작업에서의 성능을 증가시키는 것으로 나타났다. Word and phrase embeddings have been shown to increase performance in NLP tasks such as syntactic parsing and sentiment analysis when used as the basis for input representation.

문장은 단어 또는 구절로 구분될 수 있고, 각 구간은 벡터로 표현될 수 있다. 문장은 그 문장의 단어 또는 구절을 표현하는 모든 벡터를 포함하는 행렬로 표현될 수 있다. A sentence may be divided into words or phrases, and each section may be expressed as a vector. A sentence can be represented as a matrix containing all vectors representing words or phrases in the sentence.

단어를 벡터로 매핑하는 어휘는 단어 또는 구절(또는 해당 단어를 나타내는 색인)을 활용하여 접근될 수 있는 메모리 유닛(예, DRAM)에 저장될 수 있다. Vocabularies that map words to vectors may be stored in a memory unit (eg, DRAM) that can be accessed using a word or phrase (or an index representing that word).

액세스는 DRAM의 스루풋을 감소시키는 랜덤 액세스일 수 있다. 또한, 액세스는, 특히 많은 수의 액세스가 DRAM에 넣어지는 경우, DRAM을 포화시킬 수 있다. The access may be a random access which reduces the throughput of the DRAM. Also, accesses can saturate the DRAM, especially if a large number of accesses are put into the DRAM.

문장에 포함되는 단어들은 일반적으로 매우 무작위적이다. 매핑을 저장하는 DRAM 메모리에 접근하면, DRAM 버스트(bursts)를 활용하는 경우라고 하더라도, 일반적으로 랜덤 액세스의 성능 저하의 결과를 가져오게 된다. 이는, 일반적으로 버스트 중에 DRAM 메모리 뱅크 엔트리의 작은 부분의 하나만이 특정 문장에 관련이 있는 엔트리를 저장하게 되기 때문이다. The words in a sentence are usually very random. Accessing the DRAM memory that stores the mapping generally results in performance degradation of random access, even when utilizing DRAM bursts. This is because typically during a burst only one of a small fraction of DRAM memory bank entries will store entries that are relevant to a particular sentence.

이에 따라, DRAM 메모리의 스루풋은 낮고 연속적이지 않다. Accordingly, the throughput of DRAM memory is low and not continuous.

문장의 각 단어 또는 구절은 DRAM 메모리의 집적회로의 외부에 있는 호스트 컴퓨터의 제어 하에 DRAM 메모리에서 가져와 지고, 해당 단어의 위치에 대한 지식에 기반하여 각 단어 또는 구간을 나타내는 각 벡터의 검색을 제어해야 한다. 이는 시간과 리소스가 많이 드는 작업이다. Each word or phrase in a sentence is pulled from DRAM memory under the control of a host computer external to the DRAM memory's integrated circuit, and based on knowledge of the location of that word, it must control the retrieval of each vector representing each word or section. do. This is a time-consuming and resource-intensive operation.

데이터 센터들과 기타 컴퓨터 시스템들은 매우 빠른 속도로 점점 더 많은 양의 정보를 처리하고 교환해야 할 것으로 기대되고 있다. Data centers and other computer systems are expected to process and exchange increasing amounts of information at very high rates.

점점 더 많은 양의 데이터의 교환은 데이터 센터들과 기타 컴퓨터 시스템들의 병목이 될 수 있고, 이러한 데이터 센터들과 기타 컴퓨터 시스템들의 역량의 일부만이 활용되는 원인이 될 수 있다. The exchange of ever-increasing amounts of data can become a bottleneck for data centers and other computer systems, causing only a fraction of the capabilities of these data centers and other computer systems to be utilized.

도 96a는 종래 기술의 데이터베이스(12010)와 종래 기술의 마더보드(12011)의 일례를 도시한 것이다. 데이터베이스는 다중 서버를 포함할 수 있고, 각 서버는 다중 서버 마더보드("CPU + 메모리 + 네트워크")를 포함할 수 있다. 각 서버 마더보드(12011)는 트래픽을 수신하고 메모리 유닛(12013)(RAM) 및 다중 데이터베이스 가속기(DB 가속기)(12014)에 연결되는 CPU(12012)(INTEL사의 XEON 등)를 포함한다. 96A shows an example of a prior art database 12010 and a prior art motherboard 12011. The database may include multiple servers, and each server may include a multi-server motherboard (“CPU + Memory + Network”). Each server motherboard 12011 receives traffic and includes a CPU 12012 (such as INTEL's XEON) connected to a memory unit 12013 (RAM) and a multiple database accelerator (DB accelerator) 12014 .

DB 가속기는 선택 사항이고, DB 가속 동작은 CPU(12012)에 의해 수행될 수 있다. The DB accelerator is optional, and the DB acceleration operation may be performed by the CPU 12012 .

모든 트래픽은 CPU를 통해 흐르고, CPU는 PCIe와 같은 상대적으로 제한적인 대역폭인 링크를 통해 DB 가속기에 결합될 수 있다. All traffic flows through the CPU, which can be coupled to the DB accelerator via a relatively limited bandwidth link such as PCIe.

다중 마더보드 간에 정보 단위들을 보내는 데에 많은 양의 리소스가 전용된다. A large amount of resources is dedicated to sending units of information between multiple motherboards.

이에 따라, 효율적인 데이터 센터 및 기타 컴퓨터 시스템의 제공에 대한 요구가 증대되고 있다. Accordingly, there is an increasing demand for the provision of efficient data centers and other computer systems.

신경망과 같은 인공지능(AI) 응용은 그 크기가 극적으로 증가하고 있다. 증가하는 신경망의 크기에 대처하기 위하여, 각각 AI 가속 서버(서버 마더보드를 포함함) 역할을 하는 다중 서버가 사용되어 학습 등과 같은 신경망 처리 작업을 수행한다. 상이한 랙에 배치된 다중 AI 가속 서버를 포함하는 시스템의 일례가 도 1에 도시되어 있다. Artificial intelligence (AI) applications, such as neural networks, are growing dramatically in size. To cope with the increasing size of neural networks, multiple servers, each acting as an AI acceleration server (including server motherboard), are used to perform neural network processing tasks such as learning. An example of a system including multiple AI acceleration servers placed in different racks is shown in FIG. 1 .

전형적인 학습 세션에서, 매우 많은 수의 이미지가 동시에 처리되어 손실과 같은 방대한 수의 값을 제공한다. 방대한 수의 값은 상이한 AI 가속 서버 사이에서 전달되고, 그 결과로 이례적인 양의 트래픽이 발생한다. 예를 들어, 일부 신경망층은 상이한 AI 가속 서버에 위치한 다중 GPU에 걸쳐 계산될 수 있고, 네트워크를 통한 취합이 필요할 수 있고 이는 많은 대역폭을 필요로 하게 된다. In a typical training session, a very large number of images are processed simultaneously, giving a vast number of values such as loss. A huge number of values are passed between different AI-accelerated servers, resulting in an unusual amount of traffic. For example, some neural network layers may be computed across multiple GPUs located on different AI acceleration servers, and may require aggregation across the network, which will require a lot of bandwidth.

이례적인 양의 트래픽을 전달하려면 초고대역폭(ultra-high bandwidth)이 필요하고, 이는 실현 가능하지 않거나 비용 효율적이지 못할 수 있다. Carrying an unusual amount of traffic requires ultra-high bandwidth, which may not be feasible or cost-effective.

도 97a는 서브시스템들을 포함하는 시스템(12050)을 도시하고 있다. 여기서, 각 서브시스템은 RAM 메모리(RAM 12056), 중앙처리장치(CPU 12054), 네트워크 인터페이스 카드(NIC 12053)를 포함하는 서버 마더보드(12055)가 있는 AI 가속 서버(12052)를 연결하는 스위치(12051)를 포함하고, 상기 CPU(12054)는 (PCIe 버스를 통해) 다중 AI 가속기(12057)(예, 그래픽처리장치, AI 칩(AI ASIC), FPGA 등)로 연결된다. NIC는 네트워크에 의해(예, 이더넷, UDP 링크 등을 사용) 서로 결합(예, 하나 이상의 스위치에 의해 결합)되고, 이러한 NIC는 시스템이 요구하는 초고대역폭 전달이 가능할 수 있다. 97A shows a system 12050 that includes subsystems. Here, each subsystem is a switch ( 12051), and the CPU 12054 is connected to multiple AI accelerators 12057 (eg, graphic processing unit, AI chip (AI ASIC), FPGA, etc.) (via a PCIe bus). The NICs are coupled to each other (eg, by one or more switches) by the network (eg, using Ethernet, UDP links, etc.), and these NICs may be capable of delivering the ultra-high bandwidth required by the system.

이에 따라, 효율적인 AI 컴퓨팅 시스템에 대한 요구가 증대되고 있다. Accordingly, the demand for an efficient AI computing system is increasing.

기타 개시된 실시예들에 따라, 비일시적 컴퓨터 판독가능 저장 매체는 적어도 하나의 처리 장치에 의해 실행되고 여기에 기재된 방법 중의 하나 이상을 수행하는 프로그램 명령을 저장할 수 있다.According to other disclosed embodiments, a non-transitory computer-readable storage medium may store program instructions that are executed by at least one processing device and perform one or more of the methods described herein.

상기의 일반적인 설명과 하기의 상세한 설명은 예시에 불과하며 본 개시의 청구 범위에 대한 한정이 아니다.The above general description and the following detailed description are illustrative only and not limiting on the scope of the claims of the present disclosure.

본 개시에 포함되고 본 개시의 일부를 구성하는 첨부 도면은 개시된 다양한 실시예를 도시한다.
도 1은 중앙처리장치(CPU)를 개략적으로 도시한 것이다.
도 2는 그래픽처리장치(GPU)를 개략적으로 도시한 것이다.
도 3a는 개시된 실시예에 따른 예시적인 하드웨어 칩의 일 실시예를 개략적으로 도시한 것이다.
도 3b는 개시된 실시예에 따른 예시적인 하드웨어 칩의 다른 실시예를 개략적으로 도시한 것이다.
도 4는 개시된 실시예에 따른 예시적인 하드웨어 칩에 의해 실행되는 일반적인 명령을 개략적으로 도시한 것이다.
도 5는 개시된 실시예에 따른 예시적인 하드웨어 칩에 의해 실행되는 특수한 명령을 개략적으로 도시한 것이다.
도 6은 개시된 실시예에 따른 예시적인 하드웨어 칩에서 사용하기 위한 프로세싱 그룹을 개략적으로 도시한 것이다.
도 7a는 개시된 실시예에 따른 프로세싱 그룹의 장방형 어레이를 개략적으로 도시한 것이다.
도 7b는 개시된 실시예에 따른 프로세싱 그룹의 타원형 어레이를 개략적으로 도시한 것이다.
도 7c는 개시된 실시예에 따른 하드웨어 칩의 어레이를 개략적으로 도시한 것이다.
도 7d는 개시된 실시예에 따른 하드웨어 칩의 다른 어레이를 개략적으로 도시한 것이다.
도 8은 개시된 실시예에 따른 예시적인 하드웨어 칩 상에서의 실행을 위한 일련의 명령을 컴파일하기 위한 예시적인 방법을 도시한 순서도이다.
도 9는 메모리 뱅크를 개략적으로 도시한 것이다.
도 10은 메모리 뱅크를 개략적으로 도시한 것이다.
도 11은 개시된 실시예에 따른 서브뱅크 컨트롤이 있는 예시적인 메모리 뱅크의 일 실시예를 개략적으로 도시한 것이다.
도 12는 개시된 실시예에 따른 서브뱅크 컨트롤이 있는 예시적인 메모리 뱅크의 다른 실시예를 개략적으로 도시한 것이다.
도 13은 개시된 실시예에 따른 예시적인 메모리 칩의 구성도이다.
도 14는 개시된 실시예에 따른 예시적인 리던던트 로직 블록 세트의 구성도이다.
도 15는 개시된 실시예에 따른 예시적인 로직 블록의 구성도이다.
도 16은 개시된 실시예에 따른 버스가 연결된 예시적인 로직 블록의 구성도이다.
도 17은 개시된 실시예에 따른 직렬로 연결된 예시적인 로직 블록의 구성도이다.
도 18은 개시된 실시예에 따른 2차원 어레이로 연결된 예시적인 로직 블록의 구성도이다.
도 19는 개시된 실시예에 따른 복합 연결된 예시적인 로직 블록에 대한 구성도이다.
도 20은 개시된 실시예에 따른 리던던트 블록 활성화 프로세스를 도시한 예시적인 순서도이다.
도 21은 개시된 실시예에 따른 어드레스 배정 프로세스를 도시한 예시적인 순서도이다.
도 22는 개시된 실시예에 따른 예시적인 처리 장치에 대한 구성도이다.
도 23은 개시된 실시예에 따른 예시적인 처리 장치의 구성도이다.
도 24는 개시된 실시예에 따른 예시적인 메모리 구성도를 포함한다.
도 25는 개시된 실시예에 따른 메모리 설정 프로세스를 도시한 예시적인 순서도이다.
도 26은 개시된 실시예에 따른 메모리 읽기 프로세스를 도시한 예시적인 순서도이다.
도 27은 개시된 실시예에 따른 실행 프로세스를 도시한 예시적인 순서도이다.
도 28은 본 개시에 따른 리프레시 컨트롤러를 포함하는 예시적인 메모리 칩을 도시한 것이다.
도 29a는 본 개시에 따른 예시적인 리프레시 컨트롤러를 도시한 것이다.
도 29b는 본 개시에 따른 다른 예시적인 리프레시 컨트롤러를 도시한 것이다.
도 30은 본 개시에 따른 리프레시 컨트롤러에 의해 실행되는 프로세스의 예시적인 순서도이다.
도 31은 본 개시에 따른 컴파일러에 의해 이행되는 프로세스의 예시적인 순서도이다.
도 32는 본 개시에 따른 컴파일러에 의해 이행되는 프로세스의 다른 예시적인 순서도이다.
도 33은 본 개시에 따른 저장된 패턴에 의해 구성된 예시적인 리프레시 컨트롤러를 도시한 것이다.
도 34는 본 개시에 따른 리프레시 컨트롤러 내의 소프트웨어에 의해 이행되는 프로세스의 예시적인 순서도이다.
도 35a는 본 개시에 따른 다이를 포함하는 예시적인 웨이퍼를 도시한 것이다.
도 35b는 본 개시에 따른 입력/출력 버스에 연결된 예시적인 메모리 칩을 도시한 것이다.
도 35c는 본 개시에 따른 행으로 배열되고 입력-출력 버스에 연결된 메모리 칩을 포함하는 예시적인 웨이퍼를 도시한 것이다.
도 35d는 본 개시에 따른 그룹을 형성하고 입력-출력 버스에 연결된 두 개의 메모리 칩을 도시한 것이다.
도 35e는 본 개시에 따른 육각형 격자로 배치되고 입력-출력 버스에 연결된 다이를 포함하는 예시적인 웨이퍼를 도시한 것이다.
도 36a 내지 도 36d는 본 개시에 따른 입력/출력 버스에 연결된 메모리 칩의 다양한 가능한 구성을 도시한 것이다.
도 37은 본 개시에 따른 글루 로직(glue logic)을 공유하는 다이의 예시적인 그루핑을 도시한 것이다.
도 38a 내지 도 38b는 본 개시에 따른 웨이퍼의 예시적인 절단을 도시한 것이다.
도 38c는 본 개시에 따른 웨이퍼 상의 다이의 예시적인 배열 및 입력-출력 버스의 배열을 도시한 것이다.
도 39는 본 개시에 따른 상호 연결된 프로세서 서브유닛을 포함하는 웨이퍼 상의 예시적인 메모리 칩을 도시한 것이다.
도 40은 본 개시에 따른 웨이퍼로부터 메모리 칩의 그룹을 레이아웃 하는 프로세스의 예시적인 순서도이다.
도 41a는 본 개시에 따른 웨이퍼로부터 메모리 칩의 그룹을 레이아웃 하는 프로세스의 다른 예시적인 순서도이다.
도 41b 내지 도 41c는 본 개시에 따른 웨이퍼로부터 한 그룹 이상의 메모리 칩을 절단하기 위한 절단 패턴을 판단하는 프로세스의 예시적인 순서도이다.
도 42는 본 개시에 따른 열을 따라 듀얼 포트 접근을 제공하는 메모리 칩 내의 회로의 예시를 도시한 것이다.
도 43은 본 개시에 따른 행을 따라 듀얼 포트 접근을 제공하는 메모리 칩 내의 회로의 예시를 도시한 것이다.
도 44는 본 개시에 따른 행과 열을 따라 듀얼 포트 접근을 제공하는 메모리 칩 내의 회로의 예시를 도시한 것이다.
도 45a는 복제된 메모리 어레이 또는 매트를 활용한 듀얼 읽기를 도시한 것이다.
도 45b는 복제된 메모리 어레이 또는 매트를 활용한 듀얼 쓰기를 도시한 것이다.
도 46은 본 개시에 따른 행을 따라 듀얼 포트 접근을 위한 스위칭 요소를 포함하는 메모리 칩 내의 회로의 예시를 도시한 것이다.
도 47a는 본 개시에 따른 싱글 포트 메모리 어레이 또는 매트 상에 듀얼 포트 접근을 제공하기 위한 프로세스의 예시적인 순서도이다.
도 47b는 본 개시에 따른 싱글 포트 메모리 어레이 또는 매트 상에 듀얼 포트 접근을 제공하기 위한 다른 프로세스의 예시적인 순서도이다.
도 48은 본 개시에 따른 행과 열을 따라 듀얼 포트 접근을 제공하는 메모리 칩 내의 회로의 다른 예시를 도시한 것이다.
도 49는 본 개시에 따른 메모리 매트 내의 듀얼 포트 접근을 위한 스위칭 요소의 예시를 도시한 것이다.
도 50은 본 개시에 따른 부분 워드에 접근하도록 구성된 리덕션 유닛을 포함하는 예시적인 집적회로를 도시한 것이다.
도 51은 도 50에 대해 설명된 바와 같은 리덕션 유닛을 활용하기 위한 메모리 뱅크를 도시한 것이다.
도 52는 본 개시에 따른 PIM 로직으로 통합된 리덕션 유닛을 활용하는 메모리 뱅크를 도시한 것이다.
도 53은 본 개시에 따른 PIM 로직을 활용하여 부분 워드에 접근하기 위한 스위치를 활성화하는 메모리 뱅크를 도시한 것이다.
도 54a는 본 개시에 따른 부분 워드 접근을 비활성화하는 분할된 열 멀티플렉스를 포함하는 메모리 뱅크를 도시한 것이다.
도 54b는 본 개시에 따른 메모리에서 부분 워드 접근을 위한 프로세스의 예시적인 순서도이다.
도 55는 다중 메모리 매트를 포함하는 종래의 메모리 칩을 도시한 것이다.
도 56은 본 개시에 따른 라인의 개통 동안에 전력 소비를 감소하기 위한 활성화 회로를 포함하는 예시적인 메모리 칩을 도시한 것이다.
도 57은 본 개시에 따른 라인의 개통 동안에 전력 소비를 감소하기 위한 활성화 회로를 포함하는 다른 예시적인 메모리 칩을 도시한 것이다.
도 58은 본 개시에 따른 라인의 개통 동안에 전력 소비를 감소하기 위한 활성화 회로를 포함하는 또 다른 예시적인 메모리 칩을 도시한 것이다.
도 59는 본 개시에 따른 라인의 개통 동안에 전력 소비를 감소하기 위한 활성화 회로를 포함하는 추가로 예시적인 메모리 칩을 도시한 것이다.
도 60은 본 개시에 따른 라인의 개통 동안에 전력 소비를 감소하기 위한 글로벌 워드라인과 로컬 워드라인을 포함하는 예시적인 메모리 칩을 도시한 것이다.
도 61은 본 개시에 따른 라인의 개통 동안에 전력 소비를 감소하기 위한 글로벌 워드라인과 로컬 워드라인을 포함하는 다른 예시적인 메모리 칩을 도시한 것이다.
도 62는 본 개시에 따른 메모리 내에서 라인의 순차적 개통을 위한 프로세스의 예시적인 순서도이다.
도 63은 메모리 칩에 대한 종래의 검사기를 도시한 것이다.
도 64는 메모리 칩에 대한 다른 종래의 검사기를 도시한 것이다.
도 65는 본 개시에 따른 동일 기판 상의 논리 소자를 메모리로 사용하여 메모리 칩을 검사하는 예시를 도시한 것이다.
도 66은 본 개시에 따른 동일 기판 상의 논리 소자를 메모리로 사용하여 메모리 칩을 검사하는 다른 예시를 도시한 것이다.
도 67은 본 개시에 따른 동일 기판 상의 논리 소자를 메모리로 사용하여 메모리 칩을 검사하는 또 다른 예시를 도시한 것이다.
도 68은 본 개시에 따른 동일 기판 상의 논리 소자를 메모리로 사용하여 메모리 칩을 검사하는 추가적인 예시를 도시한 것이다.
도 69는 본 개시에 따른 동일 기판 상의 논리 소자를 메모리로 사용하여 메모리 칩을 검사하는 다른 추가적인 예시를 도시한 것이다.
도 70은 본 개시에 따른 메모리 칩을 검사하기 위한 프로세스의 예시적인 순서도이다.
도 71은 본 개시에 따른 메모리 칩을 검사하기 위한 다른 프로세스의 예시적인 순서도이다.
도 72a는 본 개시의 실시예들에 따른 메모리 어레이와 프로세싱 어레이를 포함하는 집적회로를 도시한 것이다.
도 72b는 본 개시의 실시예들에 따른 집적회로 내부의 메모리 영역을 도시한 것이다.
도 73a는 본 개시의 실시예들에 따른 컨트롤러가 예시적으로 구성된 집적회로를 도시한 것이다.
도 73b는 본 개시의 실시예들에 따른 복제 모델을 동시에 실행하기 위한 구성을 도시한 것이다.
도 74a는 본 개시의 실시예들에 따른 컨트롤러가 다른 예시적으로 구성된 집적회로를 도시한 것이다.
도 74b는 본 개시의 실시예들에 따른 집적회로를 보호하는 방법의 순서도를 도시한 것이다.
도 74c는 본 개시의 실시예들에 따른 칩 내에 위치한 검출 요소를 도시한 것이다.
도 75a는 본 개시의 실시예들에 따른 복수의 분산 프로세서 메모리 칩을 포함하는 스케일러블 프로세서 메모리 시스템을 도시한 것이다.
도 75b는 본 개시의 실시예들에 따른 복수의 분산 프로세서 메모리 칩을 포함하는 스케일러블 프로세서 메모리 시스템을 도시한 것이다.
도 75c는 본 개시의 실시예들에 따른 복수의 분산 프로세서 메모리 칩을 포함하는 스케일러블 프로세서 메모리 시스템을 도시한 것이다.
도 75d는 본 개시의 실시예들에 따른 듀얼 포트 분산 프로세서 메모리 칩을 도시한 것이다.
도 75e는 본 개시의 실시예들에 따른 타이밍을 예시적으로 도시한 것이다.
도 76은 본 개시의 실시예들에 따른 통합 컨트롤러 및 인터페이스 모듈을 구비하고 스케일러블 프로세서 메모리 시스템을 구성하는 프로세서 메모리 칩을 도시한 것이다.
도 77은 본 개시의 실시예들에 따른 도 75a에 도시된 스케일러블 프로세서 메모리 시스템에서 프로세서 메모리 칩 사이에 데이터를 전송하는 순서도를 도시한 것이다.
도 78a는 본 개시의 실시예들에 따른 칩 레벨에서 메모리 칩 내에 구현된 복수의 메모리 뱅크의 하나 이상의 특정 주소에 저장된 0 값을 검출하는 시스템을 도시한 것이다.
도 78b는 본 개시의 실시예들에 따른 메모리 뱅크 레벨에서 복수의 메모리 뱅크의 하나 이상의 특정 주소에 저장된 0 값을 검출하는 메모리 칩을 도시한 것이다.
도 79는 본 개시의 실시예들에 따른 메모리 매트 레벨에서 복수의 메모리 매트의 하나 이상의 특정 주소에 저장된 0 값을 검출하는 메모리 뱅크를 도시한 것이다.
도 80은 본 개시의 실시예들에 따른 복수의 이산 메모리 뱅크의 특정 주소에서 0 값을 검출하는 예시적인 방법의 순서도를 도시한 것이다.
도 81a는 본 개시의 실시예들에 따른 다음 행 예측에 의거하여 메모리 뱅크와 연관된 다음 행을 활성화하는 시스템을 도시한 것이다.
도 81b는 본 개시의 실시예들에 따른 도 81a의 시스템의 다른 실시예를 도시한 것이다.
도 81c는 본 개시의 실시예들에 따른 각 메모리 서브뱅크의 제1 서브뱅크 행 컨트롤러와 제2 서브뱅크 행 컨트롤러를 도시한 것이다.
도 81d는 본 개시의 실시예들에 따른 다음 행 예측을 위한 일 실시예를 도시한 것이다.
도 81e는 본 개시의 실시예들에 따른 메모리 뱅크에 대한 일 실시예를 도시한 것이다.
도 81f는 본 개시의 실시예들에 따른 메모리 뱅크에 대한 다른 실시예를 도시한 것이다.
도 82는 본 개시의 실시예들에 따른 메모리 행 활성화 페널티를 감소시키기 위한 듀얼 컨트롤 메모리 뱅크를 도시한 것이다.
도 83a는 메모리 뱅크의 행에 접근 및 활성화하는 제1 예를 도시한 것이다.
도 83b는 메모리 뱅크의 행에 접근 및 활성화하는 제2 예를 도시한 것이다.
도 83c는 메모리 뱅크의 행에 접근 및 활성화하는 제3 예를 도시한 것이다.
도 84는 종래의 CPU/레지스터 파일 및 외부 메모리 아키텍처를 도시한 것이다.
도 85a는 일 실시예에 따른 레지스터 파일 역할을 하는 메모리 매트를 포함하는 예시적인 분산 프로세서 메모리 칩을 도시한 것이다.
도 85b는 다른 실시예에 따른 레지스터 파일 역할을 하도록 구성된 메모리 매트를 포함하는 예시적인 분산 프로세서 메모리 칩을 도시한 것이다.
도 85c는 다른 실시예에 따른 레지스터 파일 역할을 하는 메모리 매트를 포함하는 예시적인 장치를 도시한 것이다.
도 86은 개시된 실시예들에 따른 분산 프로세서 메모리 칩에서 적어도 하나의 명령을 실행하는 예시적인 방법의 순서도를 도시한 것이다.
도 87a는 분리된 서버의 예를 도시한 것이다.
도 87b는 분산 프로세싱의 예를 도시한 것이다.
도 87c는 메모리/프로세싱 유닛의 예를 도시한 것이다.
도 87d는 메모리/프로세싱 유닛의 예를 도시한 것이다.
도 87e는 메모리/프로세싱 유닛의 예를 도시한 것이다.
도 87f는 메모리/프로세싱 유닛 및 하나 이상의 통신 모듈을 포함하는 집적회로의 예를 도시한 것이다.
도 87g는 메모리/프로세싱 유닛 및 하나 이상의 통신 모듈을 포함하는 집적회로의 예를 도시한 것이다.
도 87h는 방법의 예이다.
도 87i는 방법의 예이다.
도 88a는 방법의 예이다.
도 88b는 방법의 예이다.
도 88c는 방법의 예이다.
도 89a는 메모리/프로세싱 유닛 및 어휘의 예이다.
도 89b는 메모리/프로세싱 유닛의 예이다.
도 89c는 메모리/프로세싱 유닛의 예이다.
도 89d는 메모리/프로세싱 유닛의 예이다.
도 89e는 메모리/프로세싱 유닛의 예이다.
도 89f는 메모리/프로세싱 유닛의 예이다.
도 89g는 메모리/프로세싱 유닛의 예이다.
도 89h는 메모리/프로세싱 유닛의 예이다.
도 90a는 시스템의 예이다.
도 90b는 시스템의 예이다.
도 90c는 시스템의 예이다.
도 90d는 시스템의 예이다.
도 90e는 시스템의 예이다.
도 90f는 방법의 예이다.
도 91a는 메모리 및 필터링 시스템, 스토리지 장치, 및 CPU의 예이다.
도 91b는 메모리 및 프로세싱 시스템, 스토리지 장치, 및 CPU의 예이다.
도 92a는 메모리 및 프로세싱 시스템, 스토리지 장치, 및 CPU의 예이다.
도 92b는 메모리/프로세싱 유닛의 예이다.
도 92c는 메모리 및 필터링 시스템, 스토리지 장치, 및 CPU의 예이다.
도 92d는 메모리 및 프로세싱 시스템, 스토리지 장치, 및 CPU의 예이다.
도 92e는 메모리 및 프로세싱 시스템, 스토리지 장치, 및 CPU의 예이다.
도 92f는 방법의 예이다.
도 92g는 방법의 예이다.
도 92h는 방법의 예이다.
도 92i는 방법의 예이다.
도 92j는 방법의 예이다.
도 92k는 방법의 예이다.
도 93a는 하이브리드 집적회로의 일례의 단면도이다
도 93b는 하이브리드 집적회로의 일례의 단면도이다.
도 93c는 하이브리드 집적회로의 일례의 단면도이다.
도 93d는 하이브리드 집적회로의 일례의 단면도이다.
도 93e는 하이브리드 집적회로의 일례의 평면도이다.
도 93f는 하이브리드 집적회로의 일례의 평면도이다.
도 93g는 하이브리드 집적회로의 일례의 평면도이다.
도 93h는 하이브리드 집적회로의 일례의 단면도이다.
도 93i는 하이브리드 집적회로의 일례의 단면도이다.
도 93j는 방법의 예이다.
도 94a는 스토리지 시스템, 하나 이상의 장치, 및 컴퓨터 시스템의 예이다.
도 94b는 스토리지 시스템, 하나 이상의 장치, 및 컴퓨터 시스템의 예이다.
도 94c는 하나 이상의 장치 및 컴퓨터 시스템의 예이다.
도 94d는 하나 이상의 장치 및 컴퓨터 시스템의 예이다.
도 94e는 데이터베이스 가속 집적회로의 예이다.
도 94f는 데이터베이스 가속 집적회로의 예이다.
도 94g는 데이터베이스 가속 집적회로의 예이다.
도 94h는 데이터베이스 가속 유닛의 예이다.
도 94i는 블레이드 및 데이터베이스 가속 집적회로의 그룹의 예이다.
도 94j는 데이터베이스 가속 집적회로의 그룹들의 예이다.
도 94k는 데이터베이스 가속 집적회로의 그룹들의 예이다.
도 94l은 데이터베이스 가속 집적회로의 그룹들의 예이다.
도 94m은 데이터베이스 가속 집적회로의 그룹들의 예이다.
도 94n은 시스템의 예이다.
도 94o는 시스템의 예이다.
도 94p는 방법의 예이다.
도 95a는 방법의 예이다.
도 95b는 방법의 예이다.
도 95c는 방법의 예이다.
도 96a는 종래 기술 시스템의 예이다.
도 96b는 시스템의 예이다.
도 96c는 데이터베이스 가속기 보드의 예이다.
도 96d는 시스템의 일부의 예이다.
도 97a는 종래 기술 시스템의 예이다.
도 97b는 시스템의 예이다.
도 97c는 AI 네트워크 인터페이스 카드의 예이다. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate various embodiments of the disclosure.
1 schematically shows a central processing unit (CPU).
2 schematically illustrates a graphic processing unit (GPU).
3A schematically illustrates one embodiment of an exemplary hardware chip in accordance with the disclosed embodiment.
3B schematically illustrates another embodiment of an exemplary hardware chip in accordance with the disclosed embodiment.
4 schematically illustrates general instructions executed by an exemplary hardware chip in accordance with disclosed embodiments.
5 schematically illustrates special instructions executed by an exemplary hardware chip in accordance with the disclosed embodiment.
6 schematically illustrates a processing group for use in an example hardware chip in accordance with the disclosed embodiment.
7A schematically illustrates a rectangular array of processing groups in accordance with a disclosed embodiment.
7B schematically illustrates an elliptical array of processing groups in accordance with a disclosed embodiment.
7C schematically illustrates an array of hardware chips in accordance with a disclosed embodiment.
7D schematically illustrates another array of hardware chips in accordance with the disclosed embodiment.
8 is a flow chart illustrating an exemplary method for compiling a set of instructions for execution on an exemplary hardware chip in accordance with the disclosed embodiment.
Fig. 9 schematically shows a memory bank.
Fig. 10 schematically shows a memory bank.
11 schematically illustrates one embodiment of an exemplary memory bank with subbank control in accordance with the disclosed embodiment.
12 schematically illustrates another embodiment of an exemplary memory bank with subbank control in accordance with the disclosed embodiments.
13 is a block diagram of an exemplary memory chip according to the disclosed embodiment.
14 is a schematic diagram of an exemplary redundant logic block set in accordance with the disclosed embodiment.
15 is a schematic diagram of an exemplary logic block in accordance with the disclosed embodiment.
16 is a block diagram of an exemplary logic block to which a bus is connected according to the disclosed embodiment.
17 is a schematic diagram of exemplary logic blocks connected in series in accordance with the disclosed embodiment.
18 is a schematic diagram of exemplary logic blocks connected in a two-dimensional array in accordance with disclosed embodiments.
19 is a block diagram of an exemplary logic block that is compositely connected according to the disclosed embodiment.
20 is an exemplary flow diagram illustrating a redundant block activation process in accordance with the disclosed embodiment.
21 is an exemplary flow diagram illustrating an address assignment process in accordance with the disclosed embodiment.
22 is a schematic diagram of an exemplary processing apparatus according to the disclosed embodiment.
23 is a block diagram of an exemplary processing apparatus according to the disclosed embodiment.
24 includes an exemplary memory configuration diagram in accordance with disclosed embodiments.
25 is an exemplary flow chart illustrating a memory setup process in accordance with the disclosed embodiment.
26 is an exemplary flow diagram illustrating a memory read process in accordance with the disclosed embodiment.
27 is an exemplary flowchart illustrating an execution process in accordance with the disclosed embodiment.
28 illustrates an exemplary memory chip including a refresh controller according to the present disclosure.
29A illustrates an exemplary refresh controller in accordance with the present disclosure.
29B illustrates another exemplary refresh controller according to the present disclosure.
30 is an exemplary flowchart of a process executed by a refresh controller according to the present disclosure;
31 is an exemplary flowchart of a process implemented by a compiler in accordance with the present disclosure.
32 is another exemplary flowchart of a process implemented by a compiler according to the present disclosure.
33 illustrates an exemplary refresh controller configured by a stored pattern according to the present disclosure.
34 is an exemplary flowchart of a process implemented by software in a refresh controller according to the present disclosure;
35A illustrates an exemplary wafer including a die according to the present disclosure.
35B illustrates an exemplary memory chip coupled to an input/output bus in accordance with the present disclosure.
35C illustrates an exemplary wafer comprising memory chips arranged in rows and coupled to an input-output bus in accordance with the present disclosure.
35D illustrates two memory chips forming a group and coupled to an input-output bus in accordance with the present disclosure.
35E illustrates an exemplary wafer including dies arranged in a hexagonal grid and coupled to an input-output bus in accordance with the present disclosure.
36A-36D illustrate various possible configurations of a memory chip coupled to an input/output bus according to the present disclosure.
37 illustrates an exemplary grouping of dies sharing glue logic in accordance with the present disclosure.
38A-38B illustrate exemplary cuts of a wafer in accordance with the present disclosure.
38C illustrates an exemplary arrangement of dies on a wafer and arrangement of an input-output bus in accordance with the present disclosure.
39 illustrates an exemplary memory chip on a wafer including interconnected processor subunits in accordance with the present disclosure.
40 is an exemplary flowchart of a process for laying out a group of memory chips from a wafer in accordance with the present disclosure.
41A is another exemplary flow diagram of a process for laying out a group of memory chips from a wafer in accordance with the present disclosure.
41B-41C are exemplary flowcharts of a process for determining a cut pattern for cutting one or more groups of memory chips from a wafer in accordance with the present disclosure.
42 shows an example of circuitry in a memory chip that provides dual port access along a column in accordance with the present disclosure.
43 shows an example of circuitry in a memory chip that provides dual port access along a row in accordance with the present disclosure.
44 shows an example of circuitry in a memory chip that provides dual port access along rows and columns in accordance with the present disclosure.
45A illustrates dual reads utilizing duplicated memory arrays or mats.
45B illustrates dual writes utilizing duplicated memory arrays or mats.
46 shows an example of circuitry in a memory chip including a switching element for dual port access along a row in accordance with the present disclosure.
47A is an exemplary flow diagram of a process for providing dual port access on a single port memory array or mat in accordance with the present disclosure.
47B is an exemplary flow diagram of another process for providing dual port access on a single port memory array or mat in accordance with the present disclosure.
48 illustrates another example of circuitry in a memory chip that provides dual port access along rows and columns in accordance with the present disclosure.
49 shows an example of a switching element for dual port access in a memory mat according to the present disclosure.
50 illustrates an exemplary integrated circuit including a reduction unit configured to access a partial word in accordance with the present disclosure.
FIG. 51 illustrates a memory bank for utilizing a reduction unit as described with respect to FIG. 50 .
52 illustrates a memory bank utilizing a reduction unit integrated with PIM logic in accordance with the present disclosure.
53 illustrates a memory bank utilizing PIM logic in accordance with the present disclosure to activate a switch to access a partial word.
54A illustrates a memory bank including a partitioned column multiplex disabling partial word access in accordance with the present disclosure.
54B is an exemplary flowchart of a process for partial word access in memory according to the present disclosure.
55 illustrates a conventional memory chip including multiple memory mats.
56 illustrates an exemplary memory chip including an enable circuit for reducing power consumption during opening of a line in accordance with the present disclosure.
57 depicts another exemplary memory chip including an activation circuit for reducing power consumption during opening of a line in accordance with the present disclosure.
58 illustrates another exemplary memory chip including an activation circuit for reducing power consumption during opening of a line in accordance with the present disclosure.
59 illustrates a further exemplary memory chip including an activation circuit for reducing power consumption during opening of a line in accordance with the present disclosure.
60 illustrates an exemplary memory chip including global wordlines and local wordlines to reduce power consumption during opening of a line in accordance with the present disclosure.
61 illustrates another exemplary memory chip including a global wordline and a local wordline for reducing power consumption during opening of a line in accordance with the present disclosure.
62 is an exemplary flowchart of a process for sequential opening of a line within a memory in accordance with the present disclosure.
63 shows a conventional tester for a memory chip.
64 shows another conventional tester for a memory chip.
65 illustrates an example of testing a memory chip using a logic element on the same substrate as a memory according to the present disclosure.
66 illustrates another example of testing a memory chip using a logic element on the same substrate as a memory according to the present disclosure.
67 illustrates another example of testing a memory chip using a logic element on the same substrate as a memory according to the present disclosure.
68 illustrates an additional example of testing a memory chip using a logic element on the same substrate as a memory according to the present disclosure.
69 illustrates another additional example of testing a memory chip using a logic element on the same substrate as a memory according to the present disclosure.
70 is an exemplary flowchart of a process for testing a memory chip in accordance with the present disclosure.
71 is an exemplary flowchart of another process for testing a memory chip in accordance with the present disclosure.
72A illustrates an integrated circuit including a memory array and a processing array in accordance with embodiments of the present disclosure.
72B illustrates a memory area inside an integrated circuit according to embodiments of the present disclosure.
73A illustrates an integrated circuit in which a controller is exemplarily configured according to embodiments of the present disclosure.
73B illustrates a configuration for concurrently executing a replication model according to embodiments of the present disclosure.
74A illustrates an integrated circuit in which a controller is another exemplary configuration according to embodiments of the present disclosure.
74B is a flowchart of a method of protecting an integrated circuit according to embodiments of the present disclosure;
74C illustrates a detection element positioned within a chip according to embodiments of the present disclosure.
75A illustrates a scalable processor memory system including a plurality of distributed processor memory chips according to embodiments of the present disclosure.
75B illustrates a scalable processor memory system including a plurality of distributed processor memory chips according to embodiments of the present disclosure.
75C illustrates a scalable processor memory system including a plurality of distributed processor memory chips according to embodiments of the present disclosure.
75D illustrates a dual port distributed processor memory chip according to embodiments of the present disclosure.
75E exemplarily illustrates timing according to embodiments of the present disclosure.
76 illustrates a processor memory chip having an integrated controller and an interface module and configuring a scalable processor memory system according to embodiments of the present disclosure.
77 is a flowchart illustrating data transmission between processor memory chips in the scalable processor memory system illustrated in FIG. 75A according to embodiments of the present disclosure.
78A illustrates a system for detecting a zero value stored in one or more specific addresses of a plurality of memory banks implemented in a memory chip at a chip level according to embodiments of the present disclosure.
78B illustrates a memory chip that detects a 0 value stored in one or more specific addresses of a plurality of memory banks at a memory bank level according to embodiments of the present disclosure.
79 illustrates a memory bank for detecting a zero value stored in one or more specific addresses of a plurality of memory mats at a memory mat level according to embodiments of the present disclosure;
80 is a flowchart of an exemplary method for detecting a zero value at a specific address of a plurality of discrete memory banks in accordance with embodiments of the present disclosure.
81A illustrates a system for activating a next row associated with a memory bank based on next row prediction according to embodiments of the present disclosure.
81B illustrates another embodiment of the system of FIG. 81A in accordance with embodiments of the present disclosure.
81C illustrates a first subbank row controller and a second subbank row controller of each memory subbank according to embodiments of the present disclosure.
81D illustrates an embodiment for next row prediction according to embodiments of the present disclosure.
81E illustrates an embodiment of a memory bank according to embodiments of the present disclosure.
81F illustrates another embodiment of a memory bank according to embodiments of the present disclosure.
82 illustrates a dual control memory bank for reducing a memory row activation penalty according to embodiments of the present disclosure.
83A shows a first example of accessing and activating a row of a memory bank.
83B shows a second example of accessing and activating a row of a memory bank.
83C shows a third example of accessing and activating a row of a memory bank.
84 shows a conventional CPU/register file and external memory architecture.
85A illustrates an exemplary distributed processor memory chip including a memory mat serving as a register file in accordance with one embodiment.
85B illustrates an exemplary distributed processor memory chip including a memory mat configured to serve as a register file in accordance with another embodiment.
85C illustrates an exemplary device including a memory mat serving as a register file according to another embodiment.
86 depicts a flow diagram of an exemplary method of executing at least one instruction in a distributed processor memory chip in accordance with disclosed embodiments.
87A shows an example of a separate server.
87B shows an example of distributed processing.
87C shows an example of a memory/processing unit.
87D shows an example of a memory/processing unit.
87E shows an example of a memory/processing unit.
87F shows an example of an integrated circuit including a memory/processing unit and one or more communication modules.
87G illustrates an example of an integrated circuit including a memory/processing unit and one or more communication modules.
87H is an example of a method.
87I is an example of a method.
88A is an example of a method.
88B is an example of a method.
88C is an example of a method.
89A is an example of a memory/processing unit and vocabulary.
89B is an example of a memory/processing unit.
89C is an example of a memory/processing unit.
89D is an example of a memory/processing unit.
89E is an example of a memory/processing unit.
89F is an example of a memory/processing unit.
89G is an example of a memory/processing unit.
89H is an example of a memory/processing unit.
90A is an example of a system.
90B is an example of a system.
90C is an example of a system.
90D is an example of a system.
90E is an example of a system.
90F is an example of a method.
91A is an example of a memory and filtering system, a storage device, and a CPU.
91B is an example of a memory and processing system, a storage device, and a CPU.
92A is an example of a memory and processing system, a storage device, and a CPU.
92B is an example of a memory/processing unit.
92C is an example of a memory and filtering system, a storage device, and a CPU.
92D is an example of a memory and processing system, a storage device, and a CPU.
92E is an example of a memory and processing system, a storage device, and a CPU.
92F is an example of a method.
92G is an example of a method.
92H is an example of a method.
92I is an example of a method.
92J is an example of a method.
92K is an example of a method.
93A is a cross-sectional view of an example of a hybrid integrated circuit;
93B is a cross-sectional view of an example of a hybrid integrated circuit.
93C is a cross-sectional view of an example of a hybrid integrated circuit.
93D is a cross-sectional view of an example of a hybrid integrated circuit.
93E is a plan view of an example of a hybrid integrated circuit.
93F is a plan view of an example of a hybrid integrated circuit.
93G is a plan view of an example of a hybrid integrated circuit.
93H is a cross-sectional view of an example of a hybrid integrated circuit.
93I is a cross-sectional view of an example of a hybrid integrated circuit.
93J is an example of a method.
94A is an example of a storage system, one or more devices, and a computer system.
94B is an example of a storage system, one or more devices, and a computer system.
94C is an example of one or more devices and computer systems.
94D is an example of one or more devices and computer systems.
94E is an example of a database acceleration integrated circuit.
94F is an example of a database acceleration integrated circuit.
94G is an example of a database acceleration integrated circuit.
94H is an example of a database acceleration unit.
94I is an example of a group of blade and database acceleration integrated circuits.
94J is an example of groups of database acceleration integrated circuits.
94K is an example of groups of database acceleration integrated circuits.
94L is an example of groups of database acceleration integrated circuits.
94M is an example of groups of database acceleration integrated circuits.
94N is an example of a system.
94o is an example of a system.
94p is an example of a method.
95A is an example of a method.
95B is an example of a method.
95C is an example of a method.
96A is an example of a prior art system.
96B is an example of a system.
96C is an example of a database accelerator board.
96D is an example of a portion of a system.
97A is an example of a prior art system.
97B is an example of a system.
97C is an example of an AI network interface card.

하기의 상세한 설명은 첨부한 도면을 참조한다. 편의상, 도면과 설명에서 동일 또는 유사한 구성요소에 동일한 참조 번호를 사용한다. 여러 예시적인 실시예를 설명하지만, 다양한 변경, 개조, 구현 등이 가능하다. 예를 들어, 도면에 예시된 구성요소를 치환, 또는 추가, 변경할 수 있고, 설명에 포함된 방법은 단계를 치환하거나 순서를 바꾸거나 삭제하거나 추가하여 변경할 수 있다. 따라서, 하기의 상세한 설명은 개시된 실시예와 예시에 국한되지 않고, 올바른 청구 범위는 첨부된 청구항에 의해 정의된다. The following detailed description refers to the accompanying drawings. For convenience, the same reference numbers are used for the same or similar elements in the drawings and description. While several exemplary embodiments are described, various changes, modifications, implementations, and the like are possible. For example, components illustrated in the drawings may be substituted, or added, or changed, and methods included in the description may be changed by substituting, changing the order, deleting, or adding steps. Accordingly, the following detailed description is not limited to the disclosed embodiments and examples, and the scope of the correct claims is defined by the appended claims.

프로세서 아키텍처processor architecture

본 개시의 전체를 통해, '하드웨어 칩'이라는 용어는 하나 이상의 회로 소자(예, 트랜지스터, 커패시터, 저항기 등)가 형성되는 반도체 웨이퍼(실리콘 등)를 말한다. 회로 소자는 처리 소자 또는 메모리 소자를 형성할 수 있다. '처리 소자'는 적어도 하나의 논리 함수(예, 산술 함수, 논리 게이트, 기타 불리언(Boolean) 연산 등)을 함께 수행하는 하나 이상의 회로 소자를 말한다. 처리 소자는 범용 처리 소자(예, 설정 가능한 복수의 트랜지스터) 또는 전용 처리 소자(예, 특정 논리 함수를 수행하도록 설계된 특정 논리 게이트 또는 복수의 회로 소자)일 수 있다. '메모리 소자'는 데이터를 저장하는데 활용될 수 있는 하나 이상의 회로 소자를 말한다. '메모리 소자'는 또한 '메모리 셀'로 불릴 수도 있다. 메모리 소자는 동적(즉, 데이터 저장을 유지하기 위해 전기적 리프레쉬가 필요), 정적(즉, 전원이 차단된 이후의 일정 시간 동안 데이터 유지), 또는 비휘발성 메모리일 수 있다.Throughout this disclosure, the term 'hardware chip' refers to a semiconductor wafer (such as silicon) on which one or more circuit elements (eg, transistors, capacitors, resistors, etc.) are formed. The circuit element may form a processing element or a memory element. The 'processing element' refers to one or more circuit elements that simultaneously perform at least one logical function (eg, an arithmetic function, a logic gate, other Boolean operation, etc.). A processing element may be a general-purpose processing element (eg, a plurality of configurable transistors) or a dedicated processing element (eg, a specific logic gate or plurality of circuit elements designed to perform a particular logic function). A 'memory element' refers to one or more circuit elements that can be utilized to store data. A 'memory element' may also be referred to as a 'memory cell'. The memory device may be dynamic (ie, requiring an electrical refresh to retain data storage), static (ie, retaining data for a period of time after power is turned off), or non-volatile memory.

처리 소자는 서로 접합되어 프로세서 서브유닛(subunit)을 형성할 수 있다. 이에 따라, '프로세서 서브유닛'은 적어도 하나의 작업 또는 명령(예, 프로세서 명령 세트의 작업 또는 명령)을 실행할 수 있는 최소 묶음의 처리 소자를 포함할 수 있다. 예를 들어, 서브유닛은 명령을 함께 실행하도록 구성된 하나 이상의 범용 처리 소자, 하나 이상의 전용 처리 소자와 쌍을 이루어 서로 보완적인 방식으로 명령을 실행하도록 구성된 하나 이상의 범용 처리 소자 등을 포함할 수 있다. 프로세서 서브유닛은 기판(예, 웨이퍼) 상에 어레이로 배치될 수 있다. '어레이'는 직사각형상을 포함할 수 있지만, 서브유닛의 어레이 배치는 기판 상에서 기타 모든 형상으로 형성될 수 있다.The processing elements may be bonded together to form a processor subunit. Accordingly, the 'processor subunit' may include a minimum set of processing elements capable of executing at least one task or instruction (eg, an operation or instruction of a processor instruction set). For example, a subunit may include one or more general purpose processing elements configured to together execute instructions, one or more general purpose processing elements configured to be paired with one or more dedicated processing elements to execute instructions in a complementary manner, and the like. The processor subunits may be disposed in an array on a substrate (eg, a wafer). An 'array' may include a rectangular shape, but an array arrangement of sub-units may be formed in any other shape on a substrate.

메모리 소자는 서로 접합되어 메모리 뱅크를 형성할 수 있다. 예를 들어, 메모리 뱅크는 적어도 하나의 회선(또는 기타 전도성 연결)을 따라 연결된 하나 이상의 메모리 소자 라인을 포함할 수 있다. 또한, 메모리 소자는 다른 방향의 적어도 하나의 추가 회선을 따라 연결될 수 있다. 예를 들어, 메모리 소자는 하기에 설명하는 바와 같이 워드라인(wordline)과 비트라인(bitline)을 따라 배치될 수 있다. 메모리 뱅크가 라인을 포함할 수 있지만, 기타 모든 배치를 활용하여 소자를 기판 내에 배치하여 기판 상에 뱅크를 형성할 수 있다. 또한, 하나 이상의 뱅크가 적어도 하나의 메모리 컨트롤러에 전기적으로 접합되어 메모리 어레이를 형성할 수 있다. 메모리 어레이는 장방형 배치의 뱅크를 포함할 수 있지만, 어레이 내의 뱅크의 배치는 기판 상에서 기타 모든 형상으로 형성될 수 있다.The memory elements may be bonded to each other to form a memory bank. For example, a memory bank may include one or more lines of memory elements connected along at least one line (or other conductive connection). Also, the memory elements may be connected along at least one additional line in a different direction. For example, the memory device may be disposed along a wordline and a bitline as described below. Although a memory bank may include lines, any other arrangement may be utilized to place devices within a substrate to form the bank on the substrate. Also, one or more banks may be electrically coupled to at least one memory controller to form a memory array. A memory array may include a rectangular arrangement of banks, although the arrangement of banks within the array may be formed in any other shape on the substrate.

본 개시의 전체를 통해, '버스'는 기판의 소자 사이의 모든 통신 연결을 말한다. 예를 들어, 회선 또는 라인(전기적 연결 형성), 광섬유(광 연결 형성), 또는 구성 부품 사이의 통신을 하는 기타 모든 연결이 '버스'로 지칭될 수 있다.Throughout this disclosure, a 'bus' refers to any communication connection between devices on a substrate. For example, a line or line (forming an electrical connection), optical fiber (forming an optical connection), or any other connection that communicates between components may be referred to as a 'bus'.

기존의 프로세서에서는 범용 논리 회로와 공유 메모리가 쌍을 이룬다. 공유 메모리는 논리 회로가 실행할 명령 세트뿐만 아니라 명령 세트의 실행에 활용할 데이터와 명령 세트의 실행의 결과로 획득하는 데이터를 모두 저장할 수 있다. 하기에 설명하는 바와 같이, 일부 기존 프로세서는 캐싱(caching) 시스템을 활용하여 공유 메모리로부터의 읽기 지연을 줄이지만, 기존의 캐싱 시스템은 공유 상태를 유지하고 있다. 기존 프로세서는 중앙처리장치(CPU), 그래픽처리장치(GPU), 다양한 주문형반도체(ASIC) 등을 포함한다. 도 1은 CPU의 일례를 도시한 것이고, 도 2는 GPU의 일례를 도시한 것이다.In conventional processors, general-purpose logic circuitry and shared memory are paired. The shared memory may store not only a set of instructions to be executed by the logic circuit, but also data to be utilized for the execution of the instruction set and data obtained as a result of the execution of the instruction set. As described below, some existing processors utilize a caching system to reduce read latency from shared memory, but the existing caching system maintains a shared state. Existing processors include central processing units (CPUs), graphic processing units (GPUs), and various application-specific integrated circuits (ASICs). 1 shows an example of a CPU, and FIG. 2 shows an example of a GPU.

도 1에 도시된 바와 같이, CPU(100)는 프로세서 서브유닛(120a)과 프로세서 서브유닛(120b)과 같은 하나 이상의 프로세서 서브유닛을 포함하는 처리부(110)를 포함할 수 있다. 도 1에는 도시되어 있지 않지만, 각 프로세서 서브유닛은 복수의 처리 소자를 포함할 수 있다. 또한, 처리부(110)는 하나 이상의 단계의 온칩캐시(on-chip cache)를 포함할 수 있다. 이러한 캐시 소자는 프로세서 서브유닛(120a, 120b)과 캐시 소자를 포함하는 기판 내에 형성된 하나 이상의 버스를 통해 프로세서 서브유닛(120a, 120b)에 연결되기 보다는 대부분 처리부(110)와 동일한 반도체 다이 상에 형성된다. 버스를 통한 연결보다는 동일 다이 상의 직접 배치는 기존 프로세서의 1단계(L1) 캐시와 2단계(L2) 캐시에 모두 일반적이다. 또는, 구형 프로세서에서, L2 캐시는 프로세서 서브유닛과 L2 캐시 사이의 후면 버스(back-side bus)를 사용하여 프로세서 서브유닛 사이에 공유되었다. 후면 버스는, 하기에 설명하는 바와 같이, 일반적으로 전면 버스(front-side bus)보다 크다. 이에 따라, 캐시는 다이 상의 모든 프로세서 서브유닛과 공유돼야 하므로, 캐시(130)가 프로세서 서브유닛(120a, 120b)과 동일한 다이 상에 형성되거나 하나 이상의 후면 버스를 통하여 프로세서 서브유닛(120a, 120b)에 통신 가능하게 결합될 수 있다. 버스가 없는 실시예(예, 캐시가 다이 상에 직접 형성)와 후면 버스를 활용하는 실시예 모두에서, 캐시는 CPU의 프로세서 서브유닛 간에 공유된다.1 , the CPU 100 may include a processing unit 110 including one or more processor subunits such as a processor subunit 120a and a processor subunit 120b. Although not shown in FIG. 1 , each processor subunit may include a plurality of processing elements. In addition, the processing unit 110 may include one or more stages of an on-chip cache (on-chip cache). These cache elements are mostly formed on the same semiconductor die as the processor 110 rather than being connected to the processor subunits 120a and 120b and the processor subunits 120a and 120b through one or more buses formed in the substrate including the cache elements. do. Direct placement on the same die rather than connecting via a bus is common for both first-level (L1) and second-level (L2) caches of traditional processors. Alternatively, in older processors, the L2 cache was shared between the processor subunits and the processor subunits using a back-side bus between the L2 caches. The rear bus is generally larger than the front-side bus, as described below. Accordingly, the cache must be shared with all processor subunits on the die, such that the cache 130 is formed on the same die as the processor subunits 120a, 120b or via one or more backside buses to the processor subunits 120a, 120b. may be communicatively coupled to In both busless embodiments (eg, the cache is formed directly on the die) and embodiments that utilize a backside bus, the cache is shared between the processor subunits of the CPU.

또한, 처리부(110)는 공유 메모리(140a) 및 공유 메모리(140b)와 통신한다. 예를 들어, 메모리(140a, 140b)는 공유 DRAM(dynamic random access memory)의 메모리 뱅크를 나타내는 것일 수 있다. 도면에는 두 개의 뱅크가 도시되었지만, 대부분의 기존 메모리 칩은 8개 내지 16개의 메모리 뱅크를 포함한다. 이에 따라, 프로세서 서브유닛(120a, 120b)은 공유 메모리(140a, 140b)를 활용하여 프로세서 서브유닛(120a, 120b)에 의해 운용될 데이터를 저장할 수 있다. 그러나 이러한 구성의 결과로, 처리부(110)의 클럭 속도가 버스의 데이터 전송 속도를 초과하는 경우에, 메모리(140a, 140b)와 처리부(110) 사이의 버스는 병목이 된다. 이는 기존 프로세서에서 일반적이며, 이에 따라 클럭 속도와 트랜지스터 수에 의거한 처리 속도 사양보다 처리 속도가 떨어지는 결과를 초래한다.In addition, the processing unit 110 communicates with the shared memory 140a and the shared memory 140b. For example, the memories 140a and 140b may represent memory banks of shared dynamic random access memory (DRAM). Although two banks are shown in the figure, most conventional memory chips include 8 to 16 memory banks. Accordingly, the processor subunits 120a and 120b may store data to be operated by the processor subunits 120a and 120b by utilizing the shared memories 140a and 140b. However, as a result of this configuration, when the clock speed of the processing unit 110 exceeds the data transfer rate of the bus, the bus between the memories 140a and 140b and the processing unit 110 becomes a bottleneck. This is common in conventional processors, which results in lower throughput than the throughput specifications based on clock speed and number of transistors.

도 2에 도시된 바와 같이, 유사한 문제가 GPU에도 존재한다. GPU(200)는 하나 이상의 프로세서 서브유닛(예, 서브유닛(220a, 220b, 220c, 220d, 220e, 220f, 220g, 220h, 220i, 220j, 220k, 220l, 220m, 220n, 220o, 220p))을 포함하는 처리부(210)를 포함할 수 있다. 또한, 처리부(210)는 하나 이상의 단계의 온칩캐시 및/또는 레지스터 파일을 포함할 수 있다. 이러한 캐시 소자는 일반적으로 처리부(210)와 동일한 반도체 다이 상에 형성된다. 실제로, 도 2의 예에서, 캐시(210)는 처리부(210)와 동일한 다이 상에 형성되고 모든 프로세서 서브유닛 사이에서 공유되는 반면, 캐시(230a, 230b, 230c, 230d)는 각각 프로세서 서브세트(subset) 상에 형성되고 그 전용이 된다.As shown in Figure 2, a similar problem exists with the GPU. GPU 200 includes one or more processor subunits (eg, subunits 220a, 220b, 220c, 220d, 220e, 220f, 220g, 220h, 220i, 220j, 220k, 220l, 220m, 220n, 220o, 220p). It may include a processing unit 210 that includes. In addition, the processing unit 210 may include one or more stages of on-chip cache and/or register files. Such a cache element is generally formed on the same semiconductor die as the processing unit 210 . Indeed, in the example of Figure 2, cache 210 is formed on the same die as processing unit 210 and shared among all processor subunits, whereas caches 230a, 230b, 230c, 230d are each formed on a subset of processors ( It is formed on a subset) and is dedicated to it.

또한, 처리부(210)는 공유 메모리(250a, 250b, 250c, 250d)와 통신한다. 예를 들어, 메모리(250a, 250b, 250c, 250d)는 공유 DRAM의 메모리 뱅크를 나타내는 것일 수 있다. 이에 따라, 처리부(210)의 프로세서 서브유닛은 공유 메모리(250a, 250b, 250c, 250d)를 활용하여 프로세서 서브유닛에 의해 이후에 운용되는 데이터를 저장할 수 있다. 그러나 앞서 CPU에 대한 설명과 유사하게, 이러한 배치의 결과로 메모리(250a, 250b, 250c, 250d)와 처리부(210) 사이의 버스는 병목이 된다.In addition, the processing unit 210 communicates with the shared memories 250a, 250b, 250c, and 250d. For example, the memories 250a, 250b, 250c, and 250d may represent a memory bank of a shared DRAM. Accordingly, the processor subunit of the processing unit 210 may store data subsequently operated by the processor subunit by utilizing the shared memories 250a, 250b, 250c, and 250d. However, similar to the description of the CPU above, as a result of this arrangement, the bus between the memories 250a, 250b, 250c, and 250d and the processing unit 210 becomes a bottleneck.

개시된 하드웨어 칩의 개요Overview of Disclosed Hardware Chips

도 3a는 예시적인 하드웨어 칩(300)의 일 실시예를 개략적으로 도시한 것이다. 하드웨어 칩(300)은 앞서 설명한 CPU, GPU, 및 기타 기존 프로세서에 대한 병목을 완화하도록 설계된 분산 프로세서를 포함할 수 있다. 분산 프로세서는 단일 기판 상에 공간적으로 분산된 복수의 프로세서 서브유닛을 포함할 수 있다. 또한, 앞서 설명한 바와 같이, 본 개시의 분산 프로세서에서, 상응하는 메모리 뱅크도 기판 상에서 공간적으로 분산되어 있다. 일부 실시예에서, 분산 프로세서는 명령 세트와 연관되어 있고, 분산 프로세서의 프로세서 서브유닛의 각 서브유닛은 명령 세트에 포함된 하나 이상의 작업의 수행을 담당할 수 있다.3A schematically illustrates one embodiment of an exemplary hardware chip 300 . The hardware chip 300 may include a distributed processor designed to alleviate bottlenecks for CPUs, GPUs, and other existing processors described above. A distributed processor may include a plurality of processor subunits spatially distributed on a single substrate. Further, as described above, in the distributed processor of the present disclosure, the corresponding memory banks are also spatially distributed on the substrate. In some embodiments, a distributed processor is associated with an instruction set, and each subunit of the processor subunits of the distributed processor may be responsible for performing one or more tasks included in the instruction set.

도 3a에 도시된 바와 같이, 하드웨어 칩(300)은 논리 및 제어 서브유닛(320a, 320b, 320c, 320d, 320e, 320f, 320g, 320h)과 같은 복수의 프로세서 서브유닛을 포함할 수 있다. 도 3a에 더 도시된 바와 같이, 각 프로세서 서브유닛은 전용 메모리 인스턴스(memory instance)를 포함할 수 있다. 예컨대, 논리 및 제어 서브유닛(320a)은 전용 메모리 인스턴스(330a)에 작동적으로 연결되고, 논리 및 제어 서브유닛(320b)은 전용 메모리 인스턴스(330b)에 작동적으로 연결되고, 논리 및 제어 서브유닛(320c)은 전용 메모리 인스턴스(330c)에 작동적으로 연결되고, 논리 및 제어 서브유닛(320d)은 전용 메모리 인스턴스(330d)에 작동적으로 연결되고, 논리 및 제어 서브유닛(320e)은 전용 메모리 인스턴스(330e)에 작동적으로 연결되고, 논리 및 제어 서브유닛(320f)은 전용 메모리 인스턴스(330f)에 작동적으로 연결되고, 논리 및 제어 서브유닛(320g)은 전용 메모리 인스턴스(330g)에 작동적으로 연결되고, 논리 및 제어 서브유닛(320h)은 전용 메모리 인스턴스(330h)에 작동적으로 연결된다.As shown in FIG. 3A , the hardware chip 300 may include a plurality of processor subunits, such as logic and control subunits 320a, 320b, 320c, 320d, 320e, 320f, 320g, 320h. As further shown in FIG. 3A , each processor subunit may include a dedicated memory instance. For example, logic and control subunit 320a is operatively coupled to dedicated memory instance 330a, logic and control subunit 320b is operatively coupled to dedicated memory instance 330b, and logic and control subunit 330b. Unit 320c is operatively coupled to dedicated memory instance 330c, logic and control subunit 320d is operatively coupled to dedicated memory instance 330d, and logic and control subunit 320e is dedicated Logic and control subunit 320f is operatively coupled to memory instance 330e, logic and control subunit 320f is operatively coupled to dedicated memory instance 330f, and logic and control subunit 320g is operatively coupled to dedicated memory instance 330g. operatively coupled, and the logic and control subunit 320h is operatively coupled to the dedicated memory instance 330h.

도 3a에는 각 메모리 인스턴스가 단일 메모리 뱅크인 것으로 도시되어 있지만, 하드웨어 칩(300)은 하드웨어 칩(300) 상의 프로세서 서브유닛에 대한 전용 메모리 인스턴스로 둘 이상의 메모리 뱅크를 포함할 수 있다. 또한, 도 3a에는 각 프로세서 서브유닛이 각 전용 메모리 뱅크에 대해 논리 소자와 제어를 모두 포함하는 것으로 도시되어 있지만, 하드웨어 칩(300)은 적어도 부분적으로는 논리 소자와 분리된 제어를 메모리 뱅크에 대해 사용할 수도 있다. 또한, 도 3a에 도시된 바와 같이, 둘 이상의 프로세서 서브유닛과 그에 상응하는 메모리 뱅크가 예를 들어 프로세싱 그룹(310a, 310b, 310c, 310d)으로 합쳐질 수 있다. '프로세싱 그룹'이란 하드웨어 칩(300)이 형성되는 기판 상의 공간적 구분을 나타내는 것일 수 있다. 이에 따라, 프로세싱 그룹은, 예를 들어 제어(340a, 340b, 340c, 340d)와 같은, 프로세싱 그룹의 메모리 뱅크에 대한 추가적인 제어를 더 포함할 수 있다. 추가적으로 또는 대안적으로, '프로세싱 그룹'은 하드웨어 칩(300) 상에서 실행할 코드의 컴파일을 위한 논리 묶음을 나타내는 것일 수 있다. 이에 따라, 하드웨어 칩(300)에 대한 컴파일러(하기에 설명)가 전반적인 명령 세트를 하드웨어 칩(300) 상의 프로세싱 그룹 간에 분리할 수 있다.Although each memory instance is shown as being a single memory bank in FIG. 3A , the hardware chip 300 may include two or more memory banks with dedicated memory instances for processor subunits on the hardware chip 300 . Also, although in Figure 3A each processor subunit is shown as including both logic elements and controls for each dedicated memory bank, the hardware chip 300 provides, at least in part, control separate from the logic elements for the memory banks. can also be used Also, as shown in FIG. 3A , two or more processor subunits and corresponding memory banks may be combined into processing groups 310a, 310b, 310c, 310d, for example. The 'processing group' may indicate a spatial division on a substrate on which the hardware chip 300 is formed. Accordingly, the processing group may further include additional controls for the memory banks of the processing group, such as, for example, controls 340a, 340b, 340c, 340d. Additionally or alternatively, the 'processing group' may indicate a logical bundle for compiling code to be executed on the hardware chip 300 . Accordingly, a compiler for the hardware chip 300 (described below) may separate the overall instruction set between processing groups on the hardware chip 300 .

또한, 호스트(350)가 하드웨어 칩(300)으로 명령, 데이터, 및 기타 입력을 제공하고 하드웨어 칩(300)으로부터 출력을 읽을 수 있다. 이에 따라, 명령 세트 전체가 단일 다이, 예를 들어 하드웨어 칩(300)을 호스트하는 다이 상에서 실행될 수 있다. 실제로, 다이 밖에서의 통신은 하드웨어 칩(300)으로의 명령 로딩, 하드웨어 칩(300)으로의 입력 전송, 및 하드웨어 칩(300)으로부터의 출력 읽기가 전부일 수 있다. 이에 따라, 하드웨어 칩(300)의 프로세서 서브유닛은 하드웨어 칩(300)의 전용 메모리 뱅크와 통신하므로, 모든 계산과 메모리 연산은 다이 상(하드웨어 칩(300) 상)에서 수행될 수 있다.In addition, the host 350 may provide commands, data, and other inputs to the hardware chip 300 and read outputs from the hardware chip 300 . Accordingly, the entire set of instructions may be executed on a single die, eg, a die hosting the hardware chip 300 . In practice, out-of-die communication may be all about loading instructions into the hardware chip 300 , sending inputs to the hardware chip 300 , and reading the output from the hardware chip 300 . Accordingly, since the processor subunit of the hardware chip 300 communicates with the dedicated memory bank of the hardware chip 300 , all calculations and memory operations can be performed on the die (on the hardware chip 300 ).

도 3b는 다른 예시적인 하드웨어 칩(300')의 실시예를 개략적으로 도시한 것이다. 하드웨어 칩(300)의 대안으로 도시되었지만, 도 3b에 도시된 아키텍처는 적어도 부분적으로는 도 3a에 도시된 아키텍처와 병합될 수 있다.3B schematically illustrates an embodiment of another exemplary hardware chip 300'. Although shown as an alternative to the hardware chip 300 , the architecture shown in FIG. 3B may be at least partially merged with the architecture shown in FIG. 3A .

도 3b에 도시된 바와 같이, 하드웨어 칩(300')은 프로세서 서브유닛(350a, 350b, 350c, 350d)과 같은 복수의 프로세서 서브유닛을 포함할 수 있다. 도 3b에 더 도시된 바와 같이, 각 프로세서 서브유닛은 복수의 전용 메모리 인스턴스를 포함할 수 있다. 예컨대, 프로세서 서브유닛(350a)은 전용 메모리 인스턴스(330a, 330b)에 연결되고, 프로세서 서브유닛(350b)은 전용 메모리 인스턴스(330c, 330d)에 연결되고, 프로세서 서브유닛(350c)은 전용 메모리 인스턴스(330e, 330f)에 연결되고, 프로세서 서브유닛(350d)은 전용 메모리 인스턴스(330g, 330h)에 연결된다. 또한, 도 3b에 도시된 바와 같이, 프로세서 서브유닛과 그에 상응하는 메모리 뱅크는 예를 들어 프로세싱 그룹(310a, 310b, 310c, 310d)으로 합쳐질 수 있다. 앞서 설명한 바와 같이, '프로세싱 그룹'이란 하드웨어 칩(300')이 형성되는 기판 상의 공간적 구분 및/또는 하드웨어 칩(300') 상에서 실행할 코드의 컴파일을 위한 논리 묶음을 나타내는 것일 수 있다.As shown in FIG. 3B , the hardware chip 300 ′ may include a plurality of processor subunits such as processor subunits 350a , 350b , 350c and 350d . As further shown in FIG. 3B , each processor subunit may include a plurality of dedicated memory instances. For example, processor subunit 350a is coupled to dedicated memory instances 330a, 330b, processor subunit 350b is coupled to dedicated memory instances 330c, 330d, and processor subunit 350c is coupled to dedicated memory instances. coupled to 330e and 330f, and processor subunit 350d coupled to dedicated memory instances 330g and 330h. Also, as shown in FIG. 3B , the processor subunits and corresponding memory banks may be combined into processing groups 310a, 310b, 310c, and 310d, for example. As described above, the 'processing group' may indicate a logical bundle for spatial division on a substrate on which the hardware chip 300' is formed and/or for compiling code to be executed on the hardware chip 300'.

도 3b에 더 도시된 바와 같이, 프로세서 서브유닛은 버스를 통하여 서로 통신할 수 있다. 예를 들어, 도 3b에 도시된 바와 같이, 프로세서 서브유닛(350a)은 버스(360a)를 통하여 프로세서 서브유닛(350b)과 통신하고, 버스(360c)를 통하여 프로세서 서브유닛(350c)과 통신하고, 버스(360f)를 통하여 프로세서 서브유닛(350d)과 통신할 수 있다. 마찬가지로, 프로세서 서브유닛(350b)은 버스(360a)를 통하여 프로세서 서브유닛(350a)과 통신하고(상기에 설명), 버스(360e)를 통하여 프로세서 서브유닛(350c)과 통신하고(상기에 설명), 버스(360d)를 통하여 프로세서 서브유닛(350d)과 통신할 수 있다. 또한, 프로세서 서브유닛(350c)은 버스(360c)를 통하여 프로세서 서브유닛(350a)과 통신하고(상기에 설명), 버스(360e)를 통하여 프로세서 서브유닛(350b)과 통신하고(상기에 설명), 버스(360b)를 통하여 프로세서 서브유닛(350d)과 통신할 수 있다. 이에 따라, 프로세서 서브유닛(350d)은 버스(360f)를 통하여 프로세서 서브유닛(350a)과 통신하고(상기에 설명), 버스(360d)를 통하여 프로세서 서브유닛(350b)과 통신하고(상기에 설명), 버스(360b)를 통하여 프로세서 서브유닛(350c)과 통신할 수 있다(상기에 설명). 본 개시의 당업자라면, 도 3b에 도시된 것보다 적은 수의 버스가 사용될 수 있음을 이해할 것이다. 예를 들어, 프로세서 서브유닛(350b)과 프로세서 서브유닛(350c) 사이의 통신이 프로세서 서브유닛(350a) 및/또는 프로세서 서브유닛(350d)을 통하여 이루어지도록 버스(360e)가 제거될 수 있다. 마찬가지로, 프로세서 서브유닛(350a)과 프로세서 서브유닛(350d) 사이의 통신이 프로세서 서브유닛(350b) 또는 프로세서 서브유닛(350c)을 통하여 이루어지도록 버스(360f)가 제거될 수 있다.As further shown in FIG. 3B , the processor subunits may communicate with each other via a bus. For example, as shown in FIG. 3B , processor subunit 350a communicates with processor subunit 350b via bus 360a, communicates with processor subunit 350c via bus 360c, and , may communicate with the processor subunit 350d through the bus 360f. Similarly, processor subunit 350b communicates with processor subunit 350a via bus 360a (described above), and communicates with processor subunit 350c via bus 360e (described above). , may communicate with the processor subunit 350d through the bus 360d. In addition, processor subunit 350c communicates with processor subunit 350a via bus 360c (described above), and communicates with processor subunit 350b via bus 360e (described above). , may communicate with the processor subunit 350d through the bus 360b. Accordingly, the processor subunit 350d communicates with the processor subunit 350a via the bus 360f (described above), and communicates with the processor subunit 350b via the bus 360d (described above). ), may communicate with the processor subunit 350c via the bus 360b (described above). Those skilled in the art will appreciate that fewer buses than those shown in FIG. 3B may be used. For example, the bus 360e may be removed so that communication between the processor subunit 350b and the processor subunit 350c is via the processor subunit 350a and/or the processor subunit 350d. Similarly, the bus 360f may be eliminated so that communication between the processor subunit 350a and the processor subunit 350d is via the processor subunit 350b or the processor subunit 350c.

또한, 본 개시의 당업자라면, 도 3a와 도 3b에 도시된 것과 다른 아키텍처가 활용될 수 있음을 이해할 것이다. 예를 들어, 각각 단일 프로세서 서브유닛과 메모리 인스턴스를 포함하는 프로세싱 그룹의 어레이가 기판 상에 배치될 수 있다. 프로세서 서브유닛은 상응하는 전용 메모리 뱅크에 대한 컨트롤러의 일부, 상응하는 전용 메모리 매트(memory mat)에 대한 컨트롤러의 일부 등을 추가적으로 또는 대안적으로 형성할 수 있다.Additionally, those skilled in the art will appreciate that architectures other than those shown in FIGS. 3A and 3B may be utilized. For example, an array of processing groups may be disposed on a substrate, each comprising a single processor subunit and a memory instance. The processor subunit may additionally or alternatively form part of a controller for a corresponding dedicated memory bank, part of a controller for a corresponding dedicated memory mat, and the like.

상기 설명한 아키텍처에 따라, 하드웨어 칩(300, 300')은 메모리 집약적 작업에 대해 기존의 아키텍처에 비해 상당히 증가된 효율을 제공할 수 있다. 예를 들어, 데이터베이스 연산 및 인공지능 알고리즘(예, 신경망)은 기존의 아키텍처가 하드웨어 칩(300, 300')보다 효율성이 떨어지는 메모리 집약적 작업의 예이다. 이에 따라, 하드웨어 칩(300, 300')은 데이터베이스 가속기(accelerator) 프로세서 및/또는 인공지능 가속기 프로세서로 불릴 수 있다.In accordance with the architecture described above, the hardware chips 300, 300' can provide significantly increased efficiency over conventional architectures for memory intensive tasks. For example, database operations and artificial intelligence algorithms (eg, neural networks) are examples of memory-intensive tasks for which conventional architectures are less efficient than hardware chips 300 and 300'. Accordingly, the hardware chips 300 and 300' may be referred to as a database accelerator processor and/or an artificial intelligence accelerator processor.

개시된 하드웨어 칩의 설정Configuration of the disclosed hardware chip

상기에 설명한 하드웨어 칩 아키텍처는 코드의 실행을 위해 구성될 수 있다. 예를 들어, 각 프로세서 서브유닛은 하드웨어 칩 내의 다른 프로세서 서브유닛과 별개로 개별적으로 코드(명령 세트를 정의)를 실행할 수 있다. 이에 따라, 멀티스레딩(multithreading)을 관리하기 위하여 운영체제에 의존하거나 멀티태스킹(병렬성보다는 동시성)을 활용하기보다, 본 개시의 하드웨어 칩은 프로세서 서브유닛이 완전히 병렬로 동작하게 할 수 있다.The hardware chip architecture described above may be configured for the execution of code. For example, each processor subunit may execute code (defining a set of instructions) separately from other processor subunits within a hardware chip. Accordingly, rather than relying on the operating system to manage multithreading or utilizing multitasking (concurrency rather than parallelism), the hardware chip of the present disclosure allows the processor subunits to operate completely in parallel.

앞서 설명한 완전 병렬 구현 외에도, 각 프로세서 서브유닛에 배정된 명령의 적어도 일부는 중첩할 수 있다. 예를 들어, 분산 프로세서 상에 배치된 복수의 프로세서 서브유닛은 운영체제 또는 기타 관리 소프트웨어의 구현 등으로서 중복 명령을 실행할 수 있는 반면에, 운영체제 또는 기타 관리 소프트웨어의 컨텍스트 내에서 병렬 작업을 수행하기 위하여 비중복(non-overlapping) 명령을 실행할 수 있다.In addition to the fully parallel implementation described above, at least some of the instructions assigned to each processor subunit may overlap. For example, a plurality of processor subunits disposed on a distributed processor may execute redundant instructions, such as an implementation of an operating system or other management software, while non-parallel tasks may be performed within the context of an operating system or other management software. You can run non-overlapping commands.

도 4는 프로세싱 그룹(410)으로 일반적인 명령을 실행하는 예시적인 프로세스(400)를 도시한 것이다. 예컨대, 프로세싱 그룹(410)은 본 개시의 하드웨어 칩의 일부를 포함할 수 있다(하드웨어 칩(300), 하드웨어 칩(300') 등).4 illustrates an exemplary process 400 for executing general instructions with processing group 410 . For example, processing group 410 may include some of the hardware chips of the present disclosure (hardware chip 300 , hardware chip 300 ′, etc.).

도 4에 도시된 바와 같이, 전용 메모리 인스턴스(420)와 쌍을 이룬 프로세서 서브유닛(430)으로 명령이 전송될 수 있다. 외부 호스트(예, 호스트(350))가 실행을 위해 명령을 프로세싱 그룹(410)으로 전송할 수 있다. 또는, 프로세서 서브유닛(430)이 메모리 인스턴스(420)로부터 명령을 가져오고 가져온 명령을 실행할 수 있도록, 호스트(350)가 상기 명령을 포함하는 명령 세트를 전송하여 메모리 인스턴스(420)에 저장할 수 있다. 이에 따라, 가져온 명령을 실행하게 구성될 수 있는 일반적인 처리 소자인 처리 소자(440)에 의해 명령이 실행될 수 있다. 또한, 프로세싱 그룹(410)은 메모리 인스턴스(420)에 대한 컨트롤(460)을 포함할 수 있다. 도 4에 도시된 바와 같이, 컨트롤(460)은 수신된 명령을 실행할 때 처리 소자(440)에 의해 요구되는 메모리 인스턴스(420)로의 읽기 및/또는 쓰기를 수행할 수 있다. 명령의 실행 후, 프로세싱 그룹(410)은 명령의 결과를 외부 호스트로 또는 동일 하드웨어 칩 상의 다른 프로세싱 그룹 등으로 출력할 수 있다.As shown in FIG. 4 , the command may be sent to the processor subunit 430 paired with the dedicated memory instance 420 . An external host (eg, host 350 ) may send commands to processing group 410 for execution. Alternatively, the host 350 may send and store the instruction set including the instruction set in the memory instance 420 so that the processor subunit 430 can retrieve the instruction from the memory instance 420 and execute the fetched instruction. . Accordingly, the instruction may be executed by processing element 440 , which is a generic processing element that may be configured to execute the fetched instruction. In addition, processing group 410 may include control 460 for memory instance 420 . As shown in FIG. 4 , control 460 may perform reads and/or writes to memory instance 420 as required by processing element 440 when executing the received command. After execution of the command, the processing group 410 may output the result of the command to an external host or another processing group on the same hardware chip.

일부 실시예에서, 도 4에 도시된 바와 같이, 프로세서 서브유닛(430)은 어드레스 생성기(450)를 더 포함할 수 있다. '어드레스 생성기'는 읽기와 쓰기를 수행하기 위한 하나 이상의 메모리 뱅크의 어드레스를 판단하도록 구성된 복수의 처리 소자를 포함할 수 있고, 또한 판단된 어드레스에 위치한 데이터에 연산(예, 덧셈, 뺄셈, 곱셈 등)을 수행할 수 있다. 예를 들어, 어드레스 생성기(450)는 메모리로의 읽기 또는 쓰기를 위한 어드레스를 판단할 수 있다. 일 실시예에서, 어드레스 생성기(450)는 읽기 값이 더 이상 필요하지 않은 경우에 명령에 의거하여 판단된 새로운 값으로 읽기 값을 덮어씀으로써 효율을 향상할 수 있다. 추가적으로 또는 대안적으로, 어드레스 생성기(450)는 명령 실행의 결과를 저장할 사용 가능한 어드레스를 선택할 수 있다. 이로써, 외부 호스트에게 더 편리한 나중의 클럭 사이클에 대한 결과 읽기를 스케줄 할 수 있게 된다. 다른 예에서, 어드레스 생성기(450)는 벡터 또는 행렬 곱셈-누적(multiply-accumulate) 계산과 같은 멀티사이클 계산 동안에 읽기 및 쓰기를 할 어드레스를 판단할 수 있다. 이에 따라, 어드레스 생성기(450)는 데이터를 읽고 멀티사이클 계산의 중간 결과를 쓰기 위한 메모리 어드레스를 유지 또는 계산하여, 프로세서 서브유닛(430)이 이러한 메모리 어드레스를 저장할 필요 없이 계속 처리할 수 있도록 할 수 있다.In some embodiments, as shown in FIG. 4 , the processor subunit 430 may further include an address generator 450 . The 'address generator' may include a plurality of processing elements configured to determine the addresses of one or more memory banks for reading and writing, and also perform operations on data located at the determined addresses (eg, addition, subtraction, multiplication, etc.) ) can be done. For example, the address generator 450 may determine an address for reading or writing to the memory. In an embodiment, when the read value is no longer needed, the address generator 450 may improve efficiency by overwriting the read value with a new value determined based on a command. Additionally or alternatively, address generator 450 may select an available address to store the results of instruction execution. This makes it possible to schedule the reading of results for a later clock cycle, which is more convenient for the external host. In another example, the address generator 450 may determine which addresses to read and write to during multicycle calculations, such as vector or matrix multiply-accumulate calculations. Accordingly, the address generator 450 may maintain or calculate memory addresses for reading data and writing the intermediate results of multicycle calculations, so that the processor subunit 430 may continue to process these memory addresses without the need to store them. have.

도 5는 프로세싱 그룹(510)으로 특수 명령을 실행하기 위한 예시적인 프로세스(500)를 도시한 것이다. 예를 들어, 프로세싱 그룹(510)은 본 개시의 하드웨어 칩의 일부를 포함할 수 있다(하드웨어 칩(300), 하드웨어 칩(300') 등).5 depicts an exemplary process 500 for executing special instructions with processing group 510 . For example, processing group 510 may include a portion of a hardware chip of the present disclosure (hardware chip 300 , hardware chip 300 ′, etc.).

도 5에 도시된 바와 같이, 전용 메모리 인스턴스(520)와 쌍을 이룬 처리 소자(530)로 특수 명령(예, 곱셈-누적 명령)이 전송될 수 있다. 외부 호스트(예, 호스트(350))가 실행을 위해 명령을 처리 소자(530)로 전송할 수 있다. 이에 따라, 특정 명령(수신된 명령 포함)을 실행하도록 구성된 특수 처리 소자인 처리 소자(530)에 의해 명령이 호스트로부터의 특정 신호에 실행될 수 있다. 또는, 처리 소자(530)는 실행을 위해 메모리 인스턴스(520)로부터 명령을 가져올 수 있다. 따라서, 도 5의 예에서, 처리 소자(530)는 외부 호스트로부터 수신된 또는 메모리 인스턴스(520)로부터 가져온 MAC(multiply-accumulate) 명령을 실행하도록 구성된 MAC 회로이다. 명령을 실행한 후, 프로세싱 그룹(510)은 명령의 결과를 외부 호스트 또는 동일 하드웨어 칩의 다른 프로세싱 그룹 등으로 출력할 수 있다. 도면에는 단일 명령과 단일 결과만을 도시하였지만, 복수의 명령이 수신, 검색, 및 실행될 수 있고, 복수의 결과가 출력 이전에 프로세싱 그룹(510) 상에서 병합될 수 있다.As shown in FIG. 5 , a special instruction (eg, multiply-accumulate instruction) may be sent to the processing element 530 paired with the dedicated memory instance 520 . An external host (eg, host 350 ) may transmit a command to processing element 530 for execution. Accordingly, commands may be executed on specific signals from the host by processing element 530 , which is a special processing element configured to execute specific commands (including received commands). Alternatively, processing element 530 can retrieve instructions from memory instance 520 for execution. Thus, in the example of FIG. 5 , processing element 530 is a MAC circuit configured to execute multiply-accumulate (MAC) commands received from an external host or fetched from memory instance 520 . After executing the command, the processing group 510 may output the result of the command to an external host or another processing group of the same hardware chip. Although only a single command and a single result are shown in the figure, multiple commands may be received, retrieved, and executed, and multiple results may be merged on processing group 510 prior to output.

도 5에는 MAC 회로로 도시 되었지만, 프로세싱 그룹에는 추가적인 또는 대안적인 특수 회로가 포함될 수 있다. 예를 들어, MAX-읽기 명령(벡터의 최대값 출력), MAX0-읽기 명령(전체 벡터를 출력하지만 0으로 MAX 하는 정류기(rectifier)로 불리는 기능) 등이 구현될 수 있다.Although the MAC circuit is shown in FIG. 5, additional or alternative special circuits may be included in the processing group. For example, a MAX-read instruction (outputs the maximum value of a vector), a MAX0-read instruction (a function called a rectifier that outputs the entire vector but MAX to zero) may be implemented.

도 4의 일반적인 프로세싱 그룹(410)과 도 5의 특수 프로세싱 그룹(510)이 별도의 것으로 도시 되어 있지만, 이들은 병합될 수 있다. 예를 들어, 일반적인 프로세서 서브유닛이 하나 이상의 특수 프로세서 서브유닛과 결합되어 프로세서 서브유닛을 형성할 수 있다. 이에 따라, 일반적인 프로세서 서브유닛은 하나 이상의 특수 프로세서 서브유닛에 의해 실행 가능하지 않은 모든 명령에 대해 활용될 수 있다.Although the general processing group 410 of FIG. 4 and the special processing group 510 of FIG. 5 are shown as separate, they may be merged. For example, a generic processor subunit may be combined with one or more specialized processor subunits to form a processor subunit. Accordingly, a generic processor subunit may be utilized for all instructions that are not executable by one or more specialized processor subunits.

본 개시의 당업자라면, 신경망 구현과 기타 메모리 집약적 작업이 특수 논리 회로에 의해 처리될 수 있음을 이해할 것이다. 예를 들어, 데이터베이스 쿼리, 패킷 검사, 스트링 비교, 및 기타 기능은 여기에 기재된 하드웨어 칩에 의해 실행되는 경우에 효율이 향상될 수 있다.Those skilled in the art will appreciate that neural network implementations and other memory-intensive tasks may be handled by specialized logic circuitry. For example, database queries, packet inspection, string comparison, and other functions may be more efficient if implemented by the hardware chips described herein.

분산 처리에 대한 메모리 기반 아키텍처Memory-based architecture for distributed processing

본 개시에 따른 하드웨어 칩 상에서, 전용 버스는 칩 상의 프로세서 서브유닛 사이 및/또는 프로세서 서브유닛과 그에 상응하는 전용 메모리 뱅크 사이에 데이터를 전송할 수 있다. 전용 버스를 사용하면 상충하는 요구가 불가능하거나 하드웨어가 아닌 소프트웨어를 사용하여 쉽게 회피될 수 있기 때문에 중재 비용을 줄일 수 있다.On a hardware chip according to the present disclosure, a dedicated bus may transfer data between processor subunits on the chip and/or between a processor subunit and a corresponding dedicated memory bank. Using a dedicated bus can reduce the cost of arbitration because conflicting demands are either impossible or can be easily circumvented using software rather than hardware.

도 6은 프로세싱 그룹(600)을 개략적으로 도시한 것이다. 프로세싱 그룹(600)은 하드웨어 칩(300), 하드웨어 칩(300') 등과 같은 하드웨어 칩에서 사용하기 위한 것일 수 있다. 프로세서 서브유닛(610)은 버스(630)를 통해 메모리(620)에 연결될 수 있다. 메모리(620)는 프로세서 서브유닛(610)에 의한 실행을 위한 데이터 및 코드를 저장하는 RAM 소자를 포함할 수 있다. 일부 실시예에서, 메모리(620)는 N-웨이 메모리일 수 있다(여기서, N은 인터리브(interleaved) 메모리(620) 내의 세그먼트의 수를 의미하는 1 이상의 수). 프로세서 서브유닛(610)은 버스(630)를 통해 프로세서 서브유닛(610) 전용의 메모리(620)에 결합되므로, N은 실행 성능의 손실 없이도 상대적으로 낮은 수로 유지될 수 있을 것이다. 이는, N이 작으면 실행 성능이 떨어지고 N이 높으면 면적과 파워 손실이 큰 기존의 멀티웨이 레지스터 파일 또는 캐시에 비한 향상을 의미한다.6 schematically illustrates a processing group 600 . Processing group 600 may be for use in a hardware chip, such as hardware chip 300 , hardware chip 300 ′, or the like. The processor subunit 610 may be coupled to the memory 620 via a bus 630 . The memory 620 may include a RAM element that stores data and code for execution by the processor subunit 610 . In some embodiments, memory 620 may be an N-way memory (where N is a number greater than or equal to 1 indicating the number of segments in interleaved memory 620 ). Since processor subunit 610 is coupled to memory 620 dedicated to processor subunit 610 via bus 630, N may be kept to a relatively low number without loss of execution performance. This means an improvement over the existing multi-way register file or cache, where small N results in poor execution performance and high N results in large area and power loss.

프로세싱 그룹(600)을 사용하는 시스템의 작업 및 어플리케이션 구현의 요구조건에 맞도록 작업에 연관된 데이터의 사이즈 등에 따라 메모리(620)의 사이즈, N-웨이의 N의 수, 및 버스(630)의 폭이 조절될 수 있다. 메모리 소자(620)는, 예를 들어, 휘발성 메모리(RAM, DRAM, SRAM, 상변화 RAM (phase-change RAM 또는 PRAM), 강자성 RAM(magnetoresistive　RAM 또는 MRAM), 저항성 RAM(resistive RAM 또는 ReRAM) 등) 또는 비휘발성 메모리(플래시메모리 또는 ROM)와 같은, 본 기술 분야에 알려진 하나 이상의 메모리 유형을 포함할 수 있다. 일부 실시예에 따라, 메모리 소자(620)의 일부분은 제1 메모리 유형을 포함하고, 다른 부분은 다른 메모리 유형을 포함할 수 있다. 예를 들어, 메모리 소자(620)의 코드 영역은 ROM 소자를 포함하고, 메모리 소자(620)의 데이터 영역은 DRAM 소자를 포함할 수 있다. 이러한 분할의 다른 예로서, 신경망의 무게는 플래시메모리에 저장하는 반면, 계산을 위한 데이터는 DRAM에 저장할 수 있다.The size of the memory 620 , the number of N of N-ways, and the width of the bus 630 depending on the size of data associated with the task, etc. to suit the requirements of the task and application implementation of the system using the processing group 600 . This can be adjusted. The memory element 620 is, for example, a volatile memory (RAM, DRAM, SRAM, phase-change RAM or PRAM), a ferromagnetic RAM (magnetoresistive RAM or MRAM), a resistive RAM (resistive RAM or ReRAM), etc. ) or non-volatile memory (flash memory or ROM), one or more types of memory known in the art. According to some embodiments, a portion of the memory device 620 may include a first memory type, and another portion may include a different memory type. For example, the code region of the memory device 620 may include a ROM device, and the data region of the memory device 620 may include a DRAM device. As another example of this division, the weight of the neural network may be stored in flash memory, while data for calculation may be stored in DRAM.

프로세서 서브유닛(610)은 프로세서를 포함할 수 있는 처리 소자(640)를 포함한다. 프로세서는, 당업자라면 이해하듯이, 파이프라인 되거나 되지 않을 수 있고, 본 기술분야에 알려진 모든 시중의 집적회로(예, ARM, ARC, RISC-V 등) 상에 구현된 맞춤형 RISC(Reduced Instruction Set Computing) 소자 또는 기타 처리 스키마일 수 있다. 처리 소자(640)는 일부 실시예에서 ALU(Arithmetic Logic Unit) 또는 기타 컨트롤러를 포함하는 컨트롤러를 포함할 수 있다.The processor subunit 610 includes a processing element 640 that may include a processor. The processor may or may not be pipelined, as will be understood by one of ordinary skill in the art, and a custom Reduced Instruction Set Computing (RISC) implemented on any commercially available integrated circuit (eg, ARM, ARC, RISC-V, etc.) known in the art. ) element or other processing schema. The processing element 640 may include a controller including an Arithmetic Logic Unit (ALU) or other controller in some embodiments.

일부 실시예에 따라, 수신 또는 저장된 코드를 실행하는 처리 소자(640)는 일반적인 처리 소자를 포함할 수 있으므로 매우 다양한 처리 연산의 수행이 유연하고 가능하다. 특정 연산의 수행 동안에 소비하는 전력을 비교할 때, 전용이 아닌 회로는 특정 연산 전용 회로보다 많은 전력을 소비한다. 따라서, 특정 복잡 산술 계산을 수행할 경우, 처리 소자(640)는 전용 하드웨어보다 많은 전력을 소비하고 효율이 떨어질 수 있다. 따라서, 일부 실시예에 따라, 처리 소자(640)의 컨트롤러는 특정 연산(예, 합산 또는 '이동' 연산)을 수행하도록 설계될 수 있다.According to some embodiments, the processing element 640 that executes received or stored code may include a general processing element, so that it is flexible and possible to perform a wide variety of processing operations. When comparing the power consumed during the performance of a specific operation, the non-dedicated circuit consumes more power than the specific operation dedicated circuit. Therefore, when performing certain complex arithmetic calculations, the processing element 640 consumes more power than dedicated hardware and may be less efficient. Thus, in accordance with some embodiments, the controller of processing element 640 may be designed to perform a specific operation (eg, a summation or 'move' operation).

일례에서, 특정 연산은 하나 이상의 가속기(accelerator, 650)에 의해 수행될 수 있다. 각 가속기는 특정 계산(예, 곱셈, 부동 소수점 벡터 연산 등)의 수행의 전용으로 프로그램될 수 있다. 가속기를 사용함으로써, 프로세서 서브유닛 당 계산 별로 소비하는 평균 전력이 감소될 수 있고, 향후 계산 처리량이 증가한다. 가속기(650)는 시스템이 구현(예, 신경망의 실행, 데이터베이스 쿼리의 실행 등)하도록 설계된 어플리케이션에 따라 선택될 수 있다. 가속기(650)는 처리 소자(640)에 의해 설정될 수 있고, 전력 소비의 감소와 계산의 가속을 위해 처리 소자(640)와 협력하여 작동할 수 있다. 스마트 DMA(direct memory access) 주변기기와 같이, 가속기는 프로세싱 그룹(600)의 메모리와 MUX/DEMUX/입력/출력 포트(예, MUX(650), DEMUX(660)) 사이에 데이터를 전송하기 위해 추가적으로 또는 대안적으로 사용될 수 있다.In one example, certain operations may be performed by one or more accelerators 650 . Each accelerator can be programmed exclusively to perform a specific computation (eg, multiplication, floating point vector operation, etc.). By using the accelerator, the average power consumed per computation per processor subunit can be reduced, and future computational throughput is increased. The accelerator 650 may be selected according to an application the system is designed to implement (eg, execution of a neural network, execution of a database query, etc.). The accelerator 650 may be set by the processing element 640 and may work in concert with the processing element 640 to reduce power consumption and accelerate computation. Like a smart direct memory access (DMA) peripheral, the accelerator may additionally transfer data between the memory of the processing group 600 and the MUX/DEMUX/input/output ports (eg, MUX 650 , DEMUX 660 ). or alternatively may be used.

가속기(650)는 다양한 기능을 수행하도록 구성될 수 있다. 예를 들어, 어떤 가속기는 신경망에서 자주 사용되는 16비트 부동 소수점 계산 또는 8비트 정수 계산을 수행하도록 구성될 수 있다. 가속기 기능의 다른 예는 신경망의 학습 단계에서 자주 사용되는 32비트 부동 소수점 계산이다. 가속기 기능의 또 다른 예는 데이터베이스에서 사용되는 것과 같은 쿼리 처리이다. 일부 실시예에서, 가속기(650)는 이러한 기능을 수행하기 위해 특화된 처리 소자를 포함할 수 있고/있거나 메모리 소자(620)에 저장된 설정 데이터에 따라 설정되어 수정이 가능하도록 할 수 있다.The accelerator 650 may be configured to perform various functions. For example, some accelerators can be configured to perform 16-bit floating-point calculations or 8-bit integer calculations often used in neural networks. Another example of an accelerator function is a 32-bit floating-point calculation often used in the training phase of neural networks. Another example of an accelerator function is query processing, such as used in a database. In some embodiments, accelerator 650 may include specialized processing elements to perform these functions and/or may be set according to configuration data stored in memory element 620 to enable modification.

가속기(650)는 메모리(620)로의/로부터의 또는 기타 가속기 및/또는 입력/출력으로의/으로부터의 데이터의 설정 가능한 기재된 목록의 메모리 이동에서 타임 이동을 추가적으로 또는 대안적으로 구현할 수 있다. 이에 따라, 하기에 더 설명하는 바와 같이, 프로세싱 그룹(600)을 사용하는 하드웨어 칩 내부의 모든 데이터 이동은 하드웨어 동기화보다는 소프트웨어 동기화를 이용할 수 있다. 예를 들어, 한 프로세싱 그룹(예, 600)의 가속기는 10번째 사이클마다 데이터를 입력에서 가속기로 전송한 후에 다음 사이클에서 데이터를 출력하여 프로세싱 그룹의 메모리로부터 다른 프로세싱 그룹으로 정보가 이동하게 할 수 있다.Accelerator 650 may additionally or alternatively implement a time shift in a configurable written list of memory shifts of data to/from memory 620 or to/from other accelerators and/or inputs/outputs. Accordingly, as further described below, any data movement within a hardware chip using processing group 600 may utilize software synchronization rather than hardware synchronization. For example, an accelerator in one processing group (e.g. 600) can send data from input to accelerator every 10th cycle, and then output data in the next cycle, causing information to move from the processing group's memory to another processing group. have.

도 6에 더 도시된 바와 같이, 일부 실시예에서, 프로세싱 그룹(600)은 입력 포트에 연결된 적어도 하나의 입력 멀티플렉서(MUX)(660)와 출력 포트에 연결된 적어도 하나의 출력 디멀티플렉서(DEMUX)(670)를 더 포함할 수 있다. 이러한 MUX/DEMUX는 처리 소자(640) 및/도는 가속기(650) 중의 하나로부터의 제어 신호(미도시)에 의해 제어되고, 처리 소자(640)에 의해 수행되는 현 지시 및/또는 가속기(650) 중의 한 가속기에 의해 실행되는 연산에 따라 판단될 수 있다. 일부의 경우, 프로세싱 그룹(600)은 (코드 메모리로부터의 미리 정해진 명령에 따라) 입력 포트로부터 출력 포트로 데이터를 전송하도록 요구될 수 있다. 이에 따라, MUX/DEMUX의 각각이 처리 소자(640)와 가속기(650)에 연결될 뿐만 아니라, 입력 MUX의 하나 이상(예, MUX(660))이 하나 이상의 버스를 통해 출력 DEMUX(예, DEMUX(670))로 직접 연결될 수 있다.6 , in some embodiments, processing group 600 includes at least one input multiplexer (MUX) 660 coupled to an input port and at least one output demultiplexer (DEMUX) 670 coupled to an output port. ) may be further included. This MUX/DEMUX is controlled by a control signal (not shown) from one of the processing elements 640 and/or accelerator 650 , and current instructions performed by the processing element 640 and/or accelerator 650 . It may be determined according to an operation executed by one of the accelerators. In some cases, processing group 600 may be required to transmit data from an input port to an output port (according to a predetermined instruction from a code memory). Accordingly, each of the MUXs/DEMUXs is connected to the processing element 640 and the accelerator 650, as well as one or more of the input MUXs (eg, MUX 660) via one or more buses to the output DEMUXs (eg, DEMUX ( 670)).

도 6의 프로세싱 그룹(600)은 배열되어 도 7a에 도시된 것 등과 같은 분산 프로세서를 형성할 수 있다. 프로세싱 그룹은 기판(710)에 배치되어 어레이를 형성할 수 있다. 일부 실시예에서, 기판(710)은 실리콘과 같은 반도체 기판을 포함할 수 있다. 추가적으로 또는 대안적으로, 기판(710)은 연성회로기판과 같은 회로기판을 포함할 수 있다.The processing group 600 of FIG. 6 may be arranged to form a distributed processor such as that shown in FIG. 7A . A processing group may be disposed on the substrate 710 to form an array. In some embodiments, the substrate 710 may include a semiconductor substrate such as silicon. Additionally or alternatively, the substrate 710 may include a circuit board, such as a flexible circuit board.

도 7a에 도시된 바와 같이, 기판(710)은, 프로세싱 그룹(600)과 같은, 그 위에 배치된 복수의 프로세싱 그룹을 포함할 수 있다. 이에 따라, 기판(710)은 뱅크(720a, 720b, 720c, 720d, 720e, 720f, 720g, 720h)와 같은 복수의 뱅크를 포함하는 메모리 어레이를 포함한다. 또한, 기판(710)은 서브유닛(730a, 730b, 730c, 730d, 730e, 730f, 730g, 730h)과 같은 복수의 프로세서 서브유닛을 포함할 수 있는 프로세싱 어레이를 포함한다.As shown in FIG. 7A , substrate 710 may include a plurality of processing groups disposed thereon, such as processing group 600 . Accordingly, the substrate 710 includes a memory array including a plurality of banks, such as banks 720a, 720b, 720c, 720d, 720e, 720f, 720g, 720h. The substrate 710 also includes a processing array that may include a plurality of processor subunits, such as subunits 730a, 730b, 730c, 730d, 730e, 730f, 730g, 730h.

또한, 앞서 설명한 바와 같이, 각 프로세싱 그룹은 프로세서 서브유닛과 이 프로세서 서브유닛 전용의 하나 이상의 상응하는 메모리 뱅크를 포함할 수 있다. 이에 따라, 도 7a에 도시된 바와 같이, 각 서브유닛은 상응하는 전용 메모리 뱅크와 연관된다. 예를 들어, 프로세서 서브유닛(730a)은 메모리 뱅크(720a)와 연관되고, 프로세서 서브유닛(730b)은 메모리 뱅크(720b)와 연관되고, 프로세서 서브유닛(730c)은 메모리 뱅크(720c)와 연관되고, 프로세서 서브유닛(730d)은 메모리 뱅크(720d)와 연관되고, 프로세서 서브유닛(730e)은 메모리 뱅크(720e)와 연관되고, 프로세서 서브유닛(730f)은 메모리 뱅크(720f)와 연관되고, 프로세서 서브유닛(730g)은 메모리 뱅크(720g)와 연관되고, 프로세서 서브유닛(730h)은 메모리 뱅크(720h)와 연관된다.Also, as discussed above, each processing group may include a processor subunit and one or more corresponding memory banks dedicated to the processor subunit. Accordingly, as shown in Fig. 7A, each subunit is associated with a corresponding dedicated memory bank. For example, processor subunit 730a is associated with memory bank 720a , processor subunit 730b is associated with memory bank 720b , and processor subunit 730c is associated with memory bank 720c processor subunit 730d is associated with memory bank 720d, processor subunit 730e is associated with memory bank 720e, processor subunit 730f is associated with memory bank 720f, Processor subunit 730g is associated with memory bank 720g, and processor subunit 730h is associated with memory bank 720h.

각 프로세서 서브유닛이 상응하는 전용 메모리 뱅크와 통신하도록 하기 위하여, 기판(710)은 프로세서 서브유닛 중의 하나를 그에 상응하는 전용 메모리 뱅크로 연결하는 제1 복수의 버스를 포함할 수 있다. 이에 따라, 버스(740a)는 프로세서 서브유닛(730a)을 메모리 뱅크(720a)로 연결하고, 버스(740b)는 프로세서 서브유닛(730b)을 메모리 뱅크(720b)로 연결하고, 버스(740c)는 프로세서 서브유닛(730c)을 메모리 뱅크(720c)로 연결하고, 버스(740d)는 프로세서 서브유닛(730d)을 메모리 뱅크(720d)로 연결하고, 버스(740e)는 프로세서 서브유닛(730e)을 메모리 뱅크(720e)로 연결하고, 버스(740f)는 프로세서 서브유닛(730f)을 메모리 뱅크(720f)로 연결하고, 버스(740g)는 프로세서 서브유닛(730g)을 메모리 뱅크(720g)로 연결하고, 버스(740h)는 프로세서 서브유닛(730h)을 메모리 뱅크(720h)로 연결한다. 또한, 각 프로세서 서브유닛이 다른 프로세서 서브유닛과 통신하도록 하기 위하여, 기판(710)은 프로세서 서브유닛 중의 하나를 프로세서 서브유닛 중의 다른 하나로 연결하는 제2 복수의 버스를 포함할 수 있다. 도 7a의 예에서, 버스(750a)는 프로세서 서브유닛(730a)을 프로세서 서브유닛(750e)으로 연결하고, 버스(750b)는 프로세서 서브유닛(730a)을 프로세서 서브유닛(750b)으로 연결하고, 버스(750c)는 프로세서 서브유닛(730b)을 프로세서 서브유닛(750f)으로 연결하고, 버스(750d)는 프로세서 서브유닛(730b)을 프로세서 서브유닛(750c)으로 연결하고, 버스(750e)는 프로세서 서브유닛(730c)을 프로세서 서브유닛(750g)으로 연결하고, 버스(750f)는 프로세서 서브유닛(730c)을 프로세서 서브유닛(750d)으로 연결하고, 버스(750g)는 프로세서 서브유닛(730d)을 프로세서 서브유닛(750h)으로 연결하고, 버스(750h)는 프로세서 서브유닛(730h)을 프로세서 서브유닛(750g)으로 연결하고, 버스(750i)는 프로세서 서브유닛(730g)을 프로세서 서브유닛(750g)으로 연결하고, 버스(750j)는 프로세서 서브유닛(730f)을 프로세서 서브유닛(750e)으로 연결한다.To allow each processor subunit to communicate with a corresponding dedicated memory bank, the substrate 710 may include a first plurality of buses connecting one of the processor subunits to a corresponding dedicated memory bank. Accordingly, bus 740a connects processor subunit 730a to memory bank 720a, bus 740b connects processor subunit 730b to memory bank 720b, and bus 740c The processor subunit 730c connects to the memory bank 720c, the bus 740d connects the processor subunit 730d to the memory bank 720d, and the bus 740e connects the processor subunit 730e to the memory. to bank 720e, bus 740f connects processor subunit 730f to memory bank 720f, bus 740g connects processor subunit 730g to memory bank 720g, Bus 740h connects processor subunit 730h to memory bank 720h. Further, to allow each processor subunit to communicate with the other processor subunits, the substrate 710 may include a second plurality of buses connecting one of the processor subunits to the other of the processor subunits. In the example of Figure 7a, bus 750a connects processor subunit 730a to processor subunit 750e, bus 750b connects processor subunit 730a to processor subunit 750b, Bus 750c connects processor subunit 730b to processor subunit 750f, bus 750d connects processor subunit 730b to processor subunit 750c, and bus 750e connects processor subunit 750c The subunit 730c connects to the processor subunit 750g, the bus 750f connects the processor subunit 730c to the processor subunit 750d, and the bus 750g connects the processor subunit 730d to the processor subunit 750d. The bus 750h connects the processor subunit 730h to the processor subunit 750g, and the bus 750i connects the processor subunit 730g to the processor subunit 750g. and the bus 750j connects the processor subunit 730f to the processor subunit 750e.

이에 따라, 도 7a에 도시된 예시적인 배치에서, 복수의 논리 프로세서 서브유닛은 적어도 하나의 행과 적어도 하나의 열로 배치된다. 제2 복수의 버스는 각 프로세서 서브유닛을 동일한 행의 적어도 하나의 인접 프로세서 서브유닛과 동일한 열의 적어도 하나의 인접 프로세서 서브유닛으로 연결한다. 도 7a는 '부분 타일 연결(partial tile connection)'로 일컬을 수 있다.Accordingly, in the exemplary arrangement shown in FIG. 7A , the plurality of logical processor subunits are arranged in at least one row and at least one column. A second plurality of buses connects each processor subunit to at least one adjacent processor subunit in the same row and to at least one adjacent processor subunit in the same column. 7A may be referred to as a 'partial tile connection'.

도 7a에 도시된 배치는 수정되어 '완전 타일 연결(full tile connection)'을 형성할 수 있다. 완전 타일 연결은 대각선의 프로세서 서브유닛을 연결하는 추가적인 버스를 포함한다. 예를 들어, 제2 복수의 버스는 프로세서 서브유닛(730a)과 프로세서 서브유닛(730f) 사이, 프로세서 서브유닛(730b)과 프로세서 서브유닛(730e) 사이, 프로세서 서브유닛(730b)과 프로세서 서브유닛(730g) 사이, 프로세서 서브유닛(730c)과 프로세서 서브유닛(730f) 사이, 프로세서 서브유닛(730c)과 프로세서 서브유닛(730h) 사이, 및 프로세서 서브유닛(730d)과 프로세서 서브유닛(730g) 사이에 추가적인 버스를 포함할 수 있다.The arrangement shown in FIG. 7A may be modified to form a 'full tile connection'. A full tile connection includes an additional bus connecting the diagonal processor subunits. For example, the second plurality of buses may be connected between the processor subunit 730a and the processor subunit 730f, between the processor subunit 730b and the processor subunit 730e, and between the processor subunit 730b and the processor subunit 730b. between 730g, between processor subunit 730c and processor subunit 730f, between processor subunit 730c and processor subunit 730h, and between processor subunit 730d and processor subunit 730g may contain additional buses.

완전 타일 연결은, 근처의 프로세서 서브유닛에 저장된 데이터와 결과를 활용하는, 컨볼루션(convolution) 계산에 이용될 수 있다. 예를 들어, 컨볼루션 이미지 처리 중에, 각 프로세서 서브유닛은 이미지 타일(예, 한 픽셀 또는 픽셀 그룹)을 수신할 수 있다. 컨볼루션 결과를 계산하기 위하여, 각 프로세서 서브유닛은 각각 상응하는 타일을 수신한 8개의 인접 프로세서 서브유닛으로부터 데이터를 확보할 수 있다. 부분 타일 연결에서는, 대각선으로 인접한 프로세서 서브유닛으로부터의 데이터는 해당 프로세서 서브유닛에 연결된 다른 인접 프로세서 서브유닛을 통해 통과될 수 있다. 이에 따라, 칩 상의 분산 프로세서는 인공지능 가속기 프로세서일 수 있다.Full tile concatenation can be used for convolution computations, utilizing data and results stored in nearby processor subunits. For example, during convolutional image processing, each processor subunit may receive an image tile (eg, a pixel or group of pixels). In order to compute the convolution result, each processor subunit may obtain data from eight adjacent processor subunits, each receiving a corresponding tile. In a partial tile connection, data from a diagonally adjacent processor subunit may pass through another adjacent processor subunit coupled to that processor subunit. Accordingly, the distributed processor on the chip may be an artificial intelligence accelerator processor.

컨볼루션 계산의 구체적인 예에서, N x M 이미지가 복수의 프로세서 서브유닛에 걸쳐 분할될 수 있다. 각 프로세서 서브유닛은 상응하는 타일에 대해 A x B 필터로 컨볼루션을 수행할 수 있다. 타일 사이의 경계 상의 하나 이상의 픽셀에 대한 필터링을 수행하기 위하여, 각 프로세서 서브유닛은 동일한 경계 상의 픽셀을 포함하는 타일을 가진 이웃 프로세서 서브유닛으로부터 데이터를 요구할 수 있다. 이에 따라, 각 프로세서 서브유닛에 대해 생성된 코드는 해당 서브유닛이 컨볼루션을 계산하고 인접 서브유닛으로부터 데이터가 필요할 때마다 제2 복수의 버스 중의 하나로부터 당겨오도록 설정한다. 제2 복수의 버스로 데이터를 출력하기 위한 상응하는 명령이 해당 서브유닛으로 제공되어 필요한 데이터 전송의 타이밍이 적절하도록 한다.In a specific example of convolutional computation, an N x M image may be partitioned across a plurality of processor subunits. Each processor subunit may perform convolution with an A x B filter on the corresponding tile. To perform filtering on one or more pixels on a boundary between tiles, each processor subunit may request data from a neighboring processor subunit that has tiles that contain pixels on the same boundary. Accordingly, the generated code for each processor subunit configures that subunit to compute a convolution and pull data from one of the second plurality of buses whenever it needs data from an adjacent subunit. Corresponding commands for outputting data to the second plurality of buses are provided to the corresponding subunits to ensure proper timing of necessary data transfers.

도 7a의 부분 타일 연결은 N-부분 타일 연결이 되도록 수정될 수 있다. 이러한 수정에서, 제2 복수의 버스는 각 프로세서 서브유닛을 도 7a의 버스가 지나가는 4방향(즉, 상, 하, 좌, 우 방향)의 해당 프로세서 서브유닛의 임계 거리 이내(예, n 프로세서 서브유닛 이내)에 있는 프로세서 서브유닛으로 더 연결할 수 있다. 완전 타일 연결에도 유사한 수정이 이루어져(즉, 결과적으로 N-완전 타일 연결이 되어) 제2 복수의 버스가 각 프로세서 서브유닛을 도 7a의 버스가 지나가는 4방향과 대각선 2 방향의 해당 프로세서 서브유닛의 임계 거리 이내(예, n 프로세서 서브유닛 이내)에 있는 프로세서 서브유닛으로 더 연결하도록 할 수 있다.The partial tile connection of FIG. 7A may be modified to be an N-partial tile connection. In this modification, the second plurality of buses route each processor subunit within a threshold distance of the corresponding processor subunit in the four directions (ie, up, down, left, right direction) through which the bus of FIG. 7A passes (eg, n processor subunits). It can further link to a processor subunit in the unit). A similar modification is made to the full tile connection (i.e., resulting in an N-full tile connection) so that a second plurality of buses connect each processor subunit to the corresponding processor subunit in 4 directions and 2 diagonal directions through which the bus of Figure 7a passes. Further connections may be made to processor subunits within a threshold distance (eg, within n processor subunits).

다른 배치도 가능하다. 예를 들어, 도 7b에 도시된 배치에서, 버스(750a)는 프로세서 서브유닛(730a)을 프로세서 서브유닛(730d)으로 연결하고, 버스(750b)는 프로세서 서브유닛(730a)을 프로세서 서브유닛(730b)으로 연결하고, 버스(750c)는 프로세서 서브유닛(730b)을 프로세서 서브유닛(730c)으로 연결하고, 버스(750d)는 프로세서 서브유닛(730c)을 프로세서 서브유닛(730d)으로 연결한다. 이에 따라, 도 7b에 도시된 예에서, 복수의 프로세서 서브유닛은 별 무늬로 배치된다. 제2 복수의 버스는 각 프로세서 서브유닛을 별 무늬 이내의 적어도 하나의 인접 프로세서 서브유닛으로 연결한다.Other arrangements are possible. For example, in the arrangement shown in FIG. 7B , bus 750a connects processor subunit 730a to processor subunit 730d, and bus 750b connects processor subunit 730a to processor subunit 730d. 730b), the bus 750c connects the processor subunit 730b to the processor subunit 730c, and the bus 750d connects the processor subunit 730c to the processor subunit 730d. Accordingly, in the example shown in FIG. 7B , the plurality of processor subunits are arranged in a star pattern. A second plurality of buses connects each processor subunit to at least one adjacent processor subunit within a star pattern.

추가적인 배치(미도시)도 가능하다. 예를 들면, 복수의 프로세서 서브유닛이 하나 이상의 선에 배치되도록(도 7a에 도시된 배치와 유사) 하는 이웃 연결 배치가 사용될 수 있다. 이웃 연결 배치에서, 제2 복수의 버스는 각 프로세서 서브유닛을 동일 선 상의 좌측에 있는 프로세서 서브유닛, 동일 선 상의 우측에 있는 프로세서 서브유닛, 동일 선 상의 좌우측 모두에 있는 프로세서 서브유닛 등에 연결한다.Additional arrangements (not shown) are also possible. For example, a neighbor connection arrangement may be used such that a plurality of processor subunits are arranged in one or more lines (similar to the arrangement shown in FIG. 7A ). In the neighbor connection arrangement, the second plurality of buses connect each processor subunit to a processor subunit on the left side of the line, a processor subunit on the right side of the line, a processor subunit on both left and right sides of the same line, and the like.

다른 예에서, N-선형 연결 배치가 사용될 수 있다. N-선형 연결 배치에서, 제2 복수의 버스는 각 프로세서 서브유닛을 해당 프로세서 서브유닛의 임계 거리 이내(예, n 프로세서 서브유닛 이내)에 있는 프로세서 서브유닛에 연결한다. N-선형 연결 배치는 라인 어레이(상기 설명 참조), 장방형 어레이(도 7a에 도시), 타원형 어레이(도 7b에 도시), 또는 기타 기하 어레이와 함께 사용될 수 있다.In another example, an N-linear connection arrangement may be used. In an N-linearly coupled arrangement, a second plurality of buses connects each processor subunit to a processor subunit within a threshold distance of that processor subunit (eg, within n processor subunits). The N-linear connection arrangement may be used with a line array (see above), a rectangular array (shown in FIG. 7A ), an elliptical array (shown in FIG. 7B ), or other geometric arrays.

또 다른 예에서, N-로그 연결 배치가 사용될 수 있다. N-로그 연결 배치에서, 제2 복수의 버스는 각 프로세서 서브유닛을 해당 프로세서 서브유닛의 2의 거듭제곱 임계 거리 이내(예, 2ⁿ 프로세서 서브 유닛 이내)에 있는 프로세서 서브유닛에 연결한다. N-로그 연결 배치는 라인 어레이(상기 설명 참조), 장방형 어레이(도 7a에 도시), 타원형 어레이(도 7b에 도시), 또는 기타 기하 어레이와 함께 사용될 수 있다.In another example, an N-log connected arrangement may be used. In an N-log connected arrangement, a second plurality of buses connects each processor subunit to a processor subunit that is within a power-of-two threshold distance (eg, within 2 ⁿ processor subunits) of that processor subunit. The N-log connected arrangement may be used with a line array (see above), a rectangular array (shown in FIG. 7A ), an elliptical array (shown in FIG. 7B ), or other geometric arrays.

상기에 설명한 연결 스키마의 어느 것이라도 서로 병합하여 동일 하드웨어 칩 상에서 사용될 수 있다. 예를 들어, 완전 타일 연결이 한 영역에서 사용되고, 부분 타일 연결이 다른 영역에서 사용될 수 있다. 다른 예를 들면, N-선형 연결 배치가 한 영역에서 사용되고, N-완전 타일 연결이 다른 영역에서 사용될 수 있다.Any of the above-described connection schemes can be merged and used on the same hardware chip. For example, a full tile connection may be used in one area, and a partial tile connection may be used in another area. As another example, an N-linear connection arrangement may be used in one area, and an N-complete tile connection arrangement may be used in another area.

메모리 칩의 프로세서 서브유닛 사이의 전용 버스에 대안적으로 또는 추가적으로, 하나 이상의 공유 버스를 사용하여 분산 프로세서의 모든 프로세서 서브유닛(또는 모든 프로세서 서브유닛의 서브세트)을 서로 연결할 수 있다. 공유 버스 상의 충돌은 프로세서 서브유닛에 의해 실행되는 코드를 활용하여 공유 버스 상의 데이터 전송 타이밍을 조정함으로써 방지될 수 있으며, 이에 대하여는 하기에 설명한다. 공유 버스에 추가적으로 또는 대안적으로, 설정형 버스(configurable bus)를 사용하여 프로세서 서브유닛을 동적으로 연결하여 서로 분리된 버스로 연결되는 프로세서 서브유닛 그룹을 형성할 수 있다. 예를 들어, 설정형 버스는 프로세서 서브유닛에 의해 제어되어 데이터를 선택된 프로세서 서브유닛으로 전송할 수 있는 트랜지스터 또는 기타 메커니즘을 포함할 수 있다.Alternatively or in addition to a dedicated bus between processor subunits of a memory chip, one or more shared buses may be used to interconnect all processor subunits (or a subset of all processor subunits) of a distributed processor. Conflicts on the shared bus can be avoided by utilizing code executed by the processor subunit to adjust the timing of data transfers on the shared bus, as described below. Additionally or alternatively to a shared bus, a configurable bus may be used to dynamically connect processor subunits to form groups of processor subunits that are connected by separate buses from each other. For example, the configurable bus may include a transistor or other mechanism that may be controlled by the processor subunit to transfer data to the selected processor subunit.

도 7a와 도 7b에서, 프로세싱 어레이의 복수의 프로세서 서브유닛은 메모리 어레이의 복수의 이산 메모리 뱅크 사이에 공간적으로 분포된다. 다른 대안적인 실시예(미도시)에서, 복수의 프로세서 서브유닛은 기판의 하나 이상의 영역에서 클러스터링 될 수 있고, 복수의 메모리 뱅크는 기판의 하나 이상의 다른 영역에서 클러스터링 될 수 있다. 일부 실시예에서, 공간적 분포와 클러스터링의 조합(미도시)이 사용될 수 있다. 예를 들어, 기판의 일 영역은 프로세서 서브유닛의 클러스터를 포함하고, 기판의 다른 영역은 메모리 뱅크의 클러스터를 포함하고, 기판의 또 다른 영역은 메모리 뱅크 사이에 분포된 프로세싱 어레이를 포함할 수 있다.7A and 7B , a plurality of processor subunits of a processing array are spatially distributed among a plurality of discrete memory banks of the memory array. In another alternative embodiment (not shown), the plurality of processor subunits may be clustered in one or more regions of the substrate, and the plurality of memory banks may be clustered in one or more different regions of the substrate. In some embodiments, a combination of spatial distribution and clustering (not shown) may be used. For example, one region of the substrate may include a cluster of processor subunits, another region of the substrate may include a cluster of memory banks, and another region of the substrate may include a processing array distributed among the memory banks. .

본 개시의 당업자라면, 기판 상에 프로세싱 그룹(600)의 어레이를 형성하는 것은 배타적인 실시예가 아님을 이해할 것이다. 예를 들어, 각 프로세서 서브유닛은 적어도 두 개의 전용 메모리 뱅크와 연관될 수 있다. 이에 따라, 도 3b의 프로세싱 그룹(310a, 310b, 310c, 310d)은 프로세싱 그룹(600)을 대신하여 또는 프로세싱 그룹(600)과 함께 사용되어 프로세싱 어레이와 메모리 어레이를 형성할 수 있다. 셋, 넷, 또는 그 이상 등의 전용 메모리 뱅크(미도시)를 포함하는 다른 프로세싱 그룹도 사용될 수 있다.Those skilled in the art will appreciate that forming an array of processing groups 600 on a substrate is not an exclusive embodiment. For example, each processor subunit may be associated with at least two dedicated memory banks. Accordingly, processing groups 310a , 310b , 310c , and 310d of FIG. 3B may be used in place of or in conjunction with processing group 600 to form a processing array and a memory array. Other processing groups including three, four, or more dedicated memory banks (not shown) may also be used.

복수의 프로세서 서브유닛의 각 프로세서 서브유닛은 특정 어플리케이션과 연관된 소프트웨어 코드를 복수의 프로세서 서브유닛에 포함된 다른 프로세서 서브유닛에 대해 개별적으로 실행하도록 구성될 수 있다. 예를 들어, 하기에 설명하는 바와 같이, 복수의 서브시리즈(sub-series)의 명령이 머신 코드로 그룹으로 묶이고 각 프로세서 서브유닛으로 제공되어 실행될 수 있다.Each processor subunit of the plurality of processor subunits may be configured to separately execute software code associated with a particular application for other processor subunits included in the plurality of processor subunits. For example, as described below, a plurality of sub-series of instructions may be grouped into machine code and provided to each processor subunit for execution.

일부 실시예에서, 각 전용 메모리 뱅크는 적어도 하나의 DRAM을 포함한다. 대안적으로, 메모리 뱅크는 SRAM, DRAM, 플래시메모리 등과 같은 메모리 유형의 조합을 포함할 수 있다.In some embodiments, each dedicated memory bank includes at least one DRAM. Alternatively, the memory bank may include a combination of memory types such as SRAM, DRAM, flash memory, and the like.

기존의 프로세서에서, 프로세서 서브유닛 사이의 데이터 공유는 일반적으로 공유 메모리로 수행된다. 공유 메모리는 보통 넓은 부분의 칩 영역을 요구 및/또는 추가적인 하드웨어(예, 아비터(arbiter))에 의해 관리되는 버스에 의해 수행된다. 버스는 앞서 설명한 바와 같이 병목 현상을 초래한다. 또한, 칩의 외부에 있을 수 있는 공유 메모리는 정확하고 업데이트 된 데이터를 프로세서 서브유닛에 제공하기 위해서 캐시 일관성 메커니즘과 더욱 복잡한 캐시(예, L1 캐시, L2 캐시, 공유 DRAM)를 포함하는 것이 보통이다. 하기에 더 설명하는 바와 같이, 도 7a와 도 7b에 도시된 전용 버스는 하드웨어 관리(예, 아비터)가 필요 없는 하드웨어 칩을 가능하게 한다. 또한, 도 7a와 도 7b에 도시된 것과 같은 전용 메모리를 사용하면 복잡한 캐싱 계층과 일관성 메커니즘이 없어도 된다.In conventional processors, data sharing between processor subunits is generally performed with shared memory. Shared memory is usually implemented by a bus that requires a large portion of the chip area and/or is managed by additional hardware (eg, an arbiter). Buses are bottlenecks, as discussed earlier. In addition, shared memory, which may be external to the chip, usually includes cache coherency mechanisms and more complex caches (eg, L1 cache, L2 cache, shared DRAM) to provide correct and updated data to the processor subunits. . As will be described further below, the dedicated bus shown in FIGS. 7A and 7B enables a hardware chip that does not require hardware management (eg, arbiter). In addition, the use of dedicated memory such as that shown in Figures 7a and 7b eliminates the need for complex caching layers and coherency mechanisms.

대신에, 각 프로세서 서브유닛이 다른 프로세서 서브유닛에 의해 계산 및/또는 다른 프로세서 서브유닛의 전용 메모리 뱅크에 저장된 데이터에 접근하게 하기 위하여, 각 프로세서 서브유닛에 의해 개별적으로 실행되는 코드를 활용하여 동적으로 타이밍이 수행되는 버스가 제공된다. 이로써, 기존에 사용되는 버스 관리 하드웨어 대부분 또는 전부가 없어도 된다. 또한, 복잡한 캐싱 메커니즘을 이러한 버스를 통한 직접 전송으로 대체할 수 있으므로, 그 결과, 메모리 읽기와 쓰기 동안에 대기 시간을 줄일 수 있다.Instead, it utilizes code executed individually by each processor subunit to dynamically A bus is provided on which timing is performed. This eliminates the need for most or all of the bus management hardware used in the past. In addition, complex caching mechanisms can be replaced with direct transfers over these buses, resulting in reduced latency during memory reads and writes.

메모리 기반 프로세싱 어레이memory-based processing array

도 7a와 도 7b에 도시된 바와 같이, 본 개시의 메모리 칩은 개별적으로 동작할 수 있다. 또는, 본 개시의 메모리 칩은 메모리 장치(예, 하나 이상의 DRAM 뱅크), 시스템 온 칩, FPGA(field-programmable gate array), 또는 기타 프로세싱 및/또는 메모리 칩과 같은 하나 이상의 추가적인 집적회로에 작동적으로 연결될 수 있다. 이러한 실시예에서, 상기 아키텍처에 의해 실행되는 일련의 명령의 작업은 메모리 칩의 프로세서 서브유닛과 추가적인 집적회로의 프로세서 서브유닛 사이에 분할될 수(예, 하기에 설명하는 바와 같이, 컴파일러에 의해) 있다. 예를 들어, 추가적인 집적회로는 명령 및/또는 데이터를 메모리 칩에 입력하고 메모리 칩으로부터 출력을 수신하는 호스트(예, 도 3a의 호스트(350))를 포함할 수 있다.7A and 7B , the memory chips of the present disclosure may operate individually. Alternatively, the memory chips of the present disclosure may be operative in one or more additional integrated circuits, such as memory devices (eg, one or more banks of DRAM), system-on-chip, field-programmable gate arrays (FPGAs), or other processing and/or memory chips. can be connected to In such embodiments, the tasks of a set of instructions executed by the architecture may be partitioned (eg, by a compiler, as described below) between a processor subunit of a memory chip and a processor subunit of an additional integrated circuit. have. For example, the additional integrated circuit may include a host (eg, host 350 of FIG. 3A ) that inputs commands and/or data to and receives output from the memory chip.

본 개시의 메모리 칩과 하나 이상의 추가 집적회로를 서로 연결하기 위하여, 메모리 칩은 JEDEC(Joint Electron Device Engineering Council) 표준 또는 그 개정 표준을 준수하는 메모리 인터페이스와 같은 메모리 인터페이스를 포함할 수 있다. 상기 하나 이상의 추가 집적회로는 이후 메모리 인터페이스로 연결될 수 있다. 이에 따라, 하나 이상의 추가 집적회로가 본 개시의 복수의 메모리 칩에 연결되는 경우, 데이터가 하나 이상의 추가 집적회로를 통해 메모리 칩 사이에 공유될 수 있다. 추가적으로 또는 대안적으로, 하나 이상의 추가 집적회로는 본 개시의 메모리 칩의 버스와 연결하기 위한 버스를 포함하여 상기 하나 이상의 추가 집적회로가 본 개시의 메모리 칩과 협력하여 코드를 실행하도록 할 수 있다. 이러한 실시예에서, 하나 이상의 추가 집적회로는 본 개시의 메모리 칩과 다른 기판 상에 있더라도 분산 프로세싱을 추가적으로 지원할 수 있다.In order to interconnect the memory chip of the present disclosure and one or more additional integrated circuits, the memory chip may include a memory interface, such as a memory interface conforming to a Joint Electron Device Engineering Council (JEDEC) standard or a revised standard thereof. The one or more additional integrated circuits may then be coupled to a memory interface. Accordingly, when one or more additional integrated circuits are coupled to the plurality of memory chips of the present disclosure, data may be shared between the memory chips through the one or more additional integrated circuits. Additionally or alternatively, the one or more additional integrated circuits may include a bus for coupling with a bus of the memory chip of the present disclosure to cause the one or more additional integrated circuits to execute code in cooperation with the memory chip of the present disclosure. In such embodiments, one or more additional integrated circuits may additionally support distributed processing even if on a different substrate than the memory chip of the present disclosure.

또한, 본 개시의 메모리 칩은 분산 프로세서의 어레이를 형성하기 위하여 어레이로 배치될 수 있다. 예를 들어, 도 7c에 도시된 바와 같이, 하나 이상의 버스가 메모리 칩(770a)을 추가 메모리 칩(770b)으로 연결할 수 있다. 도 7c의 예에서, 메모리 칩(770a)은 하나 이상의 상응하는 메모리 뱅크가 각 프로세서 서브유닛의 전용인 프로세서 서브유닛을 포함한다. 예를 들어, 프로세서 서브유닛(730a)은 메모리 뱅크(720a)와 연관되고, 프로세서 서브유닛(730b)은 메모리 뱅크(720b)와 연관되고, 프로세서 서브유닛(730e)은 메모리 뱅크(720c)와 연관되고, 프로세서 서브유닛(730f)은 메모리 뱅크(720d)와 연관된다. 버스는 각 프로세서 서브유닛을 그에 상응하는 메모리 뱅크에 연결한다. 이에 따라, 버스(740a)는 프로세서 서브유닛(730a)을 메모리 뱅크(720a)에 연결하고, 버스(740b)는 프로세서 서브유닛(730b)을 메모리 뱅크(720b)에 연결하고, 버스(740c)는 프로세서 서브유닛(730e)을 메모리 뱅크(720c)에 연결하고, 버스(740d)는 프로세서 서브유닛(730f)을 메모리 뱅크(720d)에 연결한다. 또한, 버스(750a)는 프로세서 서브유닛(730a)을 프로세서 서브유닛(750a)에 연결하고, 버스(750b)는 프로세서 서브유닛(730a)을 프로세서 서브유닛(750b)에 연결하고, 버스(750c)는 프로세서 서브유닛(730b)을 프로세서 서브유닛(750f)에 연결하고, 버스(750d)는 프로세서 서브유닛(730e)을 프로세서 서브유닛(750f)에 연결한다. 앞서 설명한 바와 같이, 메모리 칩(770a)의 다른 배치도 활용될 수 있다.Also, the memory chips of the present disclosure may be arranged in an array to form an array of a distributed processor. For example, as shown in FIG. 7C , one or more buses may connect the memory chip 770a to the additional memory chip 770b. In the example of FIG. 7C , memory chip 770a includes a processor subunit in which one or more corresponding memory banks are dedicated to each processor subunit. For example, processor subunit 730a is associated with memory bank 720a, processor subunit 730b is associated with memory bank 720b, and processor subunit 730e is associated with memory bank 720c. and the processor subunit 730f is associated with the memory bank 720d. A bus connects each processor subunit to a corresponding memory bank. Accordingly, bus 740a connects processor subunit 730a to memory bank 720a, bus 740b connects processor subunit 730b to memory bank 720b, and bus 740c is Processor subunit 730e is coupled to memory bank 720c, and bus 740d connects processor subunit 730f to memory bank 720d. Bus 750a also connects processor subunit 730a to processor subunit 750a, bus 750b connects processor subunit 730a to processor subunit 750b, and bus 750c connects the processor subunit 730b to the processor subunit 750f, and a bus 750d connects the processor subunit 730e to the processor subunit 750f. As described above, other arrangements of the memory chip 770a may be utilized.

마찬가지로, 메모리 칩(770b)은 하나 이상의 상응하는 메모리 뱅크가 각 프로세서 서브유닛의 전용인 프로세서 서브유닛을 포함한다. 예를 들어, 프로세서 서브유닛(730c)은 메모리 뱅크(720e)와 연관되고, 프로세서 서브유닛(730d)은 메모리 뱅크(720f)와 연관되고, 프로세서 서브유닛(730g)은 메모리 뱅크(720g)와 연관되고, 프로세서 서브유닛(730h)은 메모리 뱅크(720h)와 연관된다. 버스는 각 프로세서 서브유닛을 그에 상응하는 메모리 뱅크에 연결한다. 이에 따라, 버스(740e)는 프로세서 서브유닛(730c)을 메모리 뱅크(720e)에 연결하고, 버스(740f)는 프로세서 서브유닛(730d)을 메모리 뱅크(720f)에 연결하고, 버스(740g)는 프로세서 서브유닛(730g)을 메모리 뱅크(720g)에 연결하고, 버스(740h)는 프로세서 서브유닛(730h)을 메모리 뱅크(720h)에 연결한다. 또한, 버스(750g)는 프로세서 서브유닛(730c)을 프로세서 서브유닛(750g)에 연결하고, 버스(750h)는 프로세서 서브유닛(730d)을 프로세서 서브유닛(750h)에 연결하고, 버스(750i)는 프로세서 서브유닛(730c)을 프로세서 서브유닛(750d)에 연결하고, 버스(750j)는 프로세서 서브유닛(730g)을 프로세서 서브유닛(750h)에 연결한다. 앞서 설명한 바와 같이, 메모리 칩(770b)의 다른 배치도 활용될 수 있다.Likewise, the memory chip 770b includes a processor subunit in which one or more corresponding memory banks are dedicated to each processor subunit. For example, processor subunit 730c is associated with memory bank 720e, processor subunit 730d is associated with memory bank 720f, and processor subunit 730g is associated with memory bank 720g. and the processor subunit 730h is associated with the memory bank 720h. A bus connects each processor subunit to a corresponding memory bank. Accordingly, bus 740e connects processor subunit 730c to memory bank 720e, bus 740f connects processor subunit 730d to memory bank 720f, and bus 740g The processor subunit 730g connects to the memory bank 720g, and a bus 740h connects the processor subunit 730h to the memory bank 720h. Bus 750g also connects processor subunit 730c to processor subunit 750g, bus 750h connects processor subunit 730d to processor subunit 750h, and bus 750i connects the processor subunit 730c to the processor subunit 750d, and a bus 750j connects the processor subunit 730g to the processor subunit 750h. As described above, other arrangements of the memory chip 770b may also be utilized.

메모리 칩(770a, 770b)의 프로세서 서브유닛은 하나 이상의 버스를 사용하여 서로 연결될 수 있다. 이에 따라, 도 7c의 예에서, 버스(750e)는 메모리 칩(770a)의 프로세서 서브유닛(730b)과 메모리 칩(770b)의 프로세서 서브유닛(730c)을 서로 연결할 수 있고, 버스(750f)는 메모리 칩(770a)의 프로세서 서브유닛(730f)과 메모리 칩(770b)의 프로세서 서브유닛(730c)을 서로 연결할 수 있다. 예를 들어, 버스(750e)는 메모리 칩(770b)으로의 입력 버스(따라서, 메모리 칩(770a)의 출력 버스) 역할을 할 수 있고, 버스(750f)는 메모리 칩(770a)으로의 입력 버스(따라서, 메모리 칩(770b)의 출력 버스) 역할을, 또는 그 반대의 역할을, 할 수 있다. 또는 버스(750e, 750f)는 모두 메모리 칩(770a, 770b) 사이의 양방향 버스 역할을 할 수 있다.The processor subunits of the memory chips 770a and 770b may be connected to each other using one or more buses. Accordingly, in the example of FIG. 7C , the bus 750e may connect the processor subunit 730b of the memory chip 770a and the processor subunit 730c of the memory chip 770b to each other, and the bus 750f is The processor subunit 730f of the memory chip 770a and the processor subunit 730c of the memory chip 770b may be connected to each other. For example, bus 750e can serve as an input bus to memory chip 770b (and thus an output bus of memory chip 770a), and bus 750f is an input bus to memory chip 770a. (thus, the output bus of the memory chip 770b), or vice versa. Alternatively, the buses 750e and 750f may both serve as bidirectional buses between the memory chips 770a and 770b.

버스(750e, 750f)는 직접 회선을 포함하거나, 메모리 칩(770a)과 집적회로(770b) 사이의 인터칩 인터페이스(inter-chip interface)에 사용되는 핀을 줄이기 위해 고속 연결 상에 인터리브 될 수 있다. 또한, 메모리 칩에 사용되는 앞서 설명한 모든 연결 구성이 사용되어 메모리 칩을 하나 이상의 추가 집적회로에 연결할 수 있다. 예를 들어, 메모리 칩(770a, 770b)은 도 7c에 도시된 것과 같은 2개의 버스만을 사용하기보다 완전 타일 연결이나 부분 타일 연결을 사용하여 연결될 수 있다.Buses 750e and 750f may include direct circuit or interleaved over high-speed connections to reduce pins used in the inter-chip interface between memory chip 770a and integrated circuit 770b. . Additionally, any of the previously described connection configurations used for the memory chip may be used to connect the memory chip to one or more additional integrated circuits. For example, the memory chips 770a and 770b may be connected using a full tile connection or a partial tile connection, rather than using only two buses as shown in FIG. 7C .

이에 따라, 버스(750e, 750f)를 사용하여 도시되었지만, 아키텍처(760)는 더 적은 수의 버스나 더 많은 수의 버스를 포함할 수 있다. 예를 들어, 프로세서 서브유닛(730a, 730b) 사이 또는 프로세서 서브유닛(730f, 730c) 사이에 단일 버스가 사용될 수 있다. 또는, 추가적인 버스가 프로세서 서브유닛(730b, 730d) 사이 또는 프로세서 서브유닛(730f, 730d) 사이 등에 사용될 수 있다.Accordingly, although illustrated using buses 750e and 750f, architecture 760 may include fewer or more buses. For example, a single bus may be used between processor subunits 730a and 730b or between processor subunits 730f and 730c. Alternatively, an additional bus may be used between the processor subunits 730b and 730d or between the processor subunits 730f and 730d, and the like.

또한, 단일 메모리 칩과 추가 집적회로를 사용하는 것으로 도시되었지만, 복수의 메모리 칩이 앞서 설명한 바와 같이 연결될 수 있다. 예를 들어, 도 7c에 도시된 바와 같이, 메모리 칩(770a, 770b, 770c, 770d)이 어레이로 연결된다. 각 메모리 칩은 앞서 설명한 메모리 칩과 유사하게 프로세서 서브유닛과 전용 메모리 뱅크를 포함한다. 이에 따라, 이러한 구성요소에 대한 설명은 여기서 반복하지 않는다.Also, although shown using a single memory chip and additional integrated circuits, multiple memory chips may be connected as described above. For example, as shown in FIG. 7C , the memory chips 770a, 770b, 770c, and 770d are connected in an array. Each memory chip includes a processor subunit and a dedicated memory bank, similar to the memory chip described above. Accordingly, descriptions of these components are not repeated herein.

도 7c의 예에서, 메모리 칩(770a, 770b, 770c, 770d)은 루프로 연결된다. 이에 따라, 버스(750a)는 메모리 칩(770a, 770d)을 연결하고, 버스(750c)는 메모리 칩(770a, 770b)을 연결하고, 버스(750e)는 메모리 칩(770b, 770c)을 연결하고, 버스(750g)는 메모리 칩(770c, 770d)을 연결한다. 메모리 칩(770a, 770b, 770c, 770d)은 완전 타일 연결, 부분 타일 연결 또는 기타 연결 구성으로 연결될 수 있지만, 도 7c의 예는 메모리 칩(770a, 770b, 770c, 770d) 사이에 적은 수의 핀 연결을 가능하게 한다.In the example of FIG. 7C , the memory chips 770a, 770b, 770c, and 770d are connected in a loop. Accordingly, the bus 750a connects the memory chips 770a and 770d, the bus 750c connects the memory chips 770a and 770b, and the bus 750e connects the memory chips 770b and 770c. , the bus 750g connects the memory chips 770c and 770d. The memory chips 770a, 770b, 770c, and 770d may be connected in full tile connections, partial tile connections, or other connection configurations, although the example of FIG. 7C has fewer pins between memory chips 770a, 770b, 770c, 770d. make the connection possible.

상대적 대용량 메모리relatively large memory

본 개시의 실시예들은 기존 프로세서의 공유 메모리에 비하여 상대적으로 큰 사이즈의 전용 메모리를 사용할 수 있다. 공유 메모리가 아닌 전용 메모리를 사용하면 메모리 증가로 인한 효율 감소 없이 작업을 진행할 수 있다. 이로써, 공유 메모리의 증가에 따른 효율 향상이 폰노이만 병목현상(von Neumann bottleneck)으로 인해 줄어드는 기존 프로세서에서 수행되는 것보다 신경망 처리와 데이터베이스 쿼리와 같은 메모리 집약적 작업이 더 효율적으로 수행될 수 있다.Embodiments of the present disclosure may use a dedicated memory having a relatively large size compared to a shared memory of an existing processor. By using dedicated memory rather than shared memory, work can proceed without reducing efficiency due to increased memory. As a result, memory-intensive tasks such as neural network processing and database queries can be performed more efficiently than those performed in a conventional processor in which efficiency improvement due to an increase in shared memory is reduced due to a von Neumann bottleneck.

예를 들어, 본 개시의 분산 프로세서에서, 분산된 프로세서의 기판 상에 배치된 메모리 어레이는 복수의 이산 메모리 뱅크를 포함할 수 있다. 각각의 이산 메모리 뱅크는 1 메가바이트 이상의 용량 및 복수의 프로세서 서브유닛을 포함하고 기판 상에 배치된 프로세싱 어레이를 포함할 수 있다. 앞서 설명한 바와 같이, 프로세서 서브유닛의 각 프로세서 서브유닛은 복수의 이산 메모리 뱅크 중에서 상응하는 전용 이산 메모리 뱅크와 연관될 수 있다. 일부 실시예에서, 복수의 프로세서 서브유닛은 메모리 어레이 이내의 복수의 이산 메모리 뱅크 사이에서 공간적으로 분포될 수 있다. 대형 CPU 또는 GPU에 대해 몇 메가바이트의 공유 캐시를 사용하기보다 최소 1메카바이트의 전용 메모리를 사용함으로써, 본 개시의 분산 프로세서는 CPU와 GPU의 폰노이만 병목현상으로 인해 기존의 시스템에서 가능하지 않은 효율성을 얻게 된다.For example, in the distributed processor of the present disclosure, a memory array disposed on a substrate of the distributed processor may include a plurality of discrete memory banks. Each discrete memory bank may include a processing array disposed on a substrate and having a capacity of at least one megabyte and including a plurality of processor subunits. As described above, each processor subunit of the processor subunit may be associated with a corresponding dedicated discrete memory bank among a plurality of discrete memory banks. In some embodiments, the plurality of processor subunits may be spatially distributed among a plurality of discrete memory banks within the memory array. By using at least 1 megabyte of dedicated memory rather than using a few megabytes of shared cache for large CPUs or GPUs, the distributed processor of the present disclosure is not possible in conventional systems due to von Neumann bottlenecks of CPU and GPU. gain efficiency.

서로 다른 메모리가 전용 메모리로 사용될 수도 있다. 예를 들어, 각 전용 메모리 뱅크는 적어도 하나의 DRAM 뱅크를 포함할 수 있다. 또는, 각 전용 메모리 뱅크는 적어도 하나의 SRAM 뱅크를 포함할 수 있다. 다른 실시예에서, 서로 다른 유형의 메모리가 단일 하드웨어 칩 상에서 병합될 수 있다.Different memories may be used as dedicated memories. For example, each dedicated memory bank may include at least one DRAM bank. Alternatively, each dedicated memory bank may include at least one SRAM bank. In other embodiments, different types of memory may be merged on a single hardware chip.

앞서 설명한 바와 같이, 각 전용 메모리는 최소 1메가바이트일 수 있다. 이에 따라, 각 전용 메모리 뱅크는 동일 사이즈이거나, 복수의 메모리 뱅크의 적어도 두 메모리 뱅크는 서로 다른 사이즈일 수 있다.As discussed above, each dedicated memory can be at least 1 megabyte. Accordingly, each dedicated memory bank may have the same size, or at least two memory banks of the plurality of memory banks may have different sizes.

또한, 앞서 설명한 바와 같이, 분산 프로세서는 각각 복수의 프로세서 서브유닛의 한 프로세서 서브유닛을 그에 상응하는 전용 메모리 뱅크에 연결하는 제1 복수의 버스 및 각각 복수의 프로세서 서브유닛의 한 프로세서 서브유닛을 복수의 프로세서 서브유닛의 다른 프로세서 서브유닛에 연결하는 제2 복수의 버스를 포함할 수 있다.Further, as described above, the distributed processor includes a plurality of first plurality of buses each connecting one processor subunit of the plurality of processor subunits to a corresponding dedicated memory bank and one processor subunit of each of the plurality of processor subunits. and a second plurality of buses for connecting to other processor subunits of the processor subunits.

소프트웨어를 활용한 동기화Synchronization with software

앞서 설명한 바와 같이, 본 개시의 하드웨어 칩은 하드웨어가 아닌 소프트웨어를 활용하여 데이터 전송을 관리할 수 있다. 특히, 버스 상의 전송, 메모리의 읽기와 쓰기, 및 프로세서 서브유닛의 계산의 타이밍이 프로세서 서브유닛에 의해 실행되는 명령의 서브시리즈에 의해 설정되기 때문에, 본 개시의 하드웨어 칩은 코드를 실행하여 버스 상의 충돌을 방지할 수 있다. 이에 따라, 본 개시의 하드웨어 칩은 종래에 데이터 전송의 관리에 사용되는 하드웨어 메커니즘(예, 칩 내의 네트워크 컨트롤러, 프로세서 서브유닛 간의 패킷 파서(packet parser) 및 패킷 전송기(packet transferor), 버스 아비터, 중재를 피하기 위한 복수의 버스 등)을 회피할 수 있다.As described above, the hardware chip of the present disclosure may manage data transmission by utilizing software rather than hardware. In particular, since the timing of transfers on the bus, reading and writing of memory, and computation of the processor subunit are set by the subseries of instructions executed by the processor subunit, the hardware chip of the present disclosure executes code on the bus collision can be avoided. Accordingly, the hardware chip of the present disclosure includes a hardware mechanism (eg, a network controller in the chip, a packet parser and packet transferor between processor subunits), a bus arbiter, and an arbitration conventionally used for management of data transmission. multiple buses to avoid) can be avoided.

본 개시의 하드웨어 칩이 종래의 방식으로 데이터를 전송한다면, N 프로세서 서브유닛을 버스에 연결하려면 아비터에 의해 제어되는 광범위한 MUX 또는 버스 중재가 필요할 것이다. 반면, 앞서 설명한 바와 같이, 본 개시의 실시예는 프로세서 서브유닛 사이에 오직 회선, 광케이블 등인 버스를 사용할 수 있고, 프로세서 서브유닛은 개별적으로 코드를 실행하여 버스 상의 충돌을 방지할 수 있다. 이에 따라, 본 개시의 실시예는 기판 상의 공간을 보존할 수 있을 뿐만 아니라 (중재에 의한 전력 및 시간 소비로 인한) 재료 비용과 효율 저하를 줄일 수 있다. FIFO(first-in-first-out) 컨트롤로 및/또는 메일박스를 사용하는 다른 아키텍처와 비교하면 효율성과 공간의 이점이 더욱 크다.If the hardware chip of the present disclosure transmits data in a conventional manner, then extensive MUX or bus arbitration controlled by an arbiter would be required to connect the N processor subunits to the bus. On the other hand, as described above, the embodiment of the present disclosure may use a bus that is only a line, an optical cable, etc. between the processor subunits, and the processor subunits may execute codes individually to prevent collisions on the bus. Accordingly, embodiments of the present disclosure can conserve space on the substrate, as well as reduce material cost and efficiency degradation (due to power and time consumption by intervention). Compared to other architectures that use first-in-first-out (FIFO) controls and/or mailboxes, the efficiency and space advantages are greater.

또한, 앞서 설명한 바와 같이, 각 프로세서 서브유닛은 하나 이상의 처리 소자 외에도 하나 이상의 가속기를 포함할 수 있다. 일부 실시예에서, 처리 소자가 아니라 가속기가 버스의 읽기와 쓰기를 할 수 있다. 이러한 실시예에서, 처리 소자가 하나 이상의 계산을 수행하는 사이클과 동일한 사이클 동안에 가속기가 데이터를 전송하게 함으로써 효율성이 추가적으로 확보될 수 있다. 그러나 이러한 실시예는 가속기에 대한 추가적인 재료를 필요로 한다. 예를 들어, 가속기의 제조를 위해 트랜지스터가 추가적으로 필요할 수 있다.Also, as discussed above, each processor subunit may include one or more accelerators in addition to one or more processing elements. In some embodiments, the accelerator rather than the processing element may read and write to the bus. In such embodiments, additional efficiencies may be ensured by having the accelerator transmit data during the same cycle as the processing element performs one or more calculations. However, this embodiment requires additional material for the accelerator. For example, transistors may be additionally required for the manufacture of accelerators.

코드는 또한 프로세서 서브유닛(예, 처리 소자 및/또는 프로세서 서브유닛의 일부를 형성하는 가속기)의 타이밍과 지연을 포함하는 내부 동작에 대처할 수 있다. 예를 들면, 컴파일러(하기에 설명)는 데이터 전송을 제어하는 명령의 서브시리즈를 생성하는 경우의 타이밍과 지연에 대처하는 프리프로세싱(pre-processing)을 수행할 수 있다.The code may also address internal operations, including timing and delay, of the processor subunit (eg, processing elements and/or accelerators forming part of the processor subunit). For example, a compiler (discussed below) can perform pre-processing to cope with timing and delays when generating a subseries of instructions that control data transfer.

일례에서, 복수의 프로세서 서브유닛은 이전 계층의 더 많은 복수의 뉴런에 완전히 연결된 복수의 뉴런을 포함하는 신경망 계층을 계산하는 작업이 배정될 수 있다. 복수의 프로세서 서브유닛 사이에 이전 계층의 데이터가 균일하게 퍼져있다고 가정할 때, 상기 계산을 수행하는 한가지 방법은 각 프로세서 서브유닛이 이전 계층의 데이터를 메인 버스로 차례로 전송하도록 설정하는 것일 수 있고, 그러면 각 프로세서 서브유닛은 프로세서 서브유닛이 구현하는 해당 뉴런의 가중치로 이 데이터를 곱할 것이다. 각 프로세서 서브유닛은 하나 이상의 뉴런을 계산하므로, 각 프로세서 서브유닛은 이전 계층의 데이터를 뉴런의 수만큼 전송할 것이다. 따라서, 프로세서 서브유닛은 서로 다른 시간에 전송하게 될 것이므로, 각 프로세서 서브유닛의 코드는 다른 프로세서 서브유닛의 코드와 동일하지 않다.In one example, the plurality of processor subunits may be assigned the task of calculating a neural network layer comprising a plurality of neurons fully connected to a greater plurality of neurons of a previous layer. Assuming that the data of the previous layer is evenly spread among the plurality of processor subunits, one way to perform the above calculation is to configure each processor subunit to transmit the data of the previous layer to the main bus in turn, Each processor subunit will then multiply this data by the weight of the corresponding neuron implemented by the processor subunit. Since each processor subunit computes one or more neurons, each processor subunit will transmit data from the previous layer as many as the number of neurons. Accordingly, the code of each processor subunit is not the same as the code of the other processor subunits, since the processor subunits will transmit at different times.

일부 실시예에서, 분산 프로세서는 복수의 이산 메모리 뱅크를 포함하는 메모리 어레이가 배치된 기판(예, 실리콘과 같은 반도체 기판 및/또는 연성회로기판과 같은 회로 기판) 및 상기 기판에 배치되고 도 7a와 도 7b 등에 도시된 것과 같은 복수의 프로세서 서브유닛을 포함하는 프로세싱 어레이를 포함할 수 있다. 앞서 설명한 바와 같이, 각각의 프로세서 서브유닛은 복수의 이산 메모리 뱅크 중에서 상응하는 전용 이산 메모리 뱅크와 연관될 수 있다. 또한, 도 7a와 도 7b 등에 도시된 것과 같이, 분산 프로세서는 각각 복수의 프로세서 서브유닛의 하나를 복수의 프로세서 서브유닛의 적어도 다른 하나에 연결하는 복수의 버스를 더 포함할 수 있다.In some embodiments, the distributed processor is disposed on a substrate (eg, a semiconductor substrate such as silicon and/or a circuit substrate such as a flexible circuit board) on which a memory array including a plurality of discrete memory banks is disposed and is shown in FIG. 7A . It may include a processing array including a plurality of processor subunits such as shown in FIG. 7B or the like. As described above, each processor subunit may be associated with a corresponding dedicated discrete memory bank among a plurality of discrete memory banks. In addition, as illustrated in FIGS. 7A and 7B , the distributed processor may further include a plurality of buses each connecting one of the plurality of processor subunits to at least another one of the plurality of processor subunits.

앞서 설명한 바와 같이, 복수의 버스는 소프트웨어로 제어될 수 있다. 이에 따라, 복수의 버스는 타이밍 하드웨어 로직 요소가 없어서 프로세서 서브유닛 사이의 데이터 전송과 복수의 버스의 상응하는 버스를 통한 데이터 전송은 타이밍 하드웨어 로직 요소에 의해 제어되지 않을 수 있다. 일례에서, 복수의 버스는 버스 아비터가 없어서 프로세서 서브유닛 사이의 데이터 전송과 복수의 버스의 상응하는 버스를 통한 데이터 전송은 버스 아비터에 의해 제어되지 않을 수 있다.As described above, the plurality of buses can be controlled by software. Accordingly, the plurality of buses may lack timing hardware logic elements such that data transfers between processor subunits and over corresponding buses of the plurality of buses may not be controlled by the timing hardware logic elements. In one example, the plurality of buses may not have bus arbiters such that data transfers between processor subunits and over corresponding buses of the plurality of buses may not be controlled by the bus arbiters.

일부 실시예에서, 도 7a와 도 7b 등에 도시된 것과 같이, 분산 프로세서는 복수의 프로세서 서브유닛을 상응하는 전용 메모리 뱅크에 연결하는 제2 복수의 버스를 더 포함할 수 있다. 앞서 설명한 복수의 버스와 유사하게, 제2 복수의 버스는 타이밍 하드웨어 로직 요소가 없어서 프로세서 서브유닛과 상응하는 전용 메모리 뱅크 사이의 데이터 전송은 타이밍 하드웨어 로직 요소에 의해 제어되지 않을 수 있다. 일례에서, 제2 복수의 버스는 버스 아비터가 없어서 프로세서 서브유닛과 상응하는 전용 메모리 뱅크 사이의 데이터 전송은 버스 아비터에 의해 제어되지 않을 수 있다.In some embodiments, as shown in FIGS. 7A and 7B , etc., the distributed processor may further include a second plurality of buses coupling the plurality of processor subunits to corresponding dedicated memory banks. Similar to the plurality of buses described above, the second plurality of buses may lack timing hardware logic elements such that data transfers between the processor subunits and corresponding dedicated memory banks may not be controlled by the timing hardware logic elements. In one example, the second plurality of buses may not have a bus arbiter such that data transfer between the processor subunit and the corresponding dedicated memory bank may not be controlled by the bus arbiter.

본 개시에서, '없다.'라는 표현은 타이밍 하드웨어 로직 요소(예, 버스 아비터, 중재 구조, FIFO 컨트롤러, 메일박스 등)가 절대적으로 없음을 반드시 의미하는 것이 아니다. 이러한 요소는 이러한 요소가 '없다.'라고 설명된 하드웨어 칩에 여전히 포함되어 있을 수 있다. 오히려, '없다.'라는 표현은 하드웨어 칩의 기능을 말하는 것이다. 즉, 타이밍 하드웨어 로직 요소가 '없는' 하드웨어 칩은 타이밍 하드웨어 로직 요소를 사용하지 않고 하드웨어 칩의 데이터 전송의 타이밍을 제어한다. 예를 들어, 하드웨어 칩이 실행 코드의 오류로 인한 충돌로부터의 보호를 위한 2차 예방책으로 타이밍 하드웨어 로직 요소를 포함하더라도, 하드웨어 칩은 하드웨어 칩의 프로세서 서브유닛 사이의 데이터 전송을 제어하는 명령의 서브시리즈를 포함하는 코드를 실행한다.In the present disclosure, the expression 'no' does not necessarily mean that timing hardware logic elements (eg, bus arbiters, arbitration structures, FIFO controllers, mailboxes, etc.) are absolutely absent. These elements may still be included in hardware chips that are described as 'free of these elements'. Rather, the expression 'no' refers to the function of the hardware chip. That is, a hardware chip 'without' a timing hardware logic element controls the timing of data transfers of the hardware chip without using a timing hardware logic element. For example, although the hardware chip includes timing hardware logic elements as a secondary precaution for protection from crashes due to errors in the executable code, the hardware chip provides subunits of instructions that control data transfer between processor subunits of the hardware chip. Execute the code containing the series.

앞서 설명한 바와 같이, 복수의 버스는 복수의 프로세서 서브유닛의 상응하는 프로세서 서브유닛 사이에 회선 및 광섬유의 적어도 하나를 포함할 수 있다. 이에 따라, 일례에서, 타이밍 하드웨어 로직 요소가 없는 분산 프로세서는 버스 아비터, 중재 구조, FIFO 컨트롤러, 메일박스 등이 없이 회선 또는 광섬유만을 포함할 수 있다.As described above, the plurality of buses may include at least one of lines and optical fibers between corresponding processor subunits of the plurality of processor subunits. Thus, in one example, a distributed processor without timing hardware logic elements may include only line or fiber optics without bus arbiters, arbitration structures, FIFO controllers, mailboxes, and the like.

일부 실시예에서, 복수의 프로세서 서브유닛은 복수의 프로세서 서브유닛에 의해 실행되는 코드에 따라 복수의 버스의 적어도 하나를 통해 데이터를 전송하도록 구성된다. 이에 따라, 앞서 설명한 바와 같이, 컴파일러는 각각 단일 프로세서 서브유닛에 의해 실행되는 코드를 포함하는 명령의 서브시리즈를 정리할 수 있다. 명령의 서브시리즈는 프로세서 서브유닛에게 언제 버스 중의 하나로 데이터를 전송할지와 언제 버스로부터 데이터를 가져올지를 지시할 수 있다. 서브시리즈가 분산 프로세서에 걸쳐 협력하여 실행되는 경우, 프로세서 서브유닛 사이의 전송의 타이밍은 서브시리즈에 포함된 전송과 회수 명령에 의해 통제될 수 있다. 따라서, 코드는 복수의 버스의 적어도 하나를 통한 데이터 전송의 타이밍을 통제한다. 컴파일러는 단일 프로세서 서브유닛에 의해 실행될 코드를 생성할 수 있다. 또한, 컴파일러는 프로세서 서브유닛의 그룹에 의해 실행될 코드를 생성할 수 있다. 일부의 경우, 컴파일러는 모든 프로세서 서브유닛을 마치 하나의 슈퍼프로세서(예, 분산 프로세서)인 것처럼 취급할 수 있고, 컴파일러는 이렇게 정의된 슈퍼프로세서/분산 프로세서에 의해 실행될 코드를 생성할 수 있다.In some embodiments, the plurality of processor subunits are configured to transmit data over at least one of the plurality of buses in accordance with code executed by the plurality of processor subunits. Accordingly, as described above, the compiler can organize a subseries of instructions, each containing code executed by a single processor subunit. The subseries of instructions can instruct the processor subunit when to send data to one of the buses and when to get data from the bus. When subseries are executed cooperatively across distributed processors, the timing of transfers between processor subunits may be governed by the transmit and retrieve instructions contained in the subseries. Accordingly, the code controls the timing of data transfers over at least one of the plurality of buses. A compiler may generate code to be executed by a single processor subunit. In addition, the compiler may generate code to be executed by a group of processor subunits. In some cases, the compiler may treat all processor subunits as if they were one superprocessor (eg, a distributed processor), and the compiler could generate code to be executed by the superprocessor/distributed processor so defined.

앞서 설명하고 도 7a와 도 7b에 도시된 바와 같이, 복수의 프로세서 서브유닛은 메모리 어레이 내의 복수의 이산 메모리 뱅크 사이에 공간적으로 분포될 수 있다. 또는, 복수의 프로세서 서브유닛은 기판의 하나 이상의 영역에서 클러스터링 될 수 있고, 복수의 메모리 뱅크는 기판의 하나 이상의 다른 영역에서 클러스터링 될 수 있다. 일부 실시예에서, 앞서 설명한 바와 같이, 공간적 분포와 클러스터링의 조합이 사용될 수 있다.As described above and illustrated in FIGS. 7A and 7B , a plurality of processor subunits may be spatially distributed among a plurality of discrete memory banks in a memory array. Alternatively, the plurality of processor subunits may be clustered in one or more regions of the substrate, and the plurality of memory banks may be clustered in one or more different regions of the substrate. In some embodiments, as described above, a combination of spatial distribution and clustering may be used.

일부 실시예에서, 분산 프로세서는 복수의 이산 메모리 뱅크를 포함하는 메모리 어레이가 배치된 기판(예, 실리콘과 같은 반도체 기판 및/또는 연성회로기판과 같은 회로 기판)을 포함할 수 있다. 프로세싱 어레이는 또한 상기 기판에 배치되고 도 7a와 도 7b 등에 도시된 것과 같은 복수의 프로세서 서브유닛을 포함할 수 있다. 앞서 설명한 바와 같이, 각각의 프로세서 서브유닛은 복수의 이산 메모리 뱅크 중에서 상응하는 전용 이산 메모리 뱅크와 연관될 수 있다. 또한, 도 7a와 도 7b 등에 도시된 것과 같이, 분산 프로세서는 각각 복수의 프로세서 서브유닛의 하나를 복수의 이산 메모리 뱅크의 상응하는 전용 이산 메모리 뱅크에 연결하는 복수의 버스를 더 포함할 수 있다.In some embodiments, the distributed processor may include a substrate (eg, a semiconductor substrate such as silicon and/or a circuit substrate such as a flexible circuit board) on which a memory array including a plurality of discrete memory banks is disposed. A processing array may also include a plurality of processor subunits disposed on the substrate, such as those shown in FIGS. 7A and 7B and the like. As described above, each processor subunit may be associated with a corresponding dedicated discrete memory bank among a plurality of discrete memory banks. Also, as shown in FIGS. 7A and 7B , the distributed processor may further include a plurality of buses each coupling one of the plurality of processor subunits to a corresponding dedicated discrete memory bank of the plurality of discrete memory banks.

앞서 설명한 바와 같이, 복수의 버스는 소프트웨어로 제어될 수 있다. 이에 따라, 복수의 버스는 타이밍 하드웨어 로직 요소가 없어서 프로세서 서브유닛과 복수의 이산 메모리 뱅크의 상응하는 전용 이산 메모리 뱅크 사이의 데이터 전송과 복수의 버스의 상응하는 버스를 통한 데이터 전송은 타이밍 하드웨어 로직 요소에 의해 제어되지 않을 수 있다. 일례에서, 복수의 버스는 버스 아비터가 없어서 프로세서 서브유닛 사이의 데이터 전송과 복수의 버스의 상응하는 버스를 통한 데이터 전송은 버스 아비터에 의해 제어되지 않을 수 있다.As described above, the plurality of buses can be controlled by software. Accordingly, the plurality of buses lack timing hardware logic elements such that data transfer between the processor subunit and corresponding dedicated discrete memory banks of the plurality of discrete memory banks and data transfers over corresponding buses of the plurality of buses are not implemented with timing hardware logic elements. may not be controlled by In one example, the plurality of buses may not have bus arbiters such that data transfers between processor subunits and over corresponding buses of the plurality of buses may not be controlled by the bus arbiters.

일부 실시예에서, 도 7a와 도 7b 등에 도시된 바와 같이, 분산 프로세서는 복수의 프로세서 서브유닛의 하나를 복수의 프로세서 서브유닛의 적어도 다른 하나에 연결하는 제2 복수의 버스를 더 포함할 수 있다. 앞서 설명한 복수의 버스와 유사하게, 제2 복수의 버스는 타이밍 하드웨어 로직 요소가 없어서 프로세서 서브유닛과 상응하는 전용 메모리 뱅크 사이의 데이터 전송은 타이밍 하드웨어 로직 요소에 의해 제어되지 않을 수 있다. 일례에서, 제2 복수의 버스는 버스 아비터가 없어서 프로세서 서브유닛과 상응하는 전용 메모리 뱅크 사이의 데이터 전송은 버스 아비터에 의해 제어되지 않을 수 있다.In some embodiments, as shown in FIGS. 7A and 7B , etc., the distributed processor may further include a second plurality of buses connecting one of the plurality of processor subunits to at least another one of the plurality of processor subunits. . Similar to the plurality of buses described above, the second plurality of buses may lack timing hardware logic elements such that data transfers between the processor subunits and corresponding dedicated memory banks may not be controlled by the timing hardware logic elements. In one example, the second plurality of buses may not have a bus arbiter such that data transfer between the processor subunit and the corresponding dedicated memory bank may not be controlled by the bus arbiter.

일부 실시예에서, 분산 프로세서는 소프트웨어 타이밍 요소와 하드웨어 타이밍 요소의 조합을 사용할 수 있다. 예를 들어, 분산 프로세서는 복수의 이산 메모리 뱅크를 포함하는 메모리 어레이가 배치된 기판(예, 실리콘과 같은 반도체 기판 및/또는 연성회로기판과 같은 회로 기판)을 포함할 수 있다. 프로세싱 어레이는 또한 상기 기판에 배치되고 도 7a와 도 7b 등에 도시된 것과 같은 복수의 프로세서 서브유닛을 포함할 수 있다. 앞서 설명한 바와 같이, 각각의 프로세서 서브유닛은 복수의 이산 메모리 뱅크 중에서 상응하는 전용 이산 메모리 뱅크와 연관될 수 있다. 또한, 도 7a와 도 7b 등에 도시된 것과 같이, 분산 프로세서는 각각 복수의 프로세서 서브유닛의 하나를 복수의 프로세서 서브유닛의 적어도 다른 하나에 연결하는 복수의 버스를 더 포함할 수 있다. 또한, 앞서 설명한 바와 같이, 복수의 프로세서 서브유닛은 복수의 버스를 통한 데이터 전송의 타이밍을 제어하여 복수의 버스의 적어도 하나 상에서의 데이터 전송의 충돌을 방지하는 소프트웨어를 실행하도록 구성될 수 있다. 이러한 예에서, 소프트웨어는 데이터 전송의 타이밍을 제어할 수 있지만, 전송 자체는 적어도 부분적으로는 하나 이상의 하드웨어 요소에 의해 제어될 수 있다.In some embodiments, a distributed processor may use a combination of software timing elements and hardware timing elements. For example, the distributed processor may include a substrate (eg, a semiconductor substrate such as silicon and/or a circuit substrate such as a flexible circuit board) on which a memory array including a plurality of discrete memory banks is disposed. A processing array may also include a plurality of processor subunits disposed on the substrate, such as those shown in FIGS. 7A and 7B and the like. As described above, each processor subunit may be associated with a corresponding dedicated discrete memory bank among a plurality of discrete memory banks. In addition, as illustrated in FIGS. 7A and 7B , the distributed processor may further include a plurality of buses each connecting one of the plurality of processor subunits to at least another one of the plurality of processor subunits. Also, as described above, the plurality of processor subunits may be configured to execute software that controls the timing of data transfers over the plurality of buses to avoid conflicting data transfers on at least one of the plurality of buses. In this example, software may control the timing of the data transfer, but the transfer itself may be controlled, at least in part, by one or more hardware elements.

이러한 실시예에서, 분산 프로세서는 복수의 프로세서 서브유닛의 하나를 상응하는 전용 메모리 뱅크에 연결하는 제2 복수의 버스를 더 포함할 수 있다. 앞서 설명한 복수의 버스와 유사하게, 복수의 프로세서 서브유닛은 제2 복수의 버스를 통한 데이터 전송의 타이밍을 제어하여 제2 복수의 버스의 적어도 하나 상에서의 데이터 전송의 충돌을 방지하는 소프트웨어를 실행하도록 구성될 수 있다. 이러한 예에서, 앞서 설명한 바와 같이, 소프트웨어는 데이터 전송의 타이밍을 제어할 수 있지만, 전송 자체는 적어도 부분적으로는 하나 이상의 하드웨어 요소에 의해 제어될 수 있다.In such embodiments, the distributed processor may further include a second plurality of buses coupling one of the plurality of processor subunits to a corresponding dedicated memory bank. Similar to the plurality of buses described above, the plurality of processor subunits are configured to execute software that controls the timing of data transfers over the second plurality of buses to avoid conflicting data transfers on at least one of the second plurality of buses. can be configured. In this example, as described above, software may control the timing of the data transfer, but the transfer itself may be controlled, at least in part, by one or more hardware elements.

코드의 분할splitting the code

앞서 설명한 바와 같이, 본 개시의 하드웨어 칩은 하드웨어 칩을 형성하는 기판 상에 포함된 프로세서 서브유닛 전체에서 병렬로 코드를 실행할 수 있다. 또한, 본 개시의 하드웨어 칩은 멀티태스킹을 수행할 수 있다. 예를 들면, 본 개시의 하드웨어 칩은 영역 멀티태스킹을 수행할 수 있다. 즉, 하드웨어 칩의 프로세서 서브유닛의 한 그룹은 한 작업(예, 오디오 프로세싱)을 실행하고, 하드웨어 칩의 프로세서 서브유닛의 다른 그룹은 다른 작업(예, 이미지 프로세싱)을 실행할 수 있다. 다른 예에서, 본 개시의 하드웨어 칩은 타이밍 멀티태스킹을 수행할 수 있다. 즉, 하드웨어 칩의 하나 이상의 프로세서 서브유닛은 제1 시간 주기 동안에 한 작업을 실행하고 제2 시간 주기 동안에는 다른 작업을 실행할 수 있다. 영역 멀티태스킹과 타이밍 멀티태스킹의 조합도 사용되어 제1 시간 주기 동안에 한 작업이 제1 그룹의 프로세서 서브유닛에 배정되고 제1 시간 주기 동안에 다른 작업이 제2 그룹의 프로세서 서브유닛에 배정된 후에 제2 시간 주기 동안에 제3 작업이 제1 그룹과 제2 그룹에 포함된 프로세서 서브유닛에 배정될 수 있다.As described above, the hardware chip of the present disclosure may execute codes in parallel in all of the processor subunits included on the substrate forming the hardware chip. In addition, the hardware chip of the present disclosure may perform multitasking. For example, the hardware chip of the present disclosure may perform area multitasking. That is, one group of processor subunits of the hardware chip may execute one task (eg, audio processing), and another group of processor subunits of the hardware chip may execute another task (eg, image processing). In another example, a hardware chip of the present disclosure may perform timing multitasking. That is, the one or more processor subunits of the hardware chip may execute one task during the first period of time and execute another task during the second period of time. A combination of area multitasking and timing multitasking is also used so that during a first period of time one task is assigned to a processor subunit of a first group and a task is assigned to a processor subunit of a second group during a first period of time after another task is assigned to a processor subunit of a second group. A third task may be assigned to the processor subunits included in the first group and the second group during the two time period.

본 개시의 메모리 칩 상의 실행을 위한 머신 코드를 정리하기 위하여, 머신 코드는 메모리 칩의 프로세서 서브유닛 사이에서 분할될 수 있다. 예를 들어, 메모리 칩 상의 프로세서는 기판 및 기판 상에 배치된 복수의 프로세서 서브유닛을 포함할 수 있다. 메모리 칩은 기판 상에 배치된 상응하는 복수의 메모리 뱅크를 더 포함할 수 있고, 여기서, 복수의 프로세서 서브유닛의 각각은 복수의 프로세서 서브유닛의 어느 프로세서 서브유닛에도 공유되지 않는 적어도 하나의 전용 메모리 뱅크에 연결될 수 있다. 메모리 칩 상의 각 프로세서 서브유닛은 다른 프로세서 서브유닛과 별개로 일련의 명령을 실행하도록 구성될 수 있다. 각 일련의 명령은 일련의 명령을 정의하는 코드에 따라 프로세서 서브유닛의 하나 이상의 일반 처리 소자를 설정 및/또는 일련을 명령을 정의하는 코드 내에 제공된 시퀀스에 따라 프로세서 서브유닛의 하나 이상의 특별 처리 소자(예, 하나 이상의 가속기)를 활성화하여 실행될 수 있다.To organize the machine code for execution on a memory chip of the present disclosure, the machine code may be partitioned among processor subunits of the memory chip. For example, a processor on a memory chip may include a substrate and a plurality of processor subunits disposed on the substrate. The memory chip may further include a corresponding plurality of memory banks disposed on the substrate, wherein each of the plurality of processor subunits has at least one dedicated memory that is not shared with any processor subunit of the plurality of processor subunits. It can be connected to a bank. Each processor subunit on the memory chip may be configured to execute a series of instructions independently of other processor subunits. Each set of instructions sets one or more general processing elements of the processor subunit according to the code defining the set of instructions and/or sets one or more special processing elements of the processor subunit according to a sequence provided in the code defining the series of instructions ( eg, by activating one or more accelerators).

이에 따라, 각 일련의 명령은 단일 프로세서 서브유닛에 의해 수행될 일련의 작업을 정의할 수 있다. 단일 작업은 프로세서 서브유닛 내의 하나 이상의 처리 소자의 아키텍처에 의해 정의된 명령 세트 이내의 명령을 포함할 수 있다. 예를 들어, 프로세서 서브유닛은 특정 레지스터(register)를 포함할 수 있고, 단일 작업은 레지스터로 데이터 푸쉬(push), 레지스터로부터 데이터 풀(pull), 레지스터 이내의 데이터에 대한 산술 연산의 수행, 레지스터 이내의 데이터에 대한 논리 연산의 수행 등을 할 수 있다. 또한, 프로세서 서브유닛은 0-피연산자(operand) 프로세서 서브유닛('스택 머신(stack machine)'으로도 지칭), 1-피연산자 프로세서 서브유닛(어큐뮬레이터 머신(accumulator machine)으로도 지칭), 2-연산자 프로세서 서브유닛(RISC 등), 3-연산자 프로세서 서브유닛(CISC(complex instruction set computer)) 등과 같이 여러 수의 피연산자에 대해 구성될 수 있다. 다른 예에서, 프로세서 서브유닛은 하나 이상의 가속기를 포함할 수 있고, 단일 작업은 가속기를 활성화하여 MAC 함수, MAX 함수, MAX-0 함수 등과 같은 특정 함수를 수행할 수 있다.Accordingly, each set of instructions may define a set of tasks to be performed by a single processor subunit. A single task may include instructions within an instruction set defined by the architecture of one or more processing elements within a processor subunit. For example, a processor subunit may contain specific registers, and a single operation may push data into a register, pull data from a register, perform arithmetic operations on data within a register, and register It is possible to perform logical operations on data within the range. In addition, the processor subunit is a 0-operand processor subunit (also referred to as a 'stack machine'), a one-operand processor subunit (also referred to as an accumulator machine), a two-operand It may be configured for any number of operands, such as a processor subunit (such as RISC), a three-operator processor subunit (such as a complex instruction set computer (CISC)), or the like. In another example, the processor subunit may include one or more accelerators, and a single task may activate the accelerators to perform specific functions, such as MAC functions, MAX functions, MAX-0 functions, and the like.

일련의 명령은 메모리 칩의 전용 메모리 뱅크로부터 읽기와 쓰기를 하기 위한 작업을 더 포함할 수 있다. 예를 들어, 작업은 이 작업을 실행하는 프로세서 서브유닛 전용의 메모리 뱅크에 데이터 하나를 쓰는 작업, 이 작업을 실행하는 프로세서 서브유닛 전용의 메모리 뱅크에서 데이터 하나를 읽는 작업 등을 포함할 수 있다. 일부 실시예에서, 읽기와 쓰기는 메모리 뱅크의 컨트롤러와 협력하여 프로세서 서브유닛에 의해 수행될 수 있다. 예를 들어, 프로세서 서브유닛은 읽기 또는 쓰기를 수행하라는 제어 신호를 컨트롤러에 전송하여 읽기 또는 쓰기 작업을 실행할 수 있다. 일부 실시예에서, 제어 신호는 읽기와 쓰기에 사용할 특정 어드레스를 포함할 수 있다. 또는 프로세서 서브유닛은 읽기와 쓰기를 위해 사용할 수 있는 어드레스의 선택을 메모리 컨트롤러에 맡길 수 있다.The series of commands may further include operations for reading and writing from a dedicated memory bank of the memory chip. For example, the operation may include writing one data to a memory bank dedicated to the processor subunit executing the operation, reading one data from a memory bank dedicated to the processor subunit executing the operation, and the like. In some embodiments, reads and writes may be performed by the processor subunit in cooperation with the controller of the memory bank. For example, the processor subunit may transmit a control signal to perform a read or write to the controller to execute a read or write operation. In some embodiments, the control signal may include a specific address to use for reading and writing. Alternatively, the processor subunit may leave the memory controller to select an address that can be used for reading and writing.

추가적으로 또는 대안적으로, 읽기와 쓰기는 하나 이상의 가속기가 메모리 뱅크의 컨트롤러와 협력하여 수행될 수 있다. 예를 들어, 가속기는 앞서 설명한 프로세서 서브유닛이 제어 신호를 생성하는 것과 유사하게 메모리 컨트롤러에 대한 제어 신호를 생성할 수 있다.Additionally or alternatively, reads and writes may be performed by one or more accelerators in cooperation with the controller of the memory bank. For example, the accelerator may generate a control signal for the memory controller similarly to the processor subunit described above for generating the control signal.

상기에 설명한 실시예에서, 어드레스 생성기를 사용하여 메모리 뱅크의 특정 어드레스로 읽기와 쓰기를 지시할 수도 있다. 예를 들어, 어드레스 생성기는 읽기와 쓰기를 위한 메모리 어드레스를 생성하도록 구성된 처리 소자를 포함할 수 있다. 어드레스 생성기는 추후 계산의 결과를 더 이상 필요가 없는 이전 계산의 결과의 어드레스와 같은 어드레스로 쓰는 등을 통하여 효율성을 향상하기 위하여 어드레스를 생성하도록 구성될 수 있다. 이에 따라, 어드레스 생성기는 프로세서 서브유닛으로부터의(예, 프로세서 서브유닛에 포함된 처리 소자 또는 하나 이상의 가속기로부터의) 명령에 대응하여 또는 프로세서 서브유닛과 협력하여 메모리 컨트롤러에 대한 제어 신호를 생성할 수 있다. 추가적으로 또는 대안적으로, 어드레스 생성기는, 예를 들어, 특정 패턴에서 메모리 내의 특정 어드레스에 대해 반복하는 중첩 루프(nested loop) 구조를 생성하는 일부 설정 또는 레지스터에 의거하여 어드레스를 생성할 수 있다.In the above-described embodiment, the address generator may be used to instruct reads and writes to specific addresses in the memory bank. For example, the address generator may include a processing element configured to generate a memory address for reading and writing. The address generator may be configured to generate addresses to improve efficiency, such as by writing the results of later calculations to the same address as those of previous calculations that are no longer needed. Accordingly, the address generator may generate control signals for the memory controller in response to or in cooperation with the processor subunit in response to instructions from the processor subunit (eg, from one or more accelerators or processing elements included in the processor subunit). have. Additionally or alternatively, the address generator may generate an address based on some setting or register, for example, creating a nested loop structure that repeats for a particular address in memory in a particular pattern.

일부 실시예에서, 각 일련의 명령은 상응하는 일련의 작업을 정의하는 머신 코드 세트를 포함할 수 있다. 이에 따라, 상기 일련의 작업은 일련의 명령을 포함하는 머신 코드 내에 압축될 수 있다. 일부 실시예에서, 하기에 도 8을 참조하여 설명하는 바와 같이, 일련의 작업은 복수의 논리 회로 중의 고수준(higher-level)의 일련의 작업을 복수의 일련의 작업으로 분포하도록 구성된 컴파일러에 의해 정의될 수 있다. 예를 들어, 컴파일러는 고수준의 일련의 작업에 의거하여 복수의 일련의 작업을 생성하여 각각 상응하는 일련의 작업을 함께 실행하는 프로세서 서브유닛이 고수준의 일련의 작업이 서술한 것과 같은 함수를 수행할 수 있다.In some embodiments, each set of instructions may include a set of machine code defining a corresponding set of actions. Accordingly, the set of operations can be compressed into machine code comprising a set of instructions. In some embodiments, as described below with reference to FIG. 8 , the set of tasks is defined by a compiler configured to distribute a higher-level set of tasks among the plurality of logic circuits into a plurality of sets of tasks. can be For example, a compiler may create a plurality of sets of tasks based on a high-level set of tasks, so that a processor subunit, each executing a corresponding set of tasks together, may perform a function as described by the high-level set of tasks. can

하기에 설명하는 바와 같이, 고수준의 일련의 작업은 인간-판독 가능 프로그래밍 언어로 된 명령 세트를 포함할 수 있다. 이에 상응하여, 각 프로세서 서브유닛에 대한 일련의 작업은 머신 코드로 된 명령 세트를 각각 포함하는 저수준(lower-level)의 일련의 작업을 포함할 수 있다.As described below, a high-level set of tasks may include a set of instructions in a human-readable programming language. Correspondingly, the set of tasks for each processor subunit may include a lower-level set of tasks, each comprising a set of instructions in machine code.

앞서 도 7a와 도 7b를 참조하여 설명한 바와 같이, 메모리 칩은 복수의 프로세서 서브유닛의 하나를 복수의 프로세서 서브유닛의 적어도 다른 하나에 각각 연결하는 복수의 버스를 더 포함할 수 있다. 또한, 앞서 설명한 바와 같이, 복수의 버스 상의 데이터 전송은 소프트웨어를 사용하여 제어될 수 있다. 이에 따라, 복수의 버스의 적어도 하나를 통한 데이터 전송은 복수의 버스의 적어도 하나에 연결된 프로세서 서브유닛에 포함된 일련의 명령에 의해 미리 정해질 수 있다. 따라서, 일련의 명령에 포함된 작업 중의 하나는 버스 중의 하나로 데이터를 출력하는 작업 또는 버스 중의 하나로부터 데이터를 가져오는 작업을 포함할 수 있다. 이러한 작업은 프로세서 서브유닛의 처리 소자 또는 프로세서 서브유닛에 포함된 하나 이상의 가속기에 의해 실행될 수 있다. 후자의 실시예에서, 프로세서 서브유닛은 가속기가 버스 중의 하나로부터 데이터를 가져오거나 버스 중의 하나로 데이터를 배치하는 사이클과 같은 사이클에서 계산을 수행하거나 상응하는 메모리 뱅크로 제어 신호를 전송할 수 있다.As described above with reference to FIGS. 7A and 7B , the memory chip may further include a plurality of buses respectively connecting one of the plurality of processor subunits to at least another one of the plurality of processor subunits. Also, as described above, data transfers on multiple buses can be controlled using software. Accordingly, data transmission over at least one of the plurality of buses may be predetermined by a series of instructions included in a processor subunit coupled to at least one of the plurality of buses. Thus, one of the tasks included in the set of commands may include outputting data to one of the buses or retrieving data from one of the buses. These tasks may be executed by processing elements of the processor subunit or one or more accelerators included in the processor subunit. In the latter embodiment, the processor subunit may perform calculations or send control signals to the corresponding memory bank in a cycle, such as a cycle in which the accelerator fetches data from or places data on one of the buses.

일례에서, 복수의 버스의 적어도 하나에 연결된 프로세서 서브유닛에 포함된 일련의 명령은 복수의 버스의 적어도 하나에 연결된 프로세서 서브유닛이 복수의 버스의 적어도 하나로 데이터를 쓰게 하는 명령을 포함하는 전송 작업을 포함할 수 있다. 추가적으로 또는 대안적으로, 복수의 버스의 적어도 하나에 연결된 프로세서 서브유닛에 포함된 일련의 명령은 복수의 버스의 적어도 하나에 연결된 프로세서 서브유닛이 복수의 버스의 적어도 하나로부터 데이터를 읽게 하는 명령을 포함하는 수신 작업을 포함할 수 있다.In one example, the series of instructions included in a processor subunit coupled to at least one of the plurality of buses performs a transfer operation comprising instructions for causing the processor subunit coupled to at least one of the plurality of buses to write data to at least one of the plurality of buses. may include Additionally or alternatively, the series of instructions included in the processor subunit coupled to at least one of the plurality of buses includes instructions to cause the processor subunit coupled to at least one of the plurality of buses to read data from at least one of the plurality of buses. It may include a receiving operation to

프로세서 서브유닛 사이의 코드 분산에 추가적으로 또는 대안적으로, 데이터는 메모리 칩의 메모리 뱅크 사이에 분할될 수 있다. 예를 들어, 앞서 설명한 바와 같이, 메모리 칩 상의 분산 프로세서는 메모리 칩 상에 배치된 복수의 프로세서 서브유닛과 메모리 칩 상에 배치된 복수의 메모리 뱅크를 포함할 수 있다. 복수의 메모리 뱅크의 각 메모리 뱅크는 복수의 메모리 뱅크의 다른 메모리 뱅크에 저장된 데이터와 별개인 데이터를 저장하도록 구성될 수 있고, 복수의 프로세서 서브유닛의 각 프로세서 서브유닛은 복수의 메모리 뱅크 중에서 적어도 하나의 전용 메모리 뱅크에 연결될 수 있다. 예를 들어, 각 프로세서 서브유닛은 그 프로세서 서브유닛 전용의 하나 이상의 상응하는 메모리 뱅크의 하나 이상의 메모리 컨트롤러에 접근할 수 있고, 다른 어떤 프로세서 서브유닛도 이러한 상응하는 하나 이상의 메모리 컨트롤러에 접근할 수 없을 수 있다. 이에 따라, 메모리 뱅크 사이에 공유될 수 있는 메모리 컨트롤러가 없기 때문에 각 메모리 뱅크에 저장된 데이터는 다른 메모리 뱅크에 저장된 메모리와 별개일 수 있다.Additionally or alternatively to distributing code between processor subunits, data may be partitioned between memory banks of memory chips. For example, as described above, the distributed processor on the memory chip may include a plurality of processor subunits disposed on the memory chip and a plurality of memory banks disposed on the memory chip. Each memory bank of the plurality of memory banks may be configured to store data separate from data stored in other memory banks of the plurality of memory banks, wherein each processor subunit of the plurality of processor subunits comprises at least one of the plurality of memory banks. It can be connected to a dedicated memory bank of For example, each processor subunit may access one or more memory controllers of one or more corresponding memory banks dedicated to that processor subunit, and no other processor subunit may access such corresponding one or more memory controllers. can Accordingly, since there is no memory controller that can be shared between the memory banks, the data stored in each memory bank may be separate from the memory stored in the other memory banks.

일부 실시예에서, 하기에 도 8을 참조하여 설명하는 바와 같이, 복수의 메모리 뱅크 각각에 저장된 데이터는 복수의 메모리 뱅크 사이에 데이터를 분산하도록 구성된 컴파일러에 의해 정의될 수 있다. 또한, 컴파일러는 상응하는 프로세서 서브유닛에 분산된 복수의 저수준 작업을 이용하여 고수준의 일련의 작업에서 정의된 데이터를 복수의 메모리 뱅크에 분산하도록 구성될 수 있다.In some embodiments, as described below with reference to FIG. 8 , data stored in each of the plurality of memory banks may be defined by a compiler configured to distribute the data among the plurality of memory banks. Further, the compiler may be configured to distribute data defined in a high-level series of tasks across a plurality of memory banks using a plurality of low-level tasks distributed to corresponding processor subunits.

하기에 더 설명하는 바와 같이, 고수준의 일련의 작업은 인간-판독 가능 프로그래밍 언어로 된 명령 세트를 포함할 수 있다. 이에 상응하여, 각 프로세서 서브유닛에 대한 일련의 작업은 머신 코드로 된 명령 세트를 각각 포함하는 저수준의 일련의 작업을 포함할 수 있다.As described further below, a high-level set of tasks may include a set of instructions in a human-readable programming language. Correspondingly, the sequence of operations for each processor subunit may include a low-level sequence of operations each comprising a set of instructions in machine code.

앞서 도 7a와 도 7b를 참조하여 설명한 바와 같이, 메모리 칩은 복수의 프로세서 서브유닛의 하나를 복수의 메모리 뱅크 중의 하나 이상의 상응하는 전용 메모리 뱅크에 각각 연결하는 복수의 버스를 더 포함할 수 있다. 또한, 앞서 설명한 바와 같이, 복수의 버스 상의 데이터 전송은 소프트웨어를 활용하여 제어될 수 있다. 이에 따라, 복수의 버스 중 특정 버스를 통한 데이터 전송은 복수의 버스 중 특정 버스에 연결된 상응하는 프로세서 서브유닛에 의해 제어될 수 있다. 따라서, 일련의 명령에 포함된 작업의 하나는 버스 중의 하나로 데이터를 출력하는 작업 또는 버스 중의 하나로부터 데이터를 가져오는 작업을 포함할 수 있다. 앞서 설명한 바와 같이, 이러한 작업은 (i) 프로세서 서브유닛의 처리 소자 또는 (ii) 프로세서 서브유닛에 포함된 하나 이상의 가속기에 의해 실행될 수 있다. 후자의 실시예에서, 프로세서 서브유닛은 가속기가 하나 이상의 상응하는 전용 메모리 뱅크에 연결된 버스 중의 하나로부터 데이터를 가져오거나 하나 이상의 상응하는 전용 메모리 뱅크에 연결된 버스 중의 하나로 데이터를 배치하는 사이클과 같은 사이클에서 계산을 수행하거나 프로세서 서브유닛을 다른 프로세서 서브유닛에 연결하는 버스를 사용할 수 있다.As previously described with reference to FIGS. 7A and 7B , the memory chip may further include a plurality of buses each coupling one of the plurality of processor subunits to a corresponding dedicated memory bank of one or more of the plurality of memory banks. Also, as described above, data transfer on multiple buses can be controlled utilizing software. Accordingly, data transfer via a specific one of the plurality of buses may be controlled by a corresponding processor subunit connected to the specific one of the plurality of buses. Thus, one of the tasks included in the sequence of commands may include outputting data to one of the buses or retrieving data from one of the buses. As described above, these tasks may be executed by (i) processing elements in the processor subunit or (ii) one or more accelerators included in the processor subunit. In the latter embodiment, the processor subunit is configured in the same cycle in which the accelerator fetches data from one of the buses coupled to one or more corresponding dedicated memory banks or places data onto one of the buses coupled to one or more corresponding dedicated memory banks. A bus may be used to perform calculations or to connect processor subunits to other processor subunits.

따라서, 일례에서, 복수의 버스의 적어도 하나로 연결된 프로세서 서브유닛에 포함된 일련의 명령은 전송 작업을 포함할 수 있다. 전송 작업은 하나 이상의 상응하는 전용 메모리 뱅크에 저장할 데이터를 복수의 버스의 적어도 하나에 연결된 프로세서 서브유닛이 복수의 버스의 적어도 하나로 쓰게 하는 명령을 포함할 수 있다. 추가적으로 또는 대안적으로, 복수의 버스의 적어도 하나로 연결된 프로세서 서브유닛에 포함된 일련의 명령은 수신 작업을 포함할 수 있다. 수신 작업은 하나 이상의 상응하는 전용 메모리 뱅크에 저장할 데이터를 복수의 버스의 적어도 하나에 연결된 프로세서 서브유닛이 복수의 버스의 적어도 하나로부터 읽게 하는 명령을 포함할 수 있다. 이에 따라, 이러한 실시예의 전송 및 수신 작업은 하나 이상의 상응하는 전용 메모리 뱅크의 하나 이상의 메모리 컨트롤러로 복수의 버스의 적어도 하나를 따라 전송되는 제어 신호를 포함할 수 있다. 또한, 전송 및 수신 작업은 프로세서 서브유닛의 다른 부분(프로세서 서브유닛의 하나 이상의 다른 가속기)에 의해 실행되는 계산 또는 기타 작업과 동시에 프로세서 서브유닛의 일 부분(예, 프로세서 서브유닛의 하나 이상의 가속기)에 의해 실행될 수 있다. 이러한 동시 실행의 일례에는 수신, 곱셈, 및 전송이 함께 실행되는 MAC-릴레이(relay) 명령이 포함될 수 있다.Thus, in one example, a series of instructions included in a processor subunit coupled to at least one of the plurality of buses may include a transmit operation. The transfer operation may include instructions that cause a processor subunit coupled to at least one of the plurality of buses to write data to at least one of the plurality of buses to be stored in one or more corresponding dedicated memory banks. Additionally or alternatively, the series of instructions included in the processor subunit coupled to at least one of the plurality of buses may include a receive operation. The receive operation may include instructions to cause a processor subunit coupled to at least one of the plurality of buses to read from at least one of the plurality of buses data to be stored in one or more corresponding dedicated memory banks. Accordingly, the transmit and receive operations of these embodiments may include control signals transmitted along at least one of the plurality of buses to one or more memory controllers of one or more corresponding dedicated memory banks. In addition, transmitting and receiving tasks may be performed by one portion of a processor subunit (e.g., one or more other accelerators in a processor subunit) concurrently with computations or other operations performed by other portions of the processor subunit (eg, one or more other accelerators in a processor subunit). can be executed by An example of such concurrent execution may include MAC-relay instructions in which receive, multiply, and transmit are executed together.

메모리 뱅크 간의 데이터 분산 외에도, 특정 부분의 데이터가 서로 다른 메모리 뱅크에 복제될 수 있다. 예를 들어, 앞서 설명한 바와 같이, 메모리 칩 상의 분산 프로세서는 메모리 칩 상에 배치된 복수의 프로세서 서브유닛과 메모리 칩 상에 배치된 복수의 메모리 뱅크를 포함할 수 있다. 복수의 프로세서 서브유닛의 각각은 복수의 메모리 뱅크 중의 적어도 하나의 전용 메모리 뱅크로 연결될 수 있고, 복수의 메모리 뱅크의 각 메모리 뱅크는 복수의 메모리 뱅크의 다른 메모리 뱅크에 저장된 데이터와 별개인 데이터를 저장하도록 구성될 수 있다. 또한, 복수의 메모리 뱅크 중의 하나의 특정 메모리 뱅크에 저장된 데이터의 적어도 일부는 복수의 메모리 뱅크의 적어도 다른 하나의 메모리 뱅크에 저장된 데이터의 복제본을 포함할 수 있다. 예를 들어, 일련의 명령에 사용되는 데이터의 번호, 문자열, 또는 기타 유형이 한 메모리 뱅크에서 메모리 칩의 다른 프로세서 서브유닛으로 전송되기보다는 서로 다른 프로세서 서브유닛의 전용인 복수의 메모리 뱅크에 저장될 수 있다.In addition to distributing data between memory banks, specific portions of data may be duplicated in different memory banks. For example, as described above, the distributed processor on the memory chip may include a plurality of processor subunits disposed on the memory chip and a plurality of memory banks disposed on the memory chip. Each of the plurality of processor subunits may be coupled to at least one dedicated memory bank of the plurality of memory banks, each memory bank of the plurality of memory banks storing data separate from data stored in other memory banks of the plurality of memory banks can be configured to Also, at least a portion of the data stored in one specific memory bank of the plurality of memory banks may include a copy of the data stored in at least another memory bank of the plurality of memory banks. For example, numbers, strings, or other types of data used in a series of instructions may be stored in multiple memory banks dedicated to different processor subunits rather than being transferred from one memory bank to another processor subunit on a memory chip. can

일례에서, 병렬 문자열 매칭(parallel string matching)은 앞서 설명한 데이터 복제를 활용할 수 있다. 예를 들어, 복수의 문자열은 동일한 문자열에 비교될 수 있다. 기존의 프로세서는 복수의 문자열의 각 문자열을 동일 스트링과 순차 비교할 것이다. 본 개시의 하드웨어 칩에서는, 동일 문자열이 메모리 뱅크 전반에 걸쳐 복제되어 프로세서 서브유닛이 복수의 문자열의 개별 문자열을 복제된 문자열과 병렬 비교할 수 있도록 할 수 있다.In one example, parallel string matching may utilize data replication described above. For example, a plurality of character strings may be compared to the same character string. The conventional processor will sequentially compare each string of a plurality of strings with the same string. In the hardware chip of the present disclosure, the same character string may be duplicated throughout the memory bank, allowing the processor subunit to parallelly compare individual character strings of a plurality of character strings with the duplicated character string.

일부 실시예에서, 하기에 도 8을 참조하여 설명하는 바와 같이, 복수의 메모리 뱅크 중의 하나의 특정 메모리 뱅크 및 복수의 메모리 뱅크의 적어도 다른 하나에 복제된 일부 데이터는 메모리 뱅크에 데이터를 복제하도록 구성된 컴파일러에 의해 정의된다. 또한, 컴파일러는 상응하는 프로세서 서브유닛에 분산된 복수의 저수준 작업을 이용하여 적어도 일부 데이터를 복제하도록 구성될 수 있다.In some embodiments, as described below with reference to FIG. 8 , some data copied to a particular memory bank of one of the plurality of memory banks and at least another one of the plurality of memory banks is configured to duplicate data in the memory bank. Defined by the compiler. Further, the compiler may be configured to replicate at least some data using a plurality of low-level tasks distributed among corresponding processor subunits.

데이터의 복제는 데이터의 동일 부분을 다른 계산에 재사용하는 작업에 유용할 수 있다. 데이터의 이런 부분을 복제함으로써, 이러한 계산이 메모리 뱅크의 프로세서 서브유닛에 분산되어 병렬 실행될 수 있는 반면, 각 프로세서 서브유닛은 (프로세서 서브유닛을 연결하는 버스를 통해 데이터의 이런 부분을 푸쉬 및 풀 하지 않고) 데이터의 이런 부분을 전용 메모리 뱅크에 저장하고 전용 메모리 뱅크에서 저장된 부분에 접근할 수 있다. 일례에서, 복수의 메모리 뱅크의 하나의 특정 메모리 뱅크와 복수의 메모리 뱅크의 적어도 하나의 다른 메모리 뱅크에 복제된 일부 데이터는 신경망의 가중치를 포함할 수 있다. 이러한 예에서, 신경망의 각 노드는 복수의 프로세서 서브유닛의 적어도 한 프로세서 서브유닛에 의해 정의될 수 있다. 예를 들어, 각 노드는 이 노드를 정의하는 적어도 하나의 프로세서 서브유닛에 의해 실행되는 머신 코드를 포함할 수 있다. 이러한 예에서, 가중치를 복제하면 각 프로세서 서브유닛이 (다른 프로세서 서브유닛과의 데이터 전송을 수행하지 않고) 하나 이상의 전용 메모리 뱅크에만 접근하면서 머신 코드를 실행하여 상응하는 노드에 적어도 부분적으로 작용할 수 있다. 전용 메모리에 대한 읽기와 쓰기 타이밍은 다른 프로세서 서브유닛과 별개인 반면에 프로세서 서브유닛 사이의 데이터 전송은 타이밍 동기화(앞서 설명한 바와 같이 소프트웨어를 활용)가 필요하므로, 메모리의 복제를 통해 프로세서 서브유닛 사이의 데이터 전송을 안 해도 되면 전체적인 실행이 더욱 효율적이 될 수 있다.Replicating data can be useful for reusing the same part of data for other computations. By duplicating this portion of data, these calculations can be distributed and executed in parallel across processor subunits in a memory bank, while each processor subunit (not pushing and pulling this portion of data through the bus connecting the processor subunits) ) can store this portion of data in a dedicated memory bank and access the stored portion in the dedicated memory bank. In one example, some data copied to one specific memory bank of the plurality of memory banks and at least one other memory bank of the plurality of memory banks may include weights of the neural network. In this example, each node of the neural network may be defined by at least one processor subunit of a plurality of processor subunits. For example, each node may include machine code executed by at least one processor subunit defining the node. In this example, duplicating the weights allows each processor subunit to execute machine code while accessing only one or more dedicated memory banks (without performing data transfers with other processor subunits) to act at least in part on the corresponding node. . Since the timing of reads and writes to the dedicated memory is independent of other processor subunits, data transfer between processor subunits requires timing synchronization (using software as described previously), thus replicating memory between processor subunits. If you don't have to transfer the data, the overall execution can be more efficient.

앞서 도 7a와 도 7b를 참조하여 설명한 바와 같이, 메모리 칩은 복수의 프로세서 서브유닛의 하나를 복수의 메모리 뱅크의 하나 이상의 상응하는 전용 메모리 뱅크에 각각 연결하는 복수의 버스를 포함할 수 있다. 또한, 앞서 설명한 바와 같이, 복수의 버스 상의 데이터 전송은 소프트웨어를 이용하여 제어될 수 있다. 이에 따라, 복수의 버스의 특정 버스를 통한 데이터 전송은 복수의 버스의 특정 버스에 연결된 상응하는 프로세서 서브유닛에 의해 제어될 수 있다. 따라서, 일련의 명령에 포함된 작업의 하나는 버스 중의 하나로 데이터를 출력하는 작업 또는 버스 중의 하나에서 데이터를 가져오는 작업을 포함할 수 있다. 앞서 설명한 바와 같이, 이러한 작업은 (i) 프로세서 서브유닛의 처리 소자 또는 (ii) 프로세서 서브유닛에 포함된 하나 이상의 가속기에 의해 실행될 수 있다. 앞서 더 설명한 바와 같이, 이러한 작업은 하나 이상의 상응하는 전용 메모리 뱅크의 하나 이상의 메모리 컨트롤러로 복수의 버스의 적어도 하나를 따라 전송되는 제어 신호를 포함하는 전송 작업 및/또는 수신 작업을 포함할 수 있다.As previously described with reference to FIGS. 7A and 7B , the memory chip may include a plurality of buses each coupling one of the plurality of processor subunits to one or more corresponding dedicated memory banks of the plurality of memory banks. Also, as described above, data transfer on multiple buses can be controlled using software. Thereby, data transfer over a specific bus of a plurality of buses can be controlled by a corresponding processor subunit connected to a specific bus of the plurality of buses. Thus, one of the tasks included in the sequence of commands may include outputting data to one of the buses or fetching data from one of the buses. As described above, these tasks may be executed by (i) processing elements in the processor subunit or (ii) one or more accelerators included in the processor subunit. As further described above, such operations may include transmit operations and/or receive operations comprising control signals transmitted along at least one of the plurality of buses to one or more memory controllers of one or more corresponding dedicated memory banks.

도 8은 도 7a와 도 7b에 도시된 것 등과 같은 본 개시의 예시적인 메모리 칩 상에서 실행되기 위한 일련의 명령을 컴파일하는 방법(800)의 순서도를 도시한 것이다. 방법(800)은 범용 또는 전용의 기존 프로세서에 의해 구현될 수 있다.8 shows a flow diagram of a method 800 of compiling a sequence of instructions for execution on an exemplary memory chip of the present disclosure, such as those shown in FIGS. 7A and 7B . Method 800 may be implemented by a general purpose or dedicated, existing processor.

방법(800)은 컴파일러를 형성하는 컴퓨터 프로그램의 일부로 실행될 수 있다. 본 개시에서, '컴파일러'는 고수준 언어(예, C, FORTRAN, BASIC 등과 같은 절차형 언어; Java, C++, Pascal, Python 등과 같은 객체 지향 언어 등)를 저수준 언어(예, 어셈블리 코드, 오브젝트 코드, 머신 코드 등)로 변환하는 모든 컴퓨터 프로그램을 말한다. 컴파일러는 사람으로 하여금 인간 판독 가능 언어로 일련의 명령을 프로그램할 수 있게 해줄 수 있고, 이러한 명령은 나중에 머신 실행 가능 언어로 변환된다.Method 800 may be executed as part of a computer program forming a compiler. In the present disclosure, a 'compiler' refers to a low-level language (eg, assembly code, object code, Any computer program that translates into machine code, etc.). Compilers can enable humans to program a set of instructions in a human readable language, which are later translated into a machine executable language.

단계 810에서, 프로세서는 일련의 명령과 연관된 작업을 프로세서 서브유닛의 서로 다른 서브유닛에 배정할 수 있다. 예를 들어, 일련의 명령은 프로세서 서브유닛에서 병렬로 실행될 소그룹으로 분할될 수 있다. 일례에서, 신경망은 노드로 분할될 수 있고, 하나 이상의 노드가 서로 다른 프로세서 서브유닛에 배정될 수 있다. 이러한 예에서, 각 소그룹은 서로 다른 계층으로 연결된 복수의 노드를 포함할 수 있다. 따라서, 프로세서 서브유닛은 신경망의 제1 계층으로부터의 노드, 동일 프로세서 서브유닛에 의해 구현된 제1 계층으로부터의 노드에 연결된 제2 계층으로부터의 노드 등을 구현할 수 있다. 각각의 연결에 의거하여 노드를 배정함으로써, 프로세서 서브유닛 간의 데이터 전송이 적어질 수 있고, 그 결과로 앞서 설명한 바와 같이 효율성이 향상될 수 있다.In step 810 , the processor may assign tasks associated with a set of instructions to different subunits of the processor subunit. For example, a series of instructions may be divided into small groups to be executed in parallel in processor subunits. In one example, a neural network may be partitioned into nodes, and one or more nodes may be assigned to different processor subunits. In this example, each subgroup may include a plurality of nodes connected by different hierarchies. Accordingly, the processor subunit may implement a node from a first layer of the neural network, a node from a second layer connected to a node from the first layer implemented by the same processor subunit, and the like. By allocating nodes based on each connection, data transfer between processor subunits can be reduced, and as a result, efficiency can be improved as described above.

앞서 도 7a와 도 7b를 참조하여 설명한 바와 같이, 프로세서 서브유닛은 메모리 칩 상에 배치된 복수의 메모리 뱅크에 공간적으로 분산될 수 있다. 이에 따라, 작업의 배정은 논리적으로 분할될 뿐만 아니라 적어도 부분적으로는 공간적으로 분할될 수 있다.As described above with reference to FIGS. 7A and 7B , the processor subunits may be spatially distributed in a plurality of memory banks disposed on the memory chip. Accordingly, the assignment of tasks can be partitioned not only logically but also at least partially spatially.

단계 820에서, 프로세서는 버스에 의해 각각 연결되는 메모리 칩의 프로세서 서브유닛의 쌍 사이에 데이터를 전송하는 작업을 생성할 수 있다. 예를 들어, 앞서 설명한 바와 같이, 데이터 전송은 소프트웨어를 활용하여 제어될 수 있다. 이에 따라, 프로세서 서브유닛은 동기화된 시간에 버스 상에서 데이터를 푸시 및 풀 하도록 구성될 수 있다. 따라서, 생성된 작업은 데이터의 이러한 동기화된 푸쉬 및 풀을 수행하기 위한 작업을 포함할 수 있다.In step 820 , the processor may create a task to transfer data between pairs of processor subunits of memory chips that are each connected by a bus. For example, as described above, data transfer may be controlled utilizing software. Accordingly, the processor subunit may be configured to push and pull data on the bus at synchronized times. Thus, the created jobs may include jobs to perform these synchronized pushes and pulls of data.

앞서 설명한 바와 같이, 단계 820은 프로세서 서브유닛의 타이밍과 지연을 포함하는 내부 동작에 대처하기 위한 프리프로세싱을 포함할 수 있다. 예를 들어, 프로세서는 생성된 작업이 동기화되도록 하기 위하여 프로세서 서브유닛의 알려진 시간과 지연(예, 버스로 데이터를 푸쉬 하는데 드는 시간, 버스로부터 데이터를 풀 하는데 드는 시간, 계산과 푸쉬 또는 풀 사이의 지연 등)을 활용할 수 있다. 따라서, 하나 이상의 프로세서 서브유닛에 의한 적어도 하나의 푸쉬 및 하나 이상의 프로세서 서브유닛에 의한 적어도 하나의 풀을 포함하는 데이터 전송은 프로세서 서브유닛 간의 타이밍 차이, 프로세서 서브유닛의 지연 등으로 인한 지연을 발생시키지 않고 동시에 일어날 수 있다.As described above, step 820 may include preprocessing to cope with internal operations including timing and delay of the processor subunit. For example, the processor may determine the known time and delay of the processor subunits (e.g., the time it takes to push data to and from the bus, the time it takes to pull data from the bus, and the time it takes to delay, etc.) can be used. Accordingly, a data transfer comprising at least one push by one or more processor subunits and at least one pull by one or more processor subunits does not incur delays due to timing differences between processor subunits, delays in processor subunits, etc. and can occur at the same time.

단계 830에서, 프로세서는 배정 및 생성된 작업을 복수의 그룹의 서브시리즈 명령으로 분류할 수 있다. 예를 들어, 서브시리즈 명령은 각각 단일 프로세서 서브유닛에 의해 실행될 일련의 작업을 포함할 수 있다. 따라서, 복수의 그룹의 서브시리즈 명령의 각각은 복수의 프로세서 서브유닛의 서로 다른 서브유닛에 상응할 수 있다. 이에 따라, 단계 810, 단계 820, 및 단계 830의 결과로, 일련의 명령이 복수의 그룹의 서브시리즈 명령으로 나누어질 수 있다. 앞서 설명한 바와 같이, 단계 820은 서로 다른 그룹 사이의 데이터 전송이 모두 동기화 되게 할 수 있다.In step 830, the processor may classify the assigned and generated tasks into a plurality of groups of subseries instructions. For example, a subseries instruction may include a series of tasks each to be executed by a single processor subunit. Accordingly, each of the plurality of groups of subseries instructions may correspond to a different subunit of the plurality of processor subunits. Accordingly, as a result of steps 810, 820, and 830, a series of instructions may be divided into a plurality of groups of subseries instructions. As described above, step 820 may cause all data transmissions between different groups to be synchronized.

단계 840에서, 프로세서는 복수의 그룹의 서브시리즈 명령의 각각에 상응하는 머신 코드를 생성할 수 있다. 예를 들어, 서브시리즈 명령을 나타내는 고수준 코드는 상응하는 프로세서 서브유닛에 의해 실행 가능한 머신 코드와 같은 저수준 코드로 변환될 수 있다.In step 840 , the processor may generate machine code corresponding to each of the plurality of groups of subseries instructions. For example, high-level code representing subseries instructions may be converted into low-level code, such as machine code, executable by a corresponding processor subunit.

단계 850에서, 프로세서는 복수의 그룹의 서브시리즈 명령의 각각에 상응하는 생성된 머신 코드를 상기 분할에 따라 복수의 프로세서 서브유닛의 상응하는 프로세서 서브유닛에 배정할 수 있다. 예를 들어, 프로세서는 각 서브시리즈 명령에 상응하는 프로세서 서브유닛의 식별자를 붙일 수 있다. 따라서, 서브시리즈 명령이 실행을 위해 메모리 칩에 업로드 되는 경우에(예, 도 3a의 호스트(350)에 의해), 각 서브시리즈는 적절한 프로세서 서브유닛을 설정할 수 있다.In step 850, the processor may assign the generated machine code corresponding to each of the subseries instructions of the plurality of groups to the corresponding processor subunits of the plurality of processor subunits according to the partitioning. For example, the processor may attach an identifier of the processor subunit corresponding to each subseries instruction. Thus, when subseries commands are uploaded to the memory chip for execution (eg, by host 350 in FIG. 3A ), each subseries may set up the appropriate processor subunit.

일부 실시예에서, 일련의 명령과 연관된 작업을 프로세서 서브유닛의 서로 다른 프로세서 서브유닛에 배정하는 것은 적어도 부분적으로는 메모리 칩 상의 둘 이상의 프로세서 서브유닛 사이의 공간적 근접성에 의존할 수 있다. 예를 들어, 앞서 설명한 바와 같이, 프로세서 서브유닛 사이의 데이터 전송의 수를 감소하면 효율성이 향상될 수 있다. 이에 따라, 프로세서는 둘 이상의 프로세서 서브유닛을 통해 데이터를 이동하는 데이터 전송을 최소화할 수 있다. 따라서, 프로세서는 인접 전송을 최대화하고(적어도 국부적으로) 이웃하지 않는 프로세서 서브유닛으로의 전송을 최소화하는(적어도 국부적으로) 방식으로 서브시리즈를 프로세서 서브유닛으로 배정하기 위하여 메모리 칩의 알려진 레이아웃을 하나 이상의 최적화 알고리즘(예, 그리디 알고리즘(greedy algorithm))과 병합하여 활용할 수 있다.In some embodiments, assigning tasks associated with a set of instructions to different processor subunits of a processor subunit may depend, at least in part, on spatial proximity between two or more processor subunits on a memory chip. For example, as described above, reducing the number of data transfers between processor subunits can improve efficiency. Accordingly, the processor can minimize data transfers that move data through two or more processor subunits. Thus, the processor uses one known layout of memory chips to allocate subseries to processor subunits in a way that maximizes contiguous transfers (at least locally) and minimizes transfers to non-neighboring processor subunits (at least locally). It can be used by merging with the above optimization algorithms (eg, greedy algorithm).

방법(800)은 본 개시의 메모리 칩에 대한 추가적인 최적화를 포함할 수 있다. 예를 들어, 프로세서는 상기 분할에 의거하여 일련의 명령과 연관된 데이터를 분류하고 이러한 분류에 따라 데이터를 메모리 뱅크에 배정할 수 있다. 이에 따라, 메모리 뱅크는 각 메모리 뱅크의 전용인 각 프로세서 서브유닛에 배정된 서브시리즈 명령에 사용되는 데이터를 보유할 수 있다.Method 800 may include further optimizations to the memory chip of the present disclosure. For example, the processor may classify data associated with a set of instructions based on the segmentation and allocate the data to a memory bank according to the segmentation. Accordingly, the memory banks may hold data used for subseries instructions assigned to each processor subunit that is dedicated to each memory bank.

일부 실시예에서, 데이터의 분류는 둘 이상의 메모리 뱅크로 복제될 데이터의 적어도 일부를 판단하는 작업을 포함할 수 있다. 예를 들어, 앞서 설명한 바와 같이, 일부 데이터는 하나 이상의 서브시리즈 명령에서 사용될 수 있다. 이러한 데이터는 서로 다른 서브시리즈 명령이 배정된 복수의 프로세서 서브유닛 전용의 메모리 뱅크에 복제될 수 있다. 이러한 최적화는 프로세서 서브유닛 간의 데이터 전송을 더 감소시킬 수 있다.In some embodiments, categorizing the data may include determining at least a portion of the data to be copied to two or more memory banks. For example, as described above, some data may be used in one or more subseries instructions. This data may be copied to memory banks dedicated to a plurality of processor subunits assigned different subseries instructions. Such optimization may further reduce data transfer between processor subunits.

방법(800)의 출력은 실행을 위한 본 개시의 메모리 칩으로의 입력일 수 있다. 예를 들어, 메모리 칩은 적어도 하나의 전용 메모리 뱅크에 각각 연결된 복수의 프로세서 서브유닛과 이에 상응하는 복수의 메모리 뱅크를 포함할 수 있고, 메모리 칩의 프로세서 서브유닛은 방법(800)에 의해 생성된 머신 코드를 실행하도록 구성될 수 있다. 도 3a를 참조하여 설명한 바와 같이, 호스트(350)는 방법(800)에 의해 생성된 머신 코드를 프로세서 서브유닛에 입력하여 실행할 수 있다.The output of the method 800 may be an input to a memory chip of the present disclosure for execution. For example, the memory chip may include a plurality of processor subunits each coupled to at least one dedicated memory bank and a corresponding plurality of memory banks, wherein the processor subunits of the memory chip are generated by the method 800 . It may be configured to execute machine code. As described with reference to FIG. 3A , the host 350 may input and execute the machine code generated by the method 800 into the processor subunit.

서브뱅크와 서브컨트롤러Subbanks and subcontrollers

기존의 메모리 뱅크에서, 컨트롤러는 뱅크 레벨에서 제공된다. 각 뱅크는 전형적으로 장방형으로 배치되지만 이 외의 기타 모든 기하 형상으로 배치될 수 있는 복수의 매트(mat)를 포함한다. 각 매트는 역시 전형적으로 장방형으로 배치되지만 이 외의 기타 모든 기하 형상으로 배치될 수 있는 복수의 메모리 셀을 포함한다. 각 셀은 (셀이 고압 또는 저압으로 유지되는지 등에 따라) 단일 비트의 데이터를 저장할 수 있다.In conventional memory banks, the controller is provided at the bank level. Each bank includes a plurality of mats that are typically arranged in a rectangular shape, but may be arranged in any other geometry. Each mat also contains a plurality of memory cells that are typically arranged in a rectangular shape, but may be arranged in any other geometry. Each cell can store a single bit of data (depending on whether the cell is maintained at high or low voltage, etc.).

이러한 기존의 아키텍처의 예가 도 9와 도 10에 도시되어 있다. 도 9에 도시된 바와 같이, 뱅크 레벨에서, 복수의 매트(예, 매트(930-1, 930-2, 940-1, 940-2))가 뱅크(900)를 형성할 수 있다. 기존의 장방형 구조에서, 뱅크(900)는 글로벌 워드라인(예, 워드라인(950))과 글로벌 비트라인(예, 비트라인(960))으로 제어될 수 있다. 이에 따라, 로우 디코더(row decoder)(910)는 입력되는 제어 신호(예, 어드레스에서 읽기 요청, 어드레스로 쓰기 요청 등)에 의거하여 정확한 워드라인을 선택할 수 있고, 글로벌 센스 증폭기(global sense amplifier, 920)(및/또는 도 9에는 도시되지 않은 글로벌 컬럼 디코더)는 제어 신호에 의거하여 정확한 비트라인을 선택할 수 있다. 증폭기(920)는 또한 읽기 동작 중에 선택된 뱅크에서 전압 레벨을 증폭할 수 있다. 도면에는 초기 선택을 위해 로우 디코더를 사용하고 열을 따라 증폭을 수행하는 것으로 도시되어 있지만, 뱅크는 추가적으로 또는 대안적으로 컬럼 디코더(column decoder)를 사용하여 초기 선택을 하고 행을 따라 증폭을 수행할 수 있다.Examples of such a conventional architecture are shown in FIGS. 9 and 10 . 9 , at the bank level, a plurality of mats (eg, mats 930-1, 930-2, 940-1, 940-2) may form a bank 900 . In the existing rectangular structure, the bank 900 may be controlled by a global wordline (eg, wordline 950 ) and a global bitline (eg, bitline 960 ). Accordingly, the row decoder 910 may select the correct word line based on an input control signal (eg, a read request from an address, a write request from an address, etc.), and a global sense amplifier, 920 (and/or a global column decoder not shown in FIG. 9 ) may select the correct bit line based on the control signal. Amplifier 920 may also amplify the voltage level in the selected bank during a read operation. Although the figure is shown using a row decoder for initial selection and performing the amplification along the columns, the bank may additionally or alternatively use a column decoder to make the initial selection and perform the amplification along the rows. can

도 10은 매트(1000)의 일례를 도시한 것이다. 예를 들어, 매트(1000)는 도 9의 뱅크(900)와 같은 메모리 뱅크의 일부를 형성할 수 있다. 도 10에 도시된 바와 같이, 복수의 셀(예, 1030-1, 1030-2, 1030-3)이 매트(1000)를 형성할 수 있다. 각 셀은 커패시터, 트랜지스터, 또는 적어도 1비트의 데이터를 저장하는 기타 회로를 포함할 수 있다. 예를 들어, 셀은 하전되어 '1'을 나타내고 방전되어 '0'을 나타내는 커패시터를 포함하거나 '1'을 나타내는 제1 상태와 '0'을 나타내는 제2 상태를 포함하는 플립플롭(flip-flop)을 포함할 수 있다. 기존의 매트는 예를 들어 512비트 x 512비트를 포함할 수 있다. 매트(1000)가 MRAM, ReRAM 등의 일부를 형성하는 실시예에서, 셀은 트랜지스터, 저항기, 커패시터, 또는 적어도 1비트의 데이터를 저장하는 물질의 이온 또는 일부를 분리하는 기타 메커니즘을 포함할 수 있다. 예를 들어, 셀은 제1 상태가 '1'을 나타내고 제2 상태가 '0'을 나타내는 전해질 이온, 칼코겐화 유리(chalcogenide glass) 등을 포함할 수 있다.10 shows an example of the mat 1000 . For example, mat 1000 may form part of a memory bank, such as bank 900 of FIG. 9 . As shown in FIG. 10 , a plurality of cells (eg, 1030 - 1 , 1030 - 2 , and 1030 - 3 ) may form the mat 1000 . Each cell may contain capacitors, transistors, or other circuitry that stores at least one bit of data. For example, a cell may include a capacitor that is charged to represent a '1' and discharged to represent a '0', or a flip-flop comprising a first state representing a '1' and a second state representing a '0'. ) may be included. A conventional mat may include, for example, 512 bits x 512 bits. In embodiments where the mat 1000 forms part of an MRAM, ReRAM, or the like, the cell may include a transistor, resistor, capacitor, or other mechanism that separates ions or parts of a material that stores at least one bit of data. . For example, the cell may include electrolyte ions, chalcogenide glass, etc., in which the first state represents '1' and the second state represents '0'.

도 10에 더 도시된 바와 같이, 기존의 장방형 구조에서, 매트(1000)는 로컬 워드라인(예, 워드라인(1040))과 로컬 비트라인(예, 비트라인(1050))에 걸쳐 제어될 수 있다. 이에 따라, 워드라인 드라이버(예, 워드라인 드라이버(1020-1, 1020-2, . . . , 1020-x))가 선택된 워드라인을 제어하여, 매트(1000)가 그 일부를 형성하는 메모리 뱅크와 연관된 컨트롤러로부터의 제어 신호(예, 어드레스에서 읽기 요청, 어드레스로 쓰기 요청, 리프레쉬 신호 등)에 의거하여 읽기, 쓰기, 또는 리프레쉬를 수행할 수 있다. 또한, 로컬 센스 증폭기(예, 로컬 증폭기(1010-1, 1010-2, . . . , 1010-x)) 및/또는 로컬 컬럼 디코더(도 10에는 미도시)가 선택된 비트라인을 제어하여 읽기, 쓰기, 또는 리프레쉬를 수행할 수 있다. 로컬 센스 증폭기는 또한, 선택된 셀의 전압 레벨을 읽기 동작 동안에 증폭할 수 있다. 도면에는 초기 선택에 워드라인 드라이버를 사용하고 열을 따라 증폭을 수행하는 것으로 도시되어 있지만, 매트는 초기 선택에 비트라인 드라이버를 사용하고 행을 따라 증폭을 수행할 수도 있다.As further shown in FIG. 10 , in a conventional rectangular structure, the mat 1000 may be controlled across a local wordline (eg, wordline 1040 ) and a local bitline (eg, bitline 1050 ). have. Accordingly, a wordline driver (eg, wordline drivers 1020-1, 1020-2, . Reading, writing, or refresh may be performed based on a control signal (eg, a read request from an address, a write request from an address, a refresh signal, etc.) from a related controller. In addition, a local sense amplifier (eg, local amplifiers 1010-1, 1010-2, . . . , 1010-x) and/or a local column decoder (not shown in FIG. 10) controls the selected bit line to read; Write or refresh can be performed. The local sense amplifier may also amplify the voltage level of the selected cell during a read operation. Although the figure is shown using wordline drivers for initial selection and performing amplification along the columns, Matt may use bitline drivers for initial selection and performing amplification along the rows.

앞서 설명한 바와 같이, 많은 수의 매트가 복제되어 메모리 뱅크를 형성한다. 메모리 뱅크는 그룹으로 묶여서 메모리 칩을 형성할 수 있다. 예를 들어, 메모리 칩은 8개 내지 32개의 메모리 뱅크를 포함할 수 있다. 이에 따라, 기존의 메모리 칩 상에서 프로세서 서브유닛과 메모리 뱅크를 페어링한 결과로 8 내지 32 프로세서 서브유닛만이 형성된다. 이에 따라, 본 개시의 실시예는 추가적인 서브뱅크(sub-bank) 체계를 가진 메모리 칩을 포함할 수 있다. 본 개시의 이러한 메모리 칩은 따라서 프로세서 서브유닛과 페어링 된 전용 메모리 뱅크로 사용되는 메모리 서브뱅크를 가진 프로세서 서브유닛을 포함할 수 있다. 이에 따라, 더 많은 수의 서브프로세서(sub processor)가 가능하고, 인메모리 계산(in-memory computing)의 병렬 수행과 성능을 향상할 수 있다.As described above, a large number of mats are duplicated to form memory banks. Memory banks may be grouped together to form memory chips. For example, a memory chip may include 8 to 32 memory banks. Accordingly, only 8 to 32 processor subunits are formed as a result of pairing the processor subunit and the memory bank on the existing memory chip. Accordingly, an embodiment of the present disclosure may include a memory chip having an additional sub-bank scheme. Such a memory chip of the present disclosure may thus include a processor subunit having a memory subbank used as a dedicated memory bank paired with the processor subunit. Accordingly, a larger number of sub-processors may be used, and parallel execution and performance of in-memory computing may be improved.

본 개시의 일부 실시예에서, 뱅크(900)의 글로벌 로우 디코더와 글로벌 센스 증폭기는 서브뱅크 컨트롤러로 대체할 수 있다. 이에 따라, 메모리 뱅크의 글로벌 로우 디코더와 글로벌 센스 증폭기로 제어 신호를 보내지 않고, 메모리 뱅크의 컨트롤러는 제어 신호를 적합한 서브뱅크 컨트롤러로 보낼 수 있다. 제어 신호를 보내는 방향은 동적으로 제어되거나 하드웨어에 내장(예, 하나 이상의 논리 게이트를 통해)될 수 있다. 일부 실시예에서, 퓨즈를 사용하여 각 서브뱅크 또는 매트의 컨트롤러가 제어 신호를 해당 서브뱅크 또는 매트로 통과시키거나 차단하도록 지시할 수 있다. 따라서, 이러한 실시예에서, 퓨즈를 사용하여 불량 서브뱅크가 비활성화될 수 있다.In some embodiments of the present disclosure, the global row decoder and global sense amplifier of the bank 900 may be replaced by a sub-bank controller. Accordingly, the controller of the memory bank may send the control signal to the appropriate subbank controller without sending the control signal to the global row decoder and global sense amplifier of the memory bank. The direction in which the control signals are sent can be dynamically controlled or built into hardware (eg, via one or more logic gates). In some embodiments, fuses may be used to instruct the controller of each subbank or mat to pass or block control signals to that subbank or mat. Thus, in such an embodiment, a bad subbank may be deactivated using a fuse.

이러한 실시예의 일례에서, 메모리 칩은 복수의 메모리 뱅크를 포함할 수 있고, 각 메모리 뱅크는 뱅크 컨트롤러와 복수의 메모리 서브뱅크를 포함할 수 있고, 각 메모리 서브뱅크는 메모리 서브뱅크 상의 각 위치로 읽기와 쓰기가 가능하도록 하는 서브뱅크 로우 디코더와 서브뱅크 컬럼 디코더를 포함할 수 있다. 각 서브뱅크는 복수의 메모리 매트를 포함할 수 있고, 각 메모리 매트는 복수의 메모리 셀을 포함할 수 있고 내부적으로 로컬인 로우 디코더, 컬럼 디코더, 및/또는 로컬 센스 증폭기를 포함할 수 있다. 서브뱅크 로우 디코더와 서브뱅크 컬럼 디코더는 뱅크 컨트롤러의 읽기 및 쓰기 요청 또는 하기에 설명하는 서브뱅크 메모리 상의 메모리 내 계산에 사용되는 서브뱅크 프로세서 서브유닛의 읽기 및 쓰기 요청을 처리할 수 있다. 추가적으로, 각 메모리 서브뱅크는 뱅크 컨트롤러의 읽기 요청과 쓰기 요청을 처리할지 여부 및/또는 이러한 요청을 다음 레벨(예, 매트 상의 로우 디코더 및 컬럼 디코더의 레벨)로 송부할지 또는 이러한 요청을 차단할지, 예를 들어, 내부 처리 소자 또는 프로세서 서브유닛이 메모리에 접근하게 허용할지, 여부를 판단하도록 구성된 컨트롤러를 더 포함할 수 있다. 일부 실시예에서, 뱅크 컨트롤러는 시스템 클럭(system clock)에 동기화될 수 있다. 그러나 서브뱅크 컨트롤러는 시스템 클럭에 동기화되지 않을 수 있다.In one example of such an embodiment, the memory chip may include a plurality of memory banks, each memory bank including a bank controller and a plurality of memory subbanks, each memory subbank being read into a respective location on the memory subbank It may include a sub-bank row decoder and a sub-bank column decoder to enable writing. Each subbank may include a plurality of memory mats, and each memory mat may include a plurality of memory cells and may include internally local row decoders, column decoders, and/or local sense amplifiers. The subbank row decoder and the subbank column decoder may process read and write requests from the bank controller or read and write requests from the subbank processor subunit used for in-memory calculation on the subbank memory to be described below. Additionally, each memory subbank will handle read requests and write requests from the bank controller and/or whether to forward these requests to the next level (e.g. the level of the row decoder and column decoder on the mat) or to block these requests; For example, it may further include a controller configured to determine whether to allow the internal processing element or processor subunit to access the memory. In some embodiments, the bank controller may be synchronized to a system clock. However, the subbank controller may not be synchronized to the system clock.

앞서 설명한 바와 같이, 서브뱅크를 사용하면, 프로세서 서브유닛이 기존 칩의 메모리 뱅크와 쌍을 이룬 경우보다 더 많은 수의 프로세서 서브유닛을 메모리 칩에 포함시킬 수 있다. 이에 따라, 각 서브뱅크는 서브뱅크를 전용 메모리로 사용하는 프로세서 서브유닛을 더 포함할 수 있다. 앞서 설명한 바와 같이, 프로세서 서브유닛은 RISC, CISC, 또는 기타 범용 프로세서 서브유닛을 포함 및/또는 하나 이상의 가속기를 포함할 수 있다. 또한, 프로세서 서브유닛은 앞서 설명한 바와 같이 어드레스 생성기를 포함할 수 있다. 앞서 설명한 모든 실시예에서, 각 프로세서 서브유닛은 뱅크 컨트롤러를 사용하지 않고 서브뱅크의 로우 디코더와 컬럼 디코더를 사용하여 프로세서 서브유닛 전용의 서브뱅크에 접근하도록 구성될 수 있다. 서브 뱅크와 연관된 프로세서 서브유닛은 또한 메모리 매트를 취급(하기에 설명하는 디코더 및 메모리 중복 메커니즘 포함) 및/또는 상위 레벨(예, 뱅크 레벨 또는 메모리 레벨)의 읽기 및 쓰기 요청이 이에 부응하여 전송 및 취급되는지 여부를 판단할 수 있다.As described above, when the sub-bank is used, a larger number of processor sub-units can be included in the memory chip than when the processor sub-unit is paired with the memory bank of the existing chip. Accordingly, each subbank may further include a processor subunit using the subbank as a dedicated memory. As discussed above, a processor subunit may include a RISC, CISC, or other general purpose processor subunit and/or may include one or more accelerators. In addition, the processor subunit may include an address generator as described above. In all the above-described embodiments, each processor subunit may be configured to access a subbank dedicated to the processor subunit by using the row decoder and column decoder of the subbank without using the bank controller. The processor subunits associated with the sub-banks also handle memory mats (including the decoder and memory redundancy mechanisms described below) and/or higher-level (eg, bank-level or memory-level) read and write requests in response to transfer and It can be determined whether or not

일부 실시예에서, 서브뱅크 컨트롤러는 서브뱅크의 상태를 저장하는 레지스터를 더 포함할 수 있다. 이에 따라, 서브뱅크가 사용중인 것으로 레지스터가 나타내는 가운데 서브뱅크 컨트롤러가 메모리 컨트롤러로부터 제어 신호를 수신하는 경우에, 서브뱅크 컨트롤러는 오류임을 출력할 수 있다. 각 서브뱅크가 프로세서 서브유닛을 더 포함하는 실시예에서, 서브뱅크의 프로세서 서브유닛이 메모리 컨트롤러로부터의 외부 요청에 상충하여 메모리에 접근하는 경우, 레지스터는 오류임을 나타낼 수 있다.In some embodiments, the sub-bank controller may further include a register for storing the state of the sub-bank. Accordingly, when the subbank controller receives a control signal from the memory controller while the register indicates that the subbank is in use, the subbank controller may output an error. In an embodiment where each subbank further includes a processor subunit, if the processor subunit of the subbank accesses the memory in conflict with an external request from the memory controller, the register may indicate an error.

도 11은 서브뱅크 컨트롤러를 사용하는 메모리 뱅크의 다른 실시예의 일례를 도시한 것이다. 도 11의 예에서, 뱅크(1100)는 로우 디코더(1110), 뱅크 디코더(1120), 및 서브뱅크 컨트롤러(예, 컨트롤러(1130a, 1130b, 1130c))가 있는 복수의 메모리 서브뱅크(예, 서브뱅크(1170a, 1170b, 1170c))를 포함한다. 서브뱅크 컨트롤러는 서브뱅크 컨트롤러의 제어를 받는 하나 이상의 서브뱅크로 요청을 전달할지 여부를 판단할 수 있는 어드레스 리졸버(address reolver; 예, 리졸버(1140a, 1140b, 1140c))를 포함할 수 있다.11 shows an example of another embodiment of a memory bank using a subbank controller. In the example of FIG. 11 , bank 1100 has a plurality of memory subbanks (eg, subbanks) having a row decoder 1110 , a bank decoder 1120 , and a subbank controller (eg, controllers 1130a , 1130b , 1130c ). banks 1170a, 1170b, 1170c). The subbank controller may include an address resolver (eg, resolvers 1140a, 1140b, 1140c) capable of determining whether to transmit a request to one or more subbanks under the control of the subbank controller.

서브뱅크 컨트롤러는 하나 이상의 논리 회로(예, 회로(1150a, 1150b, 1150c))를 더 포함할 수 있다. 예를 들어, 하나 이상의 처리 소자를 포함하는 논리 회로는 서브뱅크의 셀을 리프레쉬하는 동작, 서브뱅크의 셀을 비우는 동작 등과 같은 하나 이상의 동작이 뱅크(1100)의 외부에서 요청을 처리하지 않고 수행되도록 할 수 있다. 또는, 논리 회로는 앞서 설명한 바와 같은 프로세서 서브유닛을 포함하여 서브뱅크 컨트롤러에 의해 제어되는 모든 서브뱅크가 프로세서 서브유닛의 상응하는 전용 메모리가 되도록 할 수 있다. 도 11의 예에서, 논리(1150a)의 상응하는 전용 메모리는 서브뱅크(1170a)이고, 논리(1150b)의 상응하는 전용 메모리는 서브뱅크(1170b)이고, 논리(1150c)의 상응하는 전용 메모리는 서브뱅크(1170c)이다. 앞서 설명한 모든 실시예에서, 논리 회로에는 서브뱅크로 연결하는 버스, 예를 들어, 버스(1131a, 1131b, 1131c)가 있다. 도 11에 도시된 바와 같이, 서브뱅크 컨트롤러 각각은 처리 소자 또는 프로세서 서브유닛에 의한 또는 명령을 발생하는 고수준 메모리 컨트롤러에 의한 메모리 뱅크 상의 위치에 읽기와 쓰기가 가능하게 하는 서브뱅크 로우 디코더와 서브뱅크 컬럼 디코더와 같은 복수의 디코더를 포함할 수 있다. 예를 들어, 서브뱅크 컨트롤러(1130a)는 디코더(1160a, 1160b, 1160c)를 포함하고, 서브뱅크 컨트롤러(1130b)는 디코더(1160d, 1160e, 1160f)를 포함하고, 서브뱅크 컨트롤러(1130c)는 디코더(1160g, 1160h, 1160i)를 포함한다. 뱅크 로우 디코더(111)의 요청에 의거하여, 서브뱅크 컨트롤러는 서브뱅크 컨트롤러에 포함된 디코더를 사용하여 워드라인을 선택할 수 있다. 여기에 설명한 시스템은 서브뱅크의 처리 소자 또는 프로세서 서브유닛이 다른 뱅크 및 다른 서브뱅크 마저도 간섭하지 않으면서 메모리에 접근할 수 있게 하고, 이에 따라 각 서브뱅크 프로세서 서브유닛은 다른 서브뱅크 서브유닛과 병렬로 메모리 계산을 할 수 있다.The sub-bank controller may further include one or more logic circuits (eg, circuits 1150a, 1150b, 1150c). For example, a logic circuit including one or more processing elements may be configured such that one or more operations, such as refreshing the cells of the subbank, emptying the cells of the subbank, etc., are performed outside of the bank 1100 without processing the request. can do. Alternatively, the logic circuit may cause all subbanks controlled by the subbank controller, including the processor subunits as described above, to be the corresponding dedicated memories of the processor subunits. 11, the corresponding dedicated memory of logic 1150a is subbank 1170a, the corresponding dedicated memory of logic 1150b is subbank 1170b, and the corresponding dedicated memory of logic 1150c is It is a subbank 1170c. In all of the previously described embodiments, the logic circuit has buses that connect to the subbanks, eg, buses 1131a, 1131b, 1131c. As shown in Figure 11, each of the subbank controllers has a subbank row decoder and a subbank that enable reads and writes to locations on the memory bank by a processing element or processor subunit, or by a high-level memory controller that issues commands. It may include a plurality of decoders such as a column decoder. For example, the sub-bank controller 1130a includes decoders 1160a, 1160b, and 1160c, the sub-bank controller 1130b includes decoders 1160d, 1160e, and 1160f, and the sub-bank controller 1130c is a decoder (1160g, 1160h, 1160i). At the request of the bank row decoder 111 , the sub-bank controller may select a word line using a decoder included in the sub-bank controller. The system described herein allows the processing elements or processor subunits of a subbank to access memory without interfering with other banks and even other subbanks, so that each subbank processor subunit can run parallel with other subbank subunits. can be used for memory calculations.

또한, 각 서브뱅크는 복수의 메모리 셀을 각각 포함하는 복수의 메모리 매트를 포함할 수 있다. 예를 들어, 서브뱅크(1170a)는 매트(1190a-1, 1190a-2, . . . , 1190a-x)를 포함하고, 서브뱅크(1170b)는 매트(1190b-1, 1190b-2, . . . , 1190b-x)를 포함하고, 서브뱅크(1170c)는 매트(1190c-1, 1190c-2, . . . , 1190c-x)를 포함한다. 또한, 도 11에 도시된 바와 같이, 각 서브뱅크는 적어도 하나의 디코더를 포함할 수 있다. 예를 들어, 서브뱅크(1170a)는 디코더(1180a)를 포함하고, 서브뱅크(1170b)는 디코더(1180b)를 포함하고, 서브뱅크(1170c)는 디코더(1180c)를 포함한다. 이에 따라, 뱅크 컬럼 디코더(1120)는 외부 요청에 의거하여 글로벌 비트라인(예, 1121a 또는 1121b)을 선택할 수 있는 반면에 뱅크 로우 디코더(1110)에 의해 선택된 서브뱅크는 컬럼 디코더를 활용하여 이 서브뱅크가 전용인 논리 회로의 로컬 요청에 의거한 로컬 비트라인(예, 1181a 또는 1181b)을 선택할 수 있다. 이에 따라, 각 프로세서 서브유닛은 뱅크 로우 디코더와 뱅크 컬럼 디코더를 사용하지 않고 서브 뱅크의 로우 디코더와 컬럼 디코더를 활용하여 프로세서 서브유닛의 전용 서브 뱅크에 접근하도록 구성될 수 있다. 따라서, 각 프로세서 서브유닛은 다른 서브뱅크를 간섭하지 않으면서 상응하는 서브뱅크에 접근할 수 있다. 또한, 서브뱅크로의 요청이 프로세서 서브유닛의 외부에서 이루어지는 경우에, 서브뱅크 디코더는 접근된 데이터를 뱅크 디코더에 반영할 수 있다. 또는, 각 서브뱅크에 단일 행의 메모리 뱅크만이 있는 실시예에서, 로컬 비트라인은 서브뱅크의 비트라인이 아닌 매트의 비트라인일 수 있다.Also, each subbank may include a plurality of memory mats each including a plurality of memory cells. For example, subbank 1170a includes mats 1190a-1, 1190a-2, ..., 1190a-x, and subbank 1170b includes mats 1190b-1, 1190b-2, . . . , 1190b-x), and subbank 1170c includes mats 1190c-1, 1190c-2, . . ., 1190c-x. Also, as shown in FIG. 11 , each subbank may include at least one decoder. For example, subbank 1170a includes a decoder 1180a , subbank 1170b includes a decoder 1180b , and subbank 1170c includes a decoder 1180c . Accordingly, the bank column decoder 1120 may select a global bit line (eg, 1121a or 1121b) based on an external request, while the subbank selected by the bank row decoder 1110 utilizes the column decoder to select this sub A local bit line (eg, 1181a or 1181b) can be selected based on a local request of the logic circuit for which the bank is dedicated. Accordingly, each processor subunit may be configured to access the dedicated subbank of the processor subunit by utilizing the row decoder and column decoder of the subbank without using the bank row decoder and the bank column decoder. Thus, each processor subunit can access the corresponding subbank without interfering with the other subbanks. In addition, when the request to the sub-bank is made outside the processor sub-unit, the sub-bank decoder may reflect the accessed data to the bank decoder. Alternatively, in embodiments where there is only a single row of memory banks in each subbank, the local bitline may be the bitline of the mat rather than the bitline of the subbank.

도 11에 도시된 실시예의 서브뱅크 로우 디코더와 서브뱅크 컬럼 디코더를 사용하는 실시예의 조합도 사용될 수 있다. 예를 들어, 뱅크 로우 디코더가 제거되지만 뱅크 컬럼 디코더는 유지하고 로컬 비트라인을 사용할 수 있다.A combination of the embodiment using the subbank row decoder and the subbank column decoder of the embodiment shown in FIG. 11 may also be used. For example, the bank row decoder can be removed but the bank column decoder can be retained and the local bitline can be used.

도 12는 복수의 매트를 포함하는 메모리 서브뱅크(1200)의 실시예의 일례를 도시한 것이다. 예를 들어, 서브뱅크(1200)는 도 11의 서브뱅크(1100)의 일부분을 나타내거나 메모리 뱅크가 다르게 구현된 것을 나타내는 것일 수 있다. 도 12의 예에서, 서브뱅크(1200)는 복수의 매트(예, 1240a, 1240b)를 포함한다. 또한, 각 매트는 복수의 셀을 포함할 수 있다. 예를 들어, 매트(1240a)는 셀(1260a-1, 1260a-2, . . . , 1260a-x)을 포함하고, 매트(1240b)는 셀(1260b-1, 1260b-2, . . . , 1260b-x)을 포함한다.12 illustrates an example embodiment of a memory subbank 1200 including a plurality of mats. For example, the sub-bank 1200 may represent a part of the sub-bank 1100 of FIG. 11 or a memory bank implemented differently. In the example of FIG. 12 , subbank 1200 includes a plurality of mats (eg, 1240a, 1240b). Also, each mat may include a plurality of cells. For example, mat 1240a includes cells 1260a-1, 1260a-2, ..., 1260a-x, and mat 1240b includes cells 1260b-1, 1260b-2, ..., 1260a-x. 1260b-x).

각 매트에는 매트의 메모리 셀에 배정될 어드레스의 범위가 배정될 수 있다. 이러한 어드레스는 매트가 이리저리 옮겨질 수 있고 결함이 있는 매트는 비활성화되고 사용되지 않도록(예, 하기에 설명하는 바와 같이, 하나 이상의 퓨즈 사용) 제조시에 설정될 수 있다.Each mat may be assigned a range of addresses to be assigned to the mat's memory cells. These addresses may be set at manufacturing time so that mats can be moved around and defective mats are deactivated and not used (eg, with one or more fuses, as described below).

서브뱅크(1200)는 메모리 컨트롤러(1210)로부터 읽기와 쓰기 요청을 수신한다. 도 12에는 도시되어 있지 않지만, 메모리 컨트롤러(1210)로부터의 요청은 서브뱅크(1200)의 컨트롤러를 통해 필터링 되고 어드레스 결정을 위해 서브뱅크(1200)의 적절한 매트로 보내질 수 있다. 대안적으로, 메모리 컨트롤러(1210)의 요청의 어드레스의 적어도 일부분(예, 높은 비트)이 서브뱅크(1200)의 모든 매트(예, 1240a, 1240b)로 전송되어 매트의 배정된 어드레스 범위가 명령에 특정된 어드레스를 포함하는 경우에만 어드레스와 연관된 요청과 전체 어드레스를 처리하도록 할 수 있다. 상기의 서브뱅크 지시와 유사하게, 매트 판단은 동적으로 제어되거나 하드웨어에 내장될 수 있다. 일부 실시예에서, 퓨즈를 사용하여 각 매트에 대한 어드레스 범위를 판단할 수 있고, 또한 불법 어드레스 범위를 배정함으로써 불량 매트를 비활성화할 수 있다. 매트는 일반적인 기타 방법이나 퓨즈 연결에 의해 추가적으로 또는 대안적으로 비활성화 될 수 있다.The subbank 1200 receives read and write requests from the memory controller 1210 . Although not shown in FIG. 12 , a request from the memory controller 1210 may be filtered through the controller of the subbank 1200 and sent to the appropriate mat of the subbank 1200 for address resolution. Alternatively, at least a portion (eg, the high bit) of the address of the request of the memory controller 1210 is sent to all the mats (eg, 1240a, 1240b) of the subbank 1200 so that the assigned address range of the mats is in the command. It is possible to process requests associated with an address and the entire address only if it contains a specified address. Similar to the subbank instructions above, the mat determination can be dynamically controlled or built into hardware. In some embodiments, a fuse may be used to determine the address range for each mat, and may also disable bad mats by assigning an illegal address range. The mat may additionally or alternatively be deactivated by other common methods or fuse connections.

앞서 설명한 모든 실시예에서, 서브뱅크의 각 매트는 매트에서 워드라인을 선택하기 위한 로우 디코더(예, 1230a, 1230b)를 포함할 수 있다. 일부 실시예에서, 각 매트는 퓨즈 및 비교기(예, 1220a, 1220b)를 더 포함할 수 있다. 앞서 설명한 바와 같이, 비교기는 각 매트가 입력되는 요청을 처리할지 여부를 판단하게 할 수 있고, 퓨즈는 각 매트가 불량인 경우 비활성화 되게 할 수 있다. 대안적으로, 뱅크 및/또는 서브뱅크의 로우 디코더가 각 매트의 로우 디코더 대신에 사용될 수 있다.In all of the above-described embodiments, each mat of a subbank may include a row decoder (eg, 1230a, 1230b) for selecting wordlines from the mat. In some embodiments, each mat may further include a fuse and a comparator (eg, 1220a, 1220b). As discussed above, a comparator may cause each mat to decide whether to process an incoming request, and a fuse may cause each mat to be deactivated if it is bad. Alternatively, the row decoders of banks and/or subbanks may be used in place of the row decoders of each mat.

또한, 앞서 설명한 모든 실시예에서, 해당 매트에 포함된 컬럼 디코더(예, 1250a, 1250b)는 로컬 비트라인(예, 1251, 1253)을 선택할 수 있다. 로컬 비트라인은 메모리 뱅크의 글로벌 비트라인에 연결될 수 있다. 서브뱅크에 자체의 로컬 비트라인이 있는 실시예에서, 셀의 로컬 비트라인은 서브뱅크의 로컬 비트라인으로 더 연결될 수 있다. 이에 따라, 선택된 셀의 데이터는 셀의 컬럼 디코더(및/또는 센스 증폭기)를 통한 후에 서브뱅크의 컬럼 디코더(및/또는 센스 증폭기)를 통하고(서브뱅크 컬럼 디코더 및/또는 센스 증폭기를 포함하는 실시예에서) 그 후에 뱅크의 컬럼 디코더(및/또는 센스 증폭기)를 통해 읽어질 수 있다.In addition, in all the above-described embodiments, the column decoders (eg, 1250a and 1250b) included in the corresponding mat may select the local bitlines (eg, 1251 and 1253). The local bitline may be coupled to the global bitline of the memory bank. In embodiments where a subbank has its own local bitline, the cell's local bitline may further be coupled to the subbank's local bitline. Accordingly, the data of the selected cell passes through the column decoder (and/or sense amplifier) of the cell and then through the column decoder (and/or sense amplifier) of the subbank (including the subbank column decoder and/or sense amplifier). in an embodiment) can then be read through the bank's column decoder (and/or sense amplifier).

매트(1200)는 복제되고 어레이로 배열되어 메모리 뱅크(또는 메모리 서브뱅크)를 형성할 수 있다. 예를 들어, 본 개시의 메모리 칩은 복수의 메모리 뱅크를 포함할 수 있고, 각 메모리 뱅크에는 복수의 메모리 서브뱅크가 있고, 각 메모리 서브뱅크에는 메모리 서브뱅크 상의 각 위치로 읽기와 쓰기를 처리하기 위한 서브뱅크 컨트롤러가 있다. 또한, 각 메모리 서브뱅크는 복수의 메모리 매트를 포함할 수 있고, 각 메모리 매트에는 복수의 메모리 셀과 매트 로우 디코더와 매트 컬럼 디코더(도 12에 도시)가 있다. 매트 로우 디코더와 매트 컬럼 디코더는 서브뱅크 컨트롤러로부터의 읽기와 쓰기 요청을 처리할 수 있다. 예를 들어, 매트 디코더는 모든 요청을 수신하고 요청을 처리할지 여부를 각 매트의 알려진 어드레스 범위에 의거하여 판단(예, 비교기를 활용)하거나, 서브뱅크(또는 뱅크) 컨트롤러에 의한 매트 선택에 의거하여 알려진 어드레스 범위 이내의 요청 만을 수신할 수 있다.The mat 1200 may be duplicated and arranged in an array to form a memory bank (or memory subbank). For example, a memory chip of the present disclosure may include a plurality of memory banks, each memory bank having a plurality of memory subbanks, each memory subbank handling reads and writes to respective locations on the memory subbanks. There is a subbank controller for Further, each memory subbank may include a plurality of memory mats, each memory mat having a plurality of memory cells, a mat row decoder and a mat column decoder (shown in FIG. 12 ). The mat row decoder and mat column decoder can handle read and write requests from the subbank controller. For example, the mat decoder receives all requests and decides whether to process the requests based on each mat's known address range (e.g. utilizing a comparator), or based on mat selection by the subbank (or bank) controller. Thus, only requests within a known address range can be received.

컨트롤러 데이터 전송Controller data transfer

본 개시의 메모리 칩은 또한 프로세서 서브유닛을 활용하여 데이터를 공유하는 것 외에도 메모리 컨트롤러(또는 서브뱅크 컨트롤러 또는 매트 컨트롤러)를 활용하여 데이터를 공유할 수 있다. 예를 들어, 본 개시의 메모리 칩은 메모리 뱅크의 각 위치로 읽기와 쓰기가 가능하도록 하는 뱅크 컨트롤러, 로우 디코더, 및 컬럼 디코더를 각각 포함하는 복수의 메모리 뱅크(예, SRAM 뱅크, DRAM 뱅크 등)와 복수의 뱅크 컨트롤러의 각 컨트롤러를 복수의 뱅크 컨트롤러의 적어도 하나의 다른 컨트롤러에 연결하는 복수의 버스를 포함할 수 있다.The memory chip of the present disclosure may also utilize a memory controller (or a subbank controller or a mat controller) to share data in addition to sharing data by utilizing the processor subunit. For example, the memory chip of the present disclosure may include a plurality of memory banks (eg, SRAM bank, DRAM bank, etc.) each including a bank controller, a row decoder, and a column decoder to enable reading and writing to each location of the memory bank. and a plurality of buses connecting each controller of the plurality of bank controllers to at least one other controller of the plurality of bank controllers.

일부 실시예에서, 복수의 버스는 하나 이상의 프로세서 서브유닛에 연결된 메모리 뱅크의 메인 버스 상의 데이터 전송의 방해 없이 접근될 수 있다. 이에 따라, 메모리 뱅크(또는 서브뱅크)는 다른 메모리 뱅크(또는 서브뱅크)와의 데이터 전송 또는 수신과 동일한 클럭 사이클로 상응하는 프로세서 서브유닛과 데이터를 전송 또는 수신할 수 있다. 각 컨트롤러가 복수의 다른 컨트롤러에 연결된 실시예에서, 컨트롤러는 데이터의 전송 또는 수신을 위해 복수의 다른 컨트롤러의 하나의 다른 컨트롤러를 선택하도록 구성될 수 있다. 일부 실시예에서, 각 컨트롤러는 적어도 하나의 이웃하는 컨트롤러에 연결될 수 있다(예, 공간적으로 인접하는 컨트롤러의 쌍들이 서로 연결될 수 있다).In some embodiments, the plurality of buses may be accessed without interfering with data transfers on the main bus of a memory bank coupled to one or more processor subunits. Accordingly, the memory bank (or subbank) may transmit or receive data to or from the corresponding processor subunit in the same clock cycle as data transmission or reception with another memory bank (or subbank). In embodiments where each controller is coupled to a plurality of other controllers, the controller may be configured to select one other controller of the plurality of other controllers for transmission or reception of data. In some embodiments, each controller may be coupled to at least one neighboring controller (eg, pairs of spatially adjacent controllers may be coupled to each other).

메모리 회로의 리던던트 로직Redundant logic in memory circuits

본 개시는 온칩(on-chip) 데이터 프로세싱을 위한 프라이머리 논리부(primary logic portions)가 있는 메모리 칩에 관한 것이다. 메모리 칩은 불량 프라이머리 논리부를 교체하여 칩의 제조 수율을 증가시킬 수 있는 리던던트 논리부(redundant logic portions)를 포함할 수 있다. 따라서, 칩은 논리부의 개별 검사에 의거한 메모리 칩의 로직 블록의 설정을 가능하게 하는 온칩 요소를 포함할 수 있다. 메모리 칩은 논리부 전용의 영역이 클수록 제조 오류의 영향을 더 많이 받기 때문에 칩의 이러한 특징은 수율을 향상할 수 있다. 예를 들어, 리던던트 논리부가 큰 DRAM 메모리 칩은 수율을 감소시키는 제조 문제의 영향을 많이 받을 수 있다. 그러나 리던던트 논리부를 구현하면 고도의 병렬성 능력을 유지하면서도 DRAM 메모리 칩의 생산자나 사용자가 전체 논리부를 켜거나 끌 수 있게 해주기 때문에 수율과 신뢰성이 향상되는 결과를 얻을 수 있다. 본 개시에서, 개시된 실시예의 설명의 편의상 특정 메모리 유형(예, DRAM)의 예를 들 수 있다. 그렇다고 하더라도, 본 개시가 특정된 메모리 유형으로 한정되는 것은 아니라는 점은 당연하다 할 것이다. 오히려, 본 개시의 일부 부분에서 예로 든 메모리 유형이 적다고 하더라도, 개시된 실시예와 함께 DRAM, 플래시메모리, SRAM, ReRAM, PRAM, MRAM, ROM, 또는 기타 모든 메모리와 같은 메모리 유형이 사용될 수 있다.The present disclosure relates to a memory chip having primary logic portions for on-chip data processing. The memory chip may include redundant logic portions capable of increasing the chip manufacturing yield by replacing the defective primary logic portion. Thus, the chip may include on-chip elements that enable the setting of the logic blocks of the memory chip based on individual tests of the logic unit. Since the memory chip is more susceptible to manufacturing errors as the area dedicated to the logic unit is larger, this feature of the chip can improve the yield. For example, DRAM memory chips with large redundant logic may be subject to manufacturing issues that reduce yield. However, implementing redundant logic can result in improved yield and reliability because it allows the producer or user of a DRAM memory chip to turn the entire logic on or off while maintaining a high degree of parallelism. In the present disclosure, an example of a specific memory type (eg, DRAM) may be given for convenience of description of the disclosed embodiment. Even so, it will be appreciated that the present disclosure is not limited to a particular memory type. Rather, memory types such as DRAM, flash memory, SRAM, ReRAM, PRAM, MRAM, ROM, or any other memory may be used with the disclosed embodiments, although few of the memory types exemplified in some portions of this disclosure.

도 13은 본 개시에 따른 예시적인 메모리 칩(1300)의 구성도이다. 메모리 칩(1300)은 DRAM 메모리 칩으로 구현될 수 있다. 메모리 칩(1300)은 또한 플래시메모리, SRAM, ReRAM, PRAM, 및/또는 MRAM 등과 같은 모든 유형의 휘발성 또는 비휘발성 메모리로 구현될 수 있다. 메모리 칩(1300)은 어드레스 매니저(1302), 복수의 메모리 뱅크(1304(a,a) 내지 1304(z,z))를 포함하는 메모리 어레이(1304), 메모리 로직(1306), 비즈니스 로직(1308), 및 리던던트 비즈니스 로직(1310)이 배치된 기판(13010)을 포함할 수 있다. 메모리 로직(1306)과 비즈니스 로직(1308)은 프라이머리 로직 블록을 구성할 수 있는 반면, 리던던트 비즈니스 로직(1310)은 리던던트 블록을 구성할 수 있다. 또한, 메모리 칩(1300)은 비활성화 스위치(1312)와 활성화 스위치(1314)가 포함될 수 있는 설정 스위치를 포함할 수 있다. 비활성화 스위치(1312)와 활성화 스위치(1314)도 기판(1301)에 배치될 수 있다. 본 출원에서, 메모리 로직(1306), 비즈니스 로직(1308), 및 리던던트 비즈니스 로직(1310)은 또한 집합적으로 '로직 블록'으로 칭해질 수 있다.13 is a block diagram of an exemplary memory chip 1300 according to the present disclosure. The memory chip 1300 may be implemented as a DRAM memory chip. Memory chip 1300 may also be implemented with any type of volatile or non-volatile memory, such as flash memory, SRAM, ReRAM, PRAM, and/or MRAM. The memory chip 1300 includes an address manager 1302 , a memory array 1304 including a plurality of memory banks 1304(a,a)-1304(z,z), memory logic 1306 , and business logic 1308 . ), and a substrate 13010 on which the redundant business logic 1310 is disposed. The memory logic 1306 and the business logic 1308 may constitute a primary logic block, while the redundant business logic 1310 may constitute a redundant block. Also, the memory chip 1300 may include a setting switch that may include a deactivation switch 1312 and an enable switch 1314 . A deactivation switch 1312 and an enable switch 1314 may also be disposed on the substrate 1301 . In this application, memory logic 1306 , business logic 1308 , and redundant business logic 1310 may also be collectively referred to as a 'logic block'.

어드레스 매니저(1302)는 로우 디코더 및 컬럼 디코더 또는 기타 유형의 메모리 보조장치를 포함할 수 있다. 대안적으로 또는 추가적으로, 어드레스 매니저(1302)는 마이크로컨트롤러 또는 처리부를 포함할 수 있다.Address manager 1302 may include row decoders and column decoders or other types of memory aids. Alternatively or additionally, address manager 1302 may include a microcontroller or processing unit.

일부 실시예에서, 도 13에 도시된 바와 같이, 메모리 칩(1300)은 복수의 메모리 블록을 2차원 어레이로 기판(1301) 상에 배치할 수 있는 단일 메모리 어레이(1304)를 포함할 수 있다. 그러나 다른 실시예에서, 메모리 칩(1300)은 다수의 메모리 어레이(1304)를 포함할 수 있고, 각각의 메모리 어레이(1304)는 서로 다른 구성으로 메모리 블록을 배치할 수 있다. 예를 들어, 메모리 어레이 중의 적어도 한 어레이의 메모리 블록(즉, 메모리 뱅크)이 방사형 분포로 배치되어 어드레스 매니저(1302) 또는 메모리 로직(1306)으로부터 메모리 블록으로의 라우팅을 가능하게 할 수 있다.In some embodiments, as shown in FIG. 13 , the memory chip 1300 may include a single memory array 1304 , in which a plurality of memory blocks may be disposed on a substrate 1301 in a two-dimensional array. However, in other embodiments, the memory chip 1300 may include multiple memory arrays 1304 , and each memory array 1304 may arrange memory blocks in different configurations. For example, memory blocks (ie, memory banks) of at least one of the memory arrays may be arranged in a radial distribution to enable routing from the address manager 1302 or memory logic 1306 to the memory blocks.

비즈니스 로직(1308)은 메모리 자체의 관리에 사용되는 논리와 무관한 어플리케이션의 인메모리 계산(in-memory computation)을 위해 사용될 수 있다. 예를 들어, 비즈니스 로직(1308)은 활성화 함수로 사용되는 플로팅, 정수, 또는 MAC 연산과 같은 AI에 관한 함수를 이행할 수 있다. 또한, 비즈니스 로직(1308)은 min, max, sort, count 등과 같은 데이터베이스 관련 함수를 이행할 수 있다. 메모리 로직(1306)은 읽기, 쓰기, 리프레쉬 동작 등을 포함하는 메모리 관리에 관한 작업을 수행할 수 있다. 따라서, 비즈니스 로직은 뱅크 레벨, 매트 레벨, 매트 레벨의 그룹 중의 하나 이상에 추가될 수 있다. 비즈니스 로직(1308)에는 하나 이상의 어드레스 출력과 하나 이상의 데이터 입력/출력이 있을 수 있다. 예를 들어, 비즈니스 로직(1308)은 로우/컬럼 라인에 의해 어드레스 매니저(1302)로 어드레스 할 수 있다. 그러나 일부 실시예에서, 데이터 입력/출력을 통해 로직 블록이 추가적으로 또는 대안적으로 어드레스 될 수 있다.The business logic 1308 may be used for in-memory computation of applications that are independent of the logic used to manage the memory itself. For example, business logic 1308 may implement functions pertaining to AI, such as floating, integer, or MAC operations used as activation functions. In addition, the business logic 1308 may implement database-related functions such as min, max, sort, count, and the like. The memory logic 1306 may perform operations related to memory management including read, write, and refresh operations. Accordingly, business logic may be added to one or more of the group of bank level, mat level, and mat level. The business logic 1308 may have one or more address outputs and one or more data inputs/outputs. For example, business logic 1308 can address address manager 1302 by row/column line. However, in some embodiments, logic blocks may additionally or alternatively be addressed via data inputs/outputs.

리던던트 비즈니스 로직(1310)은 비즈니스 로직(1308)의 복제일 수 있다. 또한, 리던던트 비즈니스 로직(1310)은 작은 퓨즈/안티퓨즈(anti-fuse)를 포함할 수 있는 비활성화 스위치(1312) 및/또는 활성화 스위치(1314)에 연결될 수 있고, 인스턴스 중의 하나(예, 디폴트로 연결되는 인스턴스)를 비활성화 또는 활성화하는 논리에 사용되고 다른 로직 블록 중의 하나(예, 디폴트로 연결 해제되는 인스턴스)를 활성화할 수 있다. 일부 실시예에서, 하기에 도 15를 참조하여 설명하는 바와 같이, 블록의 중복성은 비즈니스 로직(1308)과 같은 로직 블록 이내에 국한될 수 있다.Redundant business logic 1310 may be a duplicate of business logic 1308 . Additionally, redundant business logic 1310 may be coupled to a deactivation switch 1312 and/or an enable switch 1314 which may include a small fuse/anti-fuse, and may be coupled to one of the instances (eg, by default). It can be used for logic to deactivate or activate a connected instance) and activate one of the other logic blocks (eg, an instance that is disconnected by default). In some embodiments, as described below with reference to FIG. 15 , the redundancy of blocks may be confined within a logic block, such as business logic 1308 .

일부 실시예에서, 메모리 칩(1300)의 로직 블록은 전용 버스로 메모리 어레이(1304)의 서브세트에 연결될 수 있다. 예를 들어, 메모리 로직(1306), 비즈니스 로직(1308), 및 리던던트 비즈니스 로직(1310)의 세트는 메모리 어레이(1304)의 메모리 블록(즉, 1304 (a,a) 내지 1304 (a,z))의 제1행에 연결될 수 있다. 전용 버스는 연관된 로직 블록이 어드레스 매니저(1302) 등을 통하여 통신 라인을 개설할 필요없이 메모리 블록으로부터 데이터에 신속히 접근하게 해줄 수 있다.In some embodiments, the logic blocks of the memory chip 1300 may be coupled to a subset of the memory array 1304 by a dedicated bus. For example, the set of memory logic 1306 , business logic 1308 , and redundant business logic 1310 may include memory blocks of memory array 1304 (ie, 1304 (a,a) through 1304 (a,z) ) can be connected to the first row of A dedicated bus may allow the associated logic block to quickly access data from a memory block without the need to establish a communication line through an address manager 1302 or the like.

복수의 프라이머리 로직 블록의 각각은 복수의 메모리 뱅크(1304)의 적어도 하나에 연결될 수 있다. 또한, 리던던트 비즈니스 블록(1310)과 같은 리던던트 블록은 메모리 인스턴스(1304(a,a)-(z,z))의 적어도 하나에 연결될 수 있다. 리던던트 블록은 메모리 로직(1306) 또는 비즈니스 로직(1308)과 같은 복수의 프라이머리 로직 블록의 적어도 하나를 복제할 수 있다. 비활성화 스위치(1312)는 복수의 프라이머리 로직 블록의 적어도 하나에 연결될 수 있고, 활성화 스위치(1314)는 복수의 리던던트 블록의 적어도 하나에 연결될 수 있다.Each of the plurality of primary logic blocks may be coupled to at least one of the plurality of memory banks 1304 . Also, a redundant block, such as redundant business block 1310 , may be coupled to at least one of memory instances 1304(a,a)-(z,z). The redundant block may duplicate at least one of a plurality of primary logic blocks, such as memory logic 1306 or business logic 1308 . The deactivation switch 1312 may be connected to at least one of the plurality of primary logic blocks, and the enable switch 1314 may be connected to at least one of the plurality of redundant blocks.

이러한 실시예에서, 복수의 프라이머리 로직 블록(메모리 로직(1306) 및/또는 비즈니스 로직(1308))의 하나와 연관된 결함이 검출되면, 비활성화 스위치(1312)는 복수의 프라이머리 로직 블록의 하나를 비활성화하도록 구성될 수 있다. 동시에, 활성화 스위치(1314)는 복수의 프라이머리 로직 블록의 하나를 복제하는 리던던트 로직 블록(1310)과 같은 복수의 리던던트 블록의 하나를 활성화하도록 구성될 수 있다.In this embodiment, if a fault associated with one of the plurality of primary logic blocks (memory logic 1306 and/or business logic 1308) is detected, the deactivation switch 1312 disables one of the plurality of primary logic blocks. It can be configured to disable. At the same time, the activation switch 1314 may be configured to activate one of a plurality of redundant blocks, such as the redundant logic block 1310 , which duplicates one of the plurality of primary logic blocks.

또한, 집합적으로 '설정 스위치'로 불릴 수 있는 활성화 스위치(1314)와 비활성화 스위치(1312)는 스위치의 상태를 설정하는 외부 입력을 포함할 수 있다. 예를 들어, 활성화 스위치(1314)는 외부 입력의 활성화 신호가 스위치가 닫힌 상태를 유발하도록 설정될 수 있고, 비활성화 스위치(1312)는 외부 입력의 비활성화 신호가 스위치가 열린 상태를 유발하도록 설정될 수 있다. 일부 실시예에서, 메모리 칩(1300)의 모든 설정 스위치는 디폴트로 비활성화 상태이고, 검사의 결과가 관련 로직 블록이 기능을 하는 것으로 나타내고 신호가 외부 입력에 주어진 이후에 활성화될 수 있다. 또는, 일부의 경우, 메모리 칩(1300)의 모든 설정 스위치는 디폴트로 활성화 상태이고, 검사의 결과가 관련 로직 블록이 기능을 하지 않는 것으로 나타내고 비활성화 신호가 외부 입력에 주어진 이후에 비활성화될 수 있다.In addition, the enable switch 1314 and the deactivation switch 1312, which may be collectively referred to as a 'setting switch', may include an external input for setting the state of the switch. For example, the activation switch 1314 may be set such that an activation signal of an external input causes a switch closed state, and the deactivation switch 1312 may be set such that an activation signal of an external input causes a switch open state. have. In some embodiments, all setting switches of the memory chip 1300 are disabled by default, and may be activated after the result of the check indicates that the relevant logic block is functioning and a signal is given to the external input. Alternatively, in some cases, all setting switches of the memory chip 1300 are in an activated state by default, and may be deactivated after the result of the check indicates that the relevant logic block is not functioning and a deactivation signal is given to an external input.

설정 스위치가 초기에 활성화이거나 비활성화인지 여부와 상관없이, 관련 로직 블록과 연관된 결함이 검출되면, 설정 스위치는 관련 로직 블록을 비활성화할 수 있다. 설정 스위치가 초기에 활성화되어 있는 경우, 설정 스위치의 상태는 관련 로직 블록을 비활성화하기 위하여 비활성화로 변경될 수 있다. 설정 스위치가 초기에 비활성화되어 있는 경우, 설정 스위치의 상태는 관련 로직 블록을 비활성화하기 위하여 비활성화 상태로 유지될 수 있다. 예를 들어, 조작성 검사의 결과, 특정 로직 블록이 작동하지 않는 것으로 나오거나 특정 사양 내에서 작동하지 않는 것으로 나올 수 있다. 이러한 경우, 로직 블록은 상응하는 설정 스위치를 활성화하지 않음으로써 비활성화될 수 있다.Regardless of whether the setup switch is initially enabled or disabled, if a fault associated with the associated logic block is detected, the setup switch may disable the associated logic block. If a configuration switch is initially activated, the state of the configuration switch may be changed to deactivated to deactivate the associated logic block. If the configuration switch is initially disabled, the state of the configuration switch may remain in the inactive state to disable the associated logic block. For example, an operability check may result in a specific logic block not working or not working within specific specifications. In this case, the logic block can be deactivated by not activating the corresponding setting switch.

일부 실시예에서, 설정 스위치는 둘 이상의 로직 블록에 연결될 수 있고 서로 다른 로직 블록 사이에서 선택하도록 구성될 수 있다. 예를 들어, 설정 스위치는 비즈니스 로직(1308)과 리던던트 로직 블록(1310) 모두에 연결될 수 있다. 설정 스위치는 리던던트 로직 블록(1310)을 활성화하는 반면에 비즈니스 로직(1308)을 비활성화할 수 있다.In some embodiments, a setting switch may be coupled to more than one logic block and may be configured to select between different logic blocks. For example, a configuration switch may be coupled to both business logic 1308 and redundant logic block 1310 . A setup switch may activate the redundant logic block 1310 while deactivating the business logic 1308 .

대안적으로 또는 추가적으로, 복수의 프라이머리 로직 블록의 적어도 하나(메모리 로직(1306) 및/또는 비즈니스 로직(1308))가 제1 전용 연결로 복수의 메모리 뱅크 또는 메모리 인스턴스(1304)의 서브세트에 연결될 수 있다. 이후, 복수의 프라이머리 로직 블록의 적어도 하나를 복제하는 복수의 리던던트 블록의 적어도 하나(예, 리던던트 비즈니스 블록(1310))가 제2 전용 연결로 동일한 복수의 메모리 뱅크 또는 인스턴스(1304)의 서브세트에 연결될 수 있다.Alternatively or additionally, at least one of the plurality of primary logic blocks (memory logic 1306 and/or business logic 1308 ) is connected to a subset of the plurality of memory banks or memory instances 1304 over a first dedicated connection. can be connected Thereafter, at least one of the plurality of redundant blocks (eg, redundant business block 1310 ) replicating at least one of the plurality of primary logic blocks is the same as a second dedicated connection, or a subset of the plurality of memory banks or instances 1304 . can be connected to

또한, 메모리 로직(1306)에는 비즈니스 로직(1308)과 다른 기능과 능력이 있을 수 있다. 예를 들어, 메모리 로직(1306)은 메모리 뱅크(1304)에 읽기와 쓰기 동작이 가능하도록 설계될 수 있는 반면, 비즈니스 로직(1308)은 인메모리 계산을 수행하도록 설계될 수 있다. 따라서, 비즈니스 로직(1308)이 제1 비즈니스 로직 블록을 포함하고 비즈니스 로직(1308)이 제2 비즈니스 로직 블록(예, 리던던트 비즈니스 로직(1310))을 포함하는 경우, 결함이 있는 비즈니스 블록(1308)을 연결 해제하고 능력의 손실 없이 리던던트 비즈니스 로직(1310)에 다시 연결하는 것이 가능하다.Also, memory logic 1306 may have different functions and capabilities than business logic 1308 . For example, memory logic 1306 may be designed to enable read and write operations to memory bank 1304 , while business logic 1308 may be designed to perform in-memory calculations. Thus, if business logic 1308 includes a first business logic block and business logic 1308 includes a second business logic block (eg, redundant business logic 1310), the defective business block 1308 It is possible to disconnect and reconnect to redundant business logic 1310 without loss of capability.

일부 실시예에서, 설정 스위치(비활성화 스위치(1312)와 활성화 스위치(1314) 포함)는 퓨즈, 안티퓨즈, 또는 프로그램 가능 장치(일회용 프로그램 가능 장치 포함), 또는 기타 형태의 비휘발성 메모리로 구현될 수 있다.In some embodiments, setting switches (including disable switch 1312 and enable switch 1314) may be implemented as fuses, antifuses, or programmable devices (including disposable programmable devices), or other forms of non-volatile memory. have.

도 14는 개시된 실시예에 따른 예시적인 리던던트 로직 블록 세트(1400)의 구성도이다. 일부 실시예에서, 리던던트 로직 블록 세트(1400)는 기판(1301)에 배치될 수 있다. 리던던트 로직 블록 세트(1400)는 스위치(1312, 1314)에 각각 연결된 비즈니스 로직(1308)과 리던던트 비즈니스 로직(1310) 중의 적어도 하나를 포함할 수 있다. 또한, 비즈니스 로직(1308)과 리던던트 비즈니스 로직(1310)은 어드레스 버스(1402)와 데이터 버스(1404)에 연결될 수 있다.14 is a schematic diagram of an exemplary redundant logic block set 1400 in accordance with a disclosed embodiment. In some embodiments, the redundant logic block set 1400 may be disposed on the substrate 1301 . The redundant logic block set 1400 may include at least one of a business logic 1308 and a redundant business logic 1310 connected to the switches 1312 and 1314, respectively. Additionally, business logic 1308 and redundant business logic 1310 may be coupled to address bus 1402 and data bus 1404 .

일부 실시예에서, 도 14에 도시된 바와 같이, 스위치(1312, 1314)는 로직 블록을 클럭 노드에 연결할 수 있다. 이로써, 설정 스위치는 로직 블록을 클럭 신호와 연결 또는 해제할 수 있고, 그 결과로 로직 블록을 활성화 또는 비활성화 할 수 있다. 그러나 다른 실시예에서, 스위치(1312, 1314)는 로직 블록을 다른 노드에 연결하여 활성화 또는 비활성화 할 수 있다. 예를 들어, 설정 스위치는 로직 블록을 전압 공급 노드(예, VCC) 또는 접지 노드(예, GND) 또는 클럭 신호에 연결할 수 있다. 이로써, 설정 스위치는 개방 회로를 만들거나 로직 블록 전원을 차단할 수 있기 때문에 로직 블록은 설정 스위치에 의해 활성화 또는 비활성화 될 수 있다.In some embodiments, as shown in FIG. 14 , switches 1312 and 1314 may couple logic blocks to clock nodes. In this way, the setup switch can connect or disconnect a logic block with a clock signal and, in turn, enable or disable the logic block. However, in other embodiments, switches 1312 and 1314 may connect logic blocks to other nodes to enable or disable them. For example, a configuration switch may connect the logic block to a voltage supply node (eg, VCC) or a ground node (eg, GND) or a clock signal. In this way, the logic block can be enabled or disabled by the set switch because the set switch can make an open circuit or power down the logic block.

일부 실시예에서, 도 14에 도시된 바와 같이, 어드레스 버스(1402)와 데이터 버스(1404)는 서로 각각의 버스에 병렬로 연결된 로직 블록의 반대편에 있을 수 있다. 이로써, 서로 다른 온칩 요소의 라우팅이 로직 블록 세트(1400)에 의해 가능해질 수 있다.In some embodiments, as shown in FIG. 14 , address bus 1402 and data bus 1404 may be on opposite sides of a logic block connected in parallel to each other on their respective buses. As such, routing of different on-chip elements may be enabled by the logic block set 1400 .

일부 실시예에서, 복수의 비활성화 스위치(1312)의 각각은 복수의 프라이머리 로직 블록의 적어도 하나를 클럭 노드와 결합하고, 복수의 활성화 스위치(1314)의 각각은 복수의 리던던트 블록의 적어도 하나를 클럭 노드와 결합하여 클럭을 단순 활성화/비활성화 메커니즘으로 연결/해제할 수 있게 한다.In some embodiments, each of the plurality of disable switches 1312 couples at least one of the plurality of primary logic blocks with a clock node, and each of the plurality of enable switches 1314 clocks at least one of the plurality of redundant blocks. In conjunction with a node, it allows the clock to be connected/disconnected with a simple enable/disable mechanism.

리던던트 로직 블록 세트(1400)의 리던던트 비즈니스 로직(1310)은 설계자가 복제할 가치가 있는 블록을 면적과 라우팅에 의거하여 선택할 수 있게 해준다. 예를 들어, 칩 설계자는 블록의 면적이 클수록 오류가 날 가능성이 높으므로 면적이 큰 블록을 복제하기로 선택할 수 있다. 따라서, 칩 설계자는 면적이 큰 로직 블록을 복제하기로 결정할 수 있다. 반면, 설계자는 면적이 작은 블록은 공간의 상당한 손실 없이도 쉽게 복제가 되기 때문에 면적이 작은 블록의 복제를 선호할 수도 있다. 또한, 도 14의 구성을 활용하여, 설계자는 면적당 오류의 통계에 따라 로직 블록의 복제를 쉽게 선택할 수도 있다.Redundant business logic 1310 of redundant logic block set 1400 allows designers to select blocks worth replicating based on area and routing. For example, a chip designer may choose to duplicate a block with a larger area, as larger blocks are more prone to errors. Thus, the chip designer may decide to duplicate a large area logic block. On the other hand, designers may prefer to replicate small-area blocks because small-area blocks can be easily replicated without significant loss of space. In addition, by utilizing the configuration of FIG. 14 , a designer may easily select a copy of a logic block according to statistics of errors per area.

도 15는 개시된 실시예에 따른 예시적인 로직 블록(1500)의 구성도이다. 로직 블록은 비즈니스 로직 블록(1308) 및/또는 리던던트 비즈니스 로직 블록(1310)일 수 있다. 그러나 다른 실시예에서, 예시적인 로직 블록은 메모리 로직(1306) 또는 메모리 칩(1300)의 기타 요소일 수도 있다.15 is a block diagram of an exemplary logic block 1500 in accordance with the disclosed embodiment. The logic block may be a business logic block 1308 and/or a redundant business logic block 1310 . However, in other embodiments, the example logic blocks may be memory logic 1306 or other elements of the memory chip 1300 .

로직 블록(1500)은 작은 프로세서 파이프라인 이내에서 로직 중복성이 사용되는 또 다른 실시예를 제시한다. 로직 블록(1500)은 레지스터(1508), 페치(fetch) 회로(1504), 디코더(1506), 및 라이트백(write-back) 회로(1518)를 포함할 수 있다. 또한, 로직 블록(1500)은 계산부(1510) 및 복제 계산부(1512)를 포함할 수 있다. 그러나 다른 실시예에서, 로직 블록(1500)은 컨트롤러 파이프라인을 포함하지 않지만 요구되는 비즈니스 로직이 있는 산발적 처리 소자를 포함하는 기타 요소를 포함할 수 있다.Logic block 1500 presents another embodiment where logic redundancy is used within a small processor pipeline. The logic block 1500 may include a register 1508 , a fetch circuit 1504 , a decoder 1506 , and a write-back circuit 1518 . Also, the logic block 1500 may include a calculation unit 1510 and a replication calculation unit 1512 . However, in other embodiments, the logic block 1500 does not include a controller pipeline, but may include other elements including sporadic processing elements with the required business logic.

계산부(1510)와 복제 계산부(1512)는 디지털 계산의 수행이 가능한 디지털 회로를 포함할 수 있다. 예를 들어, 계산부(1510)와 복제 계산부(1512)는 이진수에 대한 산술과 비트 연산을 수행하는 산술논리부(arithmetic logic unit, ALU)를 포함할 수 있다. 또는, 계산부(1510)와 복제 계산부(1512)는 부동소수점수에 대한 연산을 수행하는 부동소수점부(floating-point unit, FPU)를 포함할 수 있다. 또한, 일부 실시예에서, 계산부(1510)와 복제 계산부(1512)는 min, max, count, 및 compare 등과 같은 데이터베이스 관련 함수를 이행할 수 있다.The calculation unit 1510 and the duplicate calculation unit 1512 may include digital circuits capable of performing digital calculations. For example, the calculation unit 1510 and the replication calculation unit 1512 may include an arithmetic logic unit (ALU) that performs arithmetic and bitwise operations on binary numbers. Alternatively, the calculation unit 1510 and the duplicate calculation unit 1512 may include a floating-point unit (FPU) that performs an operation on a floating-point number. Also, in some embodiments, the calculation unit 1510 and the replication calculation unit 1512 may implement database-related functions such as min, max, count, and compare.

일부 실시예에서, 도 15에 도시된 바와 같이, 계산부(1510)와 복제 계산부(1512)는 스위칭 회로(1514, 1516)에 연결될 수 있다. 스위칭 회로는 활성화되는 경우에 계산부를 활성화 또는 비활성화 할 수 있다.In some embodiments, as shown in FIG. 15 , the calculator 1510 and the replication calculator 1512 may be connected to the switching circuits 1514 and 1516 . The switching circuit may activate or deactivate the calculation unit when activated.

로직 블록(1500)에서, 복제 계산부(1512)는 계산부(1510)의 복제일 수 있다. 또한, 일부 실시예에서, 레지스터(1508), 페치 회로(1504), 디코더(1506), 및 라이트백 회로(1518)는(집합적으로 '로컬 논리부'라 칭함) 계산부(1510)보다 크기가 작을 수 있다. 소자의 크기가 클수록 제조 과정에서 문제가 발생할 가능성이 크므로, 설계자는 크기가 작은 요소(예, 로컬 논리부)보다는 크기가 큰 요소(예, 계산부(1510))를 복제하기로 결정할 수 있다. 그러나 수율과 오류율의 기록에 따라, 설계자는 크기가 큰 요소에 추가적으로 또는 대안적으로 로컬 논리부를(또는 전체 블록을) 복제하기로 선택할 수도 있다. 예를 들어, 계산부(1510)는 레지스터(1508), 페치 회로(1504), 디코더(1506), 및 라이트백 회로(1518)보다 크기가 클 수 있고, 따라서 오류 가능성이 높을 수 있다. 설계자는 로직 블록(1500)의 다른 요소보다는 계산부(1510) 또는 전체 블록을 복제하기로 선택할 수 있다.In the logic block 1500 , the duplicate calculation unit 1512 may be a duplicate of the calculation unit 1510 . Also, in some embodiments, register 1508 , fetch circuit 1504 , decoder 1506 , and writeback circuit 1518 (collectively referred to as 'local logic') are larger than calculator 1510 . can be small Because a larger device is more likely to cause manufacturing problems, a designer may decide to duplicate a larger element (e.g., calculation unit 1510) rather than a smaller element (e.g., local logic unit) . However, depending on the yield and error rate record, the designer may choose to duplicate the local logic (or entire block) in addition to or as an alternative to larger elements. For example, the calculation unit 1510 may be larger in size than the register 1508 , the fetch circuit 1504 , the decoder 1506 , and the writeback circuit 1518 , and thus the error probability may be high. The designer may choose to duplicate the calculator 1510 or the entire block rather than other elements of the logic block 1500 .

로직 블록(1500)은 계산부(1510)와 복제 계산부(1512)의 적어도 하나에 각각 연결되는 복수의 로컬 설정 스위치를 포함할 수 있다. 로컬 설정 스위치는 계산부(1510)에 결함이 검출되는 경우에 계산부(1510)를 비활성화 하고 복제 계산부(1512)를 활성화하도록 설정될 수 있다.The logic block 1500 may include a plurality of local setting switches respectively connected to at least one of the calculation unit 1510 and the replication calculation unit 1512 . The local setting switch may be set to deactivate the calculation unit 1510 and activate the copy calculation unit 1512 when a defect is detected in the calculation unit 1510 .

도 16은 개시된 실시예에 따른 버스로 연결된 예시적인 로직 블록의 구성도이다. 일부 실시예에서, 로직 블록(1602)(예, 메모리 로직(1306), 비즈니스 로직(1308), 또는 리던던트 비즈니스 로직(1310))은 서로 별개일 수 있고, 버스를 통해 연결될 수 있고, 외부에서 구체적으로 어드레스 함으로써 활성화될 수 있다. 예를 들어, 메모리 칩(1300)은 각각의 ID 번호를 가진 여러 로직 블록을 포함할 수 있다. 그러나 다른 실시예에서, 로직 블록(1602)은 메모리 로직(1306), 비즈니스 로직(1308), 및 리던던트 비즈니스 로직(1310)의 하나 이상이 포함된 더 큰 요소를 나타내는 것일 수 있다.16 is a block diagram of an exemplary logic block connected by a bus in accordance with the disclosed embodiment. In some embodiments, logic blocks 1602 (eg, memory logic 1306 , business logic 1308 , or redundant business logic 1310 ) may be separate from each other, coupled via a bus, and externally specific It can be activated by addressing For example, the memory chip 1300 may include several logic blocks each having an ID number. However, in other embodiments, logic block 1602 may represent a larger element including one or more of memory logic 1306 , business logic 1308 , and redundant business logic 1310 .

일부 실시예에서, 로직 블록(1602)의 각각은 다른 로직 블록(1602)과 중복될 수 있다. 모든 블록이 프라이머리 블록 또는 리던던트 블록으로 동작할 수 있는 이러한 완전한 중복성으로 인해 설계자는 칩의 전체적인 기능은 유지하면서 결함이 있는 요소의 연결을 해제할 수 있기 때문에 제조 수율을 향상할 수 있다. 예를 들어, 모든 복제 블록이 동일한 어드레스와 데이터 버스에 연결될 수 있기 때문에 설계자는 오류가 날 가능성이 높은 논리 영역은 비활성화 하면서 유사한 계산 능력을 유지할 능력이 있을 수 있다. 예를 들어, 로직 블록(1602)의 초기 수는 목표 능력보다 클 수 있다. 이후, 일부 로직 블록(1602)을 비활성화 해도 목표 능력에 영향을 주지 않을 수 있다.In some embodiments, each of the logic blocks 1602 may overlap other logic blocks 1602 . This full redundancy, where every block can act as a primary block or a redundant block, allows designers to disconnect defective elements while maintaining the overall functionality of the chip, improving manufacturing yield. For example, since all duplicate blocks can be connected to the same address and data bus, a designer may have the ability to maintain similar computational power while disabling error-prone logical regions. For example, the initial number of logic blocks 1602 may be greater than the target capability. Thereafter, even if some logic blocks 1602 are deactivated, the target capability may not be affected.

로직 블록에 연결된 버스는 어드레스 버스(1614), 명령 라인(1616), 및 데이터 라인(1618)을 포함할 수 있다. 도 16에 도시된 바와 같이, 로직 블록의 각각은 버스의 각 라인과 별개로 연결될 수 있다. 그러나 특정 실시예에서, 로직 블록(1602)은 체계적 구조로 연결되어 라우팅을 가능하게 할 수 있다. 예를 들어, 버스의 각 라인은 라인을 다른 로직 블록(1602)으로 라우팅하는 멀티플렉서로 연결될 수 있다.Buses coupled to the logic block may include an address bus 1614 , a command line 1616 , and a data line 1618 . As shown in FIG. 16 , each of the logic blocks may be separately connected to each line of the bus. However, in certain embodiments, logic blocks 1602 may be connected in an organized structure to enable routing. For example, each line of the bus may be connected to a multiplexer that routes the line to another logic block 1602 .

일부 실시예에서, 소자의 활성화/비활성화로 인해 변경될 수 있는 칩의 내부 구조를 몰라도 외부에서의 접근을 가능하게 하기 위하여, 로직 블록의 각각은 퓨즈드(fused) ID(1604)와 같은 퓨즈드 ID를 포함할 수 있다. 퓨즈드 ID(1604)는 ID를 판단하고 관리 회로에 연결될 수 있는 스위치(예, 퓨즈)의 어레이를 포함할 수 있다. 예를 들어, 퓨즈드 ID(1604)는 어드레스 매니저(1302)에 연결될 수 있다. 또는, 퓨즈드 ID(1604)는 높은 계층의 메모리 어드레스 유닛에 연결될 수 있다. 이러한 실시예에서, 퓨즈드 ID(1604)는 특정 어드레스로 설정될 수 있다. 예를 들어, 퓨즈드 ID(1604)는 관리 회로로부터 수신된 명령에 의거하여 최종 ID를 판단하는 프로그램 가능 비휘발성 장치를 포함할 수 있다.In some embodiments, each of the logic blocks is fused, such as a fused ID 1604 , in order to enable external access without knowing the internal structure of the chip that may be changed due to activation/deactivation of a device. ID may be included. Fused ID 1604 may include an array of switches (eg, fuses) that may determine an ID and couple to management circuitry. For example, the fused ID 1604 may be coupled to the address manager 1302 . Alternatively, the fused ID 1604 may be coupled to a higher-level memory address unit. In such an embodiment, the fused ID 1604 may be set to a specific address. For example, fused ID 1604 may include a programmable non-volatile device that determines a final ID based on a command received from management circuitry.

메모리 칩 상의 분산 프로세서는 도 16에 도시된 구성으로 설계될 수 있다. 칩 웨이크업(chip wakeup) 또는 공장 검사 단계에서 BIST로서 실행되는 검사 절차는 검사 프로토콜에 합격하는 복수의 프라이머리 로직 블록(메모리 로직(1306) 및 비즈니스 로직(1308))의 블록에 실행 ID 번호를 배정할 수 있다. 검사 절차는 또한 검사 프로토콜에 불합격하는 복수의 프라이머리 로직 블록의 블록에 불법 ID 번호를 배정할 수 있다. 검사 절차는 또한 검사 프로토콜에 합격하는 복수의 리던던트 블록(리던던트 로직 블록(1310))의 블록에 실행 ID 번호를 배정할 수 있다. 리던던트 블록이 불합격하는 프라이머리 로직 블록을 대체하기 때문에, 실행 ID 번호가 배정된 복수의 리던던트 블록의 블록은 불법 ID 번호가 배정된 복수의 프라이머리 로직 블록의 블록보다 크거나 같을 수 있고, 이에 따라 블록이 비활성화 된다. 또한, 복수의 프라이머리 로직 블록의 각각과 복수의 리던던트 블록의 각각은 적어도 하나의 퓨즈드 ID(1604)를 포함할 수 있다. 또한, 도 16에 도시된 바와 같이, 로직 블록(1602)을 연결하는 버스는 명령 라인, 데이터 라인, 및 어드레스 라인을 포함할 수 있다.The distributed processor on the memory chip may be designed with the configuration shown in FIG. 16 . Inspection procedures executed as BISTs during chip wakeup or factory inspection phase include execution ID numbers in blocks of a plurality of primary logic blocks (memory logic 1306 and business logic 1308) that pass the inspection protocol. can be assigned The inspection procedure may also assign illegal ID numbers to blocks of the plurality of primary logic blocks that fail the inspection protocol. The inspection procedure may also assign an execution ID number to blocks of the plurality of redundant blocks (redundant logic block 1310 ) that pass the inspection protocol. Because the redundant block replaces the failing primary logic block, the block of the plurality of redundant blocks assigned an execution ID number may be greater than or equal to the block of the plurality of primary logic blocks assigned an illegal ID number, thus The block is deactivated. In addition, each of the plurality of primary logic blocks and each of the plurality of redundant blocks may include at least one fused ID 1604 . Also, as shown in FIG. 16 , a bus connecting the logic block 1602 may include a command line, a data line, and an address line.

그러나 다른 실시예에서, 버스에 연결된 모든 로직 블록(1602)은 비활성화 상태로 ID 번호 없이 시작할 것이다. 하나씩 검사하여, 양호한 각 블록에는 실행 ID가 부여되고, 작동하지 않는 로직 블록에는 불법 ID가 남아있고 비활성화 될 것이다. 이런 식으로, 리던던트 로직 블록은 결함이 있는 것으로 알려진 블록을 검사 과정에서 대체함으로써 제조 수율을 향상할 수 있다.However, in other embodiments, all logic blocks 1602 coupled to the bus will start in an inactive state without an ID number. Checking one by one, each good block will be given an execution ID, and the non-working logic block will have an illegal ID remaining and will be deactivated. In this way, redundant logic blocks can improve manufacturing yield by replacing blocks known to be defective during inspection.

어드레스 버스(1614)는 관리 회로를 복수의 메모리 뱅크의 각각, 복수의 프라이머리 로직 블록의 각각, 및 복수의 리던던트 블록의 각각에 결합할 수 있다. 이러한 연결로 인해, 프라이머리 로직 블록과 연관된 결함을 검출하면, 관리 회로는 유효하지 않은 어드레스를 복수의 프라이머리 로직 블록의 하나에 배정하고 유효한 어드레스를 복수의 리던던트 블록의 하나에 배정할 수 있다.Address bus 1614 may couple management circuitry to each of a plurality of memory banks, each of a plurality of primary logic blocks, and each of a plurality of redundant blocks. Due to this coupling, upon detecting a fault associated with the primary logic block, the management circuit may assign an invalid address to one of the plurality of primary logic blocks and a valid address to one of the plurality of redundant blocks.

예를 들어, 도 16의 A)에 도시된 바와 같이, 불법 ID(예, 어드레스 0xFFF)가 모든 로직 블록(1602(a)-(c))에 설정될 수 있다. 검사 후에, 로직 블록 1602(a)와 1602(c)는 제대로 기능하는 것으로 확인되고 로직 블록 1602(b)는 기능에 문제가 있는 것으로 확인된다. 도 16의 A)에서, 음영 처리하지 않은 로직 블록은 기능 검사에 합격한 로직 블록을 나타내고 음영 처리한 로직 블록은 기능 검사에 불합격한 로직 블록을 나타내는 것일 수 있다. 이후, 검사 절차는 제대로 기능하는 로직 블록에 대한 불법 ID를 합법 ID로 변경하고 기능에 문제가 있는 로직 블록에 대한 불법 ID는 그대로 유지한다. 예를 들면, 도 16의 A)에서, 로직 블록 1602(a)와 1602(c)에 대한 어드레스는 각각 0xFFF에서 0x001과 0x002로 변경된다. 반면에, 로직 블록 1602(b)에 대한 어드레스는 불법 어드레스 0xFFF로 유지된다. 일부 실시예에서, ID는 상응하는 퓨즈드 ID(1604)을 프로그램하여 변경된다.For example, as shown in FIG. 16A ), an illegal ID (eg, address 0xFFF) may be set in all logic blocks 1602(a)-(c). After the check, logic blocks 1602(a) and 1602(c) are confirmed to function properly and logic block 1602(b) is identified as having a malfunction. In FIG. 16A ), non-shaded logic blocks may represent logic blocks that have passed the function check, and shaded logic blocks may represent logic blocks that have failed the function check. Thereafter, the inspection procedure changes the illegal ID for a properly functioning logic block to a legal ID, and maintains the illegal ID for the malfunctioning logic block as it is. For example, in FIG. 16A), addresses for logic blocks 1602(a) and 1602(c) are changed from 0xFFF to 0x001 and 0x002, respectively. On the other hand, the address for logic block 1602(b) remains at the illegal address 0xFFF. In some embodiments, the ID is changed by programming the corresponding fused ID 1604 .

로직 블록(1602)의 검사 결과가 다르면 설정도 다를 수 있다. 예를 들어, 도 16의 B)에 도시된 바와 같이, 어드레스 매니저(1302)는 초기에 모든 로직 블록(1602)에 불법 ID(즉, 0xFFF )를 배정할 수 있다. 그러나 검사의 결과, 로직 블록 1602(a)와 1602(b) 모두 정상적으로 기능하는 것으로 나타날 수 있다. 이 경우, 메모리 칩(1300)은 두 개의 로직 블록만 있어도 되기 때문에, 로직 블록 1602(c)의 검사는 불필요할 수 있다. 따라서, 검사에 들어가는 리소스를 최소화하기 위하여, 메모리 칩(1300)의 제품 정의에 의해 요구되는 기능하는 로직 블록의 최소 수량에 따라서만 로직 블록이 검사하고 다른 블록은 검사하지 않을 수 있다. 도 16의 B)에서, 음영 처리하지 않은 로직 블록은 기능성 검사에 합격한 로직 블록을 나타내고, 음영 처리한 로직 블록은 검사하지 않은 로직 블록을 나타낸다.If the test result of the logic block 1602 is different, the setting may be different. For example, as shown in FIG. 16B ), the address manager 1302 may initially assign an illegal ID (ie, 0xFFF ) to all logic blocks 1602 . However, as a result of inspection, it may appear that both logic blocks 1602(a) and 1602(b) function normally. In this case, since the memory chip 1300 may have only two logic blocks, it may not be necessary to check the logic block 1602(c). Accordingly, in order to minimize the resources required for inspection, the logic block may be inspected only according to the minimum quantity of functional logic blocks required by the product definition of the memory chip 1300 and other blocks may not be inspected. In FIG. 16B ), non-shaded logic blocks represent logic blocks that have passed the functional check, and shaded logic blocks represent unchecked logic blocks.

이러한 실시예에서, 생산 테스터(외장 또는 내장, 자동 또는 수동) 또는 스타트업에서 BIST를 실행하는 제어기는 검사를 통해 기능하는 것으로 판명된 로직 블록에 대해 불법 ID를 실행 ID로 변경하고 검사하지 않은 로직 블록은 불법 ID를 변경하지 않을 수 있다. 예를 들면, 도 16의 B)에서, 로직 블록 1602(a)와 1602(b)의 어드레스는 각각 0xFFF에서 0x001과 0x002로 변경된다. 반면, 검사하지 않은 로직 블록 1602(c)에 대한 어드레스는 불법 어드레스 0xFFF를 유지한다.In these embodiments, the production tester (external or built-in, automatic or manual) or the controller running the BIST at startup changes the illegal ID to the run ID for the logic block that has been found to function through inspection and the logic that is not checked. Blocks may not change illegal IDs. For example, in FIG. 16B), the addresses of logic blocks 1602(a) and 1602(b) are changed from 0xFFF to 0x001 and 0x002, respectively. On the other hand, the unchecked address for the logic block 1602(c) maintains the illegal address 0xFFF.

도 17은 개시된 실시예에 따른 직렬로 연결된 예시적인 소자(1702, 1712)의 구성도이다. 도 17은 전체 시스템 또는 칩을 도시한 것일 수 있다. 또는, 도 17은 다른 기능 블록을 포함하는 칩 내부의 블록을 도시한 것일 수 있다.17 is a schematic diagram of exemplary devices 1702 and 1712 connected in series in accordance with disclosed embodiments. 17 may show an entire system or a chip. Alternatively, FIG. 17 may show a block inside a chip including other functional blocks.

소자(1702, 1712)는 메모리 로직(1306) 및/또는 비즈니스 로직(1308)과 같은 복수의 로직 블록을 포함하는 전체 소자를 나타내는 것일 수 있다. 이런 실시예에서, 소자(1702, 1712)는 또한 어드레스 매니저(1302)와 같이 동작을 수행하기 위해 필요한 요소를 포함할 수 있다. 그러나 다른 실시예에서, 소자(1702, 1712)는 비즈니스 로직(1308) 또는 리던던트 비즈니스 로직(1310)과 같은 논리 소자를 나타내는 것일 수 있다.Devices 1702 , 1712 may represent an entire device including a plurality of logic blocks, such as memory logic 1306 and/or business logic 1308 . In such embodiments, elements 1702 and 1712 may also include elements necessary to perform operations, such as address manager 1302 . However, in other embodiments, elements 1702 , 1712 may be representative of logic elements such as business logic 1308 or redundant business logic 1310 .

도 17은 소자(1702, 1712)가 서로 통신할 필요가 있을 수 있는 실시예를 제시한다. 이런 경우, 소자(1702, 1712)는 직렬로 연결될 수 있다. 그러나 작동하지 않는 소자가 로직 블록 사이의 연결성을 막을 수 있다. 따라서, 소자 사이의 연결은 소자가 결함으로 인해 비활성화 되어야 할 필요가 있는 경우를 대비하여 우회 옵션을 포함할 수 있다. 우회 옵션은 또한 우회된 소자 자체의 일부일 수 있다.17 presents an embodiment in which devices 1702 and 1712 may need to communicate with each other. In this case, elements 1702 and 1712 may be connected in series. However, non-functional devices can prevent connectivity between logic blocks. Thus, the connection between devices may include a bypass option in case the device needs to be deactivated due to a fault. The bypass option may also be part of the bypassed device itself.

도 17에서, 소자는 직렬로 연결될 수 있고(예, 1702(a)-(c)), 불합격 소자(예, 1702(b))는 결함 시에 우회될 수 있다. 소자는 스위칭 회로와 병렬도 더 연결될 수 있다. 예를 들어, 일부 실시예에서, 도 17에 도시된 바와 같이, 소자(1702, 1712)는 스위칭 회로(1722, 1728)와 연결될 수 있다. 도 17에 도시된 예에서, 소자(1702(b))는 결함이 있다. 예를 들어, 소자(1702(b))는 회로 기능 검사에서 불합격이다. 따라서, 소자(1702(b))는 활성화 스위치(1314; 도 17에 미도시) 등을 활용하여 비활성화 될 수 있고/있거나 스위칭 회로(1722(b))가 활성화되어 소자(1702(b))를 우회하고 로직 블록 사이의 연결성을 유지할 수 있다.In Figure 17, devices can be connected in series (eg, 1702(a)-(c)) and failing devices (eg, 1702(b)) can be bypassed in case of failure. The device may further be connected in parallel with the switching circuit. For example, in some embodiments, as shown in FIG. 17 , devices 1702 , 1712 may be coupled with switching circuits 1722 , 1728 . In the example shown in Fig. 17, element 1702(b) is defective. For example, device 1702(b) fails the circuit function test. Accordingly, device 1702(b) may be deactivated utilizing an enable switch 1314 (not shown in FIG. 17) or the like and/or switching circuitry 1722(b) is activated to activate device 1702(b). It can bypass and maintain connectivity between logic blocks.

이에 따라, 복수의 프라이머리 소자가 직렬로 연결된 경우, 복수의 소자의 각각은 병렬 스위치와 병렬로 연결될 수 있다. 복수의 소자의 하나와 연관된 결함이 검출되면, 복수의 소자의 하나와 연결된 병렬 스위치가 활성화되어 복수의 소자의 두 소자를 서로 연결할 수 있다.Accordingly, when a plurality of primary elements are connected in series, each of the plurality of elements may be connected in parallel with a parallel switch. When a defect associated with one of the plurality of elements is detected, a parallel switch connected with one of the plurality of elements may be activated to connect two elements of the plurality of elements to each other.

다른 실시예에서, 도 17에 도시된 바와 같이, 스위칭 회로(1728)는 사이클을 지연시켜서 소자의 서로 다른 라인 사이의 동기화를 유지하는 하나 이상의 샘플링 포인트를 포함할 수 있다. 어떤 소자가 비활성화 된 경우에, 인접 로직 블록 사이의 연결을 차단하면 다른 계산과의 동기화 오류가 발생할 수 있다. 예를 들어, 어떤 작업이 A 라인과 B 라인 모두로부터 데이터를 필요로 하고, A 라인과 B 라인은 각각 서로 별개의 일련의 소자에 의해 수행되는 경우에, 한 소자를 비활성화 하면 A 라인과 B 라인 사이의 비동기화가 발생할 것이고 이에 따라 추가적인 데이터 관리가 필요할 것이다. 비동기화를 방지하기 위하여, 샘플 회로(1730)는 비활성화 된 소자(1712(b))에 의해 유발된 지연을 시뮬레이션 할 수 있다. 그러나 일부 실시예에서, 병렬 스위치는 샘플링 회로(1730) 대신에 안티퓨즈를 포함할 수 있다.In another embodiment, as shown in FIG. 17 , the switching circuit 1728 may include one or more sampling points that delay cycles to maintain synchronization between different lines of the device. When a device is inactive, breaking the connection between adjacent logic blocks can cause synchronization errors with other calculations. For example, if an operation requires data from both lines A and B, and lines A and B are each performed by a separate series of devices, then disabling one device will result in lines A and B. A synchronization between the two will occur, which will require additional data management. To prevent desynchronization, the sample circuit 1730 may simulate a delay caused by the deactivated element 1712(b). However, in some embodiments, the parallel switch may include an antifuse instead of the sampling circuit 1730 .

도 18은 개시된 실시예에 따른 2차원 어레이로 연결된 예시적인 소자의 구성도이다. 도 18은 전체 시스템 또는 칩을 나타내는 것일 수 있다. 또는, 도 18은 다른 기능 블록을 포함하는 칩 내부의 블록을 도시한 것일 수 있다.18 is a block diagram of exemplary devices connected in a two-dimensional array according to the disclosed embodiment. 18 may represent an entire system or a chip. Alternatively, FIG. 18 may show a block inside a chip including other functional blocks.

소자(1806)는 메모리 로직(1306) 및/또는 비즈니스 로직(1308)과 같은 복수의 로직 블록을 포함하는 자율 소자(autonomous unit)를 나타내는 것일 수 있다. 그러나 다른 실시예에서, 소자(1806)는 비즈니스 로직(1308)과 같은 논리 소자를 나타내는 것일 수 있다. 설명의 편의상, 도 18의 설명은 앞서 도 13에 설명한 요소(예, 메모리 칩(1300))를 참조할 수 있다.Device 1806 may be representative of an autonomous unit comprising a plurality of logic blocks, such as memory logic 1306 and/or business logic 1308 . However, in other embodiments, element 1806 may be representative of a logic element such as business logic 1308 . For convenience of description, the description of FIG. 18 may refer to the element (eg, the memory chip 1300 ) previously described with reference to FIG. 13 .

도 18에 도시된 바와 같이, 소자는 소자(1806)(메모리 로직(1306), 비즈니스 로직(1308), 및 리던던트 비즈니스 로직(1310)의 하나 이상을 포함하거나 나타낼 수 있음)가 스위칭 박스(1808) 및 연결 박스(1810)를 통해 서로 연결된 2차원 어레이로 배치될 수 있다. 또한, 2차원 어레이의 설정을 제어하기 위하여, 2차원 어레이는 주변에 I/O 블록(1804)을 포함할 수 있다.As shown in FIG. 18 , the device 1806 (which may include or represent one or more of memory logic 1306 , business logic 1308 , and redundant business logic 1310 ) includes a switching box 1808 . and a two-dimensional array connected to each other through a connection box 1810 . Also, to control the configuration of the two-dimensional array, the two-dimensional array may include an I/O block 1804 around it.

연결 박스(1810)는 I/O 블록(1804)으로부터 입력되는 신호에 대응할 수 있는 프로그램 가능 및 재구성 가능 장치일 수 있다. 예를 들어, 연결 박스는 소자(1806)로부터의 복수의 입력 핀을 포함할 수 있고, 또한 스위칭 박스(1808)로 연결될 수 있다. 또는, 연결 박스(1810)는 프로그램 가능 로직 셀의 핀을 라우팅 트랙과 연결하는 한 그룹의 스위치를 포함할 수 있고, 스위칭 박스(1808)는 서로 다른 트랙을 연결하는 한 그룹의 스위치를 포함할 수 있다.The connection box 1810 may be a programmable and reconfigurable device that may correspond to a signal input from the I/O block 1804 . For example, the connection box may include a plurality of input pins from the device 1806 and may also be connected to the switching box 1808 . Alternatively, connection box 1810 may include a group of switches connecting pins of a programmable logic cell with routing tracks, and switching box 1808 may include a group of switches connecting pins of a programmable logic cell with different tracks. have.

일부 실시예에서, 연결 박스(1810)와 스위칭 박스(1808)는 스위치(1312, 1314)와 같은 설정 스위치로 구현될 수 있다. 이러한 실시예에서, 연결 박스(1810)와 스위칭 박스(1808)는 칩 스타트업에 실행되는 BIST 또는 생산 테스터에 의해 설정될 수 있다.In some embodiments, the connection box 1810 and the switching box 1808 may be implemented as setting switches, such as the switches 1312 and 1314 . In this embodiment, the connection box 1810 and the switching box 1808 can be set up by a BIST or production tester running on chip startup.

일부 실시예에서, 연결 박스(1810)와 스위칭 박스(1808)는 소자(1806)에 대한 회로 기능 검사 후에 설정될 수 있다. 이런 실시예에서, I/O 블록(1804)을 사용하여 검사 신호를 소자(1806)에 전송할 수 있다. 검사 결과에 따라, I/O 블록(1804)은 검사 프로토콜에 불합격하는 소자(1806)를 비활성화 하고 검사 프로토콜에 합격하는 소자(1806)를 활성화하는 방식으로 연결 박스(1810)와 스위칭 박스(1808)를 설정하는 프로그래밍 신호를 전송할 수 있다.In some embodiments, connection box 1810 and switching box 1808 may be established after circuit function testing for device 1806 . In such an embodiment, the I/O block 1804 may be used to send a test signal to the device 1806 . According to the test result, the I/O block 1804 deactivates the device 1806 that fails the test protocol and activates the device 1806 that passes the test protocol, the connection box 1810 and the switching box 1808 It can transmit a programming signal that sets

이러한 실시예에서, 복수의 프라이머리 로직 블록과 복수의 리던던트 블록은 2차원 그리드(grid)로 기판 상에 배치될 수 있다. 따라서, 복수의 프라이머리 소자(1806)의 각 소자와 리던던트 비즈니스 로직(1310)과 같은 복수의 리던던트 블록의 각 블록은 스위칭 박스(1808)로 서로 연결될 수 있고, 입력 블록은 2차원 그리드의 각 라인과 각 열의 주변에 배치될 수 있다.In this embodiment, the plurality of primary logic blocks and the plurality of redundant blocks may be arranged on the substrate in a two-dimensional grid. Accordingly, each element of the plurality of primary elements 1806 and each block of the plurality of redundant blocks, such as the redundant business logic 1310 , may be connected to each other by a switching box 1808 , and the input block is each line of the two-dimensional grid and can be placed on the perimeter of each column.

도 19는 개시된 실시예에 따른 복합 연결된 예시적인 소자의 구성도이다. 도 19는 전체 시스템을 나타내는 것일 수 있다. 또는, 도 19는 다른 기능 블록을 포함하는 칩 내부의 블록을 도시한 것일 수 있다.19 is a block diagram of an exemplary device that is compositely connected according to the disclosed embodiment. 19 may show the entire system. Alternatively, FIG. 19 may show a block inside a chip including other functional blocks.

도 19의 복합 연결은 소자(1902(a)-(f))와 설정 스위치(1904(a)-(h))를 포함한다. 소자(1902)는 메모리 로직(1306) 및/또는 비즈니스 로직(1308)과 같은 복수의 로직 블록을 포함하는 자율 소자를 나타내는 것일 수 있다. 그러나 다른 실시예에서, 소자(1902)는 메모리 로직(1306), 비즈니스 로직(1308), 또는 리던던트 비즈니스 로직(1310)과 같은 로직 소자를 나타내는 것일 수 있다. 설정 스위치(1904)는 비활성화 스위치(1312)와 활성화 스위치(1314) 중의 하나 이상을 포함할 수 있다.The composite connection of FIG. 19 includes elements 1902(a)-(f) and setting switches 1904(a)-(h). Device 1902 may be representative of an autonomous device comprising a plurality of logic blocks, such as memory logic 1306 and/or business logic 1308 . However, in other embodiments, element 1902 may be representative of a logic element such as memory logic 1306 , business logic 1308 , or redundant business logic 1310 . The setting switch 1904 may include one or more of a deactivation switch 1312 and an enable switch 1314 .

도 19에 도시된 바와 같이, 복합 연결은 두 면에 있는 소자(1902)를 포함할 수 있다. 예를 들어, 복합 연결은 z축을 따라 이격된 두 개의 개별적인 기판을 포함할 수 있다. 대안적으로 또는 추가적으로, 소자(1902)는 기판의 두 면에 배치될 수 있다. 예를 들어, 메모리 칩(1300)의 면적을 줄이려는 목적으로, 기판(1301)은 중첩하는 두 면에 배치될 수 있고, 3차원으로 배치된 설정 스위치(1904)로 연결될 수 있다. 설정 스위치는 비활성화 스위치(1312) 및/또는 활성화 스위치(1314)를 포함할 수 있다.As shown in FIG. 19 , the composite connection may include elements 1902 on two sides. For example, a composite connection may include two separate substrates spaced apart along the z-axis. Alternatively or additionally, the device 1902 may be disposed on two sides of the substrate. For example, for the purpose of reducing the area of the memory chip 1300 , the substrate 1301 may be disposed on two overlapping surfaces and may be connected to a three-dimensionally disposed setting switch 1904 . The setting switch may include a deactivation switch 1312 and/or an enable switch 1314 .

기판의 제1 면은 '메인' 소자(1902)를 포함할 수 있다. 이 블록은 디폴트로 활성화될 수 있다. 이러한 실시예에서, 제2면은 '리던던트' 소자(1902)를 포함할 수 있다. 이 소자는 디폴트로 비활성화 될 수 있다.The first side of the substrate may include a 'main' element 1902 . This block can be enabled by default. In such an embodiment, the second side may include a 'redundant' element 1902 . This device can be disabled by default.

일부실시예에서, 설정 스위치(1904)는 안티퓨즈를 포함할 수 있다. 따라서, 소자(1902)의 검사 후에, 특정 안티퓨즈를 '상시 온(always-on)' 상태로 스위칭하고 선택된 소자(1902)를 비활성화함으로써 블록이 서로 다른 면에 있더라도 기능 소자의 타일로 연결될 수 있다. 도 19에 도시된 예에서, '메인' 소자의 하나(1902(e))는 고장이다. 도 19에서, 기능에 문제가 있거나 검사하지 않은 블록은 음영 처리로 표시되고, 검사도 하고 기능에 문제가 없는 블록은 음영 처리하지 않은 상태로 표시될 수 있다. 따라서, 설정 스위치(1904)는 서로 다른 면에 있는 로직 블록의 하나(예, 1902(f))가 활성화되도록 설정될 수 있다. 이로써, 메인 로직 블록의 하나에 결함이 있더라도, 여분의 논리 소자로 대체함으로써 메모리 칩이 여전히 기능을 한다.In some embodiments, setting switch 1904 may include an antifuse. Thus, after inspection of the element 1902, by switching a particular antifuse to an 'always-on' state and deactivating the selected element 1902, blocks can be connected to tiles of functional elements, even if they are on different sides. . In the example shown in FIG. 19, one 1902(e) of the 'main' element is faulty. In FIG. 19 , a block having a problem with a function or not inspected may be displayed as shaded, and a block that has also been tested and not having a problem with a function may be displayed in a non-shaded state. Accordingly, the setting switch 1904 may be set such that one (eg, 1902(f)) of the logic blocks on different sides is activated. Thus, even if one of the main logic blocks is defective, the memory chip still functions by replacing it with a spare logic element.

또한, 도 19에서, 메인 로직 블록이 기능을 하기 때문에, 제2면에 있는 소자(1902)의 하나(1902(c))가 검사를 받지 않았고 활성화되지 않았다. 예를 들어, 도 19에서, 메인 소자1902(a)와 1902(d)는 모두 기능 검사에 합격했다. 따라서, 소자(1902(c))는 검사를 받거나 활성화되지 않았다. 따라서, 도 19는 검사 결과에 따라 활성화되는 로직 블록을 구체적으로 선택하는 능력을 도시한다.Also in FIG. 19 , one 1902(c) of the elements 1902 on the second side has not been tested and has not been activated because the main logic block is functioning. For example, in Fig. 19, the main elements 1902(a) and 1902(d) both passed the functional test. Thus, device 1902(c) has not been tested or activated. Thus, Fig. 19 illustrates the ability to specifically select a logic block to be activated according to a test result.

일부 실시예에서, 도 19에 도시된 바와 같이, 제1면에 있는 모든 소자(1902)에 상응하는 여분 또는 리던던트 블록이 있는 것이 아니다. 그러나 다른 실시예에서, 모든 소자가 프라이머리와 리던던트 모두에 해당하는 경우에 모든 소자는 서로에 대해 리던던트가 됨으로써 완벽한 중복성을 가질 수 있다. 또한, 일부에서는 도 19에 도시된 방사형 네트워크(star network)로 구현되지만, 병렬 연결, 직렬 연결, 및/또는 서로 다른 요소를 설정 스위치로 병렬 또는 직렬로 연결하는 구현도 가능하다.In some embodiments, as shown in FIG. 19 , not all devices 1902 on the first side have corresponding redundant or redundant blocks. However, in another embodiment, when all devices are both primary and redundant, all devices may be redundant with respect to each other, thereby providing perfect redundancy. In addition, although some are implemented as a radial network (star network) shown in FIG. 19, parallel connection, series connection, and/or implementation of connecting different elements in parallel or series with a setting switch is also possible.

도 20은 개시된 실시예에 따른 리던던트 블록 활성화 프로세스(2000)를 도시한 예시적인 순서도이다. 활성화 프로세스(2000)는 메모리 칩(1300), 특히 DRAM 메모리 칩에 대해 이행될 수 있다. 일부 실시예에서, 프로세스(200)는 메모리 칩의 기판 상에 배치된 복수의 로직 블록의 각 블록에 대해 적어도 하나의 회로 기능을 검사하는 단계, 검사의 결과에 의거하여 복수의 프라이머리 로직 블록에서 결함이 있는 로직 블록을 식별하는 단계, 메모리 칩의 기판 상에 배치된 적어도 하나의 리던던트 또는 추가 로직 블록에 대해 적어도 하나의 회로 기능을 검사하는 단계, 외부 신호를 비활성화 스위치에 인가하여 적어도 하나의 결함이 있는 로직 블록을 비활성화 하는 단계, 및 외부 신호를 활성화 스위치에 인가하여 적어도 하나의 리던던트 블록을 활성화하는 단계를 포함할 수 있다. 여기서, 활성화 스위치는 적어도 하나의 리던던트 블록과 연결될 수 있고, 메모리 칩의 기판 상에 배치될 수 있다. 하기의 도 20의 설명에서 프로세스(2000)의 각 단계에 대해 더 설명한다.20 is an exemplary flow diagram illustrating a redundant block activation process 2000 in accordance with a disclosed embodiment. Activation process 2000 may be performed for memory chip 1300 , particularly a DRAM memory chip. In some embodiments, the process 200 includes testing at least one circuit function for each block of a plurality of logic blocks disposed on a substrate of a memory chip, in the plurality of primary logic blocks based on a result of the testing. identifying a defective logic block; checking at least one circuit function for at least one redundant or additional logic block disposed on a substrate of a memory chip; applying an external signal to a disable switch to determine the at least one defective It may include the steps of deactivating the logic block with the , and activating the at least one redundant block by applying an external signal to the activation switch. Here, the activation switch may be connected to at least one redundant block and may be disposed on a substrate of the memory chip. Each step of the process 2000 is further described in the description of FIG. 20 below.

프로세스(2000)는 비즈니스 블록(1308)과 복수의 리던던트 블록(예, 리던던트 비즈니스 블록(1310))과 같은 복수의 로직 블록을 검사하는 단계(단계 2002)를 포함할 수 있다. 검사는 온웨이퍼(on-wafer) 검사를 위한 프로빙 스테이션(probing station) 등을 활용한 패키징(packaging) 이전에 수행될 수 있다. 그런, 프로세스(2000)는 패키징 이후에 수행될 수도 있다.Process 2000 may include examining a plurality of logical blocks, such as business block 1308 and a plurality of redundant blocks (eg, redundant business block 1310) (step 2002). The inspection may be performed prior to packaging using a probing station for on-wafer inspection, or the like. However, process 2000 may be performed after packaging.

단계 2002의 검사는 유한 시퀀스의 검사 신호를 메모리 칩(1300)의 모든 로직 블록 또는 메모리 칩(1300)의 로직 블록의 서브세트에 인가하는 단계를 포함할 수 있다. 검사 신호는 0 또는 1을 출력할 것으로 예상되는 계산을 요청할 수 있다. 다른 실시예에서, 검사 신호는 메모리 뱅크의 특정 어드레스를 읽거나 특정 메모리 뱅크에 쓰기를 요청할 수 있다.The test of step 2002 may include applying a finite sequence of test signals to all logic blocks of the memory chip 1300 or a subset of the logic blocks of the memory chip 1300 . The check signal may request a calculation that is expected to output a 0 or 1. In another embodiment, the test signal may request to read a specific address of a memory bank or write to a specific memory bank.

단계 2002의 반복 프로세스 하에서 로직 블록의 반응을 검사하기 위해 검사 방식이 이행될 수 있다. 예를 들어, 메모리 뱅크에 데이터를 쓰도록 하는 명령을 전송한 후에 기록한 데이터의 무결성을 검증하여 로직 블록을 검사할 수 있다. 다른 실시예에서, 데이터를 역으로 하여 알고리즘을 반복하여 검사할 수 있다.A check scheme may be implemented to check the response of the logic block under the iterative process of step 2002 . For example, after sending a command to write data to a memory bank, the logic block can be checked by verifying the integrity of the written data. In other embodiments, the data may be reversed and the algorithm checked iteratively.

다른 실시예에서, 단계 2002의 검사는 로직 블록의 모델을 실행하여 검사 명령의 세트에 의거하여 타깃 메모리 이미지를 생성하는 단계를 포함할 수 있다. 이후, 메모리 칩의 로직 블록에 명령의 동일한 시퀀스가 실행되고, 그 결과가 기록될 수 있다. 시뮬레이션의 잔존하는 메모리 이미지는 또한 검사에서 확보된 이미지와 비교될 수 있고, 불일치에는 불합격의 플래그가 붙여질 수 있다.In another embodiment, the inspection of step 2002 may include executing the model of the logic block to generate a target memory image based on the set of inspection instructions. Thereafter, the same sequence of commands may be executed in a logic block of the memory chip, and the result may be written. The remaining memory images of the simulation can also be compared with images obtained from the inspection, and discrepancies can be flagged as failing.

또는, 단계 2002에서, 검사는 셰도우 모델링(shadow modeling)을 포함할 수 있고, 여기서, 진단은 생성되지만 결과를 반드시 예측하지는 않는다. 오히려, 셰도우 모델링을 활용한 검사는 메모리 칩과 시뮬레이션에서 병렬로 실행될 수 있다. 예를 들어, 메모리 칩의 로직 블록이 명령 또는 작업을 완수하는 경우, 동일한 명령을 실행하라는 신호를 시뮬레이션에 보낼 수 있다. 메모리 칩의 로직 블록이 명령을 완수하면, 두 모델의 아키텍처 상태를 비교한다. 여기에 불일치가 있으면, 불합격 플래그가 붙는다.Alternatively, at step 2002, the testing may include shadow modeling, where a diagnosis is generated but not necessarily predictive of an outcome. Rather, inspections utilizing shadow modeling can be run in parallel on the memory chip and simulation. For example, when a logic block in a memory chip completes an instruction or task, it can signal the simulation to execute the same instruction. When the logic block of the memory chip completes the instruction, it compares the architectural state of the two models. If there is any discrepancy here, it is flagged as failing.

일부 실시예에서, 단계 2002에서 모든 로직 블록(메모리 로직(1306), 비즈니스 로직(1308), 및 리던던트 비즈니스 로직(1310)의 각각 등을 포함)을 검사할 수 있다. 그러나 다른 실시예에서, 서로 다른 차수의 검사에서 로직 블록의 서브세트만을 검사할 수도 있다. 예를 들어, 1차 검사에서는 메모리 로직(1306)과 그 연관 블록만을 검사할 수 있다. 2차 검사에서는 비즈니스 로직(1308)과 그 연관 블록만을 검사할 수 있다. 3차 검사에서는 앞선 2번의 검사의 결과에 따라 리던던트 비즈니스 로직(1310)과 연관된 로직 블록을 검사할 수 있다.In some embodiments, step 2002 may examine all logic blocks (including each of memory logic 1306 , business logic 1308 , and redundant business logic 1310 , etc.). However, in other embodiments, only a subset of the logic blocks may be tested in different orders of inspection. For example, in the first test, only the memory logic 1306 and its associated blocks may be checked. In the secondary inspection, only the business logic 1308 and its associated blocks may be inspected. In the third inspection, the logic block associated with the redundant business logic 1310 may be inspected according to the results of the previous two inspections.

프로세스(2000)는 단계 2004로 진행할 수 있다. 단계 2004에서, 결함이 있는 로직 블록이 식별되고, 결함이 있는 리던던트 블록도 식별될 수 있다. 예를 들어, 단계 2002의 검사에 불합격하는 로직 블록이 단계 2004에서 결함이 있는 블록으로 식별될 수 있다. 그러나 다른 실시예에서, 결함이 있는 특정 블록 만이 초기에 식별될 수 있다. 예를 들어, 일부 실시예에서, 비즈니스 로직(1308)과 연관된 로직 블록만이 식별될 수 있고, 리던던트 로직 블록이 결함이 있는 로직 블록을 대체해야 하는 경우에만 결함이 있는 리던던트 로직 블록이 식별될 수 있다. 또한, 결함이 있는 로직 블록을 식별하는 단계는 식별된 결함이 있는 로직 블록의 식별 정보를 메모리 뱅크 또는 비휘발성 메모리에 기록하는 단계를 포함할 수 있다.Process 2000 may proceed to step 2004 . In step 2004, a defective logic block is identified, and a defective redundant block may also be identified. For example, a logic block that fails the check in step 2002 may be identified as a defective block in step 2004 . However, in other embodiments, only certain defective blocks may be initially identified. For example, in some embodiments, only logic blocks associated with business logic 1308 may be identified, and a defective redundant logic block may only be identified if the redundant logic block should replace the defective logic block. have. Also, identifying the defective logic block may include writing identification information of the identified defective logic block to a memory bank or non-volatile memory.

단계 2006에서, 결함이 있는 로직 블록이 비활성화 될 수 있다. 예를 들어, 설정 회로를 사용하여, 결함이 있는 로직 블록을 클럭, 접지, 및/또는 전원 노드로부터 연결 해제함으로써 결함이 있는 로직 블록이 비활성화 될 수 있다. 또는, 로직 블록을 회피하는 배치로 연결 박스를 설정함으로써 결함이 있는 로직 블록이 비활성화 될 수 있다. 또한, 다른 실시예에서, 어드레스 매니저(1302)로부터 불법 어드레스를 수신함으로써 결함이 있는 로직 블록이 비활성화 될 수 있다.In step 2006, the faulty logic block may be deactivated. For example, a faulty logic block can be deactivated using a setup circuit by disconnecting the faulty logic block from a clock, ground, and/or power node. Alternatively, a faulty logic block can be deactivated by setting the connection box in a layout that avoids the logic block. Also, in another embodiment, a defective logic block may be deactivated by receiving an illegal address from the address manager 1302 .

단계 2008에서, 결함이 있는 로직 블록을 복제하는 리던던트 블록이 식별될 수 있다. 일부 로직 블록에 결함이 있더라도 메모리 칩의 능력을 유지하기 위하여, 단계 2008에서, 결함이 있는 로직 블록을 복제할 수 있고 사용 가능한 리던던트 블록을 식별할 수 있다. 예를 들어, 벡터 곱셈을 수행하는 로직 블록이 결함이 있는 것으로 판단되는 경우, 단계 2008에서, 어드레스 매니저(1302) 또는 온칩 컨트롤러는 역시 벡터 곱셈을 수행하는 사용 가능한 리던던트 로직 블록을 식별할 수 있다.At step 2008, a redundant block that duplicates the defective logic block may be identified. In order to maintain the capability of the memory chip even if some logic blocks are defective, in step 2008, the defective logic block may be duplicated and a usable redundant block may be identified. For example, if it is determined that the logic block that performs vector multiplication is defective, in step 2008, the address manager 1302 or the on-chip controller may identify an usable redundant logic block that also performs vector multiplication.

단계 2010에서, 단계 2008에서 식별된 리던던트 블록이 활성화될 수 있다. 단계 2006의 비활성화 동작에 반하여, 단계 2010에서는, 식별된 리던던트 블록을 클럭, 접지, 및/또는 전원 노드에 연결함으로써 식별된 리던던트 블록이 활성화될 수 있다. 또는, 식별된 리던던트 블록을 연결하는 배치로 연결 박스를 설정함으로써, 식별된 리던던트 블록이 활성화될 수 있다. 또한, 다른 실시예에서, 검사 절차 실행 타임에 실행 어드레스를 수신함으로써, 식별된 리던던트 블록이 활성화될 수 있다.In step 2010, the redundant block identified in step 2008 may be activated. Contrary to the deactivation operation of step 2006, in step 2010, the identified redundant block may be activated by coupling the identified redundant block to a clock, ground, and/or power node. Alternatively, the identified redundant block may be activated by setting a connection box in a configuration that connects the identified redundant block. Further, in another embodiment, by receiving an execution address at check procedure execution time, the identified redundant block may be activated.

도 21은 개시된 실시예에 따른 어드레스 배정 프로세스(2100)를 도시한 예시적인 순서도이다. 어드레스 배정 프로세스(2100)는 메모리 칩(1300), 특히 DRAM 메모리 칩에 대해 이행될 수 있다. 도 16을 참조하여 설명한 바와 같이, 일부 실시예에서, 메모리 칩(1300)의 로직 블록은 데이터 버스에 연결될 수 있고 어드레스 ID를 포함할 수 있다. 프로세스(2100)는 결함이 있는 로직 블록을 비활성화 하고 검사에 합격하는 로직 블록을 활성화하는 어드레스 배정 방법을 제공한다. 프로세스(2100)의 단계들은 칩 스타트업에 실행되는 BIST 또는 생산 테스터에 의해 수행되는 것으로 설명되지만, 메모리 칩(1300)의 다른 요소 및/또는 외부 장치도 프로세스(2100)의 하나 이상의 단계를 수행할 수 있다.21 is an exemplary flow diagram illustrating an address assignment process 2100 in accordance with the disclosed embodiment. The address assignment process 2100 may be performed for the memory chip 1300 , particularly a DRAM memory chip. As described with reference to FIG. 16 , in some embodiments, a logic block of the memory chip 1300 may be coupled to a data bus and may include an address ID. Process 2100 provides an addressing method for deactivating a faulty logic block and activating a logic block that passes inspection. While the steps of process 2100 are described as being performed by a BIST or production tester running on chip startup, other elements of memory chip 1300 and/or external devices may also perform one or more steps of process 2100 . can

단계 2102에서, 테스터는 칩 레벨의 각 로직 블록에 불법 ID를 배정함으로써 모든 로직 블록과 리던던트 블록을 비활성화 할 수 있다.In step 2102, the tester can deactivate all logic blocks and redundant blocks by assigning an illegal ID to each logic block at the chip level.

단계 2104에서, 테스터는 로직 블록의 검사 프로토콜을 실행할 수 있다. 예를 들어, 테스터는 메모리 칩(1300)의 로직 블록의 하나 이상에 대해 단계 2002에서 설명한 검사 방법을 실행할 수 있다.At step 2104, the tester may execute the logic block's inspection protocol. For example, the tester may execute the test method described in step 2002 on one or more of the logic blocks of the memory chip 1300 .

단계 2106에서, 단계 2104의 검사의 결과에 따라, 테스터는 로직 블록에 결함이 있는지 여부를 판단할 수 있다. 로직 블록에 결함이 없는 경우(단계 2106에서 '아니오'), 어드레스 매니저는 단계 2108에서 실행 ID를 검사된 로직 블록에 배정할 수 있다. 로직 블록에 결함이 있는 경우(단계 2106에서 '예'), 어드레스 매니저(1302)는 단계 2110에서 결함이 있는 로직 블록에 불법 ID를 남겨둘 수 있다.In step 2106, according to the result of the check in step 2104, the tester may determine whether the logic block is defective. If the logic block is not defective (NO at step 2106), the address manager may assign an execution ID to the checked logic block at step 2108. If the logic block is faulty (YES at step 2106), the address manager 1302 may leave an illegal ID in the faulty logic block at step 2110.

단계 2112에서, 어드레스 매니저(1302)는 결함이 있는 로직 블록을 복제하는 리던던트 로직 블록을 선택할 수 있다. 일부 실시예에서, 결함이 있는 로직 블록을 복제하는 리던던트 로직 블록에는 결함이 있는 로직 블록과 동일한 요소 및 연결이 있을 수 있다. 그러나 다른 실시예에서, 리던던트 로직 블록은 결함이 있는 로직 블록과 다른 요소 및/또는 연결을 포함하지만 균등한 동작을 수행할 수 있을 수 있다. 예를 들어, 결함이 있는 로직 블록이 벡터 곱셈을 수행하도록 설계된 경우, 선택된 리던던트 로직 블록도 결함이 있는 로직 블록과 동일한 아키텍처가 아니더라도 벡터 곱셈을 수행할 수 있을 수 있다.In step 2112, address manager 1302 may select a redundant logic block that duplicates the defective logic block. In some embodiments, the redundant logic block that duplicates the defective logic block may have the same elements and connections as the defective logic block. However, in other embodiments, the redundant logic block may include other elements and/or connections than the defective logic block, but capable of performing equivalent operations. For example, if the faulty logic block is designed to perform vector multiplication, the selected redundant logic block may also be able to perform vector multiplication, even if it is not of the same architecture as the faulty logic block.

단계 2114에서, 어드레스 매니저(1302)는 리던던트 블록을 검사할 수 있다. 예를 들어, 테스터는 단계 2104에 적용된 검사 방식을 식별된 리던던트 블록에 적용할 수 있다.In step 2114, the address manager 1302 may examine the redundant block. For example, the tester may apply the inspection scheme applied in step 2104 to the identified redundant blocks.

단계 2116에서, 단계 2114의 검사의 결과에 의거하여, 테스터는 리던던트 블록에 결함이 있는지 여부를 판단할 수 있다. 단계 2118에서, 리던던트 블록에 결함이 없는 경우(단계 2116에서 '아니오'), 테스터는 식별된 리던던트 블록에 실행 ID를 배정할 수 있다. 일부 실시예에서, 프로세스(2100)는 단계 2118 이후에 단계 2104로 돌아가서 메모리 칩의 모든 로직 블록을 검사하는 반복 루프를 구성할 수 있다.In operation 2116 , based on the result of the inspection in operation 2114 , the tester may determine whether the redundant block is defective. If, at step 2118, the redundant block is not defective ('no' at step 2116), the tester may assign a run ID to the identified redundant block. In some embodiments, process 2100 may return to step 2104 after step 2118 to form an iterative loop that examines all logic blocks of the memory chip.

테스터가 리던던트 블록에 결함이 있는 것으로 판단하는 경우(단계 2116에서 '예'), 단계 2120에서, 테스터는 추가 리던던트 블록이 사용 가능한지 여부를 판단할 수 있다. 예를 들어, 테스터는 사용 가능한 리던던트 로직 블록에 관한 정보로 메모리 뱅크에 쿼리를 할 수 있다. 사용 가능한 리던던트 로직 블록이 있는 경우(단계 2120에서 '예'), 테스터는 단계 2112로 돌아가서 결함이 있는 로직 블록을 복제하는 새로운 리던던트 로직 블록을 식별할 수 있다. 사용 가능한 리던던트 로직 블록이 없는 경우(단계 2120에서 '아니오'), 단계 2122에서, 테스터는 오류 신호를 생성할 수 있다. 오류 신호는 결함이 있는 로직 블록과 결함이 있는 리던던트 블록의 정보를 포함할 수 있다.If the tester determines that the redundant block is defective (YES in step 2116), in step 2120 the tester may determine whether additional redundant blocks are available. For example, a tester can query a memory bank for information about available redundant logic blocks. If there is a redundant logic block available (YES at step 2120), the tester can return to step 2112 to identify a new redundant logic block that duplicates the faulty logic block. If no redundant logic block is available (No at step 2120), then at step 2122, the tester may generate an error signal. The error signal may include information of a defective logic block and a defective redundant block.

결합된 메모리 뱅크combined memory bank

여기에 개시된 실시예들은 또한, 분산된 고성능 프로세서를 포함한다. 프로세서는 메모리 뱅크와 처리부를 접근시키는 메모리 컨트롤러를 포함할 수 있다. 프로세서는 데이터가 계산을 위해 처리부에 신속히 전달되도록 구성될 수 있다. 예를 들어, 작업을 수행하기 위해 처리부가 두 개의 데이터 인스턴스가 필요한 경우, 메모리 컨트롤러는 통신 라인이 개별적으로 두 데이터 인스턴스로부터의 정보에 대한 접근을 제공할 수 있도록 구성될 수 있다. 개시된 메모리 아키텍처는 복합 캐시 메모리와 복합 레지스터 파일 스키마와 연관된 하드웨어 요구조건을 최소화하고자 한다. 일반적으로, 프로세서 칩은 코어가 레지스터와 직접 작용하도록 하는 캐시 체계를 포함한다. 그러나 캐시 동작은 상당한 다이 면적을 필요로 하고 추가적인 전력을 소비한다. 개시된 메모리 아키텍처는 메모리에 논리 요소를 추가함으로써 캐시 체계의 사용을 피한다.Embodiments disclosed herein also include a distributed high-performance processor. The processor may include a memory controller that accesses the memory bank and the processing unit. The processor may be configured such that the data is quickly passed to the processing unit for calculation. For example, if a processing unit requires two instances of data to perform an operation, the memory controller may be configured such that communication lines can separately provide access to information from both data instances. The disclosed memory architecture seeks to minimize the hardware requirements associated with complex cache memory and complex register file schema. Typically, a processor chip includes a cache scheme that allows the core to work directly with registers. However, cache operation requires significant die area and consumes additional power. The disclosed memory architecture avoids the use of a cache scheme by adding logical elements to the memory.

개시된 아키텍처는 또한 메모리 뱅크 내에 데이터를 전략적으로(또는 최적화된) 배치할 수 있게 한다. 메모리 뱅크가 단일 포트만을 포함하고 지연이 많은 경우라도, 개시된 메모리 아키텍처는 데이터를 메모리 뱅크의 다른 블록에 전략적으로 배치함으로써 고성능을 가능하게 하고 메모리 병목현상을 방지할 수 있다. 지속적인 데이터 스트림을 처리부로 제공하려는 목적으로, 컴파일 최적화(compilation optimization) 단계는 데이터가 특정 또는 일반 작업에 대해 메모리 뱅크에 저장되어야 하는 방법을 판단할 수 있다. 이후, 처리부와 메모리 뱅크를 접근시키는 메모리 컨트롤러는 동작을 수행하기 위하여 데이터가 필요한 경우에 특정 처리부에게 접근을 허용하도록 구성될 수 있다.The disclosed architecture also allows for strategic (or optimized) placement of data within memory banks. Even when a memory bank contains only a single port and has high latency, the disclosed memory architecture can enable high performance and avoid memory bottlenecks by strategically placing data in different blocks of the memory bank. For the purpose of providing a continuous stream of data to the processing unit, a compilation optimization step may determine how data should be stored in a memory bank for a particular or general operation. Thereafter, the memory controller that accesses the processing unit and the memory bank may be configured to allow access to a specific processing unit when data is needed to perform an operation.

메모리 칩의 설정은 처리부(예, 설정 매니저) 또는 외부 인터페이스에 의해 수행될 수 있다. 설정은 또한 컴파일러 또는 기타 소프트웨어 툴에 의해 기록될 수 있다. 또한, 메모리 컨트롤러의 설정은 메모리 뱅크에 사용 가능한 포트와 메모리 뱅크의 데이터 구조에 의거할 수 있다. 이에 따라, 개시된 아키텍처는 서로 다른 메모리 뱅크로부터 데이터의 지속적인 흐름 또는 동시 정보를 처리부에 제공할 수 있다. 이로써, 지연 병목현상 또는 캐시 메모리 요구사항을 회피하여 메모리 이내의 계산 작업이 신속히 처리될 수 있다.The setting of the memory chip may be performed by a processing unit (eg, a setting manager) or an external interface. Settings may also be recorded by a compiler or other software tool. In addition, the configuration of the memory controller may be based on a port usable for the memory bank and a data structure of the memory bank. Accordingly, the disclosed architecture may provide a continuous flow of data from different memory banks or simultaneous information to the processing unit. This allows in-memory computational tasks to be expedited, avoiding latency bottlenecks or cache memory requirements.

또한, 메모리 칩에 저장된 데이터는 컴파일 최적화 단계에 의거하여 배치될 수 있다. 컴파일은 프로세서가 메모리 지연이 없는 처리부로 작업을 효율적으로 배정하는 처리 루틴의 구축을 가능하게 할 수 있다. 컴파일은 컴파일러에 의해 수행되고 기판의 외부 인터페이스와 연결된 호스트로 전송될 수 있다. 일반적으로, 특정 접근 패턴에 대한 높은 지연 및/또는 적은 수의 포트의 결과는 데이터를 필요로 하는 처리부에 대한 데이터 병목현상이 될 수 있다. 그러나 개시된 컴파일은 불리한 메모리 유형으로도 처리부가 지속적으로 데이터를 수신할 수 있게 메모리 뱅크에 데이터를 배치할 수 있다.In addition, data stored in the memory chip may be arranged based on a compilation optimization step. Compilation may allow the processor to build processing routines that efficiently allocate work to processing units with no memory latency. Compilation is performed by the compiler and can be sent to the host connected to the external interface of the board. In general, the result of high latency and/or a small number of ports for a particular access pattern can be a data bottleneck for the processing units that need the data. However, the disclosed compilation may place data in memory banks such that the processing unit can continue to receive data even with adverse memory types.

또한, 일부 실시예에서, 설정 매니저는 작업에 요구되는 계산에 의거하여 필요한 처리부에 신호를 보낼 수 있다. 칩의 서로 다른 처리부 또는 로직 블록에는 서로 다른 작업에 대해 특화된 하드웨어 또는 아키텍처가 있을 수 있다. 따라서, 수행될 작업에 따라, 처리부 또는 처리부의 그룹이 해당 작업을 수행하도록 선택될 수 있다. 기판 상의 메모리 컨트롤러는 프로세서 서브유닛의 선택에 따라 데이터를 보내거나 접근을 허용하여 데이터 전송 속도를 향상할 수 있다. 예를 들어, 컴파일 최적화와 메모리 아키텍처에 의거하여, 처리부는 작업을 수행하도록 요구되는 경우에 메모리 뱅크의 접근이 허용될 수 있다.Also, in some embodiments, the configuration manager may signal the necessary processing unit based on the calculation required for the job. Different processing units or logic blocks on a chip may have specialized hardware or architectures for different tasks. Thus, depending on the task to be performed, a processing unit or group of processing units may be selected to perform the corresponding operation. The on-board memory controller may send or allow access to data depending on the selection of the processor subunit to improve the data transfer rate. For example, based on compilation optimizations and memory architecture, a processing unit may be allowed access to a memory bank when required to perform a task.

또한, 칩 아키텍처는 메모리 뱅크의 데이터 접근에 필요한 시간을 단축하여 데이터의 전송을 가능하게 하는 온칩 요소를 포함할 수 있다. 따라서, 본 개시는 단순한 메모리 인스턴스를 활용하여 특정 또는 일반 작업의 수행이 가능한 고성능 프로세서에 대한 칩 아키텍처를 컴파일 최적화 단계와 함께 설명한다. 메모리 인스턴스에는 DRAM 장치 또는 기타 메모리 지향 기술에 사용되는 것과 같은 적은 수의 포트 및/또는 높은 지연이 있을 수 있지만, 개시된 아키텍처는 메모리 뱅크로부터 처리부로의 데이터의 지속적인(또는 거의 지속적인) 흐름을 가능하게 하여 이러한 단점을 극복할 수 있다.In addition, the chip architecture may include on-chip elements that enable the transfer of data by reducing the time required for data access of the memory bank. Accordingly, the present disclosure describes a chip architecture for a high-performance processor capable of performing a specific or general operation by utilizing a simple memory instance along with a compilation optimization step. A memory instance may have a low number of ports and/or high latency, such as those used in DRAM devices or other memory oriented technologies, but the disclosed architecture enables a continuous (or near continuous) flow of data from the memory bank to the processing unit. Thus, these shortcomings can be overcome.

본 출원에서, 동시 통신이란 클럭 사이클 이내의 통신을 말하는 것일 수 있다. 또는, 동시 통신이란 미리 정해진 시간 이내에 정보를 전송하는 것을 의미하는 것일 수 있다. 예를 들어, 동시 통신은 몇 나노초(10억분의 몇 초) 이내의 통신을 말하는 것일 수 있다.In the present application, simultaneous communication may refer to communication within a clock cycle. Alternatively, simultaneous communication may mean transmitting information within a predetermined time. For example, simultaneous communication may refer to communication within a few nanoseconds (billionths of a second).

도 22는 개시된 실시예에 따른 예시적인 처리 장치에 대한 구성도이다. 도 22의 A)는 메모리 컨트롤러(2210)가 멀티플렉서를 활용하여 제1 메모리 블록(2202)과 제2 메모리 블록(2204)을 연결하는 처리 장치(2200)의 제1 실시예를 도시한 것이다. 메모리 컨트롤러(2210)는 또한 적어도 설정 매니저(2212), 로직 블록(2214), 복수의 가속기(2216(a)-(n))를 연결할 수 있다. 도 22의 B)는 메모리 컨트롤러(2210)가 메모리 컨트롤러(2210)와 적어도 설정 매니저(2212), 로직 블록(2214), 복수의 가속기(2216(a)-(n))를 연결하는 버스를 활용하여 메모리 블록(2202, 2204)을 연결하는 처리 장치(2200)의 제2 실시예를 도시한 것이다. 또한, 호스트(2230)는 외부에 위치할 수 있고 외부 인터페이스 등을 통해 처리 장치(2200)에 연결될 수 있다.22 is a schematic diagram of an exemplary processing apparatus according to the disclosed embodiment. 22A illustrates a first embodiment of the processing device 2200 in which the memory controller 2210 connects the first memory block 2202 and the second memory block 2204 by using a multiplexer. The memory controller 2210 may also connect at least the configuration manager 2212 , the logic block 2214 , and the plurality of accelerators 2216(a)-(n). 22B), the memory controller 2210 utilizes a bus connecting the memory controller 2210 and at least the configuration manager 2212, the logic block 2214, and the plurality of accelerators 2216(a)-(n)). Thus, a second embodiment of the processing unit 2200 connecting the memory blocks 2202 and 2204 is shown. Also, the host 2230 may be located outside and may be connected to the processing device 2200 through an external interface or the like.

메모리 블록(2202, 2204)은 DRAM 매트 또는 매트 그룹, DRAM 뱅크, MRAM\ PRAM\ RERAM\ SRAM 유닛, 플래시 매트, 또는 기타 메모리 기술을 포함할 수 있다. 메모리 블록(2202, 2204)은 대안적으로 비휘발성 메모리, 플래시메모리 장치, ReRAM(Resistive Random Access Memory) 장치, 또는 MRAM(Magnetoresistive Random Access Memory) 장치를 포함할 수 있다.Memory blocks 2202 and 2204 may include DRAM mats or mat groups, DRAM banks, MRAM\PRAM\RERAM\SRAM units, flash mats, or other memory technologies. Memory blocks 2202 and 2204 may alternatively include non-volatile memory, flash memory devices, resistive random access memory (ReRAM) devices, or magnetoresistive random access memory (MRAM) devices.

메모리 블록(2202, 2204)은 추가적으로 복수의 워드라인(미도시)과 복수의 비트라인(미도시) 사이에 행과 열로 배치된 복수의 메모리 셀을 포함할 수 있다. 메모리 셀의 각 행의 게이트는 복수의 워드라인 각각에 연결될 수 있다. 메모리 셀의 각 열은 복수의 비트라인 각각에 연결될 수 있다.The memory blocks 2202 and 2204 may additionally include a plurality of memory cells arranged in rows and columns between a plurality of word lines (not shown) and a plurality of bit lines (not shown). A gate of each row of memory cells may be coupled to each of a plurality of wordlines. Each column of memory cells may be connected to each of a plurality of bitlines.

다른 실시예에서, 메모리 영역(메모리 블록(2202, 2204) 포함)은 단일 메모리 인스턴스로부터 구성될 수 있다. 본 출원에서, '메모리 인스턴스'라는 용어는 '메모리 블록'과 서로 호환하여 사용될 수 있다. 메모리 인스턴스(또는 블록)의 특성은 열악할 수 있다. 예를 들어, 메모리에는 포트가 하나밖에 없을 수 있고 랜덤 액세스 지연이 클 수 있다. 대안적으로 또는 추가적으로, 메모리는 열 및 라인 변경 동안에 접근이 불가능할 수 있고 정정용량 충전 및/또는 회로 설정 등과 관련된 데이터 접근 문제에 직면할 수 있다. 그러나 도 22에 제시된 아키텍처는 메모리 인스턴스와 처리부 사이의 전용 연결을 허용하고 블록의 특성을 고려하는 특정 방식으로 데이터를 배치함으로써 메모리 장치의 병렬 처리를 가능하게 한다.In other embodiments, memory regions (including memory blocks 2202 and 2204) may be constructed from a single memory instance. In the present application, the term 'memory instance' may be used interchangeably with 'memory block'. The properties of a memory instance (or block) can be poor. For example, there may be only one port in memory and the random access latency may be large. Alternatively or additionally, the memory may become inaccessible during thermal and line changes and may face data access issues related to capacitive charging and/or circuit setup, and the like. However, the architecture presented in FIG. 22 enables parallel processing of the memory device by allowing a dedicated connection between the memory instance and the processing unit and arranging the data in a specific way that takes into account the characteristics of the block.

일부 장치 아키텍처에서, 메모리 인스턴스는 여러 포트를 가지고 있어서 병렬 동작이 가능할 수 있다. 그러나 이러한 실시예에서, 칩은 데이터가 칩 아키텍처에 의거하여 컴파일 되고 구성되는 경우에 성능 향상을 이룰 수 있다. 예를 들어, 컴파일러는 단일 포트 메모리를 활용하여도 용이하게 접근될 수 있도록 명령을 제공하여 데이터 배치를 구성함으로써 메모리 영역의 접근 효율을 향상할 수 있다.In some device architectures, a memory instance may have multiple ports, allowing parallel operation. However, in such an embodiment, the chip may achieve performance improvements if the data is compiled and organized according to the chip architecture. For example, the compiler can improve the access efficiency of the memory area by configuring the data arrangement by providing a command so that it can be easily accessed even using a single port memory.

또한, 메모리 블록(2202, 2204)은 단일 칩의 메모리에 대해 다중 유형일 수 있다. 예를 들어, 메모리 블록(2202, 2204)은 eFlash 및 eDRAM일 수 있다. 또한, 메모리 블록은 ROM의 인스턴스를 가진 DRAM을 포함할 수 있다.Also, memory blocks 2202 and 2204 may be of multiple types for a single chip of memory. For example, memory blocks 2202 and 2204 may be eFlash and eDRAM. Also, a memory block may include DRAM with instances of ROM.

메모리 컨트롤러(2210)는 메모리 접근을 처리하고 그 결과를 나머지 모듈로 전달할 논리 회로를 포함할 수 있다. 예를 들어, 메모리 컨트롤러(2210)는 어드레스 매니저 및 멀티플렉서와 같은 선택 장치를 포함하여 데이터를 메모리 블록과 처리부 사이에 전달하고 메모리 블록으로의 접근을 허용할 수 있다. 대안적으로, 메모리 컨트롤러(2210)는 시스템의 메모리 클럭의 상승 에지(rising edge)와 하강 에지(falling edge)에 데이터가 전송되는 DDR SDRAM의 구동에 사용되는 DDR(double data rate) 메모리 컨트롤러를 포함할 수 있다.The memory controller 2210 may include a logic circuit that processes memory access and transmits the result to the remaining modules. For example, the memory controller 2210 may include a selection device such as an address manager and a multiplexer to transfer data between the memory block and the processing unit and allow access to the memory block. Alternatively, the memory controller 2210 includes a double data rate (DDR) memory controller used for driving a DDR SDRAM in which data is transferred on a rising edge and a falling edge of a memory clock of the system. can do.

또한, 메모리 컨트롤러(2210)는 듀얼 채널(Dual Channel) 메모리 컨트롤러를 구성할 수 있다. 듀얼 채널 메모리를 도입하면 메모리 컨트롤러(2210)에 의한 병렬 접근의 제어가 가능할 수 있다. 병렬 접근 라인은 다중 라인이 함께 사용되는 경우에 서로 동일한 길이가 데이터의 동기화를 가능하게 하도록 구성될 수 있다. 대안적으로 또는 추가적으로, 병렬 접근 라인은 메모리 뱅크의 다중 메모리 포트의 접근을 허용할 수 있다.Also, the memory controller 2210 may constitute a dual channel memory controller. When the dual-channel memory is introduced, control of parallel access by the memory controller 2210 may be possible. Parallel access lines can be configured to enable synchronization of data of equal length to each other when multiple lines are used together. Alternatively or additionally, parallel access lines may allow access of multiple memory ports of a memory bank.

일부 실시예에서, 처리 장치(2200)는 처리부에 연결될 수 있는 하나 이상의 멀티플렉서를 포함할 수 있다. 처리부는 멀티플렉서에 직접 연결될 수 있는 설정 매니저(2212), 로직 블록(2214), 가속기(2216)를 포함할 수 있다. 또한, 메모리 컨트롤러(2210)는 복수의 메모리 뱅크 또는 블록(2202)으로부터의 적어도 하나의 데이터 입력과 복수의 처리부 각각으로 연결된 적어도 하나의 데이터 출력을 포함할 수 있다. 이러한 구성으로, 메모리 컨트롤러(2210)는 두 개의 데이터 입력을 통하여 메모리 뱅크 또는 블록(2202, 2204)으로부터 동시에 데이터를 수신하고, 적어도 하나의 선택된 처리부를 통하여 수신된 데이터를 두 개의 데이터 출력을 통하여 동시에 전송할 수 있다. 그러나 일부 실시예에서, 적어도 하나의 데이터 입력과 적어도 하나의 데이터 출력은 읽기 또는 쓰기 동작만을 허용하는 단일 포트에서 이행될 수 있다. 이러한 실시예에서, 단일 포트는 데이터, 어드레스, 및 명령 라인을 포함하는 데이터 버스로 구현될 수 있다.In some embodiments, processing device 2200 may include one or more multiplexers that may be coupled to a processing unit. The processing unit may include a configuration manager 2212 , a logic block 2214 , and an accelerator 2216 that may be directly coupled to the multiplexer. Also, the memory controller 2210 may include at least one data input from the plurality of memory banks or blocks 2202 and at least one data output connected to each of the plurality of processing units. With this configuration, the memory controller 2210 simultaneously receives data from the memory banks or blocks 2202 and 2204 through two data inputs, and simultaneously transmits the data received through at least one selected processing unit through two data outputs. can be transmitted However, in some embodiments, the at least one data input and the at least one data output may be implemented on a single port allowing only read or write operations. In such an embodiment, a single port may be implemented as a data bus containing data, address, and command lines.

메모리 컨트롤러(2210)는 복수의 메모리 블록(2202, 2204)의 각각으로 연결될 수 있고, 선택 스위치 등을 통하여 처리부로도 연결될 수 있다. 또한, 설정 매니저(2212), 로직 블록(2214), 및 가속기(2216)를 포함하는 기판 상의 처리부는 메모리 컨트롤러(2210)에 개별적으로 연결될 수 있다. 일부 실시예에서, 설정 매니저(2212)는 수행될 작업의 표시를 수신할 수 있고, 이에 대응하여 메모리에 저장되거나 외부에서 제공된 설정에 따라 메모리 컨트롤러(2210), 가속기(2216), 및/또는 로직 블록(2214)을 설정할 수 있다. 대안적으로, 메모리 컨트롤러(2210)는 외부 인터페이스에 의해 설정될 수 있다. 작업은 복수의 처리부 중에서 적어도 하나의 처리부를 선택하는데 사용될 수 있는 적어도 하나의 계산을 필요로 할 수 있다. 대안적으로 또는 추가적으로, 선택은 적어도 하나의 계산을 수행할 수 있는 선택된 처리부의 능력에 적어도 부분적으로의 의거하여 이루어질 수 있다. 이에 대응하여, 메모리 컨트롤러(2210)는, 전용 버스를 활용하여 및/또는 파이프라인 된 메모리 접근에서, 메모리 뱅크에 접근을 허용하거나 적어도 하나의 선택된 처리부와 적어도 두 개의 메모리 뱅크 사이에 데이터를 전달할 수 있다.The memory controller 2210 may be connected to each of the plurality of memory blocks 2202 and 2204 , and may also be connected to a processing unit through a selection switch or the like. Also, a processing unit on a substrate including the setting manager 2212 , the logic block 2214 , and the accelerator 2216 may be individually connected to the memory controller 2210 . In some embodiments, settings manager 2212 may receive an indication of a task to be performed, and correspondingly memory controller 2210 , accelerator 2216 , and/or logic according to settings stored in memory or provided externally. Block 2214 may be established. Alternatively, the memory controller 2210 may be configured by an external interface. The operation may require at least one calculation that may be used to select at least one processing unit from among the plurality of processing units. Alternatively or additionally, the selection may be made based at least in part on the ability of the selected processing unit to perform the at least one calculation. In response, the memory controller 2210 may allow access to the memory bank or transfer data between the at least one selected processor and the at least two memory banks, utilizing a dedicated bus and/or in pipelined memory access. have.

일부 실시예에서, 적어도 두 개의 메모리 블록의 제1 메모리 블록(2202)은 복수의 처리부의 제1측에 배치될 수 있고, 적어도 두 개의 메모리 블록의 제2 메모리 블록(2204)은 제1측의 반대편인 복수의 처리부의 제2측에 배치될 수 있다. 또한, 작업을 수행할 선택된 처리부, 예를 들어, 가속기(2216(n))는 통신 라인이 제1 메모리 뱅크 또는 제1 메모리 블록(2202)에 개방된 클럭 사이클 동안에 제2 메모리 뱅크(2204)에 접근하도록 설정될 수 있다. 또는, 선택된 처리부는 통신 라인이 제1 메모리 블록(2202)에 개방된 클럭 사이클 동안에 제2 메모리 블록(2204)으로 데이터를 전송하도록 설정될 수 있다.In some embodiments, the first memory block 2202 of the at least two memory blocks may be disposed on a first side of the plurality of processing units, and the second memory block 2204 of the at least two memory blocks may be disposed on the first side of the plurality of processing units. It may be disposed on the second side of the plurality of processing units opposite to each other. Also, the selected processing unit to perform the task, e.g., accelerator 2216(n), is connected to the second memory bank 2204 during the clock cycle in which the communication line is opened to the first memory bank or first memory block 2202. can be set to access. Alternatively, the selected processing unit may be configured to transmit data to the second memory block 2204 during a clock cycle in which the communication line is opened to the first memory block 2202 .

일부 실시예에서, 도 22에 도시된 바와 같이, 메모리 컨트롤러(2210)는 개별 요소로 구현될 수 있다. 그러나 다른 실시예에서, 메모리 컨트롤러(2210)는 메모리 영역에 내장되거나 가속기(2216(a)-(n))를 따라 배치될 수 있다.In some embodiments, as shown in FIG. 22 , the memory controller 2210 may be implemented as an individual element. However, in another embodiment, the memory controller 2210 may be embedded in the memory area or disposed along the accelerators 2216(a)-(n).

처리 장치(2200)의 처리 영역은 설정 매니저(2212), 로직 블록(2214) 및 가속기(2216(a)-(n))를 포함할 수 있다. 가속기(2216)는 기능이 미리 정의된 복수의 처리 회로를 포함할 수 있고 특정 어플리케이션에 의해 정의될 수 있다. 예를 들어, 가속기는 모듈 사이의 메모리 이동을 취급하는 벡터 MAC 유닛 또는 DMA(Direct Memory Access) 유닛일 수 있다. 가속기(2216)는 또한 자체 어드레스를 계산하고 메모리 컨트롤러(2210)에 데이터를 요청하거나 기록할 수 있다. 예를 들어, 설정 매니저(2212)는 메모리 뱅크에 접근할 수 있는 가속기(2216)의 적어도 하나에 신호할 수 있다. 이후, 가속기(2216)는 메모리 컨트롤러(2210)를 설정하여 데이터를 전달하거나 접근을 허용하도록 할 수 있다. 또한, 가속기(2216)는 적어도 하나의 산술 논리부, 적어도 하나의 벡터 처리 논리부, 적어도 하나의 문자열 비교 논리부, 적어도 하나의 레지스터, 및 적어도 하나의 DMA(direct memory access)를 포함할 수 있다.The processing region of the processing device 2200 may include a configuration manager 2212 , a logic block 2214 , and accelerators 2216(a)-(n). The accelerator 2216 may include a plurality of processing circuits with predefined functions and may be defined by a particular application. For example, an accelerator may be a vector MAC unit or a Direct Memory Access (DMA) unit that handles memory movement between modules. The accelerator 2216 may also calculate its own address and request or write data to the memory controller 2210 . For example, the settings manager 2212 can signal at least one of the accelerators 2216 that it can access the memory bank. Thereafter, the accelerator 2216 may set the memory controller 2210 to transmit data or allow access. The accelerator 2216 may also include at least one arithmetic logic unit, at least one vector processing logic unit, at least one string comparison logic unit, at least one register, and at least one direct memory access (DMA). .

설정 매니저(2212)는 가속기(2216)를 설정하고 작업의 실행을 지시하는 디지털 처리 회로를 포함할 수 있다. 예를 들어, 설정 매니저(2212)는 메모리 컨트롤러(2210)와 복수의 가속기(2216) 각각에 연결될 수 있다. 설정 매니저(2212)에는 가속기(2216)의 설정을 저장할 전용 메모리가 있을 수 있다. 설정 매니저(2212)는 메모리 뱅크를 이용하여 메모리 컨트롤러(2210)를 통해 명령과 설정을 가져올 수 있다. 또는, 설정 매니저(2212)는 외부 인터페이스를 통하여 프로그램 될 수 있다. 일부 실시예에서, 설정 매니저(2212)에는 자체 캐시 체계를 가진 온칩 RISC(reduced instruction set computer) 또는 온칩 복합 CPU가 구현되어 있을 수 있다. 일부 실시예에서, 설정 매니저(2212)가 생략될 수도 있고, 가속기는 외부 인터페이스를 통해 설정될 수 있다.The settings manager 2212 may include digital processing circuitry that configures the accelerator 2216 and directs the execution of tasks. For example, the setting manager 2212 may be connected to each of the memory controller 2210 and the plurality of accelerators 2216 . The settings manager 2212 may have a dedicated memory to store the settings of the accelerator 2216 . The setting manager 2212 may obtain commands and settings through the memory controller 2210 using the memory bank. Alternatively, the setting manager 2212 may be programmed through an external interface. In some embodiments, the configuration manager 2212 may be implemented with an on-chip reduced instruction set computer (RISC) or an on-chip composite CPU with its own cache scheme. In some embodiments, the configuration manager 2212 may be omitted, and the accelerator may be configured through an external interface.

처리 장치(2200)는 또한 외부 인터페이스(미도시)를 포함할 수 있다. 외부 인터페이스는 외부 호스트(2230) 또는 온칩 메인 프로세서로부터 명령을 수신하는 메모리 뱅크 컨트롤러와 같은 상위 레벨로부터 메모리의 접근을 허용하거나 외부 호스트(2230) 또는 온칩 메인 프로세서로부터 메모리의 접근을 허용한다. 외부 인터페이스는 나중에 설정 매니저(2212) 또는 소자들(2214, 2216)에 의해 사용될 메모리 컨트롤러(2210)를 통해 메모리에 설정 또는 코드를 기록함으로써 설정 매니저(2212)와 가속기(2216)의 프로그래밍을 허용할 수 있다. 그러나 외부 인터페이스는 또한, 메모리 컨트롤러(2210)를 통해 전달되지 않고도 처리부를 직접 프로그램 할 수 있다. 설정 매니저(2212)가 마이크로컨트롤러인 경우, 설정 매니저(2212)는 외부 인터페이스를 통해 메인 메모리로부터 컨트롤러 로컬 메모리에 코드의 로딩을 허용할 수 있다. 메모리 컨트롤러(2210)는 외부 인터페이스로부터의 요청 수신에 대응하여 작업을 중단하도록 설정될 수 있다.The processing device 2200 may also include an external interface (not shown). The external interface allows access to memory from a higher level, such as a memory bank controller that receives commands from the external host 2230 or the on-chip main processor, or allows access to the memory from the external host 2230 or the on-chip main processor. An external interface may allow programming of the configuration manager 2212 and accelerator 2216 by writing settings or code to memory via the memory controller 2210 to be later used by the configuration manager 2212 or components 2214, 2216. can However, the external interface may also directly program the processing unit without being transmitted through the memory controller 2210 . When the configuration manager 2212 is a microcontroller, the configuration manager 2212 may allow loading of code from the main memory into the controller local memory via an external interface. The memory controller 2210 may be set to stop the operation in response to receiving a request from the external interface.

외부 인터페이스는 처리 장치 상의 다양한 요소에 무접착제 인터페이스를 제공하는 논리 회로와 연관된 복수의 커넥터를 포함할 수 있다. 외부 인터페이스는 데이터 읽기를 위한 데이터 I/O 입력과 데이터 쓰기를 위한 출력, 외부 어드레스 출력, 외부 CE0 칩 선택 핀, 액티브 로우 칩 셀렉터(Active-low chip selectors), 바이트 활성화 핀, 메모리 사이클 상의 대기 상대에 대한 핀, 쓰기 활성화 핀, 출력 활성화 핀, 및 읽기-쓰기 활성화 핀을 포함할 수 있다. 따라서, 외부 인터페이스에는 프로세스를 제어하고 처리 장치로부터 정보를 확보하는 요청된 입력과 출력이 있다. 예를 들어, 외부 장치는 JEDEC DDR 표준을 준수할 수 있다. 대안적으로 또는 추가적으로, 외부 장치는 SPI\OSPI 또는 UART와 같은 기타 표준을 준수할 수 있다.The external interface may include a plurality of connectors associated with logic circuitry that provides an adhesive-free interface to various elements on the processing device. External interfaces include data I/O inputs for reading data and outputs for writing data, external address outputs, external CE0 chip select pins, active-low chip selectors, byte enable pins, and wait counterparts on memory cycles. a pin for , a write enable pin, an output enable pin, and a read-write enable pin. Thus, the external interface has the requested inputs and outputs that control the process and obtain information from the processing device. For example, an external device may be compliant with the JEDEC DDR standard. Alternatively or additionally, the external device may be compliant with SPI\OSPI or other standards such as UART.

일부 실시예에서, 외부 인터페이스는 칩 기판 상에 배치될 수 있고 외부 호스트(2230)와 연결될 수 있다. 외부 호스트는 외부 인터페이스를 통해 메모리 블록(2202, 2204), 메모리 컨트롤러(2210), 및 처리부에 대한 접근을 확보할 수 있다. 추가적으로 또는 대안적으로, 외부 호스트(2230)는 메모리에 대한 읽기 및 쓰기를 할 수 있고, 읽기 및 쓰기 명령을 통하여 설정 매니저(2212)에게 신호를 보내 프로세스의 개시 및/또는 중단과 같은 동작을 수행하게 할 수 있다. 또한, 외부 호스트(2230)는 가속기(2216)를 직접 설정할 수 있다. 일부 실시예에서, 외부 호스트(2230)는 메모리 블록(2202, 2204) 상에서 직접 읽기/쓰기 동작을 수행할 수 있다.In some embodiments, the external interface may be disposed on a chip substrate and may be connected to the external host 2230 . The external host may secure access to the memory blocks 2202 and 2204 , the memory controller 2210 , and the processing unit through the external interface. Additionally or alternatively, the external host 2230 may read and write to the memory, and sends a signal to the configuration manager 2212 through the read and write commands to perform an operation such as starting and/or stopping a process. can do it Also, the external host 2230 may directly configure the accelerator 2216 . In some embodiments, the external host 2230 may directly perform read/write operations on the memory blocks 2202 and 2204 .

일부 실시예에서, 설정 매니저(2212)와 가속기(2216)는 대상 작업에 따라 직접 버스를 사용하여 장치 영역을 메모리 영역과 연결하도록 설정될 수 있다. 예를 들어, 가속기(2216)의 서브세트는 가속기의 서브세트가 작업을 실행하기 위해 필요한 계산을 수행할 능력이 있는 경우에 메모리 인스턴스(2204)와 연결할 수 있다. 이러한 분리를 함으로써, 메모리 블록(2202, 2204)에 필요한 대역폭(BW)이 전용 가속기에 확보되도록 할 수 있다. 또한, 메모리 인스턴스를 메모리 컨트롤러(2210)로 연결하면 행 지연 시간이 크더라도 서로 다른 메모리의 데이터에 신속하게 접근할 수 있기 때문에, 전용 버스가 있는 이러한 구성은 용량이 큰 메모리를 작은 인스턴스 또는 블록으로 나누는 것을 가능하게 할 수 있다. 연결의 병렬화를 하기 위해, 메모리 컨트롤러(2210)는 데이터, 어드레스, 및/또는 컨트롤 버스로 메모리 인스턴스의 각각에 연결될 수 있다.In some embodiments, settings manager 2212 and accelerator 2216 may be configured to associate device regions with memory regions using a direct bus depending on the target task. For example, a subset of accelerators 2216 may associate with a memory instance 2204 if the subset of accelerators has the capability to perform the calculations necessary to execute a task. By performing this separation, the bandwidth BW required for the memory blocks 2202 and 2204 can be secured to the dedicated accelerator. In addition, since connecting a memory instance to the memory controller 2210 allows quick access to data in different memories even with high row latency, this configuration with a dedicated bus converts large-capacity memory into smaller instances or blocks. It can make sharing possible. To parallelize the connections, the memory controller 2210 may be coupled to each of the memory instances with data, address, and/or control buses.

상기에 설명한 바와 같이 메모리 컨트롤러(2210)를 포함시키면 처리 장치에 캐시 체계 또는 복합 레지스터 파일이 없어도 될 수 있다. 캐시 체계를 추가하여 시스템 역량을 강화할 수도 있지만, 처리 장치(2200)의 아키텍처로 인해, 설계자는 처리 동작에 의거하여 충분한 메모리 블록 또는 인스턴스를 추가하고 이에 따라 캐시 체계 없이 인스턴스를 관리할 수 있다. 예를 들어, 처리 장치(2200)의 아키텍처로 인해, 파이프라인 된 메모리 접근을 제거함으로써 캐시 체계가 필요하지 않게 될 수 있다. 파이프라인 된 메모리 접근에서, 처리부는 매 사이클에 지속적인 데이터 흐름을 수신할 수 있고, 특정 데이터 라인이 개방(또는 활성화)될 수 있는 반면에 다른 데이터 라인이 데이터를 수신 또는 전송할 수 있다. 개별 통신 라인을 활용한 지속적인 데이터의 흐름으로 인해, 실행 속도는 향상될 수 있고, 라인 변경으로 인한 지연은 최소화될 수 있다.The inclusion of the memory controller 2210 as described above may eliminate the need for a cache scheme or complex register file in the processing unit. Although the system capability may be enhanced by adding a cache scheme, the architecture of the processing unit 2200 allows the designer to add sufficient memory blocks or instances based on the processing operation and thus manage the instances without the cache scheme. For example, due to the architecture of the processing unit 2200, a cache scheme may be obviated by eliminating pipelined memory accesses. In a pipelined memory access, the processing unit may receive a continuous flow of data in every cycle, and certain data lines may be opened (or activated) while other data lines may receive or transmit data. Due to the continuous data flow utilizing individual communication lines, execution speed can be improved and delays due to line changes can be minimized.

또한, 도 22의 개시된 아키텍처는 적은 수의 메모리 블록에서 데이터를 구성하고 라인 변경으로 인한 전력 손실을 줄일 수 있는 파이프라인 된 메모리 접근을 가능하게 한다. 예를 들어, 일부 실시예에서, 컴파일러는 메모리 뱅크의 데이터 구성 또는 데이터 구성 방법을 호스트(2230)와 주고받고 주어진 작업 동안에 데이터에 대한 접근을 가능하게 할 수 있다. 이후, 설정 매니저(2212)는 어느 메모리 뱅크가(경우에 따라서는 메모리 뱅크의 어느 포트가) 가속기에 의해 접근될 수 있는지 정의할 수 있다. 메모리 뱅크 내의 데이터 위치와 데이터 접근 방법 사이의 동기화는 데이터를 최소의 지연으로 가속기에 공급함으로써 전산 작업을 개선한다. 예를 들어, 설정 매니저(2212)가 RISC\CPU를 포함하는 실시예에서, 방법은 오프라인 소프트웨어에서 이행될 수 있고, 이후에 설정 매니저(2212)는 상기 방법을 실행하도록 프로그램 될 수 있다. 방법은 RISC\CPU 컴퓨터에 의해 실행 가능한 모든 언어로 만들어질 수 있고 모든 플랫폼에서 실행될 수 있다. 방법의 입력은 메모리 컨트롤러 뒤의 메모리의 설정 및 메모리 접근 패턴과 함께 데이터 자체를 포함할 수 있다. 또한, 방법은 해당 실시예에 특정된 언어 또는 기계어로 구현될 수 있고, 단순히 이진수 또는 문자로 된 일련의 설정값일 수도 있다.In addition, the disclosed architecture of FIG. 22 enables a pipelined memory access that organizes data in a small number of memory blocks and reduces power loss due to line changes. For example, in some embodiments, the compiler may send and receive data organization or data organization method of a memory bank with the host 2230 and enable access to data during a given operation. Thereafter, the configuration manager 2212 may define which memory bank (and in some cases, which port of the memory bank) can be accessed by the accelerator. Synchronization between data locations in memory banks and data access methods improves computational work by feeding data to the accelerator with minimal delay. For example, in embodiments where configuration manager 2212 includes a RISC\CPU, the method may be implemented in offline software, after which configuration manager 2212 may be programmed to execute the method. Methods can be written in any language executable by a RISC\CPU computer and run on any platform. The input of the method may include the data itself, along with the settings and memory access patterns of the memory behind the memory controller. In addition, the method may be implemented in a language or machine language specific to the embodiment, and may simply be a set of binary numbers or characters.

앞서 설명한 바와 같이, 일부 실시예에서, 컴파일러는 파이프라인 된 메모리 접근에 대비하여 데이터를 메모리 블록(2202, 2204)에 구성하기 위해 호스트(2230)로 명령을 제공할 수 있다. 파이프라인 된 메모리 접근은 일반적으로 복수의 메모리 뱅크 또는 메모리 블록(2202, 2204)의 복수의 어드레스를 수신하는 단계, 수신된 어드레스에 따라 개별 데이터 라인을 사용하여 복수의 메모리 뱅크에 접근하는 단계, 복수의 메모리 뱅크의 제1 메모리 뱅크 내에 있는 제1 어드레스로부터 제1 통신 라인을 통하여 복수의 처리부의 적어도 하나로 데이터를 공급하고 복수의 메모리 뱅크의 제2 메모리 뱅크(2204) 내에 있는 제2 어드레스로 제2 통신 라인을 개통하는 단계, 및 제2 어드레스로부터 제2 통신 라인을 통하여 복수의 처리부의 적어도 하나로 데이터를 공급하고 제2 클럭 사이클 이내에 제1 통신 라인의 제1 메모리 뱅크 내의 제3 어드레스로 제3 통신 라인을 개통하는 단계를 포함할 수 있다. 일부 실시예에서, 파이프라인 된 메모리 접근은 단일 포트에 연결된 두 개의 메모리 블록으로 실행될 수 있다. 이러한 실시예에서, 메모리 컨트롤러(2210)는 단일 포트 뒤로 메모리 블록을 숨길 수 있지만 파이프라인 된 메모리 접근 방식으로 처리부에 데이터를 전송할 수 있다.As described above, in some embodiments, the compiler may provide instructions to the host 2230 to organize data into memory blocks 2202 and 2204 for pipelined memory accesses. Pipelined memory access generally involves receiving a plurality of addresses of a plurality of memory banks or memory blocks 2202, 2204, accessing the plurality of memory banks using individual data lines according to the received addresses, a plurality of supplies data to at least one of the plurality of processing units via a first communication line from a first address in a first memory bank of a memory bank of opening a communication line, and supplying data from the second address to at least one of the plurality of processing units via the second communication line and providing a third communication to a third address in the first memory bank of the first communication line within a second clock cycle It may include the step of opening a line. In some embodiments, pipelined memory access may be implemented with two memory blocks connected to a single port. In this embodiment, the memory controller 2210 may hide the memory block behind a single port, but may send data to the processor in a pipelined memory access approach.

일부 실시예에서, 컴파일러는 작업을 실행하기 전에 호스트(2230) 상에서 실행될 수 있다. 이러한 실시예에서, 컴파일러가 데이터 흐름의 설정을 이미 알고 있을 수 있기 때문에, 컴파일러는 메모리 장치의 아키텍처에 의거하여 데이터 흐름의 설정을 판단할 수 있을 수 있다.In some embodiments, the compiler may run on host 2230 before executing the job. In such an embodiment, since the compiler may already know the setting of the data flow, the compiler may be able to determine the setting of the data flow based on the architecture of the memory device.

다른 실시예에서, 오프라인 시간에 메모리 블록(2204, 2202)의 설정을 모르는 경우, 파이프라인 된 방법은 호스트(2230)에서 실행할 수 있고, 호스트(2230)는 계산을 시작하기 전에 메모리 블록에 데이터를 배치할 수 있다. 예를 들어, 호스트(2230)는 메모리 블록(2204, 2202)에 직접 데이터를 기록할 수 있다. 이러한 실시예에서, 설정 매니저(2212)와 같은 처리부와 메모리 컨트롤러(2210)는 필요한 하드웨어에 관한 정보를 런타임까지 모를 수 있다. 이후, 작업의 실행이 시작될 때까지 가속기(2216)의 선택을 지연할 필요가 있을 수 있다. 이러한 상황에서, 처리부 또는 메모리 컨트롤러(2210)는 가속기(2216)를 무작위로 선택하고, 작업이 실행되면서 수정될 수 있는 검사 데이터 접근 패턴을 생성할 수 있다.In another embodiment, if the settings of the memory blocks 2204 and 2202 are not known at offline time, the pipelined method may be executed on the host 2230, where the host 2230 writes the data to the memory blocks before starting the computation. can be placed For example, the host 2230 may write data directly to the memory blocks 2204 and 2202 . In this embodiment, the processing unit such as the setting manager 2212 and the memory controller 2210 may not know information about necessary hardware until runtime. Thereafter, it may be necessary to delay selection of accelerator 2216 until execution of the task begins. In this situation, the processing unit or memory controller 2210 may randomly select the accelerator 2216 and generate an inspection data access pattern that may be modified as the task is executed.

그러나 작업을 미리 아는 경우, 컴파일러는 데이터와 명령을 메모리 뱅크 내에 배치하여 호스트(2230)가 접근 지연을 최소화하는 신호 연결을 설정 매니저(2212)와 같은 처리부에 설정하게 할 수 있다. 예를 들어, 일부의 경우, 가속기(2216)는 동시에 n 개의 워드를 필요로 할 수 있다. 그러나 각 메모리 인스턴스는 한 번에 m 개의 워드만 가져오는 것을 지원할 수 있다. 여기서, m과 n은 모두 정수이고, m　<　n 이다. 따라서, 컴파일러는 필요한 데이터를 서로 다른 메모리 인스턴스 또는 블록에 배치하여 데이터 접근을 가능하게 할 수 있다. 또한, 라인 누락 지연을 방지하기 위해, 호스트는 처리 장치(2200)에 다수의 메모리 인스턴스가 있는 경우에 서로 다른 메모리 인스턴스의 서로 다른 라인에 데이터를 나눌 수 있다. 데이터를 분할하면, 현재의 인스턴스에서 데이터를 계속 사용하면서 다음 인스턴스에 데이터의 다음 라인에 접근하는 것이 가능하다.However, if the task is known in advance, the compiler places data and instructions in the memory bank so that the host 2230 establishes a signal connection that minimizes the access delay to a processing unit such as the configuration manager 2212 . For example, in some cases, accelerator 2216 may require n words simultaneously. However, each memory instance can only support fetching m words at a time. Here, both m and n are integers, and m　<　n. Thus, the compiler can place the necessary data in different memory instances or blocks to enable data access. Also, in order to prevent line drop delay, the host may divide data into different lines of different memory instances when there are multiple memory instances in the processing device 2200 . Splitting data makes it possible to access the next line of data in the next instance while continuing to use the data in the current instance.

예를 들어, 가속기(2216(a))는 두 개의 벡터를 곱하도록 구성될 수 있다. 각 벡터는 메모리 블록(2202, 2204)과 같은 별개의 메모리 블록에 저장될 수 있고, 각 벡터는 다수의 워드를 포함할 수 있다. 따라서, 가속기(2216(a))에 의한 곱셈을 필요로 하는 작업을 완료하려면, 두 개의 메모리 블록에 접근하고 복수의 워드를 가져올 필요가 있을 수 있다. 그러나 일부 실시예에서, 메모리 블록은 클럭 사이클 당 하나의 워드만 접근을 허용한다. 예를 들어, 메모리 블록에는 단일 포트가 있을 수 있다. 이런 경우, 동작 중의 데이터 전송을 빠르게 하기 위해, 컴파일러는 벡터를 구성하는 워드를 서로 다른 메모리 블록에 배치하여 워드의 병렬 및/또는 동시 읽기를 가능하게 할 수 있다. 이런 상황에서, 컴파일러는 전용 라인이 있는 메모리 블록에 워드를 저장할 수 있다. 예를 들어, 각 벡터가 두 개의 워드를 포함하고 메모리 컨트롤러가 4개의 메모리 블록에 직접 접근이 가능한 경우, 컴파일러는 4개의 메모리 블록에 데이터를 배치하고, 각 메모리 블록은 워드를 전송하고 데이터 전달을 빠르게 할 수 있다. 또한, 메모리 컨트롤러(2210)에 각 메모리 블록으로 단일 연결 이상이 있을 수 있는 실시예에서, 컴파일러는 설정 매니저(2212)(또는 다른 처리부)에게 특정 포트로 접근하도록 명령할 수 있다. 이 방법으로, 처리 장치(2200)는 파이프라인 된 메모리 접근을 수행하여, 동시에 일부 라인에서 워드를 로딩하고 다른 라인에서 데이터를 전송함으로써 데이터를 처리부로 지속 제공할 수 있다.For example, accelerator 2216(a) may be configured to multiply two vectors. Each vector may be stored in a separate memory block, such as memory blocks 2202 and 2204, and each vector may contain multiple words. Thus, to complete a task requiring multiplication by accelerator 2216(a), it may be necessary to access two memory blocks and fetch multiple words. However, in some embodiments, the memory block allows access to only one word per clock cycle. For example, a memory block may have a single port. In this case, in order to speed up data transfer during operation, the compiler may arrange the words constituting the vector in different memory blocks to enable parallel and/or simultaneous reading of the words. In this situation, the compiler can store the word in a block of memory with a dedicated line. For example, if each vector contains two words and the memory controller has direct access to four memory blocks, the compiler places the data in four memory blocks, and each memory block transfers words and handles data transfer. You can do it quickly. Also, in an embodiment in which the memory controller 2210 may have more than a single connection to each memory block, the compiler may instruct the configuration manager 2212 (or other processing unit) to access a specific port. In this way, the processing unit 2200 can continuously provide data to the processing unit by performing pipelined memory accesses, simultaneously loading words on some lines and transferring data on other lines.

도 23은 개시된 실시예에 따른 예시적인 처리 장치(2300)의 구성도이다. 구성도는 MAC 유닛(2302), 설정 매니저(2304; 설정 매니저(2212)와 균등 또는 유사), 메모리 컨트롤러(2306; 메모리 컨트롤러(2210)와 균등 또는 유사), 및 복수의 메모리 블록(2308(a)-(d)) 형태의 단일 가속기를 표시하는 단순화된 처리 장치(2300)를 도시한다.23 is a block diagram of an exemplary processing device 2300 in accordance with a disclosed embodiment. The configuration diagram shows a MAC unit 2302, a configuration manager 2304 (equivalent to or similar to the configuration manager 2212), a memory controller 2306 (equivalent to or similar to the memory controller 2210), and a plurality of memory blocks 2308 (a). A simplified processing unit 2300 showing a single accelerator of the form )-(d)) is shown.

일부 실시예에서, MAC 유닛(2302)은 특정 작업을 처리하기 위한 특정 가속기일 수 있다. 예를 들어, 처리 장치(2300)에 2D-컨볼루션 작업이 부여될 수 있다. 이 경우, 설정 매니저(2304)는 적절한 하드웨어를 구비한 가속기에 이 작업과 연관된 계산을 수행하도록 신호를 보낼 수 있다. 예를 들어, MAC 유닛(2302)에는 4개의 내부 증가 카운터(internal incrementing counter; 컨볼루션 계산에 필요한 4개의 루프를 관리하기 위한 논리 합산기(logical adder) 및 레지스터)와 곱셈-누적부(multiply accumulate unit)가 있을 수 있다. 설정 매니저(2304)는 입력되는 데이터를 처리하고 작업을 실행하도록 MAC 유닛(2302)에 신호를 보낼 수 있다. 설정 매니저(2304)는 작업을 실행하라는 표시를 MAC 유닛(2302)에 전송할 수 있다. 이러한 상황에서, MAC 유닛(2302)은 계산된 어드레스에서 반복하고, 수를 곱하고, 내부 레지스터에 누적할 수 있다.In some embodiments, the MAC unit 2302 may be a specific accelerator to handle a specific task. For example, the processing device 2300 may be given a 2D-convolution operation. In this case, the configuration manager 2304 may signal an accelerator with the appropriate hardware to perform the calculations associated with this task. For example, the MAC unit 2302 has four internal incrementing counters (logical adders and registers to manage the four loops required to compute the convolution) and a multiply accumulate unit) may exist. The settings manager 2304 may signal the MAC unit 2302 to process the incoming data and execute a task. The settings manager 2304 may send an indication to the MAC unit 2302 to execute the task. In this situation, the MAC unit 2302 may iterate on the computed address, multiply the number, and accumulate in an internal register.

일부 실시예에서, 메모리 컨트롤러(2306)가 전용 버스를 활용하여 블록(2308)과 MAC 유닛(2302)에 접근을 허용하는 반면, 설정 매니저(2304)는 가속기를 설정할 수 있다. 그러나 다른 실시예에서, 설정 매니저(2304) 또는 외부 인터페이스로부터 수신된 명령에 의거하여 메모리 컨트롤러(2306)가 직접 가속기를 설정할 수 있다. 대안적으로 또는 추가적으로, 설정 매니저(2304)는 몇 가지 설정을 미리 로드하고 가속기가 서로 다른 크기를 가진 서로 다른 어드레스 상에서 반복적으로 실행하도록 할 수 있다. 이러한 실시예에서, 설정 매니저(2304)는 가속기(2216)와 같은 복수의 처리부의 적어도 하나로 명령이 전송되기 전에 명령을 저장하는 캐시 메모리를 포함할 수 있다. 그러나 다른 실시예에서, 설정 매니저(2304)는 캐시를 포함하지 않을 수 있다.In some embodiments, configuration manager 2304 may configure accelerators, while memory controller 2306 utilizes a dedicated bus to allow access to block 2308 and MAC unit 2302 . However, in another embodiment, the memory controller 2306 may directly configure the accelerator based on a command received from the configuration manager 2304 or an external interface. Alternatively or additionally, the configuration manager 2304 can preload some settings and have the accelerator run repeatedly on different addresses with different sizes. In such an embodiment, settings manager 2304 may include a cache memory that stores instructions before they are sent to at least one of a plurality of processing units, such as accelerator 2216 . However, in other embodiments, the settings manager 2304 may not include a cache.

일부 실시예에서, 설정 매니저(2304) 또는 메모리 컨트롤러(2306)는 작업을 위해 접근될 필요가 있는 어드레스를 수신할 수 있다. 설정 매니저(2304) 또는 메모리 컨트롤러(2306)는 레지스터를 확인하여 메모리 블록(2308)의 하나에 로드된 라인에 이 어드레스가 이미 있는지 여부를 판단할 수 있다. 이 어드레스가 이미 있는 경우, 메모리 컨트롤러(2306)는 메모리 블록(2308)으로부터 워드를 읽어서 MAC 유닛(2302)으로 전달할 수 있다. 로드된 라인에 이 어드레스가 없는 경우, 설정 매니저(2304)는 메모리 컨트롤러(2306)에게 이 라인을 로드하도록 요청하고 MAC 유닛(2302)에게 이 어드레스를 가져올 때까지 대기하라는 신호를 보낼 수 있다.In some embodiments, settings manager 2304 or memory controller 2306 may receive addresses that need to be accessed for operations. The settings manager 2304 or the memory controller 2306 may check the registers to determine whether this address already exists in a line loaded into one of the memory blocks 2308 . If this address already exists, the memory controller 2306 may read the word from the memory block 2308 and forward it to the MAC unit 2302 . If the loaded line does not have this address, the settings manager 2304 can request the memory controller 2306 to load this line and signal the MAC unit 2302 to wait until it gets this address.

일부 실시예에서, 도 23에 도시된 바와 같이, 메모리 컨트롤러(2306)는 두 개의 개별 어드레스로부터 두 개의 입력을 포함할 수 있다. 그러나 둘 이상의 어드레스에 동시에 접근해야 하고, 이러한 어드레스가 단일 메모리 블록에 있는 경우(예를 들어, 메모리 블록(2308(a))에만 있는 경우), 메모리 컨트롤러(2306) 또는 설정 매니저(2304)는 예외를 둘 수 있다. 또는, 설정 매니저(2304)는 두 어드레스가 단일 라인을 통해서만 접근 가능한 경우에 무효 데이터 신호를 출력할 수 있다. 다른 실시예에서, 유닛은 필요한 모든 데이터를 가져올 수 있을 때까지 프로세스 실행을 지연할 수 있다. 이로 인해, 전체적인 성능이 저하될 수 있다. 그렇지만, 컴파일러는 지연을 방지할 수 있는 설정과 데이터 배치를 찾을 수 있다.In some embodiments, as shown in FIG. 23 , the memory controller 2306 may include two inputs from two separate addresses. However, if more than one address needs to be accessed simultaneously, and these addresses are in a single memory block (eg, only in memory block 2308(a)), then memory controller 2306 or configuration manager 2304 is an exception. can be placed Alternatively, the setting manager 2304 may output an invalid data signal when two addresses are accessible only through a single line. In other embodiments, the unit may defer process execution until it is able to fetch all the necessary data. As a result, overall performance may be degraded. However, the compiler can find settings and data placements that can avoid delays.

일부 실시예에서, 컴파일러는 설정 매니저(2304)와 메모리 컨트롤러(2306)와 가속기(2302)가 단일 메모리 블록으로부터 다중 어드레스가 접근되어야 하지만 메모리 블록에 포트가 하나뿐인 상황을 대응하도록 설정할 수 있는 처리 장치(2300)에 대한 설정 또는 명령을 생성할 수 있다. 예를 들어, 컴파일러는 처리부가 메모리 블록(2308)의 여러 라인에 접근할 수 있도록 메모리 블록(2308)에 데이터를 재배치할 수 있다.In some embodiments, the compiler may configure the configuration manager 2304, memory controller 2306, and accelerator 2302 to correspond to situations where multiple addresses must be accessed from a single memory block, but there is only one port in the memory block. You can create settings or commands for 2300 . For example, the compiler may relocate data in the memory block 2308 so that the processing unit can access multiple lines of the memory block 2308 .

또한, 메모리 컨트롤러(2306)는 하나 이상의 입력에 대해 동시에 작용할 수 있다. 예를 들면, 메모리 컨트롤러(2306)는 한 포트를 통하여 메모리 블록(2308)의 하나에 접근을 허용하고 다른 입력에서 다른 메모리 블록의 요청을 수신하면서 데이터를 공급하는 것을 허용할 수 있다. 따라서, 이러한 동작의 결과로, 해당 메모리 블록과의 통신의 전용 라인으로부터 데이터를 수신하는 예시적인 2D-컨볼루션 작업이 가속기(2216)에 부여될 수 있다.Also, the memory controller 2306 may act on more than one input simultaneously. For example, the memory controller 2306 may allow access to one of the memory blocks 2308 through one port and may allow the supply of data while receiving a request from the other memory block at the other input. Thus, as a result of this operation, the accelerator 2216 may be subjected to an exemplary 2D-convolution operation that receives data from a dedicated line of communication with the corresponding memory block.

추가적으로 또는 대안적으로, 메모리 컨트롤러(2306) 또는 로직 블록은 각 메모리 블록에 대해 리프레쉬 카운터를 보유하고 모든 라인의 리프레쉬를 처리할 수 있다. 이런 카운터가 있으면, 메모리 컨트롤러(2306)가 장치로부터의 데드 액세스 타임(dead access times) 사이에 리프레쉬 사이클을 삽입할 수 있다.Additionally or alternatively, the memory controller 2306 or logic block may maintain a refresh counter for each memory block and process the refresh of all lines. The presence of such a counter allows the memory controller 2306 to insert a refresh cycle between dead access times from the device.

또한, 메모리 컨트롤러(2306)는 파이프라인 된 메모리 접근을 수행하여 데이터를 공급하기 전에 어드레스를 수신하고 메모리 블록에 라인을 개통하도록 설정하는 것이 가능할 수 있다. 파이프라인 된 메모리 접근은 중단 또는 지연된 클럭 사이클 없이 데이터를 처리부로 제공할 수 있다. 예를 들어, 메모리 컨트롤러(2306) 또는 로직 블록의 하나가 도 23의 우측 라인으로 데이터에 접근하는 반면, 좌측 라인에서 데이터가 전송될 수 있다. 이 방법에 대해서는 도 26을 참조하여 상세히 설명하기로 한다.Also, it may be possible for the memory controller 2306 to perform a pipelined memory access to receive an address before supplying data and set it to open a line to the memory block. Pipelined memory access can provide data to processing without interruption or delayed clock cycles. For example, the memory controller 2306 or one of the logic blocks may access data on the right line of FIG. 23 , while data may be transferred on the left line. This method will be described in detail with reference to FIG. 26 .

요구되는 데이터에 대응하여, 처리 장치(2300)는 멀티플렉서 및/또는 기타 스위칭 장치를 활용하여 어느 장치가 주어진 작업을 수행할지를 선택할 수 있다. 예를 들어, 설정 매니저(2304)는 적어도 두 개의 데이터 라인이 MAC 유닛(2302)까지 이어지도록 멀티플렉서를 설정할 수 있다. 이로써, 컨볼루션 중에 곱셈을 필요로 하는 벡터 또는 워드가 단일 클럭에서 동시에 처리부에 도달할 수 있기 때문에, 2D-컨볼루션과 같이 다중 어드레스로부터의 데이터를 필요로 하는 작업의 수행이 빨라질 수 있다. 이러한 데이터 전송 방법으로 인해, 가속기(2216)와 같은 처리부는 결과를 신속하게 출력할 수 있다.In response to the required data, processing device 2300 may utilize a multiplexer and/or other switching device to select which device will perform a given task. For example, the configuration manager 2304 may configure the multiplexer so that at least two data lines run to the MAC unit 2302 . This can speed up operations that require data from multiple addresses, such as 2D-convolution, because vectors or words that require multiplication during convolution can reach the processing unit simultaneously in a single clock. Due to this data transmission method, a processing unit such as the accelerator 2216 can quickly output a result.

일부 실시예에서, 설정 매니저(2304)는 작업의 우선순위에 의거하여 프로세스를 실행하도록 설정하는 것이 가능할 수 있다. 예를 들어, 설정 매니저(2304)는 중단 없이 실행 프로세스를 완료하도록 설정될 수 있다. 이 경우, 설정 매니저(2304)는 명령 또는 작업의 설정을 가속기(2216)에 제공하고, 중단 없이 실행되도록 하고, 작업이 완료된 경우에만 멀티플렉서를 스위치 할 수 있다. 그러나 다른 실시예에서, 설정 매니저(2304)는 외부 인터페이스의 요청과 같은 우선 작업을 수신하는 경우에 작업을 중단하고 데이터 라우팅을 재설정할 수 있다. 그러나 메모리 블록(2308)이 충분히 있으면, 메모리 컨트롤러(2306)는 작업이 완료될 때까지 변경되지 않아도 되는 전용 라인이 있는 처리부로 데이터를 라우팅하거나 접근을 허용하도록 설정될 수 있다. 또한, 일부 실시예에서, 모든 장치는 버스에 의해 설정 매니저(2304)의 입구로 연결될 수 있고, 장치는 장치와 버스 사이의 접근을 관리(예, 멀티플렉서와 동일한 로직을 활용)할 수 있다. 따라서, 메모리 컨트롤러(2306)는 복수의 메모리 인스턴스 또는 메모리 블록에 직접 연결될 수 있다.In some embodiments, the settings manager 2304 may be capable of setting processes to run based on the priority of the task. For example, the settings manager 2304 can be set to complete the running process without interruption. In this case, the settings manager 2304 can provide the accelerator 2216 with a set of commands or tasks to execute without interruption, and only switch the multiplexer when the task is complete. However, in other embodiments, the configuration manager 2304 may abort the task and re-establish data routing when receiving a priority task, such as a request from an external interface. However, if there are enough memory blocks 2308, the memory controller 2306 can be set up to route data or allow access to processing units with dedicated lines that do not have to change until the operation is complete. Also, in some embodiments, all devices may be connected to the entrance of configuration manager 2304 by a bus, and devices may manage access between devices and the bus (eg, utilize the same logic as a multiplexer). Accordingly, the memory controller 2306 may be directly connected to a plurality of memory instances or memory blocks.

또는, 메모리 컨트롤러(2306)는 메모리 서브인스턴스(sub-instance)에 직접 연결될 수 있다. 일부 실시예에서, 각 메모리 인스턴스 또는 블록은 서브인스턴스로부터 구성될 수 있다(예를 들어, DRAM은 개별적인 데이터 라인이 다중 서브블록(sub-block)으로 배치된 매트로부터 구성될 수 있다). 또한, 인스턴스는 DRAM 매트, DRAM, 뱅크, 플래시 매트, SRAM 매트, 또는 기타 유형의 메모리를 포함할 수 있다. 이후, 메모리 컨트롤러(2306)는 어드레스 서브인스턴스로의 전용 라인을 포함하여 파이프라인 된 메모리 접근 중의 지연을 직접 최소화할 수 있다.Alternatively, the memory controller 2306 may be directly connected to a memory sub-instance. In some embodiments, each memory instance or block may be constructed from sub-instances (eg, DRAM may be constructed from mats in which individual data lines are arranged into multiple sub-blocks). Instances may also include DRAM mats, DRAMs, banks, flash mats, SRAM mats, or other types of memory. Thereafter, the memory controller 2306 may include a dedicated line to the address sub-instance to directly minimize the delay during pipelined memory access.

일부 실시예에서, 메모리 컨트롤러(2306)는 또한 특정 메모리 인스턴스에 필요한 로직(로우/컬럼 디코더, 리프레쉬 로직 등)을 구비하고, 메모리 블록(2308)은 자체 로직을 처리할 수 있다. 따라서, 메모리 블록(2308)은 어드레스를 확보하고 출력/기록 데이터에 대한 명령을 생성할 수 있다.In some embodiments, the memory controller 2306 also has the logic required for a particular memory instance (row/column decoder, refresh logic, etc.), and the memory block 2308 can handle its own logic. Thus, the memory block 2308 can obtain an address and generate commands for output/write data.

도 24는 개시된 실시예에 따른 예시적인 메모리 구성도를 도시한다. 일부 실시예에서, 처리 장치(2200)에 대한 코드 또는 설정을 생성하는 컴파일러는 데이터를 각 메모리 블록에 사전 배치하여 메모리 블록(2202, 2204)으로부터 로딩을 설정하는 방법을 수행할 수 있다. 예를 들어, 컴파일러는 작업에 필요한 각 워드가 메모리 인스턴스 또는 메모리 블록의 라인에 상호 연관되도록 데이터를 사전 배치할 수 있다. 그러나 처리 장치(2200)의 사용 가능한 메모리 블록보다 많은 메모리 블록을 필요로 하는 작업에 대해, 컴파일러는 각 메모리 블록의 하나 이상의 메모리 위치에 데이터를 맞추는 방법을 이행할 수 있다. 컴파일러는 또한 데이터를 시퀀스로 저장하고, 라인 누락 지연을 피하기 위해 각 메모리 블록의 지연을 평가할 수 있다. 일부 실시예에서, 호스트는 설정 매니저(2212)와 같은 처리부의 일부일 수 있지만, 다른 실시예에서, 컴파일러 호스트는 외부 인터페이스를 통해 처리 장치(2200)에 연결될 수 있다. 이러한 실시예에서, 호스트는 컴파일러에 대해 설명한 것과 같은 컴파일링 기능을 실행할 수 있다.24 shows an exemplary memory configuration diagram in accordance with the disclosed embodiment. In some embodiments, a compiler that generates code or settings for the processing unit 2200 may perform a method of pre-locating data into each memory block to establish loading from the memory blocks 2202 and 2204 . For example, the compiler can pre-place data so that each word needed for an operation is correlated to a line of memory instance or memory block. However, for tasks that require more memory blocks than the available memory blocks of the processing unit 2200, the compiler may implement a method of fitting data into one or more memory locations in each memory block. The compiler can also store the data as a sequence and evaluate the delay of each memory block to avoid missing line delays. In some embodiments, the host may be part of a processing unit such as the configuration manager 2212 , but in other embodiments, the compiler host may be connected to the processing unit 2200 through an external interface. In such an embodiment, the host may execute compilation functions such as those described for the compiler.

일부 실시예에서, 설정 매니저(2212)는 CPU 또는 마이크로컨트롤러(uC)일 수 있다. 이러한 실시예에서, 설정 매니저(2212)는 메모리에 접근하여 메모리에 배치된 명령을 가져와야 할 수 있다. 특정 컴파일러는 연속 명령이 동일 메모리 라인과 다수의 메모리 뱅크에 저장되어 가져온 명령에 대한 파이프라인 된 메모리 접근이 허용되도록 하는 방식으로 코드를 생성하고 메모리에 배치할 수 있다. 이러한 실시예에서, 설정 매니저(2212)와 메모리 컨트롤러(2210)는 파이프라인 된 메모리 접근을 가능하게 함으로써 선형 실행에서 행 지연을 방지하는 것이 가능할 수 있다.In some embodiments, the settings manager 2212 may be a CPU or microcontroller (uC). In such an embodiment, the settings manager 2212 may need to access the memory and fetch the commands placed in the memory. Certain compilers can generate and place code in memory in such a way that successive instructions are stored on the same memory line and in multiple memory banks, allowing pipelined memory access for fetched instructions. In such an embodiment, the configuration manager 2212 and the memory controller 2210 may be able to avoid row latencies in linear execution by enabling pipelined memory accesses.

프로그램의 선형 실행의 이전 경우의 방법에서는, 컴파일러가 명령을 인지하고 배치하여 파이프라인 된 메모리 실행을 하였다. 그러나 다른 소프트웨어 구조는 더 복잡할 수 있고 컴파일러가 명령을 인지하고 그에 따라 동작하는 것이 요구될 수 있다. 예를 들어, 작업에 루프와 브랜치(branches)가 필요한 경우, 컴파일러는 모든 루프 코드를 단일 라인 내부에 배치하여 단일 라인이 라인 개통 지연 없이 반복되게 할 수 있다. 이에 따라, 메모리 컨트롤러(2210)는 실행 중에 라인을 변경할 필요가 없을 수 있다.In the previous case method of linear execution of programs, the compiler recognized and placed the instructions, resulting in pipelined memory execution. However, other software architectures may be more complex and may require the compiler to recognize the instructions and act accordingly. For example, if a task requires loops and branches, the compiler can place all the loop code inside a single line so that a single line repeats without delaying line-opening. Accordingly, the memory controller 2210 may not need to change the line during execution.

일부 실시예에서, 설정 매니저(2212)는 내부 캐싱 또는 소형 메모리를 포함할 수 있다. 내부 캐싱은 설정 매니저(2212)에 의해 실행되는 명령을 저장하여 브랜치와 루프를 처리할 수 있다. 예를 들어, 내부 캐싱 메모리의 명령은 메모리 블록에 접근하기 위해 가속기를 설정하는 명령을 포함할 수 있다.In some embodiments, the settings manager 2212 may include internal caching or small memory. Internal caching can handle branches and loops by storing instructions executed by configuration manager 2212 . For example, instructions in the internal caching memory may include instructions to set up an accelerator to access a block of memory.

도 25는 개시된 실시예에 따른 메모리 설정 프로세스(2500)를 도시한 예시적인 순서도이다. 메모리 설정 프로세스(2500) 설명의 편의상, 앞서 설명한 도 22에 도시된 구성요소가 참조될 수 있다. 일부 실시예에서, 프로세스(2500)는 외부 인터페이스를 통해 연결된 호스트로 명령을 제공하는 컴파일러에 의해 실행될 수 있다. 다른 실시예에서, 프로세스(2500)는 설정 매니저(2212)와 같은 처리 장치(2200)의 구성요소에 의해 실행될 수 있다.25 is an exemplary flow chart illustrating a memory setup process 2500 in accordance with the disclosed embodiment. For convenience of description of the memory setup process 2500, reference may be made to the components illustrated in FIG. 22 described above. In some embodiments, process 2500 may be executed by a compiler that provides instructions to a connected host through an external interface. In other embodiments, process 2500 may be executed by a component of processing device 2200 , such as settings manager 2212 .

일반적으로, 프로세스(2500)는 작업의 수행을 위해 동시에 필요한 워드의 수를 판단하는 단계, 복수의 메모리 뱅크 각각으로부터 동시에 접근될 수 있는 워드의 수를 판단하는 단계, 및 동시에 필요한 워드의 수가 동시에 접근될 수 있는 워드의 수보다 큰 경우에 동시에 필요한 워드의 수를 다중 메모리 뱅크 사이에 분할하는 단계를 포함할 수 있다. 또한, 동시에 필요한 워드의 수를 분할하는 단계는 사이클릭(cyclic) 구조의 워드를 실행하는 단계 및 메모리 뱅크 당 한 워드를 순차적으로 배정하는 단계를 포함할 수 있다.In general, process 2500 includes determining the number of words required simultaneously to perform an operation, determining the number of words that can be accessed simultaneously from each of a plurality of memory banks, and the number of words required concurrently being accessed simultaneously. partitioning among multiple memory banks the number of words needed at the same time if it is greater than the number of words that can be. Also, dividing the number of words required at the same time may include executing a word of a cyclic structure and sequentially allocating one word per memory bank.

더욱 구체적으로, 프로세스(2500)는 컴파일러가 작업 사양을 수신할 수 있는 단계 2502로 시작할 수 있다. 이 사양은 요구되는 계산 및/또는 우선순위 레벨을 포함할 수 있다.More specifically, process 2500 may begin with step 2502 where a compiler may receive a job specification. This specification may include required computational and/or priority levels.

단계 2504에서, 컴파일러는 작업을 수행할 수 있는 가속기 또는 가속기의 그룹을 식별할 수 있다. 또는, 컴파일러는 작업을 수행할 가속기를 설정 매니저(2212)와 같은 처리부가 식별할 수 있도록 명령을 생성할 수 있다. 예를 들어, 요구되는 계산을 활용하여, 설정 매니저는 가속기(2216)의 그룹에서 작업을 수행할 수 있는 가속기를 식별할 수 있다.At step 2504 , the compiler may identify an accelerator or group of accelerators that may perform a task. Alternatively, the compiler may generate a command so that a processing unit such as the configuration manager 2212 can identify an accelerator to perform an operation. For example, utilizing the required calculation, the configuration manager can identify accelerators that can perform tasks in the group of accelerators 2216 .

단계 2506에서, 컴파일러는 작업을 실행하기 위해 동시에 접근돼야 하는 워드의 수를 판단할 수 있다. 예를 들어, 두 벡터의 곱셈을 하려면 적어도 두 벡터에 접근해야 하고, 따라서 컴파일러는 연산을 수행하기 위해 벡터 워드가 동시에 접근돼야 한다고 판단할 수 있다.In step 2506, the compiler can determine the number of words that must be accessed concurrently to execute the operation. For example, multiplication of two vectors requires access to at least two vectors, so the compiler can determine that vector words must be accessed simultaneously to perform the operation.

단계 2508에서, 컴파일러는 작업을 실행하기 위해 필요한 사이클의 수를 판단할 수 있다. 예를 들면, 4개의 부수곱(by-product)의 컨볼루션 연산이 작업에 필요한 경우, 컴파일러는 작업을 수행하기 위해 적어도 4 사이클이 필요할 것이라고 판단할 수 있다.At step 2508 , the compiler may determine the number of cycles required to execute the task. For example, if an operation requires a convolution operation of four by-products, the compiler may determine that at least 4 cycles will be required to perform the operation.

단계 2510에서, 컴파일러는 동시에 접근될 필요가 있는 워드를 서로 다른 메모리 뱅크에 배치할 수 있다. 이로써, 메모리 컨트롤러(2210)는, 캐시에 저장된 데이터가 필요 없이, 서로 다른 메모리 인스턴스로 라인을 개통하고 필요한 메모리 블록에 클럭 사이클 이내에 접근하도록 설정될 수 있다.In step 2510, the compiler may place words that need to be accessed concurrently into different memory banks. Accordingly, the memory controller 2210 may be configured to open a line to different memory instances and access a required memory block within a clock cycle without requiring data stored in the cache.

단계 2512에서, 컴파일러는 순차적으로 접근되는 워드를 동일한 메모리 뱅크에 배치할 수 있다. 예를 들어, 4 사이클의 연산이 필요한 경우, 컴파일러는 실행 중에 서로 다른 메모리 블록 사이의 라인 변경을 방지하기 위하여 단일 메모리 블록에 순차적 사이클로 필요한 워드를 쓰도록 하는 명령을 생성할 수 있다.At step 2512 , the compiler may place sequentially accessed words into the same memory bank. For example, if four cycles of operation are required, the compiler can generate instructions to write the necessary words in sequential cycles to a single memory block to avoid line changes between different memory blocks during execution.

단계 2514에서, 컴파일러는 설정 매니저(2212)와 같은 처리부를 프로그램 하는 명령을 생성할 수 있다. 명령은 스위칭 장치(예, 멀티플렉서)를 작동하거나 데이터 버스를 설정하는 조건을 명시할 수 있다. 이러한 명령으로, 설정 매니저(2212)는 메모리 컨트롤러(2210)가 작업에 따른 전용 통신 라인을 활용하여 데이터를 메모리 블록에서 처리부로 라우팅하거나 메모리 블록에 대한 접근을 허용하도록 설정할 수 있다.In step 2514 , the compiler may generate an instruction to program a processing unit such as the configuration manager 2212 . Commands can specify conditions that operate a switching device (eg, a multiplexer) or establish a data bus. With this command, the configuration manager 2212 may configure the memory controller 2210 to route data from the memory block to the processing unit using a dedicated communication line according to the task or to allow access to the memory block.

도 26은 개시된 실시예에 따른 메모리 읽기 프로세스(2600)를 도시한 예시적인 순서도이다. 메모리 읽기 프로세스(2600) 설명의 편의상, 앞서 설명한 도 22에 도시된 구성요소가 참조될 수 있다. 일부 실시예에서, 하기에 설명하는 바와 같이, 프로세스(2600)는 메모리 컨트롤러(2210)에 의해 이행될 수 있다. 그러나 다른 실시예에서, 프로세스(2600)는 설정 매니저(2212)와 같은 처리 장치(2200)의 다른 구성요소에 의해 이행될 수 있다.26 is an exemplary flow diagram illustrating a memory read process 2600 in accordance with the disclosed embodiment. For convenience of description of the memory read process 2600 , reference may be made to the components illustrated in FIG. 22 described above. In some embodiments, as described below, process 2600 may be performed by memory controller 2210 . However, in other embodiments, process 2600 may be implemented by other components of processing device 2200 , such as settings manager 2212 .

단계 2602에서, 메모리 컨트롤러(2210), 설정 매니저(2212), 또는 기타 처리부는 메모리 뱅크로 데이터를 라우팅하거나 메모리 뱅크에 대한 접근을 허용하도록 하는 요청을 수신할 수 있다. 이 요청은 어드레스와 메모리 블록을 명시할 수 있다.In step 2602 , the memory controller 2210 , the configuration manager 2212 , or other processing unit may receive a request to route data to or allow access to the memory bank. This request can specify an address and a block of memory.

일부 실시예에서, 상기 요청은 라인 2218의 읽기 명령 및 라인 2220의 어드레스를 명시하는 데이터 버스를 통해 수신될 수 있다. 다른 실시예에서, 상기 요청은 메모리 컨트롤러(2210)에 연결된 디멀티플렉서를 통해 수신될 수 있다.In some embodiments, the request may be received over a data bus specifying a read command on line 2218 and an address on line 2220. In another embodiment, the request may be received through a demultiplexer connected to the memory controller 2210 .

단계 2604에서, 설정 매니저(2212), 호스트, 또는 기타 처리부는 내부 레지스터를 쿼리할 수 있다. 내부 레지스터는 메모리 뱅크로 개통된 라인, 개통된 어드레스, 개통된 메모리 블록, 및/또는 다음 작업에 관한 정보를 포함할 수 있다. 내부 레지스터에 있는 정보에 의거하여, 메모리 뱅크로 개통된 라인이 있는지 여부 및/또는 메모리 블록이 단계 2602에서 요청을 수신했는지 여부가 판단될 수 있다. 대안적으로 또는 추가적으로, 메모리 컨트롤러(2210)가 직접 내부 레지스터를 쿼리할 수 있다.In step 2604, the configuration manager 2212, host, or other processing unit may query the internal registers. The internal registers may contain information regarding the opened line into the memory bank, the opened address, the opened memory block, and/or the next operation. Based on the information in the internal register, it may be determined whether there is an open line to the memory bank and/or whether the memory block has received the request in step 2602 . Alternatively or additionally, the memory controller 2210 may directly query the internal registers.

개통된 라인에 메모리 뱅크가 로딩되어 있지 않다고 내부 레지스터가 나타내는 경우(즉, 단계 2606에서 '아니오'), 프로세스(2600)는 단계 2616으로 진행하여 수신된 어드레스와 연관된 메모리 뱅크로 라인이 로딩될 수 있다. 또한, 메모리 컨트롤러(2210) 또는 설정 매니저(2212)와 같은 처리부는 단계 2616에서 메모리 어드레스로부터 정보를 요청하는 구성요소로 지연 신호를 보낼 수 있다. 예를 들어, 가속기(2216)가 이미 사용중인 메모리 블록에 위치한 메모리 정보를 요청하는 경우, 메모리 컨트롤러(210)는 단계 2618에서 가속기로 지연 신호를 보낼 수 있다. 단계 2620에서, 설정 매니저(2212) 또는 메모리 컨트롤러(2210)는 내부 레지스터를 업데이트하여 새로운 메모리 뱅크 또는 새로운 메모리 블록으로 라인이 개통됐음을 나타낼 수 있다.If the internal register indicates that no memory bank is loaded on the opened line (i.e., no in step 2606), process 2600 proceeds to step 2616 where the line may be loaded into the memory bank associated with the received address. have. In addition, a processing unit such as the memory controller 2210 or the setting manager 2212 may send a delay signal to a component requesting information from the memory address in operation 2616 . For example, when the accelerator 2216 requests memory information located in a memory block that is already in use, the memory controller 210 may send a delay signal to the accelerator in operation 2618 . In operation 2620 , the configuration manager 2212 or the memory controller 2210 may update an internal register to indicate that a line is opened to a new memory bank or a new memory block.

개통된 라인에 메모리 뱅크가 로딩되어 있다고 내부 레지스터가 나타내는 경우(즉, 단계 2606에서 '예'), 프로세스(2600)는 단계 2608로 진행할 수 있다. 단계 2608에서, 메모리 뱅크에 로딩된 라인이 다른 어드레스에 사용되고 있는지 여부가 판단될 수 있다. 라인이 다른 어드레스에 사용되고 있는 경우(즉, 단계 2608에서 '예'), 단일 블록에 두 인스턴스가 있는 것을 나타내는 것일 수 있으므로, 동시에 접근될 수 없다. 따라서, 단계 2616에서 메모리 어드레스로부터 정보를 요청하는 구성요소로 오류 또는 면제 신호가 전송될 수 있다. 그러나 라인이 다른 어드레스에 사용되고 있지 않은 경우(즉, 단계 2608에서 '아니오'), 라인이 어드레스에 대해 개통될 수 있고 타깃 메모리 뱅크로부터 데이터를 가져오고 단계 2614로 진행하여 메모리 어드레스로부터 정보를 요청하는 구성요소로 데이터를 전송할 수 있다.If the internal register indicates that the open line is loaded with a memory bank (ie, 'Yes' at step 2606 ), the process 2600 may proceed to step 2608 . In step 2608, it may be determined whether the line loaded into the memory bank is being used for another address. If the line is being used at a different address (ie 'yes' in step 2608), it may indicate that there are two instances in a single block, and thus cannot be accessed simultaneously. Accordingly, an error or exemption signal may be transmitted from the memory address to the component requesting information in step 2616 . However, if the line is not being used for another address (i.e., no at step 2608), then the line can be opened to the address and fetch data from the target memory bank and proceed to step 2614 to request information from the memory address. You can send data to the component.

프로세스(2600)를 통해, 처리 장치(2200)는 작업을 수행하는데 필요한 정보를 포함하는 메모리 블록 또는 메모리 인스턴스와 처리부 사이의 직접 연결을 구축할 수 있다. 이러한 데이터의 구성으로 인해, 서로 다른 메모리 인스턴스 내에 구성된 벡터로부터 정보를 읽는 것이 가능할 뿐만 아니라 장치가 복수의 이런 어드레스를 요청하는 경우에 서로 다른 메모리 블록으로부터 동시에 정보를 가져오는 것이 가능할 수 있다.Through the process 2600 , the processing unit 2200 may establish a direct connection between the processing unit and a memory block or memory instance containing information necessary to perform a task. This organization of data makes it possible not only to read information from vectors constructed in different memory instances, but also to simultaneously retrieve information from different memory blocks when a device requests a plurality of such addresses.

도 27은 개시된 실시예에 따른 실행 프로세스(2700)를 도시한 예시적인 순서도이다. 실행 프로세스(2700) 설명의 편의상, 앞서 설명한 도 22에 도시된 구성요소가 참조될 수 있다.27 is an exemplary flow diagram illustrating an execution process 2700 in accordance with a disclosed embodiment. For convenience in describing the execution process 2700 , reference may be made to the components illustrated in FIG. 22 described above.

단계 2702에서, 컴파일러 또는 설정 매니저(2212)와 같은 로컬 유닛은 수행되어야 하는 작업의 요청을 수신할 수 있다. 작업은 단일 연산(예, 곱셈) 또는 복합 연산(예, 행렬 사이의 컨볼루션)을 포함할 수 있다. 작업은 또한 필요한 계산을 나타낼 수 있다.At step 2702 , a local unit such as a compiler or configuration manager 2212 may receive a request for work to be performed. Operations can involve single operations (eg, multiplication) or complex operations (eg, convolution between matrices). Tasks can also represent required calculations.

단계 2704에서, 컴파일러 또는 설정 매니저(2212)는 작업을 수행하기 위해 동시에 요구되는 워드의 수를 판단할 수 있다. 예를 들어, 설정 매니저 또는 컴파일러는 벡터 사이의 곱셈을 수행하기 위해 두 워드가 동시에 필요하다고 판단할 수 있다. 다른 예로써, 2D 컨볼루션 작업에서, 설정 매니저(2212)는 'n'과 'm'이 각각 행렬 차원인 'n' 곱하기 'm'의 워드가 행렬 사이의 컨볼루션을 위해 필요하다고 판단할 수 있다. 단계 2704에서, 설정 매니저(2212)는 또한 작업을 수행하기 위해 필요한 사이클의 수를 판단할 수 있다.In step 2704, the compiler or configuration manager 2212 may determine the number of words simultaneously required to perform the operation. For example, the configuration manager or compiler may determine that two words are needed simultaneously to perform multiplication between vectors. As another example, in the 2D convolution operation, the setting manager 2212 may determine that the word of 'n' times 'm', where 'n' and 'm' are matrix dimensions, is required for convolution between matrices. have. In step 2704, the settings manager 2212 may also determine the number of cycles required to perform the task.

단계 2706에서, 단계 2704의 판단에 따라, 컴파일러는 동시에 접근되어야 하는 워드를 기판 상에 배치된 복수의 메모리 뱅크에 기록할 수 있다. 예를 들어, 복수의 메모리 뱅크로부터 동시에 접근될 수 있는 워드의 수가 동시에 필요한 워드의 수보다 적은 경우, 컴파일러는 데이터를 다중 메모리 뱅크에 배치하여 필요한 서로 다른 워드에 클럭 이내에 접근 가능하게 할 수 있다. 또한, 작업을 수행하기 위해 여러 사이클이 필요하다고 설정 매니저(2212) 또는 컴파일러가 판단하는 경우, 컴파일러는 필요한 워드를 순차적 사이클로 복수의 메모리 뱅크의 단일 메모리 뱅크에 기록하여 메모리 뱅크 사이의 라인 변경을 방지할 수 있다.In step 2706, according to the determination of step 2704, the compiler may write words to be accessed simultaneously to a plurality of memory banks disposed on the substrate. For example, if the number of words that can be accessed simultaneously from a plurality of memory banks is less than the number of words needed at the same time, the compiler can place the data in multiple memory banks so that the different words needed can be accessed within a clock. Also, if setup manager 2212 or the compiler determines that multiple cycles are required to perform an operation, the compiler writes the necessary words in sequential cycles to a single memory bank of multiple memory banks to prevent line changes between memory banks. can do.

단계 2708에서, 메모리 컨트롤러(2210)는 제1 메모리 라인을 이용하여 복수의 메모리 뱅크 또는 블록의 제1 메모리 뱅크에서 적어도 하나의 제1 워드를 읽거나 적어도 하나의 제1 워드에 대한 접근을 허용하도록 설정될 수 있다.In operation 2708 , the memory controller 2210 reads at least one first word from a first memory bank of a plurality of memory banks or blocks by using the first memory line or allows access to the at least one first word can be set.

단계 2170에서, 가속기(2216)의 하나와 같은 처리부는 적어도 하나의 제1 워드를 활용하여 작업을 처리할 수 있다.In step 2170 , a processing unit, such as one of the accelerators 2216 , may process the job utilizing the at least one first word.

단계 2712에서, 메모리 컨트롤러(2210)는 제2 메모리 뱅크에 제2 메모리 라인을 개통하도록 설정될 수 있다. 예를 들어, 작업에 의거하고 파이프라인 된 메모리 접근 방식을 활용하여, 메모리 컨트롤러(2210)는 작업에 필요한 정보가 단계 2706에서 기록된 제2 메모리 블록에 제2 메모리 라인을 개통하도록 설정될 수 있다. 일부 실시예에서, 제2 메모리 라인은 단계 2170의 작업이 완료되려는 시점에 개통될 수 있다. 예를 들어, 작업에 100 클럭이 필요한 경우, 제2 메모리 라인은 90번째 클럭에서 개통될 수 있다.In operation 2712 , the memory controller 2210 may be set to open a second memory line in the second memory bank. For example, using a task-based and pipelined memory approach, the memory controller 2210 may be configured to open a second memory line to the second memory block in which the information required for the task is written in step 2706 . . In some embodiments, the second memory line may be opened when the operation of step 2170 is about to be completed. For example, if 100 clocks are required for the operation, the second memory line may be opened at the 90th clock.

일부 실시예에서, 단계 2708 내지 단계 2712는 하나의 라인 액세스 사이클 이내에 실행될 수 있다.In some embodiments, steps 2708 through 2712 may be executed within one line access cycle.

단계 2714에서, 메모리 컨트롤러(2210)는 단계 2710에서 개통된 제2 메모리 라인을 활용하여 제2 메모리 뱅크의 적어도 하나의 제2 워드의 데이터에 대한 접근을 허용하도록 설정될 수 있다.In operation 2714 , the memory controller 2210 may be configured to allow access to data of at least one second word of the second memory bank by utilizing the second memory line opened in operation 2710 .

단계 2176에서, 가속기(2216)의 하나와 같은 처리부는 적어도 하나의 제2 워드를 활용하여 작업을 처리할 수 있다.In step 2176 , a processing unit such as one of the accelerators 2216 may utilize the at least one second word to process the task.

단계 2718에서, 메모리 컨트롤러(2710)는 제1 메모리 뱅크에 제2 메모리 라인을 개통하도록 설정될 수 있다. 예를 들어, 작업에 의거하고 파이프라인 된 메모리 접근 방식을 활용하여, 메모리 컨트롤러(2210)는 제1 메모리 블록으로의 제2 메모리 라인을 개통하도록 설정될 수 있다. 일부 실시예에서, 제1 블록으로의 제2 메모리 라인은 단계 2176의 작업이 완료되려는 시점에 개통될 수 있다.In operation 2718 , the memory controller 2710 may be set to open a second memory line in the first memory bank. For example, using a task-based and pipelined memory approach, the memory controller 2210 may be configured to open a second memory line to a first memory block. In some embodiments, the second memory line to the first block may be opened when the operation of step 2176 is about to complete.

일부 실시예에서, 단계 2714 내지 단계 2718은 하나의 라인 액세스 사이클 이내에 실행될 수 있다.In some embodiments, steps 2714 through 2718 may be executed within one line access cycle.

단계 2720에서, 메모리 컨트롤러는 제1 뱅크의 제2 메모리 라인 또는 제3 뱅크의 제1 라인을 활용하고 다른 메모리 뱅크로 진행하여 복수의 메모리 뱅크의 제1 메모리 뱅크에서 적어도 하나의 제3 워드를 읽거나 적어도 하나의 제3 워드에 대한 접근을 허용할 수 있다.In step 2720, the memory controller utilizes the second memory line of the first bank or the first line of the third bank and proceeds to another memory bank to read at least one third word from the first memory bank of the plurality of memory banks Alternatively, access to at least one third word may be allowed.

부분 리스페시Partial Respect

DRAM 칩과 같은 일부 메모리 칩은 칩의 커패시터 또는 기타 전기 부품의 전압 쇠퇴로 인해 저장된 데이터(예, 전기용량 활용)가 손실되는 것을 방지하기 위해 리프레시를 활용한다. 예를 들어, DRAM에서, 각 셀은 데이터가 손실되거나 손상되지 않도록 커패시터에 전하를 복원하기 위하여 수시로(특정 프로세스 또는 설계에 의거) 리프레시 되어야 한다. DRAM 칩의 메모리 용량이 증가함에 따라, 메모리를 리프레시 하는데 상당한 양의 시간이 요구된다. 특정 라인의 메모리가 리프레시 되고 있는 시한 동안에, 리프레시 되고 있는 라인을 포함하는 뱅크에는 접근이 불가능하다. 이는 성능 저하로 이어질 수 있다. 또한, 리프레시 프로세스와 연관된 전력 또한, 상당해진다. 종전에는 메모리 리프레시와 연관된 역효과를 줄이기 위해 리프레시가 수행되는 비율을 줄이려고 노력하였지만, 이러한 노력의 대부분은 DRAM의 물리적 레이어에만 집중하였다.Some memory chips, such as DRAM chips, utilize refresh to prevent loss of stored data (eg, utilization of capacitance) due to voltage decay on the chip's capacitors or other electrical components. For example, in DRAM, each cell must be refreshed from time to time (depending on a particular process or design) to restore charge to the capacitor so that data is not lost or corrupted. As the memory capacity of DRAM chips increases, a significant amount of time is required to refresh the memory. As long as the memory of a particular line is being refreshed, the bank containing the line being refreshed is inaccessible. This can lead to poor performance. In addition, the power associated with the refresh process is also significant. Previous efforts have been made to reduce the rate at which refreshes are performed to reduce the adverse effects associated with memory refreshes, but most of these efforts have focused only on the physical layer of DRAM.

리프레시 동작은 메모리의 행을 읽고 다시 쓰는 것과 비슷하다. 이 원리를 활용하고 메모리로의 접근 패턴에 집중하여, 본 개시의 실시예들은 리프레시에 전력을 덜 사용하고 메모리가 리프레시 되는 소요 시간을 줄이기 위해 소프트웨어 및 하드웨어 방식 및 메모리 칩의 수정을 포함한다. 예를 들어, 개요를 설명하면, 일부 실시에는 하드웨어 및/또는 소프트웨어를 사용하여 라인 액세스 타이밍을 추적하고 리프레시 사이클 이내의(예, 타이밍 임계값에 의거하여) 최근에 접근된 행을 건너뛸 수 있다. 다른 예에서, 일부 실시예는 메모리 칩의 리프레시 컨트롤러에 의해 실행된 소프트웨어에 의존하여 메모리로의 접근이 무작위 되지 않도록 읽기와 쓰기를 배정할 수 있다. 이에 따라, 소프트웨어는 리프레시를 더욱 정교하게 제어하여 낭비되는 리프레시 사이클 및/또는 라인을 방지할 수 있다. 이러한 방법들은 단독으로 사용되거나 프로세서에 대한 머신 코드와 함께 리프레시 컨트롤러에 대한 명령을 인코딩하는 컴파일러와 함께 사용되어 메모리로의 접근이 역시 무작위 되지 않도록 한다. 하기에 상세히 설명하는 이러한 방법 및 구성의 모든 임의의 조합을 활용하여, 개시된 실시예들은 메모리 유닛이 리프레시 되는 소요 시간을 줄임으로써 메모리 리프레시 전력 요구를 감소 및/또는 시스템 성능을 향상시킬 수 있다.A refresh operation is similar to reading and writing a row in memory. Utilizing this principle and focusing on access patterns to the memory, embodiments of the present disclosure include modifications to the memory chip and software and hardware methods to use less power to refresh and reduce the time it takes for the memory to be refreshed. For example, to outline, some implementations may use hardware and/or software to track line access timing and skip recently accessed rows within a refresh cycle (eg, based on a timing threshold). . In another example, some embodiments may rely on software executed by the refresh controller of the memory chip to allocate reads and writes such that accesses to the memory are not random. Accordingly, the software can more precisely control refreshes, avoiding wasted refresh cycles and/or lines. These methods can be used alone or with a compiler that encodes instructions to the refresh controller along with the machine code for the processor to ensure that accesses to memory are also not randomized. Utilizing any and all combinations of these methods and configurations detailed below, the disclosed embodiments may reduce the time it takes for a memory unit to be refreshed, thereby reducing memory refresh power requirements and/or improving system performance.

도 28은 본 개시에 따른 리프레시 컨트롤러(2803)를 포함하는 예시적인 메모리 칩(2800)을 도시한 것이다. 예를 들어, 메모리 칩(2800)은 기판 상에 복수의 메모리 뱅크(예, 메모리 뱅크(2801a) 등)를 포함할 수 있다. 도 28의 예에서, 기판은 4개의 메모리 뱅크를 포함하고, 각 메모리 뱅크에는 4개의 라인이 있다. 라인은 메모리 칩(2800)의 하나 이상의 메모리 뱅크 내의 워드라인 또는, 메모리 뱅크 또는 메모리 뱅크의 그룹을 따라 있는 행의 일부 또는 전체와 같은, 메모리 칩(2800) 내의 메모리 셀의 모든 임의의 다른 모음을 의미할 수 있다.28 illustrates an exemplary memory chip 2800 including a refresh controller 2803 in accordance with the present disclosure. For example, the memory chip 2800 may include a plurality of memory banks (eg, a memory bank 2801a) on a substrate. In the example of Figure 28, the substrate includes four memory banks, each memory bank having four lines. A line represents any other collection of memory cells in the memory chip 2800, such as a wordline in one or more memory banks of the memory chip 2800, or some or all of a row along a memory bank or group of memory banks. can mean

다른 실시예에서, 기판은 모든 임의의 수의 메모리 뱅크를 포함할 수 있고, 각 메모리 뱅크는 모든 임의의 수의 라인을 포함할 수 있다. 일부 메모리 뱅크는 동일할 수의 라인을 포함할 수 있는 반면에(예, 도 28의 경우), 다른 메모리 뱅크는 상이한 수의 라인을 포함할 수 있다. 도 28에 도시된 바와 같이, 메모리 칩(2800)은 컨트롤러(2805)를 포함하여 메모리 칩(2800)으로 입력을 수신하고 메모리 칩(2800)으로부터 출력을 전송할 수 있다(상기에 '코드의 분할'에서 설명).In other embodiments, the substrate may include any number of memory banks, and each memory bank may include any number of lines. Some memory banks may include the same number of lines (eg, for FIG. 28 ), while other memory banks may include a different number of lines. As shown in FIG. 28 , the memory chip 2800 may include a controller 2805 to receive input to the memory chip 2800 and transmit an output from the memory chip 2800 (see 'code division' above). described in).

일부 실시예에서, 복수의 메모리 뱅크는 DRAM(dynamic random access memory)을 포함할 수 있다. 그러나 복수의 메모리 뱅크는 주기적인 리프레시를 필요로 하는 데이터를 저장하는 모든 임의의 휘발성 메모리를 포함할 수 있다.In some embodiments, the plurality of memory banks may include dynamic random access memory (DRAM). However, the plurality of memory banks may include any volatile memory that stores data requiring periodic refresh.

하기에 더 상세히 설명하는 바와 같이, 개시된 실시예들은 카운터 또는 저항-커패시터 회로를 이용하여 리프레시 사이클의 타이밍을 수행할 수 있다. 예를 들어, 카운터 또는 타이머를 활용하여 마지막 완전 리프레시 사이클 이후의 시간을 잰 후에 카운터가 목표 값에 도달하는 경우에 다른 카운터를 활용하여 모든 행에 걸쳐 반복할 수 있다. 본 개시의 실시예들은 추가적으로 메모리 칩(2800)의 세그먼트로의 접근을 추적하고 필요한 리프레시 전력을 줄일 수 있다. 예를 들면, 도 28에는 도시되어 있지 않지만, 메모리 칩(2800)은 복수의 메모리 뱅크의 하나 이상의 세그먼트에 대한 접근 동작을 나타내는 접근 정보를 저장하도록 구성된 데이터 스토리지를 더 포함할 수 있다. 예컨대, 하나 이상의 세그먼트는 메모리 칩(2800) 내의 메모리 셀의 라인, 열, 또는 모든 임의의 다른 그룹의 모든 임의의 부분을 포함할 수 있다. 한 특정 예에서, 하나 이상의 세그먼트는 복수의 메모리 뱅크 내의 메모리 구조의 적어도 한 행을 포함할 수 있다. 리프레시 컨트롤러(2803)는 적어도 부분적으로는 저장된 접근 정보에 의거하여 하나 이상의 세그먼트의 리프레시 동작을 수행하도록 구성될 수 있다.As will be described in greater detail below, the disclosed embodiments may utilize a counter or resistor-capacitor circuit to perform the timing of the refresh cycle. For example, if a counter or timer reaches its target value after counting since the last full refresh cycle, another counter can be utilized to iterate through all rows. Embodiments of the present disclosure may additionally track access to segments of the memory chip 2800 and reduce refresh power required. For example, although not shown in FIG. 28 , the memory chip 2800 may further include data storage configured to store access information indicative of an access operation for one or more segments of a plurality of memory banks. For example, one or more segments may include any portion of any line, column, or any other group of memory cells within the memory chip 2800 . In one particular example, the one or more segments may include at least one row of memory structures in a plurality of memory banks. The refresh controller 2803 may be configured to perform a refresh operation of one or more segments based at least in part on the stored access information.

예를 들어, 데이터 스토리지는 메모리 칩(2800)의 세그먼트(예, 메모리 칩(2800) 내의 메모리 셀의 라인, 열, 또는 모든 임의의 기타 그룹)와 연관된 하나 이상의 레지스터, SRAM 셀 등을 포함할 수 있다. 또한, 데이터 스토리지는 연관된 세그먼트가 하나 이상의 이전 사이클에서 접근되었는지 여부를 나타내는 비트를 저장하도록 구성될 수 있다. '비트'는 레지스터, SRAM 셀, 비휘발성 메모리 등과 같은 적어도 하나의 비트를 저장하는 모든 임의의 데이터 구조를 포함할 수 있다. 또한, 비트는 데이터 구조의 상응하는 스위치(또는 트랜지스터와 같은 스위칭 요소)를 ON('1' 또는 'true'와 동등할 수 있음)으로 설정하여 설정될 수 있다. 추가적으로, 또는 대안적으로, 비트는 데이터 구조에 '1'(또는 비트의 설정을 나타내는 모든 임의의 다른 값)을 쓰기 위하여 데이터 구조 내의 모든 임의의 다른 성질을 수정(예, 플래시 메모리의 플로팅 게이트를 충전, SRAM 내의 하나 이상의 플립플롭의 상태를 수정 등)하여 설정될 수 있다. 비트가 메모리 컨트롤러의 리프레시 동작의 일부로 설정된 것으로 판단되는 경우, 리프레시 컨트롤러(2803)는 연관된 세그먼트의 리프레시 사이클을 건너뛰고 해당 부분과 연관된 레지스터를 비울 수 있다.For example, data storage may include one or more registers, SRAM cells, etc. associated with a segment of memory chip 2800 (eg, a line, column, or any other group of memory cells within memory chip 2800 ). have. Additionally, the data storage may be configured to store a bit indicating whether the associated segment has been accessed in one or more previous cycles. A 'bit' may include any arbitrary data structure that stores at least one bit, such as a register, SRAM cell, non-volatile memory, and the like. A bit can also be set by setting a corresponding switch (or a switching element such as a transistor) in the data structure to ON (which may be equivalent to '1' or 'true'). Additionally, or alternatively, a bit modifies any other property in the data structure to write a '1' (or any other value indicating the setting of the bit) to the data structure (e.g., the floating gate of a flash memory). charging, modifying the state of one or more flip-flops in the SRAM, etc.). When it is determined that the bit is set as part of the refresh operation of the memory controller, the refresh controller 2803 may skip the refresh cycle of the associated segment and empty the register associated with the portion.

다른 예에서, 데이터 스토리지는 메모리 칩(2800)의 세그먼트(예, 메모리 칩(2800) 내의 메모리 셀의 라인, 열, 또는 모든 임의의 기타 그룹)와 연관된 하나 이상의 비휘발성 메모리(예, 플래시 메모리 등)를 포함할 수 있다. 비휘발성 메모리는 연관된 세그먼트가 하나 이상의 이전 사이클에서 접근되었는지 여부를 나타내는 비트를 저장하도록 구성될 수 있다.In another example, data storage is one or more non-volatile memory (eg, flash memory, etc.) associated with a segment of memory chip 2800 (eg, a line, column, or any other group of memory cells within memory chip 2800 ). ) may be included. The non-volatile memory may be configured to store a bit indicating whether the associated segment has been accessed in one or more previous cycles.

일부 실시예는, 추가적으로 또는 대안적으로, 라인이 접근된 현재의 리프레시 사이클 내의 마지막 틱(tick)을 가지고 있는 각 행 또는 행의 그룹(또는 메모리 칩(2800)의 다른 세그먼트) 상에 타임스탬프 레지스터를 추가할 수 있다. 이는 각 행 접근에서 리프레시 컨트롤러가 행 타임스탬프 레지스터를 업데이트 할 수 있음을 의미한다. 따라서, 다음번에 리프레시가 발생할 때에(예, 리프레시 사이클의 끝), 리프레시 컨트롤러는 저장된 타임스탬프를 비교하고, 연관된 세그먼트가 이전에 특정 시간 주기 이내에(예, 저장된 타임스탬프에 적용된 것과 같은 특정 임계값 이내에) 접근된 경우, 리프레시 컨트롤러는 다음 세그먼트로 건너뛸 수 있다. 이는 시스템이 최근에 접근된 세그먼트에 리프레시 전력을 사용하는 것을 방지한다. 또한, 리프레시 컨트롤러는 각 세그먼트가 다음 사이클에서 접근되거나 리프레시 되도록 접근의 추적을 계속할 수 있다.Some embodiments, additionally or alternatively, a timestamp register on each row or group of rows (or other segment of memory chip 2800 ) for which the line has the last tick within the current refresh cycle accessed. can be added. This means that on each row access, the refresh controller can update the row timestamp register. Thus, the next time a refresh occurs (e.g., at the end of a refresh cycle), the refresh controller compares the stored timestamp, and the associated segment is within a certain threshold within a certain time period (e.g., within a certain threshold as previously applied to the stored timestamp). ), the refresh controller can skip to the next segment. This prevents the system from using refresh power on recently accessed segments. In addition, the refresh controller can keep track of access as each segment is accessed or refreshed in the next cycle.

이에 따라, 또 다른 예에서, 데이터 스토리지는 메모리 칩(2800)의 세그먼트(예, 메모리 칩(2800) 내의 메모리 셀의 라인, 열, 또는 모든 임의의 기타 그룹)와 연관된 하나 이상의 레지스터 또는 비휘발성 메모리를 포함할 수 있다. 비트를 사용하여 연관 세그먼트가 접근되었는지 여부를 나타내기보다, 레지스터 또는 비휘발성 메모리는 연관 세그먼트의 가장 최근 접근을 나타내는 타임스탬프 또는 기타 정보를 저장하도록 구성될 수 있다. 이러한 예에서, 리프레시 컨트롤러(2803)는 연관 레지스터 또는 메모리에 저장된 타임스탬프와 현재 시각 사이의 시간의 양(예, 하기의 도 29a 및 도 29b에 설명하는 바와 같은 타이머로부터의 시간)이 미리 정해진 임계값(예, 8ms, 16ms, 32ms, 64ms 등)을 초과하는지 여부에 의거하여 연관 세그먼트를 리프레시 할지 또는 접근할 지 여부를 판단할 수 있다.Accordingly, in another example, data storage may include one or more registers or non-volatile memory associated with a segment of memory chip 2800 (eg, a line, column, or any other group of memory cells within memory chip 2800 ). may include Rather than using bits to indicate whether an associated segment was accessed, a register or non-volatile memory may be configured to store a timestamp or other information indicating the most recent access of the associated segment. In this example, the refresh controller 2803 determines that the amount of time between the current time and the timestamp stored in the associated register or memory (eg, the time from the timer as described in FIGS. 29A and 29B below) is a predetermined threshold. Based on whether a value (eg, 8ms, 16ms, 32ms, 64ms, etc.) is exceeded, it is possible to determine whether to refresh or access the associated segment.

이에 따라, 미리 정해진 임계값은 연관 세그먼트가 리프레시 사이클마다 적어도 한 번씩 리프레시(접근이 아닌 경우) 되도록 하는 리프레시 사이클에 대한 시간의 양을 포함할 수 있다. 대안적으로, 미리 정해진 임계값은 리프레시 사이클에 필요한 시간보다 짧은 시간의 양을 포함할 수 있다(예, 모든 임의의 요구되는 리프레시 또는 접근 신호가 리프레시 사이클이 완료되기 전에 연관 세그먼트에 도달하게 하기 위함). 예를 들어, 미리 정해진 시간은 리프레시 주기가 8ms인 메모리 칩에 대해 7ms일 수 있고, 이로써 세그먼트가 7ms 이내에 접근되지 않으면 리프레시 컨트롤러가 8ms 리프레시 주기의 종료 전에 세그먼트에 도착하는 리프레시 또는 접근 신호를 전송하게 될 수 있다. 일부 실시예에서, 미리 정해진 임계값은 연관 세그먼트의 크기에 달려있을 수 있다. 예를 들어, 미리 정해진 임계값은 메모리 칩(2800)의 세그먼트가 작을수록 작아질 수 있다.Accordingly, the predetermined threshold may include an amount of time for a refresh cycle that causes the associated segment to be refreshed (if not accessed) at least once per refresh cycle. Alternatively, the predetermined threshold may include an amount of time less than the time required for the refresh cycle (eg, to ensure that any required refresh or access signals reach the associated segment before the refresh cycle is complete). ). For example, the predetermined time could be 7 ms for a memory chip with a refresh period of 8 ms, such that if the segment is not accessed within 7 ms, the refresh controller will send a refresh or access signal arriving at the segment before the end of the 8 ms refresh period. can be In some embodiments, the predetermined threshold may depend on the size of the associated segment. For example, the predetermined threshold value may become smaller as the segment of the memory chip 2800 is smaller.

상기에서는 메모리 칩을 참조하여 설명하였지만, 본 개시의 리프레시 컨트롤러는 앞서 설명하고 본 개시 전반에 걸쳐 설명하는 것과 같은 분산 프로세서 아키텍처에서도 사용될 수 있다. 이러한 아키텍처의 일례가 도 7a에 도시되어 있다. 이러한 실시예에서, 메모리 칩(2800)과 동일한 기판은, 도 7a에 도시된 바와 같은, 그 위에 배치된 복수의 프로세싱 그룹을 포함할 수 있다. 앞서 도 3a를 참조하여 설명한 바와 같이, '프로세싱 그룹'은 둘 이상의 프로세서 서브유닛 및 이 서브유닛에 상응하는 기판 상의 메모리 뱅크를 의미할 수 있다. 프로세싱 그룹은 기판 상의 공간적 분산 및/또는 메모리 칩(2800) 상의 실행을 위한 코드의 컴파일링 목적의 논리적 그루핑을 나타낼 수 있다. 이에 따라, 기판은 도 28에 도시된 뱅크(2801a)와 기타 뱅크와 같은 복수의 뱅크를 포함하는 메모리 어레이를 포함할 수 있다. 또한, 기판은 복수의 프로세서 서브유닛(예, 도 7a에 도시된 서브유닛(730a, 730b, 730c, 730d, 730e, 730f, 730g, 730h))을 포함할 수 있는 프로세싱 어레이를 포함할 수 있다.Although described above with reference to a memory chip, the refresh controller of the present disclosure may also be used in distributed processor architectures such as those described above and throughout this disclosure. An example of such an architecture is shown in FIG. 7A . In such an embodiment, the same substrate as the memory chip 2800 may include a plurality of processing groups disposed thereon, as shown in FIG. 7A . As described above with reference to FIG. 3A , a 'processing group' may mean two or more processor subunits and a memory bank on a substrate corresponding to the subunits. A processing group may represent a logical grouping for purposes of spatial distribution on a substrate and/or compilation of code for execution on a memory chip 2800 . Accordingly, the substrate may include a memory array including a plurality of banks, such as the bank 2801a shown in FIG. 28 and other banks. The substrate may also include a processing array that may include a plurality of processor subunits (eg, subunits 730a, 730b, 730c, 730d, 730e, 730f, 730g, 730h shown in FIG. 7A ).

앞서 도 7a를 참조하여 더 설명한 바와 같이, 각 프로세싱 그룹은 프로세서 서브유닛 및 프로세서 서브유닛 전용의 상응하는 하나 이상의 메모리 뱅크를 포함할 수 있다. 또한, 각 프로세서 서브유닛이 상응하는 전용 메모리 뱅크와 통신할 수 있게 하기 위하여, 기판은 프로세서 서브유닛 중의 하나를 그에 상응하는 전용 메모리 뱅크로 연결하는 제1 복수의 버스를 포함할 수 있다.As further described above with reference to FIG. 7A , each processing group may include a processor subunit and a corresponding one or more memory banks dedicated to the processor subunit. Further, to enable each processor subunit to communicate with a corresponding dedicated memory bank, the substrate may include a first plurality of buses connecting one of the processor subunits to a corresponding dedicated memory bank.

이러한 실시예에서, 도 7a에 도시된 바와 같이, 기판은 각 프로세서 서브유닛을 적어도 하나의 다른 프로세서 서브유닛(예, 동일 행의 인접 서브유닛, 동일 열의 인접 프로세서 서브유닛, 또는 기판 상의 모든 임의의 다른 프로세서 서브유닛)과 연결하기 위한 제2 복수의 버스를 포함할 수 있다. 앞서 '소프트웨어를 활용한 동기화' 부분에서 설명한 바와 같이, 프로세서 서브유닛 사이와 복수의 버스의 상응하는 버스를 통한 데이터 전달이 타이밍 하드웨어 로직 요소에 의해 제어되지 않도록 제1 및/또는 제2 복수의 버스에는 타이밍 하드웨어가 없을 수 있다.In such an embodiment, as shown in FIG. 7A , the substrate may include each processor subunit with at least one other processor subunit (eg, an adjacent subunit in the same row, an adjacent processor subunit in the same column, or any other processor subunit on the substrate). and a second plurality of buses for connecting to other processor subunits). As described above in the 'Synchronization with software' section, the first and/or second plurality of buses such that data transfer between the processor subunits and over corresponding buses of the plurality of buses is not controlled by the timing hardware logic element. may have no timing hardware.

메모리 칩(2800)과 동일한 기판이 그 위에 배치된 복수의 프로세싱 그룹(예, 도 7a에 도시)을 포함할 수 있는 실시예에서, 프로세서 서브유닛은 어드레스 생성기(예, 도 4에 도시된 어드레스 생성기(450))를 더 포함할 수 있다. 또한, 각 프로세싱 그룹은 프로세서 서브유닛 및 프로세서 서브유닛 전용의 상응하는 하나 이상의 메모리 뱅크를 포함할 수 있다. 이에 따라, 어드레스 생성기 각각은 복수의 메모리 뱅크의 상응하는 전용 메모리 뱅크와 연관될 수 있다. 또한, 기판은 복수의 어드레스 생성기 중의 하나를 그에 상응하는 전용 메모리 뱅크로 각각 연결하는 복수의 버스를 포함할 수 있다.In embodiments in which the same substrate as memory chip 2800 may include a plurality of processing groups (eg, as shown in FIG. 7A ) disposed thereon, the processor subunit may include an address generator (eg, as shown in FIG. 4 ). (450)) may be further included. Additionally, each processing group may include a processor subunit and a corresponding one or more memory banks dedicated to the processor subunit. Accordingly, each address generator may be associated with a corresponding dedicated memory bank of the plurality of memory banks. The substrate may also include a plurality of buses each coupling one of the plurality of address generators to a corresponding dedicated memory bank.

도 29a는 본 개시에 따른 예시적인 리프레시 컨트롤러(2900)를 도시한 것이다. 리프레시 컨트롤러(2900)는 도 28의 메모리 칩(2800)과 같은 본 개시의 메모리 칩에 포함될 수 있다. 도 29a에 도시된 바와 같이, 리프레시 컨트롤러(2900)는 온칩 발진기(on-chip oscillator) 또는 리프레시 컨트롤러(2900)를 위한 모든 임의의 기타 타이밍 회로를 포함할 수 있는 타이머(2901)를 포함할 수 있다. 도 29a에 도시된 구성에서, 타이머(2901)는 리프레시 사이클을 주기적으로(예, 8ms, 16ms, 32ms, 64ms 등 마다) 촉발할 수 있다. 리프레시 사이클은 로우 카운터(2903)를 활용하여 상응하는 메모리 칩의 모든 행을 통해 사이클을 하고 합산기(2907)를 활성 비트(2905)와 함께 활용하여 각 행에 대한 리프레시 신호를 생성할 수 있다. 도 29a에 도시된 바와 같이, 비트(2905)는 각 행이 한 사이클 동안에 확실히 리프레시 되도록 하기 위해 1('true')에 고정될 수 있다.29A illustrates an exemplary refresh controller 2900 in accordance with the present disclosure. The refresh controller 2900 may be included in a memory chip of the present disclosure, such as the memory chip 2800 of FIG. 28 . 29A , the refresh controller 2900 may include a timer 2901, which may include an on-chip oscillator or any other timing circuitry for the refresh controller 2900. . In the configuration shown in FIG. 29A , the timer 2901 may trigger a refresh cycle periodically (eg, every 8 ms, 16 ms, 32 ms, 64 ms, etc.). A refresh cycle may utilize a row counter 2903 to cycle through all rows of the corresponding memory chip and a summer 2907 with an active bit 2905 to generate a refresh signal for each row. 29A, bit 2905 may be set to 1 ('true') to ensure that each row is refreshed during one cycle.

본 개시의 실시예에서, 리프레시 컨트롤러(2900)는 데이터 스토리지를 포함할 수 있다. 앞서 설명한 바와 같이, 데이터 스토리지는 메모리 칩(2800)의 세그먼트(예, 메모리 칩(2800) 내의 메모리 셀의 라인, 열, 또는 모든 임의의 기타 그룹)와 연관된 하나 이상의 레지스터 또는 비휘발성 메모리를 포함할 수 있다. 레지스터 또는 비휘발성 메모리는 연관 세그먼트의 가장 최근 접근을 나타내는 타임스탬프 또는 기타 정보를 저장하도록 구성될 수 있다.In an embodiment of the present disclosure, the refresh controller 2900 may include data storage. As previously described, data storage may include one or more registers or non-volatile memory associated with a segment of memory chip 2800 (eg, a line, column, or any other group of memory cells within memory chip 2800 ). can A register or non-volatile memory may be configured to store a timestamp or other information indicating the most recent access of the associated segment.

리프레시 컨트롤러(2900)는 저장된 정보를 활용하여 메모리 칩(2800)의 세그먼트에 대한 리프레시를 건너뛸 수 있다. 예를 들어, 어떤 세그먼트가 하나 이상의 이전 리프레시 사이클 동안에 리프레시 되었다고 저장된 정보가 나타내는 경우에 리프레시 컨트롤러(2900)는 현 리프레시 사이클에서 해당 세그먼트를 건너뛸 수 있다. 다른 예에서, 어떤 세그먼트에 대한 저장된 타임스탬프와 현재 시각 사이의 차이가 임계값보다 작은 경우에 리프레시 컨트롤러(2900)는 현재 리프레시 사이클에서 해당 세그먼트를 건너뛸 수 있다. 리프레시 컨트롤러(2900)는 또한 다중 리프레시 사이클을 통해 메모리 칩(2800)의 세그먼트의 접근과 리프레시의 추적을 계속 이어갈 수 있다. 예를 들어, 리프레시 컨트롤러(2900)는 타이머(2901)를 활용하여 저장된 타임스탬프를 업데이트할 수 있다. 이러한 실시예에서, 리프레시 컨트롤러(2900)는 임계 시간 간격 이후에 데이터 스토리지에 저장된 접근 정보를 비우는 데에 타이머의 출력을 활용하도록 구성될 수 있다. 예를 들어, 데이터 스토리지가 연관 세그먼트에 대한 가장 최근 접근 또는 리프레시의 타임스탬프를 저장하는 실시예에서, 리프레시 컨트롤러(2900)는 접근 명령 또는 리프레시 신호가 세그먼트로 전송될 때마다 새로운 타임스탬프를 데이터 스토리지에 저장할 수 있다. 데이터 스토리지가 타임스탬프 대신에 비트를 저장하는 경우, 타이머(2901)는 임계 시간 주기보다 길게 설정된 비트를 비우도록 구성될 수 있다. 예를 들어, 데이터 스토리지가 연관 세그먼트가 하나 이상의 이전 사이클에서 접근되었던 것으로 나타내는 비트를 저장하는 실시예에서, 리프레시 컨트롤러(2900)는 타이머(2901)가 연관 비트에 설정된 것(예, 1)보다 임계 수의 사이클(예, 1, 2 등) 이후의 새로운 리프레시 사이클을 촉발할 때마다 데이터 스토리지 내의 비트를 비울(예, 0으로 설정) 수 있다.The refresh controller 2900 may skip the refresh of the segment of the memory chip 2800 by using the stored information. For example, when stored information indicates that a certain segment has been refreshed during one or more previous refresh cycles, the refresh controller 2900 may skip the corresponding segment in the current refresh cycle. In another example, when a difference between a stored timestamp for a certain segment and the current time is less than a threshold value, the refresh controller 2900 may skip the corresponding segment in the current refresh cycle. The refresh controller 2900 may also keep track of accesses and refreshes of segments of the memory chip 2800 through multiple refresh cycles. For example, the refresh controller 2900 may update the stored timestamp by utilizing the timer 2901 . In such an embodiment, the refresh controller 2900 may be configured to utilize the output of the timer to empty the access information stored in the data storage after a threshold time interval. For example, in an embodiment where the data storage stores the timestamp of the most recent access or refresh to the associated segment, the refresh controller 2900 may store a new timestamp whenever an access command or refresh signal is sent to the segment. can be stored in If the data storage stores bits instead of timestamps, timer 2901 may be configured to free bits set longer than a threshold time period. For example, in an embodiment where the data storage stores a bit indicating that the associated segment was accessed in one or more previous cycles, the refresh controller 2900 may indicate that the timer 2901 is more critical than the timer 2901 set in the associated bit (eg, 1). Each time you trigger a new refresh cycle after a number of cycles (eg, 1, 2, etc.), a bit in the data storage can be freed (eg, set to 0).

리프레시 컨트롤로(2900)는 메모리 칩(2800)의 다른 하드웨어와 협력하여 메모리 칩(2800)의 세그먼트의 접근을 추적할 수 있다. 예를 들면, 메모리 칩은 센스 증폭기를 사용하여 읽기 동작을 수행한다(예, 도 9 및 도 10에 도시). 센스 증폭기는 하나 이상의 메모리 셀에 데이터를 저장하는 메모리 칩(2800)의 세그먼트로부터 저전력 신호를 감지하고, 데이터가 앞서 설명한 바와 같은 외부 CPU 또는 GPU 또는 집적 프로세서 서브유닛과 같은 로직에 의해 해석될 수 있도록 작은 전압 스윙(voltage swing)을 높은 전압 수준으로 증폭하도록 구성된 복수의 트랜지스터를 포함할 수 있다. 도 29a에는 도시되어 있지 않지만, 리프레시 컨트롤러(2900)는 또한 하나 이상의 세그먼트에 접근하고 적어도 하나의 비트 레지스터의 상태를 변경하도록 구성된 센스 증폭기와 통신할 수 있다. 예를 들어, 센스 증폭기가 하나 이상의 세그먼트에 접근하는 경우, 연관 세그먼트가 이전 사이클에서 접근되었음을 나타내는 비트를 설정(예, 1로 설정)할 수 있다. 데이터 스토리지가 연관 세그먼트에 대한 가장 최근 접근 또는 리프레시의 타임스탬프를 저장하는 실시예에서, 센스 증폭기가 하나 이상의 세그먼트에 접근하는 경우, 타이머(2901)로부터 레지스터, 메모리, 또는 데이터 스토리지를 포함하는 기타 요소로 타임스탬프의 기록을 촉발할 수 있다.The refresh controller 2900 may cooperate with other hardware of the memory chip 2800 to track access of segments of the memory chip 2800 . For example, a memory chip uses a sense amplifier to perform a read operation (eg, as shown in FIGS. 9 and 10 ). The sense amplifier senses low-power signals from segments of the memory chip 2800 that store data in one or more memory cells, such that the data can be interpreted by logic, such as an external CPU or GPU or integrated processor subunit, as previously described. It may include a plurality of transistors configured to amplify a small voltage swing to a high voltage level. Although not shown in FIG. 29A , the refresh controller 2900 may also be in communication with a sense amplifier configured to access one or more segments and change the state of at least one bit register. For example, when the sense amplifier accesses more than one segment, it may set (eg, set to 1) a bit indicating that the associated segment was accessed in a previous cycle. In embodiments where the data storage stores the timestamp of the most recent access or refresh to the associated segment, when the sense amplifier accesses one or more segments, a register, memory, or other element including data storage from timer 2901 can trigger the logging of timestamps.

앞서 설명한 모든 실시예에서, 리프레시 컨트롤러(2900)는 복수의 메모리 뱅크에 대한 메모리 컨트롤러와 통합될 수 있다. 예를 들어, 도 3a에 도시된 실시예와 유사하게, 리프레시 컨트롤러(2900)는 메모리 칩(2800)의 메모리 뱅크 또는 기타 세그먼트와 연관된 논리 및 제어 서브유닛에 포함될 수 있다.In all the above-described embodiments, the refresh controller 2900 may be integrated with a memory controller for a plurality of memory banks. For example, similar to the embodiment shown in FIG. 3A , the refresh controller 2900 may be included in a logic and control subunit associated with a memory bank or other segment of the memory chip 2800 .

도 29b는 본 개시에 따른 다른 예시적인 리프레시 컨트롤러(2900')를 도시한 것이다. 리프레시 컨트롤러(2900')는 도 28의 메모리 칩(2800)과 같은 본 개시의 메모리 칩에 포함될 수 있다. 리프레시 컨트롤러(2900)와 유사하게, 리프레시 컨트롤러(2900')는 타이머(2901), 로우 카운터(2903), 활성 비트(2905), 및 합산기(2907)를 포함한다. 또한, 리프레시 컨트롤러(2900')는 데이터 스토리지(2909)를 포함할 수 있다. 도 29b에 도시된 바와 같이, 데이터 스토리지(2909)는 메모리 칩(2800)의 세그먼트(예, 메모리 칩(2800) 내의 메모리 셀의 라인, 열, 또는 모든 임의의 기타 그룹)와 연관된 하나 이상의 레지스터 또는 비휘발성 메모리를 포함할 수 있고, 데이터 스토리지 내의 상태는 연관되는 하나 이상의 세그먼트에 대응하여 변경되도록(예, 앞서 설명한 바와 같이, 리프레시 컨트롤러(2900')의 센스 증폭기 또는 기타 요소에 의해) 구성될 수 있다. 이에 따라, 리프레시 컨트롤러(2900')는 데이터 스토리지 내의 상태에 의거하여 하나 이상의 세그먼트의 리프레시를 건너뛰도록 구성될 수 있다. 예를 들어, 세그먼트와 연관된 상태가 활성화되어 있으면(예, 스위치를 켜서 1로 설정, '1'을 저장하기 위하여 성질을 변경 등), 리프레시 컨트롤러(2900')는 연관 세그먼트에 대한 리프레시 사이클을 건너뛰고 해당 부분과 연관된 상태를 비울 수 있다. 상태는 적어도 1비트 레지스터 또는 적어도 1비트의 데이터를 저장하도록 구성된 모든 임의의 다른 메모리 구조로 저장될 수 있다.29B illustrates another exemplary refresh controller 2900' according to the present disclosure. The refresh controller 2900 ′ may be included in a memory chip of the present disclosure, such as the memory chip 2800 of FIG. 28 . Similar to refresh controller 2900 , refresh controller 2900 ′ includes a timer 2901 , a row counter 2903 , an active bit 2905 , and a summer 2907 . Additionally, the refresh controller 2900 ′ may include a data storage 2909 . 29B , data storage 2909 may include one or more registers associated with a segment of memory chip 2800 (eg, a line, column, or any other group of memory cells within memory chip 2800 ) or may include non-volatile memory, and may be configured to change (eg, by a sense amplifier or other element of refresh controller 2900', as described above) in response to one or more segments with which a state within the data storage is associated. have. Accordingly, the refresh controller 2900' may be configured to skip the refresh of one or more segments based on the state in the data storage. For example, if the state associated with the segment is active (eg, by turning on a switch to set it to 1, changing the property to store a '1', etc.), the refresh controller 2900' skips the refresh cycle for the associated segment. You can run and empty the state associated with that part. The state may be stored in at least one bit registers or any other memory structure configured to store at least one bit of data.

메모리 칩의 세그먼트가 각 리프레시 사이클 동안에 반드시 리프레시 또는 접근되도록 하기 위하여, 리프레시 컨트롤러(2900')는 다음 리프레시 사이클 동안에 리프레시 신호를 촉발하기 위하여 상태를 재설정하거나 비울 수 있다. 일부 실시예에서, 세그먼트를 건너뛴 다음에, 리프레시 컨트롤러(2900')는 해당 세그먼트가 다음 리프레시 사이클에 반드시 리프레시 되게 하기 위하여 연관 상태를 비울 수 있다. 다른 실시예에서, 리프레시 컨트롤러(2900')는 임계 시간 간격 이후에 데이터 스토리지 내의 상태를 재설정하도록 구성될 수 있다. 예를 들면, 리프레시 컨트롤러(2900')는 타이머(2901)가 연관 상태가 설정된(예, 스위치를 켜서 1로 설정, '1'을 저장하기 위하여 성질을 변경 등) 이후에 임계 시간을 초과할 때마다 데이터 스토리지의 상태를 비울 수(예, 0으로 설정) 있다. 일부 실시예에서, 리프레시 컨트롤러(2900')는 임계 시간 대신에 임계 수의 리프레시 사이클(예, 1, 2 등) 또는 임계 수의 클럭 사이클(예, 2, 4 등)을 사용할 수 있다.To ensure that segments of the memory chip must be refreshed or accessed during each refresh cycle, the refresh controller 2900' may reset or clear the state to trigger a refresh signal during the next refresh cycle. In some embodiments, after skipping a segment, refresh controller 2900' may flush the associated state to ensure that the segment is refreshed on the next refresh cycle. In another embodiment, the refresh controller 2900' may be configured to reset the state in the data storage after a threshold time interval. For example, when the refresh controller 2900' exceeds a threshold time after the timer 2901 has set the associated state (eg, turning on a switch to set it to 1, changing the property to store '1', etc.) The state of the data storage can be cleared (eg, set to 0) every time. In some embodiments, the refresh controller 2900' may use a threshold number of refresh cycles (eg, 1, 2, etc.) or a threshold number of clock cycles (eg, 2, 4, etc.) instead of a threshold time.

다른 실시예에서, 상태는 연관 세그먼트의 가장 최근 리프레시 또는 접근의 타임스탬프를 포함하여, 타임스탬프와 현재 시각(예, 도 29a 및 도 29b의 타이머(2901)로부터의 시각) 사이의 시간의 양이 미리 정해진 임계값(예, 8ms, 16ms, 32ms, 64ms 등)을 초과하는 경우에 리프레시 컨트롤러(2900')는 연관 세그먼트로 접근 명령 또는 리프레시 신호를 전송하고 그 부분과 연관된 타임스탬프를 업데이트(예, 타이머(2901)를 활용)하도록 할 수 있다. 추가적으로 또는 대안적으로, 리프레시 컨트롤러(2900')는 리프레시 타임 지시자가 미리 정해진 타임 임계값 이내의 마지막 리프레시 타임을 나타내는 경우에 복수의 메모리 뱅크의 하나 이상의 세그먼트에 대한 리프레시 동작을 건너뛰도록 구성될 수 있다. 이러한 실시예에서, 하나 이상의 세그먼트에 대한 리프레시 동작을 건너뛴 이후에, 리프레시 컨트롤러(2900')는 하나 이상의 세그먼트와 연관된 저장된 리프레시 타임 지시자를 변경하여 다음 동작 사이클 동안에 하나 이상의 세그먼트가 리프레시 되게 하도록 구성될 수 있다. 예를 들어, 앞서 설명한 바와 같이, 리프레시 컨트롤러(2900')는 타이머(2901)를 이용하여 저장된 리프레시 타임 지시자를 업데이트할 수 있다.In another embodiment, the status includes a timestamp of the most recent refresh or access of the associated segment, such that the amount of time between the timestamp and the current time (eg, the time from timer 2901 of FIGS. 29A and 29B ) is When a predetermined threshold (e.g., 8ms, 16ms, 32ms, 64ms, etc.) is exceeded, the refresh controller 2900' sends an access command or refresh signal to the associated segment and updates the timestamp associated with that segment (e.g., A timer 2901 may be used). Additionally or alternatively, the refresh controller 2900' may be configured to skip refresh operations on one or more segments of the plurality of memory banks when the refresh time indicator indicates a last refresh time within a predetermined time threshold. have. In such embodiments, after skipping a refresh operation for one or more segments, the refresh controller 2900' may be configured to change the stored refresh time indicator associated with the one or more segments to cause the one or more segments to be refreshed during the next cycle of operation. can For example, as described above, the refresh controller 2900 ′ may update the stored refresh time indicator using the timer 2901 .

이에 따라, 데이터 스토리지는 복수의 메모리 뱅크의 하나 이상의 세그먼트가 마지막으로 리프레시 되었던 시간을 나타내는 리프레시 타임 지시자를 저장하도록 구성된 타임스탬프 레지스터를 포함할 수 있다. 또한, 리프레시 컨트롤러(2900')는 데이터 스토리지에 저장된 접근 정보를 임계 시간 간격 이후에 비우는데 타이머의 출력을 이용할 수 있다.Accordingly, the data storage may include a timestamp register configured to store a refresh time indicator indicating a time at which one or more segments of the plurality of memory banks were last refreshed. Also, the refresh controller 2900 ′ may use the output of the timer to empty the access information stored in the data storage after a threshold time interval.

앞서 설명한 모든 실시예에서, 하나 이상의 세그먼트로의 접근은 하나 이상의 세그먼트와 연관된 쓰기 동작을 포함할 수 있다. 추가적으로 또는 대안적으로, 하나 이상의 세그먼트로의 접근은 하나 이상의 세그먼트와 연관된 읽기 동작을 포함할 수 있다.In all of the embodiments described above, accessing one or more segments may include a write operation associated with the one or more segments. Additionally or alternatively, accessing the one or more segments may include a read operation associated with the one or more segments.

또한, 도 29b에 도시된 바와 같이, 리프레시 컨트롤러(2900')는 적어도 부분적으로는 데이터 스토리지 내의 상태에 의거하여 데이터 스토리지(2909)의 업데이트를 보조하도록 구성된 합산기(2907) 및 로우 카운터(2903)를 포함할 수 있다. 데이터 스토리지(2909)는 복수의 메모리 뱅크와 연관된 비트 테이블을 포함할 수 있다. 예를 들어, 비트 테이블은 연관 세그먼트에 대한 비트를 가지도록 구성된 스위치(또는 트랜지스터와 같은 스위칭 요소) 또는 레지스터(예, SRAM 등)의 어레이를 포함할 수 있다. 추가적으로 또는 대안적으로, 데이터 스토리지(2909)는 복수의 메모리 뱅크와 연관된 타임스탬프를 저장할 수 있다.Further, as shown in FIG. 29B , the refresh controller 2900 ′ includes a summer 2907 and a row counter 2903 configured to aid in the updating of the data storage 2909 based, at least in part, on a state within the data storage. may include Data storage 2909 may include a bit table associated with a plurality of memory banks. For example, a bit table may include an array of switches (or switching elements, such as transistors) or registers (eg, SRAM, etc.) configured to have bits for an associated segment. Additionally or alternatively, data storage 2909 may store timestamps associated with the plurality of memory banks.

또한, 리프레시 컨트롤러(2900')는 하나 이상의 세그먼트로의 리프레시가 비트 테이블에 저장된 해당 값에 의거하여 일어날지 여부를 제어하도록 구성된 리프레시 게이트(2911)를 포함할 수 있다. 예를 들어, 리프레시 게이트(2911)는, 연관 세그먼트가 하나 이상의 이전 클럭 사이클 동안에 리프레시 또는 접근되었던 것으로 데이터 스토리지(2909)의 상응하는 상태가 나타내는 경우에, 로우 카운터(2903)로부터의 리프레시 신호를 무효로 하도록 구성된 논리 게이트(예, 'and' 게이트)를 포함할 수 있다. 다른 실시예에서, 리프레시 게이트(2911)는, 연관 세그먼트가 미리 정해진 임계 시간값 이내에 리프레시 또는 접근되었던 것으로 데이터 스토리지(2909)로부터의 해당 타임스탬프가 나타내는 경우에, 로우 카운터(2903)로부터의 리프레시 신호를 무효로 하도록 구성된 마이크로프로세서 또는 기타 회로를 포함할 수 있다.Additionally, the refresh controller 2900' may include a refresh gate 2911 configured to control whether a refresh to one or more segments will occur based on corresponding values stored in the bit table. For example, refresh gate 2911 invalidates the refresh signal from row counter 2903 if the corresponding state of data storage 2909 indicates that the associated segment has been refreshed or accessed during one or more previous clock cycles. It may include a logic gate (eg, an 'and' gate) configured to In another embodiment, the refresh gate 2911 triggers a refresh signal from the row counter 2903 when the corresponding timestamp from the data storage 2909 indicates that the associated segment has been refreshed or accessed within a predetermined threshold time value. may include a microprocessor or other circuitry configured to override the

도 30은 메모리 칩(예, 도 28의 메모리 칩(2800))의 부분 리프레시를 위한 프로세스(3000)의 예시적인 순서도이다. 프로세스(3000)는 도 29a의 리프레시 컨트롤러(2900) 또는 도 29b의 리프레시 컨트롤러(2900')와 같은 본 개시에 따른 리프레시 컨트롤러에 의해 실행될 수 있다.30 is an exemplary flow diagram of a process 3000 for a partial refresh of a memory chip (eg, memory chip 2800 of FIG. 28 ). Process 3000 may be executed by a refresh controller according to the present disclosure, such as refresh controller 2900 of FIG. 29A or refresh controller 2900' of FIG. 29B.

단계 3010에서, 리프레시 컨트롤러는 복수의 메모리 뱅크의 하나 이상의 세그먼트에 대한 접근 동작을 나타내는 정보에 접근할 수 있다. 예를 들어, 앞서 도 29a와 도 29b를 참조하여 설명한 바와 같이, 리프레시 컨트롤러는 메모리 칩(2800)의 세그먼트(예, 메모리 칩(2800) 내의 메모리 셀의 라인, 열, 또는 모든 임의의 기타 그룹)와 연관되고 연관 세그먼트의 가장 최근 접근을 나타내는 타임스탬프 또는 기타 정보를 저장하도록 구성된 데이터 스토리지를 포함할 수 있다.In operation 3010, the refresh controller may access information indicating an access operation for one or more segments of the plurality of memory banks. For example, as previously described with reference to FIGS. 29A and 29B , the refresh controller may be configured to configure a segment of memory chip 2800 (eg, a line, column, or any other group of memory cells within memory chip 2800 ). and data storage configured to store a timestamp or other information associated with the associated segment indicative of a most recent access of the associated segment.

단계 3020에서, 리프레시 컨트롤러는 적어도 부분적으로는 접근된 정보에 의거하여 리프레시 및/또는 접근 명령을 생성할 수 있다. 예를 들어, 앞서 도 29a와 도 29b를 참조하여 설명한 바와 같이, 리프레시 컨트롤러는, 접근된 정보가 미리 정해진 시간 임계값 이내의 마지막 리프레시 또는 접근 시각 및/또는 하나 이상의 이전 클럭 사이클 동안에 발생한 마지막 리프레시 또는 접근을 나타내는 경우에, 복수의 메모리 뱅크의 하나 이상의 세그먼트에 대한 리프레시 동작을 건너뛸 수 있다. 추가적으로 또는 대안적으로, 리프레시 컨트롤러는, 접근된 정보가 나타내는 가장 최근 리프레시 또는 접근 시간이 미리 정해진 임계값을 초과하는지 여부에 의거 및/또는 가장 최근 리프레시 또는 접근이 하나 이상의 이전 클럭 사이클 동안에 발생하지 않은 경우에, 연관 세그먼트를 리프레시 또는 접근하라는 명령을 생성할 수 있다.In operation 3020 , the refresh controller may generate a refresh and/or access command based at least in part on the accessed information. For example, as previously described with reference to FIGS. 29A and 29B , the refresh controller may be configured to indicate that the information accessed is the last refresh or access time within a predetermined time threshold and/or the last refresh that occurred during one or more previous clock cycles or When indicating access, refresh operations on one or more segments of the plurality of memory banks may be skipped. Additionally or alternatively, the refresh controller is configured to determine whether a most recent refresh or access time indicated by the accessed information exceeds a predetermined threshold and/or if the most recent refresh or access did not occur during one or more previous clock cycles. In some cases, it may generate a command to refresh or access the associated segment.

단계 3030에서, 리프레시 컨트롤러는 하나 이상의 세그먼트와 연관되어 저장된 리프레시 타임 지시자를 변경하여 다음 동작 사이클 동안에 하나 이상의 세그먼트가 리프레시 되게 할 수 있다. 예를 들어, 하나 이상의 세그먼트에 대한 리프레시 동작을 건너뛴 이후에, 리프레시 컨트롤러는 하나 이상의 세그먼트에 대한 접근 동작을 나타내는 정보를 변경하여 다음 클럭 사이클 동안에 하나 이상의 세그먼트가 리프레시 되게 할 수 있다. 이에 따라, 리프레시 컨트롤러는 리프레시 사이클을 건너뛴 이후에 세그먼트에 대한 상태를 비울 수(예, 0으로 설정) 있다. 추가적으로 또는 대안적으로, 리프레시 컨트롤러는 현재 사이클 동안에 리프레시 및/또는 접근되는 세그먼트에 대한 상태를 설정할 수(예, 1로 설정) 있다. 하나 이상의 세그먼트에 대한 접근 동작을 나타내는 정보가 타임스탬프를 포함하는 실시예에서, 리프레시 컨트롤러는 현재 사이클 동안에 리프레시 및/또는 접근되는 세그먼트와 연관되어 저장된 모든 임의의 타임스탬프를 업데이트할 수 있다.At step 3030, the refresh controller may change the stored refresh time indicator associated with the one or more segments to cause the one or more segments to be refreshed during the next cycle of operation. For example, after skipping a refresh operation for one or more segments, the refresh controller may change the information indicative of an access operation for one or more segments to cause the one or more segments to be refreshed during the next clock cycle. Accordingly, the refresh controller can flush (eg set to zero) the state for the segment after skipping the refresh cycle. Additionally or alternatively, the refresh controller may set (eg, set to 1) the state for segments being refreshed and/or accessed during the current cycle. In embodiments in which the information indicative of an access operation for one or more segments includes timestamps, the refresh controller may update any stored timestamps associated with the segments being refreshed and/or accessed during the current cycle.

방법(3000)은 추가적인 단계를 더 포함할 수 있다. 예를 들어, 단계 3030에 대한 추가 또는 대안으로, 센스 증폭기가 하나 이상의 세그먼트에 접근하고 하나 이상의 세그먼트와 연관된 정보를 변경할 수 있다. 추가적으로 또는 대안적으로, 센스 증폭기는 접근이 발생한 경우에 리프레시 컨트롤러로 신호를 보내서 리프레시 컨트롤러가 하나 이상의 세그먼트와 연관된 정보를 업데이트하게 할 수 있다. 앞서 설명한 바와 같이, 센스 증폭기는 하나 이상의 메모리 셀에 데이터를 저장하는 메모리 칩의 세그먼트로부터 저전력 신호를 감지하고, 데이터가 앞서 설명한 바와 같은 외부 CPU 또는 GPU 또는 집적 프로세서 서브유닛과 같은 로직에 의해 해석될 수 있도록 작은 전압 스윙을 높은 전압 수준으로 증폭하도록 구성된 복수의 트랜지스터를 포함할 수 있다. 이러한 예에서, 센스 증폭기가 하나 이상의 세그먼트에 접근할 때마다, 세그먼트와 연관된 비트를 설정하여(예, 1로 설정) 연관 세그먼트가 이전 사이클에서 접근되었음을 나타낼 수 있다. 하나 이상의 세그먼트에 대한 접근 동작을 나타내는 정보가 타임스탬프를 포함하는 실시예에서, 센스 증폭기가 하나 이상의 세그먼트에 접근할 때마다, 리프레시 컨트롤러의 타이머로부터 데이터 스토리지로 타임스탬프의 쓰기를 촉발하여 세그먼트와 연관된 모든 임의의 저장된 타임스탬프를 업데이트할 수 있다.Method 3000 may further include additional steps. For example, in addition to or alternatively to step 3030, the sense amplifier may access one or more segments and change information associated with the one or more segments. Additionally or alternatively, the sense amplifier may signal to the refresh controller when an access occurs, causing the refresh controller to update information associated with one or more segments. As previously described, the sense amplifier senses a low-power signal from a segment of a memory chip that stores data in one or more memory cells, where the data is to be interpreted by logic, such as an external CPU or GPU or integrated processor subunit, as previously described. It may include a plurality of transistors configured to amplify a small voltage swing to a high voltage level. In this example, whenever the sense amplifier accesses one or more segments, it may set a bit associated with the segment (eg, set to 1) to indicate that the associated segment was accessed in a previous cycle. In embodiments in which the information indicative of an access operation for one or more segments includes a timestamp, whenever the sense amplifier accesses one or more segments, it triggers a write of the timestamp from a timer in the refresh controller to data storage to be associated with the segment. Any arbitrary stored timestamp can be updated.

도 31은 메모리 칩(예 도 28의 메모리 칩(2800))에 대한 리프레시를 판단하는 프로세스(3100)의 예시적인 순서도이다. 프로세스(3100)는 본 개시에 따른 컴파일러 내에서 이행될 수 있다. 앞서 설명한 바와 같이, '컴파일러'는 고수준 언어(예, C, FORTRAN, BASIC 등과 같은 절차형 언어; Java, C++, Pascal, Python 등과 같은 객체 지향 언어 등)를 저수준 언어(예, 어셈블리 코드, 오브젝트 코드, 머신 코드 등)로 변환하는 모든 컴퓨터 프로그램을 말한다. 컴파일러는 사람으로 하여금 인간 판독 가능 언어로 일련의 명령을 프로그램 할 수 있게 해줄 수 있고, 이러한 명령은 나중에 머신 실행 가능 언어로 변환된다. 컴파일러는 하나 이상의 프로세서에 의해 실행되는 소프트웨어 명령을 포함할 수 있다.31 is an exemplary flow diagram of a process 3100 for determining a refresh for a memory chip (eg, memory chip 2800 of FIG. 28 ). Process 3100 may be implemented within a compiler in accordance with the present disclosure. As described above, a 'compiler' is a high-level language (eg, procedural languages such as C, FORTRAN, BASIC, etc.; object-oriented languages such as Java, C++, Pascal, Python, etc.) and low-level languages (eg assembly code, object code). , machine code, etc.) Compilers can enable humans to program a set of instructions in a human-readable language, which are later translated into a machine-executable language. A compiler may include software instructions that are executed by one or more processors.

단계 3110에서, 하나 이상의 프로세서는 고수준 컴퓨터 코드를 수신할 수 있다. 예를 들면, 고수준 컴퓨터 코드는 메모리(예, 하드디스크 드라이브 등과 같은 비휘발성 메모리, DRAM과 같은 휘발성 메모리 등) 상의 하나 이상의 파일에 인코딩 되거나 네트워크(예, 인터넷 등)를 통해 수신될 수 있다. 추가적으로 또는 대안적으로, 고수준 컴퓨터 코드는 사용자로부터 수신될 수 있다(예, 키보드와 같은 입력 장치를 활용).At step 3110, the one or more processors may receive high-level computer code. For example, high-level computer code may be encoded in one or more files on a memory (eg, non-volatile memory such as a hard disk drive, volatile memory such as DRAM, etc.) or received over a network (eg, the Internet, etc.). Additionally or alternatively, high-level computer code may be received from a user (eg, utilizing an input device such as a keyboard).

단계 3120에서, 하나 이상의 프로세서는 고수준 컴퓨터 코드에 의해 접근될 메모리 칩과 연관된 복수의 메모리 뱅크에 분산된 복수의 메모리 세그먼트를 식별할 수 있다. 예를 들어, 하나 이상의 프로세서는 복수의 메모리 뱅크를 한정하는 데이터 구조 및 메모리 칩의 상응하는 구조에 접근할 수 있다. 하나 이상의 프로세서는 메모리(예, 하드디스크 드라이브 등과 같은 비휘발성 메모리, DRAM과 같은 휘발성 메모리 등)로부터 데이터 구조에 접근하거나 네트워크(예, 인터넷 등)를 통해 데이터 구조를 수신할 수 있다. 이러한 실시예에서, 데이터 구조는 컴파일러에 의해 접근될 수 있는 하나 이상의 라이브러리에 포함되어, 접근될 특정 메모리 칩에 대한 명령을 컴파일러가 생성하도록 허용할 수 있다.At step 3120 , the one or more processors may identify a plurality of memory segments distributed across a plurality of memory banks associated with a memory chip to be accessed by the high-level computer code. For example, one or more processors may have access to data structures defining a plurality of memory banks and corresponding structures in a memory chip. The one or more processors may access data structures from memory (eg, non-volatile memory, such as a hard disk drive, etc., volatile memory, such as DRAM, etc.) or receive data structures over a network (eg, the Internet, etc.). In such embodiments, data structures may be included in one or more libraries that may be accessed by the compiler, allowing the compiler to generate instructions for a particular memory chip to be accessed.

단계 3130에서, 하나 이상의 프로세서는 고수준 컴퓨터 코드에 접근하여 복수의 메모리 접근 사이클에서 일어날 복수의 메모리 읽기 명령을 식별할 수 있다. 예를 들어, 하나 이상의 프로세서는 메모리로부터의 하나 이상의 읽기 명령 및/또는 메모리로의 하나 이상의 쓰기 명령을 요구하는 고수준 컴퓨터 코드 이내의 각 연산을 식별할 수 있다. 이러한 명령은 변수 초기화, 변수 재할당, 변수의 논리 연산, 입력-출력 연산 등을 포함할 수 있다.At step 3130, the one or more processors may access high-level computer code to identify a plurality of memory read commands to occur in a plurality of memory access cycles. For example, the one or more processors may identify each operation within the high-level computer code that requires one or more read instructions from and/or one or more write instructions to the memory. These instructions may include variable initialization, variable reallocation, logical operation of a variable, input-output operation, and the like.

단계 3140에서, 하나 이상의 프로세서는 복수의 메모리 접근 사이클의 각 사이클 동안에 복수의 메모리 세그먼트의 각 세그먼트가 접근되도록 복수의 메모리 세그먼트의 각 세그먼트를 통해 복수의 메모리 접근 명령과 연관된 데이터의 분산을 유발할 수 있다. 예를 들어, 하나 이상의 프로세서는 메모리 칩의 구조를 한정하는 데이터 구조로부터 메모리 세그먼트를 식별한 후에 변수를 고수준 코드로부터 메모리 세그먼트의 다양한 세그먼트로 배정하여 각 메모리 세그먼트가 각 리프레시 사이클(특정 수의 클럭 사이클을 포함할 수 있음) 동안에 적어도 한 번 접근(예, 쓰기 또는 읽기를 통해)되도록 할 수 있다. 이러한 예에서, 하나 이상의 프로세서는 각 메모리 세그먼트가 특정 수의 클럭 사이클 동안에 적어도 한 번 접근(예, 쓰기 또는 읽기를 통해)되도록 고수준 코드의 라인으로부터 변수를 배정하기 위하여 고수준 코드의 각 라인이 요구하는 클럭 사이클의 수를 나타내는 정보에 접근할 수 있다.At step 3140 , the one or more processors may cause distribution of data associated with the plurality of memory access instructions through each segment of the plurality of memory segments such that each segment of the plurality of memory segments is accessed during each cycle of the plurality of memory access cycles. . For example, one or more processors may identify a memory segment from a data structure that defines the structure of a memory chip and then assign variables from high-level code to the various segments of the memory segment so that each memory segment undergoes each refresh cycle (a certain number of clock cycles). may be accessed (e.g., via write or read) at least once during In this example, the one or more processors may cause each line of high-level code to allocate a variable from a line of high-level code such that each memory segment is accessed (eg, via a write or read) at least once during a specified number of clock cycles. You have access to information representing the number of clock cycles.

다른 예에서, 하나 이상의 프로세서는 고수준 코드로부터 머신 코드 또는 기타 저수준 코드를 우선 생성할 수 있다. 이후, 하나 이상의 프로세서는 변수를 저수준 코드로부터 메모리 세그먼트의 다양한 세그먼트로 배정하여 각 메모리 세그먼트가 각 리프레시 사이클(특정 수의 클럭 사이클을 포함할 수 있음) 동안에 적어도 한 번 접근(예, 쓰기 또는 읽기를 통해)되도록 할 수 있다. 이러한 예에서, 저수준 코드의 각 라인은 단일 클럭 사이클을 필요로 할 수 있다.In another example, the one or more processors may first generate machine code or other low-level code from the high-level code. One or more processors then assign variables from low-level code to the various segments of the memory segment so that each memory segment is accessed (e.g., writes or reads) at least once during each refresh cycle (which may contain a certain number of clock cycles). through) can be done. In this example, each line of low-level code may require a single clock cycle.

상기의 모든 실시예에서, 하나 이상의 프로세서는 임시 출력을 사용하는 논리 연산 또는 기타 명령을 메모리 세그먼트이 다양한 세그먼트에 더 배정할 수 있다. 이러한 임시 출력의 결과는 여전히 읽기 및/또는 쓰기 명령이 되어서 지명된 변수가 해당 메모리 세그먼트에 아직 배정되지 않아도 배정된 메모리 세그먼트가 리프레시 사이클 동안에 여전히 접근되도록 할 수 있다.In all of the above embodiments, the one or more processors may further allocate logical operations or other instructions that use temporary outputs to the various segments of the memory. The result of this temporary output may still be read and/or write commands, allowing the allocated memory segment to still be accessed during refresh cycles, even if the named variable has not yet been assigned to that memory segment.

방법(3100)은 추가적인 단계를 더 포함할 수 있다. 예를 들어, 컴파일링 이전에 변수가 배정되는 실시예에서, 하나 이상의 프로세서는 고수준 코드로부터 머신 코드 또는 기타 저수준 코드를 생성할 수 있다. 또한, 하나 이상의 프로세서는 메모리 칩과 상응하는 논리 회로에 의한 실행을 위한 컴파일링 된 코드를 전송할 수 있다. 논리 회로는 GPU 또는 CPU와 같은 종래의 회로를 포함하거나 도 7a에 도시된 바와 같이 메모리 칩으로서 동일 기판 상에 프로세싱 그룹을 포함할 수 있다. 이에 따라, 앞서 설명한 바와 같이, 기판은 도 28에 도시된 뱅크(2801a)와 기타 뱅크와 같은 복수의 뱅크를 포함하는 메모리 어레이를 포함할 수 있다. 또한, 기판은 복수의 프로세서 서브유닛(예, 도 7a에 도시된 서브유닛(730a, 730b, 730c, 730d, 730e, 730f, 730g, 730h))을 포함하는 프로세싱 어레이를 포함할 수 있다.Method 3100 may further include additional steps. For example, in embodiments where variables are assigned prior to compilation, one or more processors may generate machine code or other low-level code from high-level code. Additionally, the one or more processors may transmit the compiled code for execution by the memory chips and corresponding logic circuits. The logic circuit may include conventional circuitry such as GPU or CPU, or it may include processing groups on the same substrate as a memory chip as shown in FIG. 7A . Accordingly, as described above, the substrate may include a memory array including a plurality of banks, such as the bank 2801a shown in FIG. 28 and other banks. The substrate may also include a processing array including a plurality of processor subunits (eg, subunits 730a , 730b , 730c , 730d , 730e , 730f , 730g , 730h illustrated in FIG. 7A ).

도 32는 메모리 칩(예, 도 28의 메모리 칩(2800))에 대한 리프레시를 판단하는 프로세스(3200)의 다른 예시적인 순서도이다. 프로세스(3200)는 본 개시에 따른 컴파일러 내에서 이행될 수 있다. 프로세스(3200)는 컴파일러를 포함하는 소프트웨어 명령을 실행하는 하나 이상의 프로세서에 의해 실행될 수 있다. 프로세스(3200)는 도 31의 프로세스(3100)와 별도로 또는 함께 이행될 수 있다.32 is another exemplary flow diagram of a process 3200 for determining a refresh for a memory chip (eg, memory chip 2800 in FIG. 28 ). Process 3200 may be implemented within a compiler in accordance with the present disclosure. Process 3200 may be executed by one or more processors executing software instructions, including compilers. Process 3200 may be performed separately or in conjunction with process 3100 of FIG. 31 .

단계 3210에서, 단계 3110과 유사하게, 하나 이상의 프로세서는 고수준 컴퓨터 코드를 수신할 수 있다. 단계 3220에서, 단계 3120과 유사하게, 하나 이상의 프로세서는 고수준 컴퓨터 코드에 의해 접근될 메모리 칩과 연관된 복수의 메모리 뱅크에 분산된 복수의 메모리 세그먼트를 식별할 수 있다.At step 3210, similar to step 3110, one or more processors may receive high-level computer code. At step 3220, similar to step 3120, the one or more processors may identify a plurality of memory segments distributed across a plurality of memory banks associated with a memory chip to be accessed by high-level computer code.

단계 3230에서, 하나 이상의 프로세서는 고수준 컴퓨터 코드를 평가하여 복수의 메모리 세그먼트의 하나 이상을 각각 연관시키는 복수의 메모리 읽기 명령을 식별할 수 있다. 예를 들어, 하나 이상의 프로세서는 메모리로부터의 하나 이상의 읽기 명령 및/또는 메모리로의 하나 이상의 쓰기 명령을 필요로 하는 고수준 컴퓨터 코드 내의 각 연산을 식별할 수 있다. 이러한 명령은 변수 초기화, 변수 재할당, 변수의 논리 연산, 입력-출력 연산 등을 포함할 수 있다.In step 3230, the one or more processors may evaluate the high-level computer code to identify a plurality of memory read instructions each associating one or more of the plurality of memory segments. For example, one or more processors may identify each operation in the high-level computer code that requires one or more read instructions from and/or one or more write instructions to memory. These instructions may include variable initialization, variable reallocation, logical operation of a variable, input-output operation, and the like.

일부 실시예에서, 하나 이상의 프로세서는 논리 회로 및 복수의 메모리 세그먼트를 활용하여 고수준 코드의 실행을 시뮬레이션 할 수 있다. 예를 들어, 시뮬레이션은 디버거(debugger) 또는 기타 명령어 집합 시뮬레이터(instruction set simulator 또는 ISS)와 유사하게 고수준 코드의 라인별 스텝스루(line-by-line step-through)를 포함할 수 있다. 디버거(debugger)가 프로세서의 레지스터를 나타내는 내부 변수를 유지하는 것과 유사하게, 시물레이션은 복수의 메모리 세그먼트의 어드레스를 나타내는 내부 변수를 더 유지할 수 있다.In some embodiments, one or more processors may utilize logic circuitry and multiple memory segments to simulate execution of high-level code. For example, the simulation may include line-by-line step-through of high-level code, similar to a debugger or other instruction set simulator (ISS). Similar to a debugger that maintains an internal variable representing a register of a processor, the simulation may further maintain an internal variable representing the address of a plurality of memory segments.

단계 3240에서, 하나 이상의 프로세서는, 메모리 접근 명령의 분석에 의거하고 복수의 메모리 세그먼트의 각 메모리 세그먼트에 대해, 메모리 세그먼트로의 마지막 접근부터 누적되는 시간의 양을 추적할 수 있다. 예를 들어, 앞서 설명한 시뮬레이션을 활용하여, 하나 이상의 프로세서는 복수의 메모리 세그먼트의 각 세그먼트 이내의 하나 이상의 어드레스로의 각 접근(예, 읽기 또는 쓰기) 사이의 시간의 길이를 판단할 수 있다. 시간의 길이는 절대적 시간, 클럭 사이클, 또는 리프레시 사이클(예, 메모리 칩의 알려진 리프레시 속도에 의해 판단)로 측정될 수 있다.At step 3240 , the one or more processors may track, for each memory segment of the plurality of memory segments, an amount of time accumulated since the last access to the memory segment based on the analysis of the memory access instruction. For example, utilizing the simulation described above, the one or more processors may determine the length of time between each access (eg, read or write) to one or more addresses within each segment of the plurality of memory segments. The length of time can be measured in absolute time, clock cycles, or refresh cycles (eg, determined by the known refresh rate of the memory chip).

단계 3250에서, 임의의 특정 메모리 세그먼트에 대한 마지막 접근 이후의 시간의 양이 미리 정해진 임계값을 초과할 것이라는 판단에 대응하여, 하나 이상의 프로세서는 특정 메모리 세그먼트로의 접근을 유발하도록 구성된 메모리 리프레시 명령 및 메모리 접근 명령의 적어도 하나를 고수준 컴퓨터 코드로 도입할 수 있다. 예를 들어, 하나 이상의 프로세서는 리프레시 컨트롤러(예, 도 29a의 리프레시 컨트롤러(2900) 또는 도 29b의 리프레시 컨트롤러(2900'))에 의한 실행을 위한 리프레시 명령을 포함할 수 있다. 논리 회로가 메모리 칩으로서 동일 기판 상에 매립되지 않는 실시예에서, 하나 이상의 프로세서는 메모리 칩으로의 전송을 위한 리프레시 명령을 논리 회로로의 전송을 위한 저수준 코드와 별도로 생성할 수 있다.In step 3250, in response to determining that the amount of time since last access to any particular memory segment will exceed a predetermined threshold, the one or more processors issue a memory refresh instruction configured to cause an access to the particular memory segment; At least one of the memory access instructions may be introduced into high-level computer code. For example, the one or more processors may include a refresh command for execution by a refresh controller (eg, refresh controller 2900 of FIG. 29A or refresh controller 2900' of FIG. 29B ). In embodiments where the logic circuitry is not embedded on the same substrate as the memory chip, the one or more processors may generate refresh commands for transmission to the memory chip separately from the low-level code for transmission to the logic circuitry.

추가적으로 또는 대안적으로, 하나 이상의 프로세서는 메모리 컨트롤러(리프레시 컨트롤러와 분리 또는 일체일 수 있음)에 의한 실행을 위한 접근 명령을 포함할 수 있다. 접근 명령은 메모리 세그먼트에 읽기 연산을 촉발하지만 메모리 세그먼트에 읽기 또는 쓰기가 된 변수에 논리 회로가 더 이상의 연산을 못하게 하도록 구성된 더미(dummy) 명령을 포함할 수 있다.Additionally or alternatively, one or more processors may include access instructions for execution by a memory controller (which may be separate or integral to the refresh controller). The access instruction may include a dummy instruction configured to trigger a read operation on the memory segment but prevent the logic circuit from performing further operations on the variable that has been read or written to the memory segment.

일부 실시예에서, 컴파일러는 프로세스(3100)와 프로세스(3200)의 단계의 조합을 포함할 수 있다. 예를 들면, 컴파일러는 단계 3140에 따른 변수를 배정한 후에 상기에 설명한 시뮬레이션을 실행하여 단계 3250에 따른 모든 임의의 추가적인 메모리 리프레시 명령 또는 메모리 접근 명령을 추가할 수 있다. 이러한 조합으로, 컴파일러는 최대한 많은 메모리 세그먼트로 변수를 분산하고 미리 정해진 임계 시간의 양 이내에 접근될 수 없는 모든 임의의 메모리 세그먼트에 대한 리프레시 또는 접근 명령을 생성할 수 있다. 다른 조합의 예에서, 컴파일러는 단계 3230에 따른 코드를 시뮬레이션 하고, 미리 정해진 임계 시간의 양 이내에 접근되지 않을 것으로 시뮬레이션에서 나타난 모든 임의의 세그먼트에 의거하여 단계 3140에 따른 변수를 배정할 수 있다. 일부 실시예에서, 이러한 조합은 단계 3250을 더 포함하여, 단계 3140에 따른 배정이 완료된 후라도, 미리 정해진 임계 시간의 양 이내에 접근될 수 없는 메모리 세그먼트에 대한 리프레시 또는 접근 명령을 컴파일러가 생성할 수 있다.In some embodiments, the compiler may include a combination of the steps of process 3100 and process 3200 . For example, the compiler may run the simulation described above after allocating the variables according to step 3140 to add any additional memory refresh instructions or memory access instructions according to step 3250 . With this combination, the compiler is able to spread variables across as many memory segments as possible and generate refresh or access instructions for any arbitrary memory segment that cannot be accessed within a predetermined threshold amount of time. In another example combination, the compiler may simulate code according to step 3230 and assign a variable according to step 3140 based on any and all segments that the simulation indicates will not be approached within a predetermined threshold amount of time. In some embodiments, this combination may further include step 3250, such that, even after the allocation according to step 3140 is complete, the compiler may generate a refresh or access instruction for a memory segment that cannot be accessed within a predetermined threshold amount of time. .

본 개시의 리프레시 컨트롤러로 인해, 논리회로(CPU 및 GPU와 같은 종래의 논리 회로 또는 도 7a에 도시된 바와 같은 메모리 칩으로서 동일 기판 상의 프로세싱 그룹)에 의해 실행되는 소프트웨어는 리프레시 컨트롤러에 의해 실행되는 자동 리프레시를 비활성화 하고, 대신에 실행되는 소프트웨어를 통해 리프레시를 제어할 수 있다. 이에 따라, 본 개시의 일부 실시예는 공지의 메모리 칩 접근 패턴을 가진 소프트웨어를 제공할 수 있다(예, 컴파일러가 복수의 메모리 뱅크를 한정하는 데이터 구조 또는 메모리 칩의 상응하는 구조로 접근이 되는 경우). 이러한 실시예에서, 포스트 컴파일링 옵티마이저(post-compiling optimizer)는 자동 리프레시를 비활성화 하고 임계 시간의 양 이내에 접근되지 않은 메모리 칩의 세그먼트에 대해서만 수동으로 리프레시 제어를 설정할 수 있다. 따라서, 앞서 설명한 단계 3250과 유사하지만 컴파일링 이후에, 포스트 컴파일링 옵티마이저는 각 메모리 세그먼트가 반드시 미리 정해진 임계 시간의 양으로 접근 또는 리프레시 되도록 하는 리프레시 명령을 생성할 수 있다.With the refresh controller of the present disclosure, software executed by a logic circuit (a processing group on the same substrate as a conventional logic circuit such as a CPU and a GPU or a memory chip as shown in Fig. 7A) is automatically executed by the refresh controller You can disable refreshes and control them through software running instead. Accordingly, some embodiments of the present disclosure may provide software having a known memory chip access pattern (eg, when a compiler accesses a data structure defining a plurality of memory banks or a corresponding structure of a memory chip) ). In such an embodiment, the post-compiling optimizer can disable automatic refreshes and manually set refresh controls only for segments of the memory chip that have not been accessed within a threshold amount of time. Thus, similar to step 3250 described above, but after compilation, the post-compile optimizer may generate a refresh instruction that ensures that each memory segment is necessarily accessed or refreshed for a predetermined threshold amount of time.

리프레시 사이클을 감소시키는 다른 예로 메모리 칩으로의 미리 정의된 패턴의 접근을 활용할 수 있다. 예를 들어, 논리 회로에 의해 실행되는 소프트웨어가 메모리 칩에 대한 접근 패턴을 제어할 수 있는 경우에, 일부 실시예는 종래의 선형 라인 리프레시 이상의 리프레시를 위한 접근 패턴을 생성할 수 있다. 예를 들면, 컨트롤러의 판단에 논리 회로에 의해 실행되는 소프트웨어가 메모리의 2행마다 정기적으로 접근하는 경우, 본 개시의 리프레시 컨트롤러는 메모리 칩의 속도를 증가하고 전력 소모를 감소하기 위하여 2 라인마다 리프레시를 하지 않는 접근 패턴을 사용할 수 있다.Another example of reducing refresh cycles is accessing predefined patterns to the memory chip. For example, where software executed by logic circuitry can control access patterns to a memory chip, some embodiments can create access patterns for refresh beyond conventional linear line refreshes. For example, if, at the judgment of the controller, software executed by the logic circuit periodically accesses every two rows of the memory, the refresh controller of the present disclosure refreshes every two lines in order to increase the speed of the memory chip and reduce power consumption. You can use an access pattern that doesn't.

이러한 리프레스 컨트롤러의 일례가 도 33에 도시되어 있다. 도 33은 본 개시에 따른 저장된 패턴에 의해 구성된 예시적인 리프레시 컨트롤러(3300)를 도시한 것이다. 리프레시 컨트롤러(3300)는 본 개시의 메모리 칩에 포함될 수 있다. 예를 들어, 복수의 메모리 뱅크와 복수의 메모리 세그먼트가 도 28의 메모리 칩(2800)과 같은 복수의 메모리 뱅크의 각각에 포함되게 할 수 있다.An example of such a refresh controller is shown in FIG. 33 . 33 illustrates an exemplary refresh controller 3300 configured by a stored pattern according to the present disclosure. The refresh controller 3300 may be included in the memory chip of the present disclosure. For example, a plurality of memory banks and a plurality of memory segments may be included in each of a plurality of memory banks, such as the memory chip 2800 of FIG. 28 .

리프레시 컨트롤러(3300)는 타이머(3301)(도 29a와 도 29b의 타이머(2901)와 유사), 로우 카운터(3303)(도 29a와 도 29b의 로우 카운터(2903)와 유사), 및 합산기(3305)(도 29a와 도 29b의 합산기(2907)와 유사)를 포함한다. 또한, 리프레시 컨트롤러(3300)는 데이터 스토리지(3307)를 포함한다. 도 29b의 데이터 스토리지(2909)와 달리, 데이터 스토리지(3307)는 복수의 메모리 뱅크의 각각에 포함된 복수의 메모리 세그먼트의 리프레시에서 이행될 적어도 하나의 메모리 리프레시 패턴을 포함할 수 있다. 예를 들어, 도 33에 도시된 바와 같이, 데이터 스토리지(3307)는 메모리 뱅크의 세그먼트를 행 및/또는 열로 구분하는 Li(예, 도 33의 예에서 L1, L2, L3, L4) 및 Hi(예, 도 33의 예에서 H1, H2, H3, H4)를 포함할 수 있다. 또한, 각 세그먼트는 세그먼트와 연관된 행이 증가되는 방식(예, 각 행이 접근 또는 리프레시 되는지 여부, 한 항씩 건너서 접근 또는 리프레시 되는지 여부 등)을 정의하는 Inci 변수(예, 도 33의 예에서 Inc1, Inc2, Inc3, Inc4)와 연관될 수 있다. 따라서, 도 33에 도시된 바와 같이, 리프레시 패턴은 리프레시 사이클 동안에 리프레시 될 특정 메모리 뱅크 내의 복수의 메모리 세그먼트의 범위 및 리프레시 사이클 동안에 리프레시 되지 않을 특정 메모리 뱅크의 복수의 메모리 세그먼트의 범위를 식별하게 소프트웨어에 의해 배정되는 복수의 메모리 세그먼트 식별자를 포함하는 테이블을 포함할 수 있다.The refresh controller 3300 includes a timer 3301 (similar to timer 2901 in FIGS. 29A and 29B), a row counter 3303 (similar to row counter 2903 in FIGS. 29A and 29B), and a summer ( 3305) (similar to summer 2907 of FIGS. 29A and 29B). Also, the refresh controller 3300 includes a data storage 3307 . Unlike data storage 2909 of FIG. 29B , data storage 3307 may include at least one memory refresh pattern to be implemented in refresh of a plurality of memory segments included in each of a plurality of memory banks. For example, as shown in FIG. 33 , data storage 3307 divides segments of a memory bank into rows and/or columns, such as Li (eg, L1, L2, L3, L4 in the example of FIG. 33) and Hi ( Yes, it may include H1, H2, H3, H4) in the example of FIG. 33 . In addition, each segment has an Inci variable (e.g., Inc1 in the example of Figure 33, Inc2, Inc3, Inc4). Thus, as shown in Fig. 33, the refresh pattern is used in software to identify a range of a plurality of memory segments within a particular memory bank that will be refreshed during a refresh cycle and a range of a plurality of memory segments of a particular memory bank that will not be refreshed during a refresh cycle. It may include a table including a plurality of memory segment identifiers allocated by

따라서, 데이터 스토리지(3307)는 논리 회로(CPU 및 GPU와 같은 종래의 논리 회로 또는 도 7a에 도시된 바와 같은 메모리 칩으로서 동일 기판 상의 프로세싱 그룹)에 의해 실행되는 소프트웨어가 사용하기로 선택할 수 있는 리프레시 패턴을 정의할 수 있다. 메모리 리프레시 패턴은 리프레시 사이클 동안에 특정 메모리 뱅크 내의 복수의 메모리 세그먼트 중에서 어느 메모리 세그먼트가 리프레시 될지 및 리프레시 사이클 동안에 특정 메모리 뱅크의 복수의 메모리 세그먼트 중에서 어느 메모리 세그먼트가 리프레시 되지 않을지를 식별하게 소프트웨어를 활용하여 구성 가능할 수 있다. 따라서, 리프레시 컨트롤러(3300)는 Inci에 따라 현재 사이클 동안에 접근되지 않는 정의된 세그먼트 내의 일부 또는 모든 행을 리프레시 할 수 있다. 리프레시 컨트롤러(3300)는 현재 사이클 동안에 접근되도록 설정된 정의된 세그먼트의 다른 행을 건너뛸 수 있다.Thus, data storage 3307 is a refresh that software executed by logic circuits (either conventional logic circuits such as CPUs and GPUs or processing groups on the same substrate as a memory chip as shown in FIG. 7A) may choose to use. You can define patterns. A memory refresh pattern is configured utilizing software to identify which memory segments among a plurality of memory segments in a particular memory bank will be refreshed during a refresh cycle and which memory segments among a plurality of memory segments in a particular memory bank will not be refreshed during a refresh cycle. It may be possible. Accordingly, the refresh controller 3300 may refresh some or all rows in the defined segment that are not accessed during the current cycle according to Inci. The refresh controller 3300 may skip another row of a defined segment set to be accessed during the current cycle.

리프레시 컨트롤러(3300)의 데이터 스토리지(3307)가 복수의 메모리 리프레시 패턴을 포함하는 실시예에서, 각 메모리 리프레시 패턴은 복수의 메모리 뱅크의 각각에 포함된 복수의 메모리 세그먼트의 리프레시를 위한 상이한 리프레시 패턴을 나타낼 수 있다. 메모리 리프레시 패턴은 복수의 메모리 세그먼트 상의 사용을 위해 선택 가능할 수 있다. 이에 따라, 리프레시 컨트롤러(3300)는 특정 리프레시 사이클 동안에 복수의 메모리 리프레시 패턴 중에서 이행할 메모리 리프레시 패턴의 선택이 가능하도록 구성될 수 있다. 예를 들어, 논리 회로(CPU 및 GPU와 같은 종래의 논리 회로 또는 도 7a에 도시된 바와 같은 메모리 칩으로서 동일 기판 상의 프로세싱 그룹)에 의해 실행되는 소프트웨어는 하나 이상의 상이한 리프레시 사이클 동안에 사용될 상이한 메모리 리프레시 패턴을 선택할 수 있다. 대안적으로, 논리 회로에 의해 실행되는 소프트웨어는 상이한 리프레시 사이클의 일부 또는 전체 동안에 사용할 하나의 메모리 리프레시 패턴을 선택할 수 있다.In an embodiment where the data storage 3307 of the refresh controller 3300 includes a plurality of memory refresh patterns, each memory refresh pattern includes a different refresh pattern for refreshing the plurality of memory segments included in each of the plurality of memory banks. can indicate The memory refresh pattern may be selectable for use on a plurality of memory segments. Accordingly, the refresh controller 3300 may be configured to enable selection of a memory refresh pattern to be implemented from among a plurality of memory refresh patterns during a specific refresh cycle. For example, software executed by a logic circuit (a processing group on the same substrate as a conventional logic circuit such as a CPU and a GPU or a memory chip as shown in Figure 7A) may have different memory refresh patterns to be used during one or more different refresh cycles. can be selected. Alternatively, software executed by the logic circuit may select one memory refresh pattern to use during some or all of the different refresh cycles.

메모리 리프레시 패턴은 데이터 스토리지(3307)에 저장된 하나 이상의 변수를 활용하여 인코딩 될 수 있다. 예를 들어, 복수의 메모리 세그먼트가 행으로 배열된 실시예에서, 각 메모리 세그먼트 식별자는 메모리 리프레시가 시작 또는 종료되어야 하는 메모리의 행 이내의 특정 위치를 식별하도록 구성될 수 있다. 예를 들어, Li와 Hi 외에, 하나 이상의 추가적인 변수가 Li와 Hi에 의해 정의된 행의 어느 부분이 세그먼트 이내인지 정의할 수 있다.The memory refresh pattern may be encoded utilizing one or more variables stored in data storage 3307 . For example, in embodiments in which a plurality of memory segments are arranged in rows, each memory segment identifier may be configured to identify a specific location within the row of memory at which a memory refresh should begin or end. For example, in addition to Li and Hi, one or more additional variables may define which part of the row defined by Li and Hi is within a segment.

도 34는 메모리 칩(예, 도 28의 메모리 칩(2800))에 대한 리프레시를 판단하기 위한 프로세스(3400)의 예시적인 순서도이다. 프로세스(3100)는 본 개시에 따른 리프레시 컨트롤러(예, 도 33의 리프레시 컨트롤러(3300)) 내의 소프트웨어에 의해 이행될 수 있다.34 is an exemplary flow diagram of a process 3400 for determining a refresh for a memory chip (eg, memory chip 2800 of FIG. 28 ). Process 3100 may be implemented by software in a refresh controller according to the present disclosure (eg, refresh controller 3300 of FIG. 33 ).

단계 3410에서, 리프레시 컨트롤러는 복수의 메모리 뱅크의 각각에 포함된 복수의 메모리 세그먼트의 리프레시에 이행될 적어도 하나의 메모리 리프레시 패턴을 저장할 수 있다. 예를 들어, 앞서 도 33을 참조하여 설명한 바와 같이, 리프레시 패턴은 리프레시 사이클 동안에 리프레시 될 특정 메모리 뱅크 내의 복수의 메모리 세그먼트의 범위 및 리프레시 사이클 동안에 리프레시 되지 않을 특정 메모리 뱅크의 복수의 메모리 세그먼트의 범위를 식별하게 소프트웨어에 의해 배정되는 복수의 메모리 세그먼트 식별자를 포함하는 테이블을 포함할 수 있다.In operation 3410 , the refresh controller may store at least one memory refresh pattern to be performed for refresh of a plurality of memory segments included in each of the plurality of memory banks. For example, as previously described with reference to FIG. 33, the refresh pattern defines a range of a plurality of memory segments within a particular memory bank to be refreshed during a refresh cycle and a range of a plurality of memory segments in a particular memory bank that will not be refreshed during a refresh cycle. and a table comprising a plurality of memory segment identifiers assigned by software to identify them.

일부 실시예에서, 적어도 하나의 리프레시 패턴은 제조시에 리프레시 컨트롤러 상으로(예, 리프레시 컨트롤러와 연관되거나 리프레시 컨트롤러에 의해 적어도 접근될 수 있는 ROM 상으로) 인코딩 될 수 있다. 이에 따라, 리프레시 컨트롤러는 적어도 하나의 리프레시 패턴에 접근하지만 이 리프레시 패턴을 저장하지 않을 수 있다.In some embodiments, the at least one refresh pattern may be encoded on a refresh controller at manufacturing time (eg, onto a ROM associated with or at least accessible by the refresh controller). Accordingly, the refresh controller may access at least one refresh pattern but not store this refresh pattern.

단계 3420과 단계 3430에서, 리프레시 컨트롤러는 소프트웨어를 활용하여 리프레시 사이클 동안에 특정 메모리 뱅크 내의 복수의 메모리 세그먼트 중에서 어느 메모리 세그먼트가 리프레시 될지 및 리프레시 사이클 동안에 특정 메모리 뱅크의 복수의 메모리 세그먼트 중에서 어느 메모리 세그먼트가 리프레시 되지 않을지를 식별할 수 있다. 예를 들면, 앞서 도 33을 참조하여 설명한 바와 같이, 논리 회로(CPU 및 GPU와 같은 종래의 논리 회로 또는 도 7a에 도시된 바와 같은 메모리 칩으로서 동일 기판 상의 프로세싱 그룹)에 의해 실행되는 소프트웨어는 적어도 하나의 메모리 리프레시 패턴을 선택할 수 있다. 또한, 리프레시 컨트롤러는 선택된 적어도 하나의 메모리 리프레시 패턴에 접근하여 각 리프레시 사이클 동안에 상응하는 리프레시 신호를 생성할 수 있다. 리프레시 컨트롤러는 적어도 하나의 리프레시 패턴에 따라 현재 리프레시 사이클 동안에 접근되지 않은 정의된 세그먼트 이내의 일부 또는 전체 부분을 리프레시 하고, 현재 사이클 동안에 접근되기로 설정된 정의된 세그먼트의 다른 부분을 건너뛸 수 있다.In steps 3420 and 3430, the refresh controller utilizes software to determine which memory segment among the plurality of memory segments within a particular memory bank will be refreshed during the refresh cycle and which memory segment from among the plurality of memory segments of the particular memory bank during the refresh cycle is refreshed. It can be identified whether or not For example, as previously described with reference to FIG. 33 , software executed by a logic circuit (a processing group on the same substrate as a conventional logic circuit such as a CPU and a GPU or a memory chip as shown in FIG. 7A ) may be at least One memory refresh pattern can be selected. Also, the refresh controller may access the selected at least one memory refresh pattern and generate a corresponding refresh signal during each refresh cycle. The refresh controller may refresh some or all parts within the defined segment that are not accessed during the current refresh cycle according to at least one refresh pattern, and skip other parts of the defined segment set to be accessed during the current cycle.

단계 3440에서, 리프레시 컨트롤러는 상응하는 리프레시 명령을 생성할 수 있다. 예를 들어, 도 33에 도시된 바와 같이, 합산기(3305)는 데이터 스토리지(3307) 내의 적어도 하나의 메모리 리프레시 패턴에 따라 리프레시 되지 않을 특정 세그먼트에 대한 리프레시 신호를 무효로 하도록 구성된 논리 회로를 포함할 수 있다. 추가적으로 또는 대안적으로, 마이크로프로세서(도 33에는 미도시)가 데이터 스토리지(3307) 내의 적어도 하나의 메모리 리프레시 패턴에 따라 어느 세그먼트가 리프레시 되어야 하는지에 의거하여 특정 리프레시 신호를 생성할 수 있다.In step 3440, the refresh controller may generate a corresponding refresh command. For example, as shown in FIG. 33 , summer 3305 includes logic circuitry configured to invalidate refresh signals for particular segments not to be refreshed according to at least one memory refresh pattern in data storage 3307 . can do. Additionally or alternatively, a microprocessor (not shown in FIG. 33 ) may generate a specific refresh signal based on which segments are to be refreshed according to at least one memory refresh pattern in data storage 3307 .

방법(3400)은 추가적인 단계를 더 포함할 수 있다. 예를 들어, 적어도 하나의 메모리 리프레시 패턴이 1, 2, 또는 기타 수의 리프레시 사이클마다 변화하도록(예, 도 33에 도시된 바와 같이 L1, H1, Inc1에서 L2, H2, Inc2로 이동) 구성된 실시예에서, 리프레시 컨트롤러는 단계 3430과 단계 3440에 따른 리프레시 신호의 다음 판단을 위해 데이터 스토리지의 상이한 부분에 접근할 수 있다. 이와 유사하게, 논리 회로(CPU 및 GPU와 같은 종래의 논리 회로 또는 도 7a에 도시된 바와 같은 메모리 칩으로서 동일 기판 상의 프로세싱 그룹)에 의해 실행되는 소프트웨어가 하나 이상의 추후 리프레시 사이클에서의 사용을 위해 데이터 스토리지에서 새로운 메모리 리프레시 패턴을 선택하는 경우, 리프레시 컨트롤러는 단계 3430과 단계 3440에 따른 리프레시 신호의 다음 판단을 위해 데이터 스토리지의 상이한 부분에 접근할 수 있다.Method 3400 may further include additional steps. For example, an implementation configured such that at least one memory refresh pattern changes every 1, 2, or any other number of refresh cycles (eg, from L1, H1, Inc1 to L2, H2, Inc2 as shown in FIG. 33 ). In an example, the refresh controller may access different portions of the data storage for subsequent determination of the refresh signal according to steps 3430 and 3440 . Similarly, software executed by logic circuits (processing groups on the same substrate as conventional logic circuits such as CPUs and GPUs or memory chips as shown in FIG. When the storage selects a new memory refresh pattern, the refresh controller may access different portions of the data storage for subsequent determination of the refresh signal according to steps 3430 and 3440 .

선택적 크기의 메모리 칩Optional sized memory chips

메모리 칩을 설계하고 메모리의 특정 용량을 목표로 하는 경우, 메모리 용량의 증가 또는 축소의 변화를 하려면 제품 및 전체 마스크 세트를 재설계해야 할 수 있다. 종종, 제품 설계와 시장 조사는 동시에 이루어지고, 경우에 따라 시장 조사가 가능하기 전에 제품 설계가 끝나기도 한다. 따라서, 제품 설계와 시장의 실제 수요 사이에 괴리가 있을 수 있다. 본 개시는 시장 수요에 상응하는 메모리 용량의 메모리 칩을 유연하게 제공하는 방법을 제시한다. 설계 방법은 적절한 인터커넥트(interconnect) 회로와 함께 다이를 설계하여, 단일 웨이퍼로부터 가변 크기의 메모리 용량의 메모리 칩을 생산하는 기회를 제공하기 위하여, 하나 이상의 다이를 포함할 수 있는 메모리 칩이 웨이퍼로부터 선택적으로 절단되게 할 수 있다.When designing a memory chip and targeting a specific capacity of memory, changes in increasing or decreasing memory capacity may require redesigning the product and the entire mask set. Often, product design and market research occur simultaneously, and in some cases, product design is completed before market research is possible. Therefore, there may be a gap between product design and actual market demand. The present disclosure provides a method of flexibly providing a memory chip having a memory capacity corresponding to market demand. The design method involves designing a die with appropriate interconnect circuitry so that memory chips, which may include more than one die, are selected from the wafer to provide the opportunity to produce memory chips of variable size memory capacity from a single wafer. can be cut with

본 개시는 웨이퍼로부터 메모리 칩을 절단하여 메모리 칩을 제조하는 시스템 및 방법에 관한 것이다. 본 방법은 웨이퍼로부터 선택적 크기의 메모리 칩을 생산하는데 활용될 수 있다. 다이(3503)를 포함하는 웨이퍼(3501)의 예시적인 실시예가 도 35a에 도시되어 있다. 웨이퍼(3501)는 반도체 물질(예, 실리콘(Si), 실리콘 게르마늄(SiGe), SOI(silicon on insulator), 질화 갈륨(GaN), 질화 알루미늄(AlN), 알루미늄 질화 갈륨(AlGaN), 질화 붕소(BN), 갈륨 비소(GaAs), 갈륨 알루미늄 비소(AlGaAs), 질화 인듐(InN), 및 그 조합 등)로부터 형성될 수 있다. 다이(3503)는 모든 임의의 적합한 반도체, 유전체, 또는 금속 성분을 포함할 수 있는 모든 임의의 적합한 회로 요소(예, 트랜지스터, 커패시터, 저항 등)를 포함할 수 있다. 다이(3503)는 웨이퍼(3501)의 물질과 동일하거나 상이할 수 있는 반도체 물질로 형성될 수 있다. 다이(3503) 외에도, 웨이퍼(3501)는 다른 구조 및/또는 회로를 포함할 수 있다. 일부 실시예에서, 하나 이상의 결합 회로가 제공될 수 있고 다이 중의 하나 이상을 서로 결합할 수 있다. 예시적인 실시예에서, 이러한 결합 회로는 둘 이상의 다이(3503)에 의해 공유되는 버스를 포함할 수 있다. 추가적으로, 결합 회로는 다이(3503)와 연관된 회로를 제어 및/또는 다이(3503)로 정보를 제공 및 다이(3503)로부터 정보를 가져오도록 설계된 하나 이상의 논리 회로를 포함할 수 있다. 일부 경우에서, 결합 회로는 메모리 접근 관리 로직을 포함할 수 있다. 이러한 로직은 논리적 메모리 어드레스를 다이(3503)와 연관된 물리적 어드레스로 변환할 수 있다. 여기서, '제조'라는 용어는 개시된 웨이터, 다이, 및/또는 칩을 구성하는 모든 단계를 통칭할 수 있다. 예를 들어, 제조는 웨이퍼 상에 포함된 다양한 다이(및 모든 임의의 기타 회로)의 동시 레이아웃과 형성을 의미할 수 있다. 제조는 또한 경우에 따라 하나의 다이 또는 다수의 다이를 포함하도록 웨이퍼로부터 선택적 크기의 메모리 칩을 절단하는 것을 의미할 수 있다. 물론, 제조라는 용어는 이러한 예에 제한되지 않으며 개시된 메모리 칩 및 그 중간 구조의 일부 또는 전체의 생성과 연관된 다른 측면을 포함할 수 있다.The present disclosure relates to a system and method for manufacturing a memory chip by cutting a memory chip from a wafer. The method can be utilized to produce memory chips of selective sizes from wafers. An exemplary embodiment of a wafer 3501 including a die 3503 is shown in FIG. 35A . The wafer 3501 is formed of a semiconductor material (eg, silicon (Si), silicon germanium (SiGe), silicon on insulator (SOI), gallium nitride (GaN), aluminum nitride (AlN), aluminum gallium nitride (AlGaN), boron nitride) BN), gallium arsenide (GaAs), gallium aluminum arsenide (AlGaAs), indium nitride (InN), and combinations thereof). Die 3503 may include any suitable circuit element (eg, transistor, capacitor, resistor, etc.) that may include any suitable semiconductor, dielectric, or metallic component. Die 3503 may be formed of a semiconductor material that may be the same as or different from that of wafer 3501 . In addition to die 3503 , wafer 3501 may include other structures and/or circuitry. In some embodiments, one or more coupling circuits may be provided and may couple one or more of the dies to each other. In an exemplary embodiment, such a coupling circuit may include a bus shared by two or more dies 3503 . Additionally, the coupling circuitry may include one or more logic circuits designed to control circuitry associated with die 3503 and/or to provide information to and retrieve information from die 3503 . In some cases, the coupling circuitry may include memory access management logic. This logic may translate a logical memory address into a physical address associated with die 3503 . Here, the term 'manufacturing' may collectively refer to all steps of constructing the disclosed waiter, die, and/or chip. For example, fabrication may refer to the simultaneous layout and formation of the various dies (and any and all other circuitry) contained on the wafer. Fabrication may also mean cutting a memory chip of an optional size from a wafer to contain one die or multiple dies as the case may be. Of course, the term manufacturing is not limited to these examples and may include other aspects associated with the creation of some or all of the disclosed memory chips and intermediate structures thereof.

메모리 칩의 제조를 위해 다이(3503) 또는 한 그룹의 다이를 사용할 수 있다. 메모리 칩은 본 개시의 다른 부분에서 설명한 바와 같이 분산 프로세서를 포함할 수 있다. 도 35b에 도시된 바와 같이, 다이(3503)는 기판(3507) 및 기판 상에 배치된 메모리 어레이를 포함할 수 있다. 메모리 어레이는 데이터를 저장하도록 설계된 메모리 뱅크(3511A-3511D) 등과 같은 하나 이상의 메모리 유닛을 포함할 수 있다. 다양한 실시예에서, 메모리 뱅크는 트랜지스터, 커패시터 등과 같은 반도체 기반 회로 요소를 포함할 수 있다. 예시적인 일 실시예에서, 메모리 뱅크는 저장 장치의 다중 행과 열을 포함할 수 있다. 일부의 경우에서, 이러한 메모리 뱅크의 용량은 1 메가바이트 이상일 수 있다. 메모리 뱅크는 DRAM 또는 SRAM을 포함할 수 있다.A die 3503 or a group of dies may be used for manufacturing a memory chip. The memory chip may include a distributed processor as described elsewhere in this disclosure. As shown in FIG. 35B , die 3503 may include a substrate 3507 and a memory array disposed on the substrate. The memory array may include one or more memory units, such as memory banks 3511A-3511D, etc. designed to store data. In various embodiments, a memory bank may include semiconductor-based circuit elements such as transistors, capacitors, and the like. In one exemplary embodiment, a memory bank may include multiple rows and columns of storage devices. In some cases, the capacity of such a memory bank may be 1 megabyte or more. The memory bank may include DRAM or SRAM.

다이(3503)는 기판 상에 배치된 프로세싱 어레이를 더 포함할 수 있고, 프로세싱 어레이는 도 35b에 도시된 바와 같이 복수의 프로세서 서브유닛(3515A-3515D)을 포함할 수 있다. 앞서 설명한 바와 같이, 각 메모리 뱅크는 전용 버스에 의해 연결된 전용 프로세서 서브유닛을 포함할 수 있다. 예를 들어, 프로세서 서브유닛(3515A)은 버스 또는 연결(3512)을 통해 메모리 뱅크(3511A)와 연관된다. 여기서, 메모리 뱅크(3511A-3511D)와 프로세서 서브유닛(3515A-3515D) 사이에 다양한 연결이 가능하고, 도 35b에는 그 중의 일부 연결만이 도시되어 있음은 당연하다 할 것이다. 예시적인 일 실시예에서, 프로세서 서브유닛은 연관된 메모리 뱅크에 대해 읽기/쓰기 동작을 수행할 수 있고, 다양한 메모리 뱅크에 저장된 메모리에 관한 리프레시 동작 또는 기타 적절한 동작을 더 수행할 수 있다.Die 3503 may further include a processing array disposed on a substrate, and the processing array may include a plurality of processor subunits 3515A-3515D as shown in FIG. 35B . As described above, each memory bank may include dedicated processor subunits connected by dedicated buses. For example, processor subunit 3515A is associated with memory bank 3511A via a bus or connection 3512 . Here, it will be understood that various connections are possible between the memory banks 3511A-3511D and the processor subunits 3515A-3515D, and only some of the connections are shown in FIG. 35B . In one exemplary embodiment, the processor subunit may perform read/write operations on the associated memory bank, and may further perform a refresh operation or other suitable operation on the memory stored in the various memory banks.

다이(3503)는 프로세서 서브유닛을 상응하는 메모리 뱅크와 연결시키도록 구성된 제1 그룹의 버스를 포함할 수 있다. 예시적인 버스는 전기 요소를 연결하고 각 메모리 뱅크와 그 연관 프로세서 서브유닛 사이에 데이터와 어드레스의 전송을 허용하는 전선 또는 컨덕터의 집합을 포함할 수 있다. 예시적인 실시예에서, 연결(3512)은 프로세서 서브유닛(3515A)을 메모리 뱅크(3511A)로 연결하기 위한 전용 버스의 역할을 할 수 있다. 다이(3503)는 한 그룹의 이러한 버스를 포함할 수 있고, 각 버스는 프로세서 서브유닛을 상응하는 전용 메모리 뱅크로 연결할 수 있다. 또한, 다이(3503)는 다른 그룹의 버스를 포함할 수 있고, 각 버스는 프로세서 서브유닛(예, 서브유닛(3515A-3515D))을 서로 연결할 수 있다. 예를 들어, 이러한 버스는 연결(3516A-3516D)을 포함할 수 있다. 다양한 실시예에서, 메모리 뱅크(3511A-3511D)를 위한 데이터는 입력-출력 버스(3530)를 통해 전달될 수 있다. 예시적인 일 실시예에셔, 입력-출력 버스(3530)는 데이터 관련 정보 및 다이(3503)의 메모리 유닛의 동작을 제어하기 위한 명령 관련 정보를 전달할 수 있다. 데이터 정보는 메모리 뱅크에 저장하기 위한 데이터, 메모리 뱅크에서 읽은 데이터, 상응하는 메모리 뱅크에 저장된 데이터에 대해 수행된 동작에 의거한 하나 이상의 프로세서 서브유닛으로부터의 프로세싱 결과, 명령 관련 정보, 다양한 코드 등을 포함할 수 있다.Die 3503 may include a first group of buses configured to couple processor subunits with corresponding memory banks. An exemplary bus may include a set of wires or conductors that connect electrical components and allow transfer of data and addresses between each memory bank and its associated processor subunit. In an exemplary embodiment, connection 3512 may serve as a dedicated bus for coupling processor subunit 3515A to memory bank 3511A. Die 3503 may include a group of such buses, each bus connecting the processor subunits to a corresponding dedicated memory bank. In addition, die 3503 may include different groups of buses, each bus capable of interconnecting processor subunits (eg, subunits 3515A-3515D). For example, such a bus may include connections 3516A-3516D. In various embodiments, data for memory banks 3511A- 3511D may be communicated via input-output bus 3530 . In one exemplary embodiment, the input-output bus 3530 may carry data-related information and instruction-related information for controlling the operation of the memory unit of the die 3503 . The data information includes data for storage in a memory bank, data read from the memory bank, processing results from one or more processor subunits based on operations performed on the data stored in the corresponding memory bank, instruction-related information, various codes, and the like. may include

다양한 경우에서, 입력-출력 버스(3530)에 의해 전송된 데이터 및 명령은 입력-출력(IO) 컨트롤러(35231)에 의해 제어될 수 있다. 예시적인 일 실시예에서, IO 컨트롤러(3521)는 버스(3530)와 프로세서 서브유닛(3515A-3515D) 사이의 데이터의 흐름을 제어할 수 있다. IO 컨트롤러(3521)는 프로세서 서브유닛(3515A-3515D) 중의 어느 서브유닛에서 정보를 가져올 것인지를 판단할 수 있다. 다양한 실시에에서, IO 컨트롤러(3521)는 IO 컨트롤러(3521)를 비활성화 하도록 구성된 퓨즈(3554)를 포함할 수 있다. 퓨즈(3554)는 다수의 다이를 함께 합쳐서 더 큰 메모리 칩(단일 다이만을 포함하는 단일 다이 메모리 칩에 대한 상대적인 용어로 멀티 다이 메모리 칩으로도 지칭함)을 형성하는 경우에 사용될 수 있다. 이후, 멀티 다이 메모리 칩은 멀티 다이 메모리 칩을 형성하는 다이 유닛 중의 하나의 IO 컨트롤러의 하나를 사용하고, 다른 IO 컨트롤러에 상응하는 퓨즈를 사용하여 다른 다이 유닛과 연관된 다른 IO 컨트롤러를 비활성화 할 수 있다.In various cases, the data and commands sent by the input-output bus 3530 may be controlled by an input-output (IO) controller 35231 . In one exemplary embodiment, the IO controller 3521 may control the flow of data between the bus 3530 and the processor subunits 3515A-3515D. The IO controller 3521 may determine from which subunit of the processor subunits 3515A-3515D the information is to be retrieved. In various embodiments, the IO controller 3521 may include a fuse 3554 configured to disable the IO controller 3521 . The fuse 3554 may be used when merging multiple dies together to form a larger memory chip (also referred to as a multi-die memory chip relative to a single die memory chip that includes only a single die). Thereafter, the multi-die memory chip may use one of the IO controllers of one of the die units forming the multi-die memory chip, and use a fuse corresponding to the other IO controller to disable the other IO controllers associated with the other die unit. .

각 메모리 칩 또는 프로세서 다이 또는 다이의 그룹은 상응하는 메모리 뱅크와 연관된 분산 프로세서를 포함할 수 있다. 일부 실시예에서, 이러한 분산 프로세서는 복수의 메모리 뱅크로서 동일 기판 상에 배치된 프로세싱 어레이로 배열될 수 있다. 또한, 프로세싱 어레이는 각각이 어드레스 생성기(어드레스 생성 장치(AGU)로도 지칭)를 포함하는 하나 이상의 논리부를 포함할 수 있다. 일부 경우에서, 어드레스 생성기는 적어도 하나의 프로세서 서브유닛의 일부일 수 있다. 어드레스 생성기는 메모리 칩과 연관된 하나 이상의 메모리 뱅크로부터 데이터를 가져오기 위해 필요한 메모리 어드레스를 생성할 수 있다. 어드레스 생성 계산은 덧셈, 뺄셈, 모듈로 연산, 또는 비트 이동과 같은 정수 산술 연산을 포함할 수 있다. 어드레스 생성기는 한 번에 여러 피연산자를 연산하도록 구성될 수 있다. 또한, 다중 어드레스 생성기가 동시에 하나 시상의 어드레스 계산 연산을 수행할 수 있다. 다양한 실시예에서, 어드레스 생성기는 상응하는 메모리 뱅크와 연관될 수 있다. 어드레스 생성기는 상응하는 버스 라인을 통해 상응하는 메모리 뱅크와 연결될 수 있다.Each memory chip or processor die or group of dies may include a distributed processor associated with a corresponding memory bank. In some embodiments, such distributed processors may be arranged in a processing array disposed on the same substrate as a plurality of memory banks. Further, the processing array may include one or more logic units each including an address generator (also referred to as an address generating unit (AGU)). In some cases, the address generator may be part of at least one processor subunit. The address generator may generate the necessary memory addresses to fetch data from one or more memory banks associated with the memory chip. Address generation calculations may include addition, subtraction, modulo operations, or integer arithmetic operations such as bit shifts. The address generator may be configured to operate on several operands at once. In addition, multiple address generators can perform one or more address calculation operations at the same time. In various embodiments, an address generator may be associated with a corresponding memory bank. The address generator may be coupled to the corresponding memory bank via a corresponding bus line.

다양한 실시예에서, 웨이퍼(3501)의 상이한 영역을 선택적으로 절단하여 웨이퍼(3501)로부터 선택 가능 사이즈의 메모리 칩을 형성할 수 있다. 설명한 바와 같이, 웨이퍼는 한 그룹의 다이(3503)를 포함할 수 있고, 이 그룹은 웨이퍼 상에 포함된 둘 이상의 다이(예, 2, 3, 4, 5, 10, 또는 그 이상의 다이)의 그룹을 포함할 수 있다. 하기에 상세히 설명하는 바와 같이, 일부 경우에서, 한 그룹의 다이의 하나의 다이만을 포함하는 웨이퍼의 부분을 절단하여 단일 메모리 칩이 형성될 수 있다. 이러한 경우, 그 결과의 메모리 칩은 한 다이와 연관된 메모리 칩을 포함할 수 있다. 반면, 다른 경우에서, 선택 가능 사이즈의 메모리 칩이 하나 이상의 다이를 포함하게 형성될 수 있다. 이러한 메모리 칩은 웨이퍼 상에 포함된 한 그룹의 다이의 둘 이상의 다이를 포함하는 웨이퍼의 영역을 절단하여 형성될 수 있다. 이러한 경우, 다이는 다이를 서로 결합시키는 결합 회로와 함께 멀티 다이 메모리 칩을 제공한다. 클럭 요소, 데이터 버스, 또는 모든 임의의 적절한 논리 회로와 같은 일부 회로 요소가 또한 칩 사이에 추가적으로 연결될 수 있다.In various embodiments, different regions of wafer 3501 may be selectively cut to form selectable sized memory chips from wafer 3501 . As described, a wafer may include a group of dies 3503, which group is a group of two or more dies (eg, 2, 3, 4, 5, 10, or more dies) included on the wafer. may include As detailed below, in some cases, a single memory chip may be formed by cutting a portion of a wafer that contains only one die of a group of dies. In this case, the resulting memory chip may include a memory chip associated with one die. On the other hand, in other cases, a memory chip of a selectable size may be formed to include one or more dies. Such a memory chip may be formed by cutting a region of a wafer containing two or more dies of a group of dies included on the wafer. In this case, the die provides a multi-die memory chip with coupling circuitry that couples the die together. Some circuit elements may also be additionally coupled between the chips, such as clock elements, data buses, or any suitable logic circuitry.

일부 경우에서, 다이 그룹과 연관된 적어도 하나의 컨트롤러는 다이 그룹의 동작을 단일 메모리 칩(예, 다중 메모리 유닛 메모리 칩)으로 제어하도록 구성될 수 있다. 컨트롤러는 메모리 칩과의 데이터 흐름을 관리하는 하나 이상의 회로를 포함할 수 있다. 메모리 컨트롤러는 메모리 칩의 일부이거나 메모리 칩과 직접 연관되지 않는 별도의 칩의 일부일 수 있다. 예시적인 일 실시예에서, 컨트롤러는 메모리 칩의 분산 프로세서와 연관된 읽기 및 쓰기 요청 또는 기타 명령을 가능하게 하도록 구성될 수 있고, 메모리 칩의 모든 임의의 다른 측면(에, 메모리 칩의 리프레시, 분산 프로세서와의 상호 작용 등)을 제어하도록 구성될 수 있다. 일부 경우에서, 컨트롤러는 다이(3503)의 일부일 수 있고, 다른 경우에서, 컨트롤러는 다이(3503)에 인접하여 레이아웃 될 수 있다. 다양한 실시예에서, 컨트롤러는 또한 메모리 칩 상에 포함된 적어도 하나의 메모리 유닛의 적어도 하나의 메모리 컨트롤러를 포함할 수 있다. 일부 경우에서, 메모리 칩 상의 정보에 접근하기 위해 사용되는 프로토콜은 메모리 칩 상에 존재할 수 있는 복제 로직 및 메모리 유닛(예, 메모리 뱅크)과 무관할 수 있다. 프로토콜은 메모리 칩 상의 데이터의 적절한 접근을 위해 상이한 ID 또는 어드레스 범위를 가지도록 구성될 수 있다. 이러한 프로토콜을 가진 칩의 예로, 상이한 메모리 뱅크에 상이한 어드레스 범위가 있는 JEDEC(Joint Electron Device Engineering Council) 더블 데이터 레이트(DDR) 컨트롤러, 상이한 메모리 유닛(예, 메모리 뱅크)에 상이한 ID가 있는 SPI(serial peripheral interface) 등을 가진 칩이 있을 수 있다.In some cases, at least one controller associated with the die group may be configured to control operation of the die group with a single memory chip (eg, multiple memory unit memory chips). The controller may include one or more circuitry to manage data flow to and from the memory chip. The memory controller may be part of a memory chip or part of a separate chip that is not directly associated with the memory chip. In one exemplary embodiment, the controller may be configured to enable read and write requests or other commands associated with the distributed processor of the memory chip, and any other aspect of the memory chip (eg, refresh of the memory chip, distributed processor interaction with, etc.). In some cases, the controller may be part of die 3503 , and in other cases, the controller may be laid out adjacent to die 3503 . In various embodiments, the controller may also include at least one memory controller of at least one memory unit included on the memory chip. In some cases, the protocol used to access information on the memory chip may be independent of the replication logic and memory units (eg, memory banks) that may exist on the memory chip. Protocols can be configured to have different IDs or address ranges for proper access of data on the memory chip. Examples of chips with this protocol include a Joint Electron Device Engineering Council (JEDEC) double data rate (DDR) controller with different address ranges in different memory banks, SPI (serial) with different IDs in different memory units (e.g. memory banks). There may be a chip having a peripheral interface) or the like.

다양한 실시예에서, 다중 영역이 웨이퍼에서 절단될 수 있고, 여기서 다양한 영역은 하나 이상의 다이를 포함할 수 있다. 일부 경우에서, 각 별도의 영역은 멀티 다이 메모리 칩의 형성에 사용될 수 있다. 일부 경우에서, 둘 이상의 영역이 동일한 형상일 수 있고, 동일한 수의 다이가 동일한 방식으로 결합 회로에 결합될 수 있다. 대안적으로, 예시적인 일 실시예에서, 제1 그룹의 영역이 제1 유형의 메모리 칩의 형성에 활용되고, 제2 그룹의 영역이 제2 유형의 메모리 칩의 형성에 활용될 수 있다. 예를 들어, 도 35c에 도시된 바와 같은 웨이퍼(3501)은 단일 다이를 포함할 수 있는 영역(3505)을 포함할 수 있고, 제2 영역(3504)은 두 개의 다의의 그룹을 포함할 수 있다. 영역(3505) 웨이퍼(3501)에서 절단되는 경우, 단일 다이 메모리 칩이 제공되게 된다. 영역(3504)이 웨이퍼(3501)에서 절단되는 경우, 멀티 다이 메모리 칩이 제공되게 된다. 도 35c에 도시된 그룹들은 예시에 불과하고, 다양한 다른 영역과 다이 그룹이 웨이퍼(3501)에서 절단될 수 있다.In various embodiments, multiple regions may be cut from the wafer, wherein the various regions may include one or more dies. In some cases, each separate region may be used in the formation of a multi-die memory chip. In some cases, two or more regions may be of the same shape, and the same number of dies may be coupled to the coupling circuit in the same manner. Alternatively, in one exemplary embodiment, a first group of regions may be utilized in the formation of a first type of memory chip, and a second group of regions may be utilized in the formation of a second type of memory chip. For example, a wafer 3501 as shown in FIG. 35C may include a region 3505 that may include a single die, and a second region 3504 may include two distinct groups. have. When region 3505 is cut from wafer 3501, a single die memory chip is provided. When region 3504 is cut from wafer 3501, a multi-die memory chip is provided. The groups shown in FIG. 35C are exemplary only, and various other regions and die groups may be cut from the wafer 3501 .

다양한 실시예에서, 다이는 웨이퍼(3501)에서 형성되어, 예를 들어 도 35c에 도시된 바와 같이, 다이는 웨이퍼의 하나 이상의 행을 따라 배열될 수 있다. 다이는 하나 이상의 행에 상응하는 입력-출력 버스(3530)를 공유할 수 있다. 예시적인 일 실시예에서, 다이의 그룹은 다양한 절단 형상을 사용하여 웨이퍼(3501)에서 절단될 수 있고, 여기서, 메모리 칩의 형성에 활용될 수 있는 다이의 그룹을 절단하는 경우, 공유 입력-출력 버스(3530)의 적어도 일부분은 제외될 수 있다(예, 입력-출력 버스(3530)의 일부만이 다이의 그룹을 포함하여 형성되는 메모리 칩의 일부로 포함될 수 있다).In various embodiments, dies may be formed on wafer 3501 , such that the dies may be arranged along one or more rows of wafers, for example as shown in FIG. 35C . The dies may share an input-output bus 3530 corresponding to one or more rows. In one exemplary embodiment, groups of dies may be cut from wafer 3501 using various cut shapes, where a shared input-output when cutting a group of dies that may be utilized in the formation of memory chips. At least a portion of the bus 3530 may be excluded (eg, only a portion of the input-output bus 3530 may be included as part of a memory chip formed including a group of dies).

앞서 설명한 바와 같이, 다수의 다이(예, 도 35c에 도시된 다이(3506A 및 3506B))를 사용하여 메모리 칩(3517)을 형성하는 경우, 다이 중의 하나에 상응하는 하나의 IO 컨트롤러가 활성화되고 다이(3506A 및 3506B)의 모든 프로세서 서브유닛으로의 데이터 흐름을 제어하도록 구성될 수 있다. 예를 들어, 도 35d에는 메모리 뱅크(3511A-3511H), 프로세서 서브유닛(3515A-3515H), IO 컨트롤러(3521A, 3521B), 및 퓨즈(3554A, 3554B)를 포함하는 메모리 칩(3517)을 형성하기 위해 조합된 다이(3506A 및 3506B)가 도시되어 있다. 여기서, 메모리 칩(3517)은 웨이퍼로부터 메모리 칩을 제거하기 전에 웨이퍼(3501)의 영역(3517)에 상응한다. 즉, 본 개시에서, 일단 웨이퍼(3501)에서 절단된 웨이퍼(3501)의 영역(3504, 3505, 3517 등)은 메모리 칩(3504, 3505, 3517 등)이 되게 된다. 또한, 여기서 퓨즈는 비활성화 요소를 말하기도 한다. 예시적인 일 실시예에서, 퓨즈(3554B)는 IO 컨트롤러(3521B)를 비활성화 하는데 사용될 수 있고, IO 컨트롤러(3521A)는 프로세서 서브유닛(3515A-3515H)과 데이터를 주고받음으로써 모든 메모리 뱅크(3511A-3511H)로의 데이터 흐름을 제어하는데 사용될 수 있다. 예시적인 일 실시예에서, IO 컨트롤러(3521A)는 모든 임의의 적절한 연결을 활용하여 다양한 프로세서 서브유닛과 연결될 수 있다. 일부 실시예에서, 하기에 더 설명하는 바와 같이, 프로세서 서브유닛(3515A-3515H)은 상호 연결될 수 있고, IO 컨트롤러(3521A)는 메모리 칩(3517)의 프로세싱 로직을 형성하는 프로세서 서브유닛(3515A-3515H)으로의 데이터 흐름을 제어하도록 구성될 수 있다.As previously described, when multiple dies (eg, dies 3506A and 3506B shown in FIG. 35C ) are used to form memory chip 3517 , one IO controller corresponding to one of the dies is activated and the die may be configured to control data flow to all processor subunits of 3506A and 3506B. For example, FIG. 35D shows a memory chip 3517 including memory banks 3511A-3511H, processor subunits 3515A-3515H, IO controllers 3521A, 3521B, and fuses 3554A, 3554B. Dies 3506A and 3506B are shown combined for Here, memory chip 3517 corresponds to region 3517 of wafer 3501 prior to removal of the memory chip from the wafer. That is, in the present disclosure, the regions 3504, 3505, 3517, etc. of the wafer 3501 once cut from the wafer 3501 become memory chips 3504, 3505, 3517, and the like. In addition, a fuse may also refer to a deactivation element here. In one exemplary embodiment, fuse 3554B may be used to disable IO controller 3521B, which IO controller 3521A sends and receives data to and from processor subunits 3515A-3515H, thereby causing all memory banks 3511A- 3511H). In one exemplary embodiment, the IO controller 3521A may be coupled with the various processor subunits utilizing any and all suitable connections. In some embodiments, as described further below, the processor subunits 3515A-3515H may be interconnected, and the IO controller 3521A forms the processing logic of the memory chip 3517 processor subunit 3515A- 3515H).

예시적인 일 실시예에서, 컨트롤러(3521A 및 3521B)와 같은 IO 컨트롤러 및 상응하는 퓨즈(3554A 및 3554B)는 메모리 뱅크(3511A-3511H) 및 프로세서 서브유닛(3515A-3515H)의 형성과 함께 웨이퍼(3501) 상에 형성될 수 있다. 다양한 실시예에서, 메모리 칩(3517)을 형성할 때에, 단일 칩으로 기능하고 단일 입력-출력 컨트롤러(예, 컨트롤러(3521A))에 의해 제어되는 메모리 칩(3517)을 다이(3506A 및 3506B)가 형성하도록 구성되도록 퓨즈 중의 하나(예, 퓨즈(3554B))가 비활성화 될 수 있다. 예시적인 일 실시예에서, 퓨즈의 활성화는 전류를 인가하여 퓨즈를 촉발하는 것을 포함할 수 있다. 다양한 실시예에서, 메모리 칩의 형성을 위해 하나 이상의 다이가 사용되는 경우에, 하나의 IO 컨트롤러를 제외한 모든 IO 컨트롤러는 상응하는 퓨즈를 통해 비활성화 될 수 있다.In one exemplary embodiment, IO controllers, such as controllers 3521A and 3521B and corresponding fuses 3554A and 3554B, form wafer 3501 with memory banks 3511A-3511H and processor subunits 3515A-3515H. ) can be formed on the In various embodiments, when forming memory chip 3517 , dies 3506A and 3506B form memory chip 3517 that functions as a single chip and is controlled by a single input-output controller (eg, controller 3521A). One of the fuses (eg, fuse 3554B) may be deactivated to be configured to form. In one exemplary embodiment, activating the fuse may include applying a current to trigger the fuse. In various embodiments, when more than one die is used to form a memory chip, all IO controllers except one IO controller may be disabled via a corresponding fuse.

다양한 실시예에서, 도 35c에 도시된 바와 같이, 다수의 다이가 한 세트의 입력-출력 버스 및/또는 컨트롤 버스와 함께 웨이퍼(3501) 상에 형성된다. 예시적인 입력-출력 버스(3530)가 도 35c에 도시되어 있다. 예시적인 일 실시예에서, 입력-출력 버스 중의 하나(예, 입력-출력 버스(3530))가 여러 다이에 연결될 수 있다. 도 35c에는 다이(3506A 및 3506B) 옆을 지나가는 입력-출력 버스(3530)의 예시적인 일 실시예가 도시되어 있다. 도 35c에 도시된 바와 같은 다이(3506A 및 3506B) 및 입력-출력 버스(3530)의 구성은 예시에 불과하고, 다양한 다른 구성이 사용될 수도 있다. 예를 들어, 도 35e는 웨이퍼(3501) 상에 형성되고 육각형 대형으로 배열된 다이(3540)가 도시하고 있다. 예시적인 일 실시예에서, 4개의 다이(3540)를 포함하는 메모리 칩(3532)이 웨이퍼(3501)에서 절단될 수 있다. 예시적인 일 실시예에서, 메모리 칩(3532)은 적절한 버스 라인(예, 도 35e에 도시된 라인(3533))에 의해 4개의 다이로 연결된 입력-출력 버스(3530)의 일부분을 포함할 수 있다. 정보를 메모리 칩(3532)의 적절한 메모리 유닛으로 전달하기 위하여, 메모리 칩(3532)은 입력-출력 버스(3530)에 대한 분기점에 배치된 입력/출력 컨트롤러(3542A 및 3542B)를 포함할 수 있다. 컨트롤러(3542A 및 3542B)는 입력-출력 버스(3530)를 통해 명령 데이터를 수신하고 적절한 메모리 유닛으로 정보를 전송하기 위한 버스(3530)의 분기를 선택할 수 있다. 예를 들면, 명령 데이터가 다이(3546)와 연관된 메모리 유닛에서의 읽기/쓰기 정보를 포함하는 경우, 컨트롤러(3542A)는 명령 요청을 수신하고 데이터를 도 35d에 도시된 버스(3530)의 분기(3531A)로 전송하는 반면, 컨트롤러(3542B)는 명령 요청을 수신하고 데이터를 분기(3531B)로 전송할 수 있다. 도 35e에는 파선으로 절단선을 표현하여 상이한 영역의 다양한 절단이 이루어질 수 있음이 도시되어 있다.In various embodiments, as shown in FIG. 35C , multiple dies are formed on wafer 3501 with a set of input-output buses and/or control buses. An exemplary input-output bus 3530 is shown in FIG. 35C . In one exemplary embodiment, one of the input-output buses (eg, input-output bus 3530 ) may be coupled to multiple dies. 35C shows one exemplary embodiment of an input-output bus 3530 passing by dies 3506A and 3506B. The configurations of the dies 3506A and 3506B and the input-output bus 3530 as shown in FIG. 35C are exemplary only, and various other configurations may be used. For example, FIG. 35E shows a die 3540 formed on a wafer 3501 and arranged in a hexagonal formation. In one exemplary embodiment, a memory chip 3532 comprising four dies 3540 may be cut from the wafer 3501 . In one exemplary embodiment, the memory chip 3532 may include a portion of an input-output bus 3530 connected to the four dies by suitable bus lines (eg, line 3533 shown in FIG. 35E). . To pass information to the appropriate memory unit of memory chip 3532 , memory chip 3532 may include input/output controllers 3542A and 3542B disposed at junctions to input-output bus 3530 . Controllers 3542A and 3542B may select a branch of bus 3530 for receiving command data via input-output bus 3530 and transferring information to an appropriate memory unit. For example, if the command data includes read/write information in a memory unit associated with die 3546, controller 3542A receives the command request and transfers the data to the branch of bus 3530 shown in FIG. 35D. 3531A), while controller 3542B may receive the command request and send data to branch 3531B. 35E shows that various cuts in different areas can be made by expressing cut lines with broken lines.

예시적인 일 실시예에서, 한 그룹의 다이 및 상호 연결 회로가 도 36a에 도시된 것과 같은 메모리 칩(3506)에 포함되도록 설계될 수 있다. 이러한 실시예는 사로 통신하도록 구성될 수 있는 프로세서 서브유닛을 포함할 수(인메모리 프로세싱을 위해) 있다. 예를 들어, 메모리 칩(3506)에 포함될 각 다이는 메모리 뱅크(3511A-3511D), 프로세서 서브유닛(3515A-3515D), 및 IO 컨트롤러(3521 및 3522)와 같은 다양한 메모리 유닛을 포함할 수 있다. IO 컨트롤러(3521 및 3522)는 입력-출력 버스(3530)와 병렬로 연결될 수 있다. IO 컨트롤러(3521)에는 버스(3554)가 있을 수 있고, IO 컨트롤러(3522)에는 버스(3555)가 있을 수 있다. 예시적인 일 실시예에서, 프로세서 서브유닛(3515A-3515D)은 예를 들어 버스(3613)를 통해 연결될 수 있다. 일부 경우에서, IO 컨트롤러 중의 하나는 상응하는 퓨즈를 사용하여 비활성화 될 수 있다. 예를 들어, IO 컨트롤러(3522)는 퓨즈(3555)를 활용하여 비활성화 될 수 있고, IO 컨트롤러(3521)는 버스(3613)를 통해 서로 연결된 프로세서 서브유닛(3515A-3515D)을 통한 메모리 뱅크(3511A-3511D)로의 데이터 흐름을 제어할 수 있다.In one exemplary embodiment, a group of die and interconnect circuitry may be designed to be included in a memory chip 3506 such as that shown in FIG. 36A. Such embodiments may include (for in-memory processing) a processor subunit that may be configured to communicate with the company. For example, each die to be included in the memory chip 3506 may include various memory units such as memory banks 3511A-3511D, processor subunits 3515A-3515D, and IO controllers 3521 and 3522 . IO controllers 3521 and 3522 may be connected in parallel with input-output bus 3530 . The IO controller 3521 may have a bus 3554 , and the IO controller 3522 may have a bus 3555 . In one exemplary embodiment, processor subunits 3515A - 3515D may be coupled via bus 3613 , for example. In some cases, one of the IO controllers can be disabled using a corresponding fuse. For example, IO controller 3522 may be disabled utilizing fuse 3555 , IO controller 3521 may be deactivated via bus 3613 and memory bank 3511A via processor subunits 3515A-3515D connected to each other via bus 3613 . -3511D) can control the data flow.

도 36a에 도시된 바와 같은 메모리 유닛의 구성은 예시에 불과하고, 웨이퍼(3501)의 상이한 영역을 절단함으로써 다양한 다른 구성이 형성될 수 있다. 예를 들면, 도 36b에는 메모리 유닛을 포함하고 입력-출력 버스(3530)에 연결된 3개의 영역(3601-3603)을 가진 구성이 도시되어 있다. 예시적인 일 실시예에서, 영역(3601-3603)은 상응하는 퓨즈(3554-3556)에 의해 비활성화 될 수 있는 IO 컨트롤 모듈(3521-3523)을 활용하여 입력-출력 버스(3530)로 연결된다. 메모리 유닛을 포함하는 영역을 배열하는 실시예의 다른 예가 도 36c에 도시되어 있고, 여기서 3개의 영역(3601, 3602, 3603)이 버스 라인(3611, 3612, 3613)을 활용하여 입력-출력 버스(3530)로 연결된다. 도 36d는 IO 컨트롤러(3521-3524)를 통해 입력-출력 버스(3530A 및 3530B)에 연결된 메모리 칩(3506A-3506D)의 다른 예시적인 실시예를 도시한 것이다. 예시적인 일 실시예에서, IO 컨트롤러는 도 36d에 도시된 바와 같이 상응하는 퓨즈를 활용하여 비활성화 될 수 있다.The configuration of the memory unit as shown in FIG. 36A is merely an example, and various other configurations may be formed by cutting different regions of the wafer 3501 . For example, FIG. 36B shows a configuration having three regions 3601-3603 including a memory unit and coupled to an input-output bus 3530 . In one exemplary embodiment, regions 3601-3603 are coupled to input-output buses 3530 utilizing IO control modules 3521-3523 which can be disabled by corresponding fuses 3554-3556. Another example of an embodiment of arranging regions comprising memory units is shown in FIG. 36C , where three regions 3601 , 3602 , 3603 utilize bus lines 3611 , 3612 , 3613 to input-output bus 3530 . ) is connected to 36D depicts another exemplary embodiment of a memory chip 3506A-3506D coupled to input-output buses 3530A and 3530B via an IO controller 3521-3524. In one exemplary embodiment, the IO controller may be deactivated utilizing a corresponding fuse as shown in FIG. 36D .

도 37은 하나 이상의 다이(3503)를 포함할 수 있는 그룹(3713) 및 그룹(3715)과 같은 다이(3503)의 다양한 그룹을 도시하고 있다. 예시적인 일 실시예에서, 웨이퍼(3501) 상에 다이(3503)를 형성하는 것 외에도, 웨이퍼(3501)는 글루 로직(3711)으로 불리는 논리 회로(3711)도 포함할 수 있다. 글루 로직(3711)은 웨이퍼(3501) 상의 공간을 일부 차지함으로써 웨이퍼(3501) 당 제조되는 다이의 수가 글루 로직(3711)이 없이 제조하는 경우보다 적어지는 결과를 가져올 수 있다. 그러나 글루 로직(3711)이 있음으로써 다수의 다이가 단일 메모리 칩으로 기능하도록 구성되게 할 수 있다. 예컨대, 글루 로직은 구성을 변경하지 않고 다이 내부의 영역을 다이끼리 연결하기 위해서만 사용되는 회로를 위해 지정하지 않고 다수의 다이를 연결할 수 있다. 다양한 실시예에서, 글루 로직(3711)은 다른 메모리 컨트롤러와의 인터페이스를 제공하여 멀티 다이 메모리 칩이 단일 메모리 칩으로 기능하도록 할 수 있다. 글루 로직(3711)은 예를 들어 그룹(3713)으로 도시된 다이의 그룹으로 함께 절단될 수 있다. 또는, 메모리 칩에 대해 하나의 다이만이, 예를 들어 그룹(3715)이 필요한 경우, 글루 로직은 절단되지 않을 수 있다. 예를 들어, 글루 로직은 상이한 다이 사이의 협력을 가능하게 하기 위해 필요하지 않은 경우에 선택적으로 제거될 수 있다. 도 37에서, 예를 들어, 파선 영역과 같은, 상이한 영역의 다양한 절단이 도시된 바와 같이 이루어질 수 있다. 다양한 실시예에서, 도 37에 도시된 바와 같이, 하나의 글루 로직 요소(3711)가 두 개의 다이(3506)마다 웨이퍼 상에 레이아웃 될 수 있다. 일부 경우에서, 하나의 글루 로직 요소(3711)는 한 그룹의 다이를 형성하는 모든 임의의 수의 다이(3506)에 대해 사용될 수 있다. 글루 로직(3711)은 다이 그룹의 모든 다이에 연결되도록 구성될 수 있다. 다양한 실시예에서, 글루 로직(3711)에 연결된 다이는 멀티 다이 메모리 칩을 형성하도록 구성될 수 있고, 글루 로직(3711)에 연결되지 않은 경우에는 별개의 단일 다이 메모리 칩을 형성하도록 구성될 수 있다. 다양한 실시예에서, 글루 로직(3711)에 연결되고 함께 기능하도록 설계된 다이가 그룹으로 웨이퍼(3501)에서 절단될 수 있고, 그룹(3713) 등으로 표시된 바와 같이 글루 로직(3711)을 포함할 수 있다. 글루 로직(3711)에 연결되지 않은 다이는 그룹(3715) 등으로 표시된 바와 같이 글루 로직(3711)을 포함하지 않고 웨이퍼(3501)에서 절단되어 단일 다이 메모리 칩을 형성할 수 있다.FIG. 37 illustrates various groups of dies 3503 , such as group 3713 and group 3715 , which may include one or more dies 3503 . In one exemplary embodiment, in addition to forming die 3503 on wafer 3501 , wafer 3501 may also include logic circuitry 3711 called glue logic 3711 . Since the glue logic 3711 occupies a portion of the space on the wafer 3501 , the number of dies manufactured per wafer 3501 may be reduced compared to the case where the glue logic 3711 is manufactured without the glue logic 3711 . However, the presence of glue logic 3711 allows multiple dies to be configured to function as a single memory chip. For example, glue logic can connect multiple dies without changing the configuration and designating regions inside the die for circuitry used only to connect dies to each other. In various embodiments, the glue logic 3711 may provide an interface with other memory controllers to allow the multi-die memory chip to function as a single memory chip. Glue logic 3711 may be cut together into groups of dies shown as group 3713, for example. Alternatively, if only one die for the memory chip, for example group 3715, is needed, the glue logic may not be cut. For example, glue logic can be selectively removed when not needed to enable cooperation between different dies. In FIG. 37 , various cuts in different regions, such as, for example, broken-line regions, can be made as shown. In various embodiments, as shown in FIG. 37 , one glue logic element 3711 may be laid out on the wafer every two dies 3506 . In some cases, one glue logic element 3711 may be used for any number of dies 3506 forming a group of dies. Glue logic 3711 may be configured to connect to all dies in a die group. In various embodiments, a die connected to glue logic 3711 may be configured to form a multi-die memory chip, and if not connected to glue logic 3711 may be configured to form a separate single die memory chip. . In various embodiments, dies connected to glue logic 3711 and designed to function together may be cut from wafer 3501 in groups and may include glue logic 3711 as indicated by group 3713 or the like. . Dies not connected to glue logic 3711 may be cut from wafer 3501 without including glue logic 3711 as indicated by groups 3715 or the like to form a single die memory chip.

일부 실시예에서, 웨이퍼(3501)에서 멀티 다이 메모리 칩을 제조하는 과정에서, 원하는 멀티 다이 메모리 칩의 세트를 생성하기 위해 하나 이상의 절단 형상(예, 그룹(3713, 3715)을 형성하는 형상)이 결정될 수 있다. 일부 경우에서, 그룹(3715)으로 도시된 바와 같이, 절단 형상은 글루 로직(3711)을 포함하지 않을 수 있다.In some embodiments, during fabrication of multi-die memory chips on wafer 3501, one or more cut features (eg, shapes that form groups 3713 and 3715) are cut to create a desired set of multi-die memory chips. can be decided. In some cases, as shown by group 3715 , the cut shape may not include glue logic 3711 .

다양한 실시예에서, 글루 로직(3711)은 멀티 다이 메모리 칩의 다중 메모리 유닛의 제어를 위한 컨트롤러일 수 있다. 일부 경우에서, 글루 로직(3711)은 다양한 다른 컨트롤러에 의해 수정될 수 있는 파라미터를 포함할 수 있다. 예컨대, 멀티 다이 메모리 칩의 결합 회로는 글루 로직(3711)의 파라미터 또는 메모리 컨트롤러(예, 도 35b에 도시된 프로세서 서브유닛(subunits 3515A-3515D))의 파라미터를 구성하기 위한 회로를 포함할 수 있다. 글루 로직(3711)은 다양한 작업을 수행하도록 구성될 수 있다. 예를 들면, 글루 로직(3711)은 어느 다이가 어드레스 되어야 할지를 판단하도록 구성될 수 있다. 일부 경우에서, 글루 로직(3711)은 다중 메모리 유닛을 동기화 하는데 사용될 수 있다. 다양한 실시예에서, 글루 로직(3711)은 메모리 유닛들이 단일 칩으로 작동하게 되도록 다양한 메모리 유닛을 제어하도록 구성될 수 있다. 일부 경우에서, 입력-출력 버스(예, 도 35c에 도시된 버스(3530))와 프로세서 서브유닛(3515A-3515D) 사이에 증폭기가 추가되어 버스(3530)로부터의 데이터 신호를 증폭할 수 있다.In various embodiments, the glue logic 3711 may be a controller for controlling multiple memory units of a multi-die memory chip. In some cases, the glue logic 3711 may include parameters that may be modified by various other controllers. For example, the combination circuit of a multi-die memory chip may include circuitry for configuring parameters of glue logic 3711 or parameters of a memory controller (e.g., processor subunits 3515A-3515D shown in FIG. 35B). . The glue logic 3711 may be configured to perform various tasks. For example, the glue logic 3711 may be configured to determine which die should be addressed. In some cases, glue logic 3711 may be used to synchronize multiple memory units. In various embodiments, the glue logic 3711 may be configured to control various memory units such that the memory units operate as a single chip. In some cases, an amplifier may be added between the input-output bus (eg, bus 3530 shown in FIG. 35C ) and processor subunits 3515A- 3515D to amplify the data signal from bus 3530 .

다양한 실시예에서, 웨이퍼(3501)에서 복잡한 형상을 절단하는 것은 기술적으로 어려울/비쌀 수 있고, 다이들이 웨이퍼(3501) 상에 정렬된다면 더 단순한 절단 방법이 채택될 수 있다. 도 38a는 다이(3506)가 정렬되어 직사각형 격자를 형성하는 것을 도시하고 있다. 예시적인 일 실시예에서, 전체 웨이퍼(3501)에 세로 절단(3803)과 가로 절단(3801)을 하여 절단된 다이의 그룹을 분리할 수 있다. 예시적인 일 실시예에서, 세로 및 가로 절단(3803, 3801)으로 선택된 수의 다이를 포함하는 그룹이 만들어질 수 있다. 예컨대, 절단(3803)과 절단(3801)의 결과로 단일 다이를 포함하는 영역(예, 3811A), 두 개의 다이를 포함하는 영역(예, 3811B), 및 4개의 다이를 포함하는 영역(예, 3811C)이 형성될 수 있다. 절단(3803)과 절단(3801)에 의해 형성된 영역들은 예시에 불과하며, 모든 임의의 다른 적절한 영역이 형성될 수 있다. 다양한 실시예에서, 다이의 정렬에 따라, 다양한 절단이 이루어질 수 있다. 예를 들면, 도 38b에 도시된 바와 같이, 다이들이 삼각형 격자로 정렬되는 경우, 라인(3802, 3804, 3806)과 같은 절단 라인을 활용하여 멀티 다이 메모리 칩을 형성할 수 있다. 예컨대, 일부 영역은 6개의 다이, 5개의 다이, 3개의 다이, 2개의 다이, 1개의 다이, 또는 모든 임의의 기타 적절한 수의 다이를 포함할 수 있다.In various embodiments, cutting complex shapes in wafer 3501 may be technically difficult/expensive, and simpler cutting methods may be employed if dies are aligned on wafer 3501 . 38A shows die 3506 aligned to form a rectangular grid. In an exemplary embodiment, a vertical cut 3803 and a cross cut 3801 may be performed on the entire wafer 3501 to separate groups of cut dies. In one exemplary embodiment, longitudinal and transverse cuts 3803 and 3801 may create a group comprising a selected number of dies. For example, as a result of cut 3803 and cut 3801, a region containing a single die (eg, 3811A), a region containing two dies (eg, 3811B), and a region containing four dies (eg, 3811C) may be formed. The areas formed by cut 3803 and cut 3801 are exemplary only, and any other suitable area may be formed. In various embodiments, depending on the alignment of the die, various cuts may be made. For example, as shown in FIG. 38B , when the dies are arranged in a triangular grid, cutting lines such as lines 3802 , 3804 , and 3806 may be utilized to form a multi-die memory chip. For example, some regions may include 6 dies, 5 dies, 3 dies, 2 dies, 1 die, or any other suitable number of dies.

도 38c는 버스 라인(3530)이 삼각형 격자로 배열되고, 교차하는 버스 라인(3530)에 의해 형성된 삼각형의 중앙에 다이(3503)가 정렬된 것을 도시하고 있다. 다이(3503)는 버스 라인(3820)을 통해 이웃하는 모든 버스라인으로 연결될 수 있다. 둘 이상의 인접 다이를 포함하는 영역(예, 도 38c에 도시된 영역(3822))을 절단함으로써, 적어도 하나의 버스 라인(예, 라인(3824))이 영역(3822)에 남게 되고, 영역(3822)을 활용하여 형성된 멀티 다이 메모리 칩으로 데이터 및 명령을 제공하기 위해 버스 라인(3824)이 활용될 수 있다.38C illustrates that bus lines 3530 are arranged in a triangular grid, with die 3503 aligned in the center of a triangle formed by intersecting bus lines 3530 . Die 3503 may be connected to all neighboring buslines via bus line 3820 . By severing a region that includes two or more adjacent dies (eg, region 3822 shown in FIG. 38C ), at least one bus line (eg, line 3824 ) is left in region 3822 , and region 3822 ), a bus line 3824 may be utilized to provide data and commands to a multi-die memory chip formed by utilizing .

도 39는 메모리 유닛의 그룹이 단일 메모리 칩으로 기능하도록 하기 위해 프로세서 서브유닛(3515A-3515P) 사이에 다양한 연결이 형성될 수 있음을 도시하고 있다. 예를 들어, 다양한 메모리 유닛의 그룹(3901)은 프로세서 서브유닛(3515B)과 프로세서 서브유닛(3515E) 사이의 연결(3905)을 포함할 수 있다. 연결(3905)은 메모리 뱅크(3511E)를 제어하는데 사용될 수 있는 서브유닛(3515B)에서 서브유닛(3515E)로의 데이터 및 명령의 전송을 위한 버스 라인으로 사용될 수 있다. 다양한 실시예에서, 프로세서 서브유닛 사이의 연결은 웨이퍼(3501) 상에 다이를 형성하는 동안에 구현될 수 있다. 일부 경우에서, 여러 다이로부터 형성된 메모리 칩의 패키징 단계에서 추가적인 연결이 제조될 수 있다.39 illustrates that various connections may be made between processor subunits 3515A-3515P to allow groups of memory units to function as a single memory chip. For example, group 3901 of various memory units may include a connection 3905 between processor subunit 3515B and processor subunit 3515E. Connection 3905 may be used as a bus line for the transfer of data and commands from subunit 3515B to subunit 3515E, which may be used to control memory bank 3511E. In various embodiments, connections between processor subunits may be implemented during die formation on wafer 3501 . In some cases, additional connections may be made during packaging of memory chips formed from multiple dies.

도 39에 도시된 바와 같이, 프로세서 서브유닛(3515A-3515P)은 다양한 버스(예, 연결(3905))를 활용하여 서로 연결될 수 있다. 연결(3905)에는 타이밍 하드웨어 로직 요소가 없어서 프로세서 서브유닛 사이의 데이터 전송과 연결(3905)을 통한 데이터 전송은 타이밍 하드웨어 로직 요소에 의해 제어되지 않을 수 있다. 다양한 실시예에서, 프로세서 서브유닛(3515A-3515P)을 연결하는 버스는 웨이퍼(3501) 상에 다양한 회로를 제조하기 전에 웨이퍼(3501) 상에 레이아웃 될 수 있다.As shown in FIG. 39, processor subunits 3515A-3515P may be coupled to each other utilizing various buses (eg, connection 3905). Connection 3905 does not have a timing hardware logic element so that data transfer between processor subunits and data transfer over connection 3905 may not be controlled by the timing hardware logic element. In various embodiments, the buses connecting the processor subunits 3515A-3515P may be laid out on the wafer 3501 prior to fabricating the various circuits on the wafer 3501 .

다양한 실시예에서, 프로세서 서브유닛(예, 서브유닛(3515A-3515P))은 상호 연결될 수 있다. 예를 들어, 서브유닛(3515A-3515P)은 적절한 버스(예, 연결(3905))에 의해 연결될 수 있다. 연결(3905)은 서브유닛(3515A-3515P) 중의 하나 이상을 서브유닛(3515A-3515P) 중의 다른 하나 이상과 연결시킬 수 있다. 예시적인 일 실시예에서, 연결된 서브유닛은 동일 다이 상에 있을 수 있고(예, 3515A와 3515B의 경우), 다른 경우에서, 연결된 서브유닛은 상이한 다이 상에 있을 수 있다(예, 3515B와 3515E의 경우). 연결(3905)은 서브유닛을 연결하기 위한 전용 버스를 포함할 수 있고 서브유닛(3515A-3515P) 사이에 데이터를 효율적으로 전송하도록 구성될 수 있다.In various embodiments, processor subunits (eg, subunits 3515A-3515P) may be interconnected. For example, subunits 3515A-3515P may be connected by a suitable bus (eg, connection 3905). Connection 3905 may connect one or more of subunits 3515A-3515P with another one or more of subunits 3515A-3515P. In one exemplary embodiment, connected subunits may be on the same die (eg, 3515A and 3515B), and in other cases, connected subunits may be on different die (eg, 3515B and 3515E). case). Connection 3905 may include a dedicated bus for connecting subunits and may be configured to efficiently transfer data between subunits 3515A-3515P.

본 개시의 다양한 측면은 선택 가능 사이즈의 메모리 칩을 웨이퍼로부터 생산하기 위한 방법에 관한 것이다. 예시적인 일 실시예에서, 선택 가능 사이즈의 메모리 칩은 아나 이상의 다이로부터 형성될 수 있다. 앞서 설명한 바와 같이, 다이는 예를 들어 도 35c에 도시된 바와 같이 하나 이상의 행을 따라 배열될 수 있다. 일부 경우에서, 하나 이상의 행에 상응하는 적어도 하나의 공유 입력-출력 버스가 웨이퍼(3501) 상에 레이아웃 될 수 있다. 예를 들어, 버스(3530)가 도 35c에 도시된 바와 같이 레이아웃 될 수 있다. 다양한 실시예에서, 버스(3530)는 적어도 두 개의 다이의 메모리 유닛으로 전기적으로 연결될 수 있고, 연결된 다이는 멀티 다이 메모리 칩의 형성에 사용될 수 있다. 예시적인 일 실시예에서, 하나 이상의 컨트롤러(예, 도 35b에 도시된 입력-출력 컨트롤러(3521, 3522))가 멀티 다이 메모리 칩의 형성에 사용되는 적어도 두 개의 다이의 메모리 유닛을 제어하도록 구성될 수 있다. 다양한 실시예에서, 버스(3530)로 연결된 메모리 유닛이 있는 다이는 적어도 하나의 컨트롤러(예, 컨트롤러(3521, 3522))로 정보를 전송하는 공유 입력-출력 버스(예, 도 35b에 도시된 버스(3530))의 적어도 하나의 상응하는 부분과 함께 웨이퍼에서 절단되어, 연결된 다이의 메모리 유닛을 함께 단일 칩으로 기능하게 컨트롤러가 제어하도록 구성될 수 있다.Various aspects of the present disclosure relate to methods for producing selectable sized memory chips from wafers. In one exemplary embodiment, memory chips of selectable sizes may be formed from one or more dies. As discussed above, the dies may be arranged along one or more rows, for example as shown in FIG. 35C . In some cases, at least one shared input-output bus corresponding to one or more rows may be laid out on the wafer 3501 . For example, the bus 3530 may be laid out as shown in FIG. 35C . In various embodiments, bus 3530 may be electrically coupled to memory units of at least two dies, and the connected dies may be used to form a multi-die memory chip. In one exemplary embodiment, one or more controllers (eg, input-output controllers 3521 , 3522 shown in FIG. 35B ) may be configured to control memory units of at least two dies used in the formation of a multi-die memory chip. can In various embodiments, a die with memory units connected by bus 3530 may have a shared input-output bus (eg, the bus shown in FIG. 35B ) that transmits information to at least one controller (eg, controllers 3521 , 3522 ). The controller may be configured to control the memory units of the connected die together to function as a single chip, cut from the wafer together with at least one corresponding portion of ( 3530 ).

일부 경우에서, 웨이퍼(3501)의 영역을 절단하여 메모리 칩을 제조하기 이전에 웨이퍼(3501) 상에 위치한 메모리 유닛을 검사할 수 있다. 검사는 적어도 하나의 공유 입력-출력 버스(예, 도 35c에 도시된 버스(3530))를 사용하여 수행될 수 있다. 메모리 칩은 메모리 유닛이 검사에 합격하는 경우에 메모리 유닛을 포함하는 다이의 그룹으로부터 형성될 수 있다. 검사에 불합격하는 메모리 유닛은 폐기되고 메모리 칩의 제소에 사용되지 않을 수 있다.In some cases, a memory unit located on the wafer 3501 may be inspected prior to manufacturing a memory chip by cutting a region of the wafer 3501 . The check may be performed using at least one shared input-output bus (eg, bus 3530 shown in FIG. 35C ). A memory chip may be formed from a group of dies containing the memory unit if the memory unit passes the test. Memory units that fail the test may be discarded and not used for filing the memory chip.

도 40은 다이의 그룹으로부터 메모리 칩을 구성하는 예시적인 프로세스(4000)를 도시한 것이다. 프로세스(4000)의 단계 4011에서, 다이가 반도체 웨이퍼(3501) 상에 레이아웃 될 수 있다. 단계 4015에서, 모든 임의의 적절한 방식을 활용하여 웨이퍼(3501) 상에 다이가 제작될 수 있다. 예를 들어, 다이는 웨이퍼(3501)를 식각하고, 다양한 유전층, 금속층, 또는 반도체층을 증착하고, 증착된 층을 더 식각하는 등을 통하여 제작될 수 있다. 예컨대, 다중 층이 증착되고 식각될 수 있다. 다양한 실시예에서, 층은 모든 임의의 적절한 도핑 요소를 활용하여 n형 도핑 또는 p형 도핑 될 수 있다. 예를 들어, 반도체층은 인(P)으로 n형 도핑 될 수 있고 붕소(B)로 p형 도핑 될 수 있다. 도 35a에 도시된 바와 같은 다이(3503)는 웨이퍼(3501)에서 다이(3503)를 절단하기 위해 사용될 수 있는 공간에 의해 서로로부터 분리될 수 있다. 예를 들어, 다이(3503)는 공간 영역에 의해 서로로부터 이격될 수 있고, 여기서 공간 영역의 폭은 공간 영역에서 웨이퍼 절단이 가능하도록 선택될 수 있다.40 depicts an exemplary process 4000 for constructing a memory chip from a group of dies. At step 4011 of process 4000 , a die may be laid out on semiconductor wafer 3501 . In step 4015, a die may be fabricated on wafer 3501 utilizing any suitable manner. For example, a die may be fabricated by etching the wafer 3501 , depositing various dielectric, metal, or semiconductor layers, further etching the deposited layers, and the like. For example, multiple layers may be deposited and etched. In various embodiments, the layer may be n-type doped or p-type doped utilizing any and all suitable doping elements. For example, the semiconductor layer may be n-type doped with phosphorus (P) and p-type doped with boron (B). Dies 3503 as shown in FIG. 35A may be separated from each other by a space that may be used to cut die 3503 from wafer 3501 . For example, the dies 3503 may be spaced apart from each other by a spatial region, wherein the width of the spatial region may be selected to enable wafer cutting in the spatial region.

단계 4017에서, 모든 임의의 적절한 방식을 활용하여 웨이퍼(3501)로부터 다이(3503)가 절단될 수 있다. 예시적인 일 실시예에서, 다이(3503)는 레이저를 활용하여 절단될 수 있다. 예시적인 일 실시예에서, 웨이퍼(3501)은 우선 스크라이브(scribe) 된 후에 기계적 절단이 될 수 있다. 또는, 기계적 절단 톱(dicing saw)이 사용될 수 있다. 일부 경우에서, 스텔스다이싱(stealth dicing) 프로세스가 사용될 수 있다. 절단 과정에서, 웨이퍼(3501)는 다이가 절단된 이후에 다이를 홀드하기 위한 다이싱 테이프 상에 탑재될 수 있다. 다양한 실시예에서, 도 38a의 절단(3801 및 3803) 또는 도 38b의 절단(3802, 3804, 및 3806)으로 도시된 바와 같이 대형 절단이 이루어질 수 있다. 일단 다이(3503)가 개별적으로 또는, 도 35c의 그룹(3504)과 같이, 그룹으로 절단되면, 다이(3503)는 패키징 될 수 있다. 다이의 패키징은 다이(3503)에 컨택트(contacts)를 형성하고, 컨택트에 보호층을 증착하고, 열관리 장치(예, 히트싱크)를 부착하고, 다이(3503)를 인캡슐레이션 하는 과정을 포함한다. 다양한 실시예에서, 메모리 칩의 형성에 선택되는 다이의 수에 따라, 컨택트와 버스의 적절한 구성이 사용될 수 있다. 예시적인 일 실시예에서, 메모리 칩을 형성하는 다이 사이의 컨택트의 일부는 메모리 칩 패키징 과정에서 형성될 수 있다.At step 4017 , the die 3503 may be cut from the wafer 3501 utilizing any suitable manner. In one exemplary embodiment, die 3503 may be cut utilizing a laser. In one exemplary embodiment, wafer 3501 may be first scribed and then mechanically cut. Alternatively, a mechanical dicing saw may be used. In some cases, a stealth dicing process may be used. During the cutting process, the wafer 3501 may be mounted on a dicing tape for holding the die after the die is cut. In various embodiments, large cuts may be made, as shown by cuts 3801 and 3803 of FIG. 38A or cuts 3802, 3804, and 3806 of FIG. 38B. Once the dies 3503 are cut individually or in groups, such as group 3504 in FIG. 35C , the dies 3503 may be packaged. Packaging of the die includes forming contacts on the die 3503 , depositing a protective layer on the contacts, attaching a thermal management device (eg, a heatsink), and encapsulating the die 3503 . . In various embodiments, an appropriate configuration of contacts and buses may be used, depending on the number of dies selected to form the memory chip. In one exemplary embodiment, some of the contacts between the dies forming the memory chip may be formed during the memory chip packaging process.

도 41a는 다수의 다이를 포함하는 메모리 칩을 제조하기 위한 예시적인 프로세스(4100)를 도시한 것이다. 프로세스(4100)의 단계 4011은 프로세스(4000)의 단계 4011과 동일할 수 있다. 단계 4111에서, 도 37에 도시된 바와 같은 글루 로직(3711)이 웨이퍼(3501) 상에 레이아웃 될 수 있다. 글루 로직(3711)은 도 37에 도시된 바와 같이 다이(3506)의 동작을 제어하기 위한 모든 임의의 적합한 로직일 수 있다. 앞서 설명한 바와 같이, 글루 로직(3711)의 존재로 인해 여러 다이가 단일 메모리 칩으로 기능할 수 있게 된다. 글루 로직(3711)은 다수의 다이로부터 형성된 메모리 칩이 단일 메모리 칩으로 기능할 수 있도록 다른 메모리 컨트롤러와의 인터페이스를 제공할 수 있다.41A depicts an exemplary process 4100 for manufacturing a memory chip including multiple dies. Step 4011 of process 4100 may be the same as step 4011 of process 4000 . In step 4111 , the glue logic 3711 as shown in FIG. 37 may be laid out on the wafer 3501 . The glue logic 3711 may be any suitable logic for controlling the operation of the die 3506 as shown in FIG. 37 . As described above, the presence of glue logic 3711 allows multiple dies to function as a single memory chip. The glue logic 3711 may provide an interface with another memory controller so that a memory chip formed from a plurality of dies may function as a single memory chip.

프로세스(4100)의 단계 4113에서, 버스(예, 입력-출력 버스 및 컨트롤 버스)가 웨이퍼(3501) 상에 레이아웃 될 수 있다. 버스는 다양한 다이 및 글루 로직(3711)과 같은 다양한 논리 회로와 연결되도록 레이아웃 될 수 있다. 일부 경우에서, 버스는 메모리 유닛을 연결할 수 있다. 예를 들어, 버스는 상이한 다이의 프로세서 서브유닛을 연결하도록 구성될 수 있다. 단계 4115에서, 모든 임의의 적합한 방식을 활용하여 다이, 글루 로직, 및 버스가 제작될 수 있다. 예를 들어, 논리 요소는 웨이퍼(3501)를 식각하고, 다양한 유전층, 금속층, 또는 반도체층을 증착하고, 증착된 층을 더 식각하는 등을 통하여 제작될 수 있다. 버스는 예를 들어 금속 증착을 활용하여 제작될 수 있다.At step 4113 of process 4100 , a bus (eg, an input-output bus and a control bus) may be laid out on wafer 3501 . The bus may be laid out to connect with various logic circuits, such as various die and glue logic 3711 . In some cases, a bus may connect memory units. For example, a bus may be configured to connect processor subunits of different dies. At step 4115, the die, glue logic, and bus may be fabricated utilizing any suitable approach. For example, the logic element may be fabricated by etching the wafer 3501 , depositing various dielectric, metal, or semiconductor layers, further etching the deposited layers, and the like. The bus may be fabricated utilizing, for example, metal deposition.

단계 4140에서, 절단 형상을 활용하여 도 37에 도시된 것과 같은 단일 글루 로직(3711)에 연결된 다이의 그룹을 절단할 수 있다. 절단 형상은 다수의 다이(3503)를 포함하는 메모리 칩에 대한 메모리 요구사항을 활용하여 판단될 수 있다. 예를 들어, 도 41b는 프로세스(4100)의 변형일 수 있는 프로세스(4101)를 도시하고 있으며, 여기서 프로세스(4100)의 단계 4140 이전에 단계 4117과 단계 4119가 있을 수 있다. 단계 4117에서, 웨이퍼(3501) 절단 시스템은 메모리 칩에 대한 요구사항을 설명하는 명령을 수신할 수 있다. 예를 들면, 요구사항에는 4개의 다이(3503)를 포함하는 메모리 칩의 형성을 포함할 수 있다. 일부 경우에서, 단계 4119에서 프로그램 소프트웨어가 다이의 그룹과 글루 로직(3711)에 대한 주기적 패턴을 판단할 수 있다. 예컨대, 주기적 패턴은 2개의 글루 로직(3711) 요소와 4개의 다이(3503)에서 1개의 글루 로직(3711)마다 2개의 다이가 연결된 패턴을 포함할 수 있다. 또는 단계 4119에서, 패턴은 메모리 칩의 설계자에 의해 제공될 수 있다.At step 4140 , the cut shape may be utilized to cut a group of dies connected to a single glue logic 3711 as shown in FIG. 37 . The cut shape may be determined utilizing the memory requirements for a memory chip comprising multiple dies 3503 . For example, FIG. 41B shows process 4101 , which may be a variant of process 4100 , where there may be steps 4117 and 4119 prior to step 4140 of process 4100 . At step 4117, the wafer 3501 cutting system may receive a command describing the requirements for the memory chip. For example, a requirement may include the formation of a memory chip comprising four dies 3503 . In some cases, at step 4119 the program software may determine a periodic pattern for the group of dies and the glue logic 3711 . For example, the periodic pattern may include two glue logic 3711 elements and a pattern in which two dies are connected for every one glue logic 3711 in four dies 3503 . Alternatively, in step 4119 , the pattern may be provided by a designer of the memory chip.

일부 경우에서, 패턴은 웨이퍼(3501)로부터 메모리 칩의 수율을 극대화하도록 선택될 수 있다. 예시적인 일 실시예에서, 다이(3503)의 메모리 유닛을 검사하여 불량 메모리 유닛이 있는 다이('불량 다이'라고 지칭)를 식별하고, 불량 다이의 위치에 의거하여, 검사에 합격한 메모리 유닛을 포함하는 다이(3503)의 그룹이 식별되고 적절한 절단 패턴이 판단될 수 있다. 예를 들어, 웨이퍼(3501)의 가장자리에서 많은 수의 다이(3503)가 불량인 경우, 웨이퍼(3501)의 가장자리의 다이를 제외하는 절단 패턴이 결정될 수 있다. 단계 4011, 단계 4111, 단계 4113, 단계 4115, 및 단계 4140과 같은 프로세스(4101)의 다른 단계들은 프로세스(4100)의 동일한 번호의 단계들과 동일할 수 있다.In some cases, the pattern may be selected to maximize the yield of memory chips from wafer 3501 . In one exemplary embodiment, the memory units of die 3503 are inspected to identify the die with the bad memory units (referred to as 'bad dies'), and based on the location of the bad dies, memory units that passed the inspection are identified. The group of dies 3503 it contains can be identified and an appropriate cut pattern can be determined. For example, when a large number of dies 3503 are defective at the edge of the wafer 3501 , a cutting pattern excluding the dies at the edge of the wafer 3501 may be determined. Other steps in process 4101 such as steps 4011 , 4111 , 4113 , 4115 , and 4140 may be identical to the same numbered steps in process 4100 .

도 41c는 프로세스(4101)의 변형일 수 있는 예시적인 프로세스(4102)를 도시한 것이다. 프로세스(4102)의 단계 4011, 단계 4111, 단계 4113, 단계 4115, 및 단계 4140은 프로세스(4101)의 동일한 번호의 단계들과 동일할 수 있고, 프로세스(4102)의 단계 4131은 프로세스(4101)의 단계 4117을 대신할 수 있고, 프로세스(4102)의 단계 4133은 프로세스(4101)의 단계 4119를 대신할 수 있다. 단계 4131에서, 웨이퍼(3501) 절단 시스템은 제1 세트의 메모리 칩과 제2 세트의 메모리 칩에 대한 요구사항을 설명하는 명령을 수신할 수 있다. 예를 들어, 요구사항은 4개의 다이(3503)를 포함하는 메모리 칩으로 제1 세트의 메모리 칩을 형성하고 2개의 다이(3503)를 포함하는 메모리 칩으로 제2 세트의 메모리 칩을 형성하는 것을 포함할 수 있다. 일부 경우에서, 웨이퍼(3501)에서 2 세트 이상의 메모리 칩이 형성되어야 할 수도 있다. 예컨대, 제3 세트의 메모리 칩은 1개의 다이(3503)만을 포함하는 메모리 칩을 포함할 수 있다. 일부의 경우, 단계 4133에서, 각 세트의 메모리 칩에 대한 메모리 칩을 형성하기 위한 다이의 그룹과 글루 로직(3711)의 주기적 패턴을 프로그램 소프트웨어가 판단할 수 있다. 예를 들면, 제1 세트의 메모리 칩은 2개의 글루 로직(3711)과 4개의 다이(3503)에서 1개의 글루 로직(3711)마다 2개의 다이가 연결된 패턴을 포함하는 메모리 칩을 포함할 수 있다. 다양한 실시예에서, 동일한 메모리 칩에 대한 글루 로직(3711)은 서로 연결되어 단일 글루 로직으로 기능할 수 있다. 예를 들어, 글루 로직(3711)의 제작 과정에서, 글루 로직(3711)을 서로 연결하는 버스 라인이 형성될 수 있다.41C depicts an example process 4102 that may be a variant of process 4101 . Step 4011 , step 4111 , step 4113 , step 4115 , and step 4140 of process 4102 may be identical to the same numbered steps of process 4101 , and step 4131 of process 4102 is equivalent to step 4101 of process 4101 . Step 4117 may be substituted, and step 4133 of process 4102 may be substituted for step 4119 of process 4101 . At step 4131 , the wafer 3501 cleaving system may receive commands describing requirements for the first set of memory chips and the second set of memory chips. For example, a requirement may include forming a first set of memory chips with memory chips comprising four dies 3503 and forming a second set of memory chips with memory chips comprising two dies 3503 . may include In some cases, more than one set of memory chips may have to be formed on wafer 3501 . For example, the third set of memory chips may include memory chips including only one die 3503 . In some cases, at step 4133 , the program software may determine a periodic pattern of glue logic 3711 and groups of dies to form memory chips for each set of memory chips. For example, the first set of memory chips may include a memory chip including two glue logic 3711 and a pattern in which two dies are connected for every one glue logic 3711 in four dies 3503 . . In various embodiments, the glue logic 3711 for the same memory chip may be connected to each other to function as a single glue logic. For example, in the manufacturing process of the glue logic 3711 , a bus line connecting the glue logic 3711 to each other may be formed.

제2 세트의 메모리 칩은 1개의 글루 로직(3711)과 2개의 다이(3503)에서 다이(3503)가 글루 로직(3711)에 연결된 패턴을 가진 메모리 칩을 포함할 수 있다. 일부 경우에서, 제2 세트의 메모리 칩이 선택되고, 제2 세트의 메모리 칩은 단일 다이(3503)를 포함하는 메모리 칩을 포함하는 경우, 글루 로직(3711)이 필요하지 않을 수 있다.The second set of memory chips may include memory chips with one glue logic 3711 and two dies 3503 having a pattern in which the die 3503 is connected to the glue logic 3711 . In some cases, if a second set of memory chips is selected, and the second set of memory chips includes memory chips comprising a single die 3503 , the glue logic 3711 may not be needed.

듀얼 포트(dual port) 기능Dual port function

메모리 칩 또는 칩 내의 메모리 인스턴스를 설계할 때에, 한 가지 중요한 특성은 단일 클럭 사이클 동안에 동시에 접근될 수 있는 워드의 수이다. 읽기 및/또는 쓰기를 위해 동시에 접근될 수 있는 어드레스(예, 워드 또는 워드라인이라고도 부르는 행 및 비트 또는 비트라인이라고도 부르는 열을 따라 있는 어드레스)가 많을수록, 메모리 칩은 더 빨라진다. 레지스터 파일, 캐시, 또는 공유 메모리를 구축하는 등의, 동시에 다중 어드레스에 접근할 수 있게 하는 다중 방향 포트를 포함하는 메모리를 개발하려는 노력이 있어왔지만, 대부분의 인스턴스는 사이즈가 크고 다중 어드레스 접근을 지원하는 메모리 매트를 활용한다. 그러나 DRAM 칩은 각 메모리 셀의 각 커패시터에 연결된 단일 비트라인과 단일 로우라인(row line)을 포함하는 것이 일반적이다. 이에 따라, 본 개시의 실시예들은 DRAM 어레이의 이러한 종래의 단일 포트 메모리 구조를 변경하지 않고 기존의 DRAM 칩 상에 다중 포트 접근을 제공하고자 한다.When designing a memory chip or memory instance within a chip, one important characteristic is the number of words that can be accessed simultaneously during a single clock cycle. The more addresses (eg, addresses along rows and columns, also called bits or bitlines) that can be accessed simultaneously for reading and/or writing, the faster the memory chip. Efforts have been made to develop memory with multi-directional ports that allow simultaneous access to multiple addresses, such as building register files, caches, or shared memory, but most instances are large in size and support multi-address access. use a memory mat that However, DRAM chips typically include a single bit line and a single row line coupled to each capacitor of each memory cell. Accordingly, embodiments of the present disclosure seek to provide multi-port access on an existing DRAM chip without changing this conventional single-port memory structure of a DRAM array.

본 개시의 실시예들에서, 메모리 인스턴스 또는 칩의 클럭 속도는 메모리를 사용하는 논리 회로의 속도보다 2배 빠를 수 있다. 따라서, 메모리를 사용하는 모든 임의의 논리 회로는 메모리 및 그 구성요소에 '상응'할 수 있다. 이에 따라, 본 개시의 실시예들은 메모리 어레이 클럭 사이클로 2개의 어드레스에 읽기 또는 쓰기를 할 수 있고, 이는 논리 회로에 대한 단일 프로세싱 클럭 사이클과 맞먹는다. 논리 회로는 컨트롤러, 가속기, GPU, 또는 CPU와 같은 회로를 포함할 수 있거나 도 7a에 도시된 바와 같이 메모리 칩과 동일한 기판 상의 프로세싱 그룹을 포함할 수 있다. 앞서 도 3a를 참조하여 설명한 바와 같이, '프로세싱 그룹'은 둘 이상의 프로세서 서브유닛과 이에 상응하는 기판 상의 메모리 뱅크를 말하는 것일 수 있다. 그룹은 기판 상의 공간적 분산 및/또는 메모리 칩(2800) 상의 실행을 위한 코드의 컴파일링을 위한 논리적 묶음을 나타낼 수 있다. 이에 따라, 앞서 도 7a를 참조하여 설명한 바와 같이, 메모리 칩이 있는 기판은 도 28에 도시된 뱅크(2801a) 및 기타 뱅크와 같은 복수의 뱅크가 있는 메모리 어레이를 포함할 수 있다. 또한, 기판은 복수의 프로세서 서브유닛(예, 도 7a에 도시된 서브유닛(730a, 730b, 730c, 730d, 730e, 730f, 730g, 730h))을 포함할 수 있는 프로세싱 어레이도 포함할 수 있다.In embodiments of the present disclosure, the clock speed of the memory instance or chip may be twice as fast as the speed of the logic circuit using the memory. Thus, any arbitrary logic circuit that uses a memory may 'correspond' to the memory and its components. Accordingly, embodiments of the present disclosure can read or write to two addresses with a memory array clock cycle, which is equivalent to a single processing clock cycle for a logic circuit. The logic circuitry may include circuitry such as a controller, accelerator, GPU, or CPU, or it may include a processing group on the same substrate as the memory chip as shown in FIG. 7A . As described above with reference to FIG. 3A , a 'processing group' may refer to two or more processor subunits and corresponding memory banks on a substrate. Groups may represent logical bundles for spatial distribution on a substrate and/or compilation of code for execution on memory chip 2800 . Accordingly, as described above with reference to FIG. 7A , the substrate with the memory chip may include a memory array having a plurality of banks, such as the bank 2801a shown in FIG. 28 and other banks. The substrate may also include a processing array that may include a plurality of processor subunits (eg, subunits 730a, 730b, 730c, 730d, 730e, 730f, 730g, 730h shown in FIG. 7A ).

이에 따라, 본 개시의 실시예에서, 단일 포트 메모리 어레이가 2 포트 메모리 칩인 것처럼 각 로직 사이클에 대해 2개의 어드레스를 처리하고 로직에 2개의 결과를 제공하기 위하여, 연속하는 2 메모리 사이클의 각각에서 어레이로부터 데이터를 읽을 수 있다. 추가적인 클러킹(clocking)을 통해, 본 개시의 메모리 칩은 단일 포트 어레이가 2 포트 메모리 인스턴스, 3 포트 메모리 인스턴스, 4 포트 메모리 인스턴스, 또는 모든 임의의 기타 다중 포트 메모리 인스턴스처럼 기능할 수 있다.Accordingly, in an embodiment of the present disclosure, in order to process two addresses for each logic cycle and provide two results to the logic as if a single port memory array were a two port memory chip, the array in each of two successive memory cycles data can be read from Through additional clocking, the memory chips of the present disclosure may allow a single port array to function as a two port memory instance, a three port memory instance, a four port memory instance, or any other multi-port memory instance.

도 42는 본 개시에 따른 회로(4200)가 사용되는 메모리 칩의 열을 따라 듀얼 포트 접근을 제공하는 예시적인 회로(4200)를 도시한 것이다. 도 42에 도시된 실시예는 2개의 컬럼 멀티플렉서('mux')(4205a 및 4205b)가 있는 하나의 메모리 어레이(4201)를 사용하여 논리 회로에 대한 동일 클럭 사이클 동안에 동일 행의 2 워드에 접근할 수 있다. 예를 들어, 메모리 클럭 사이클 동안에, RowAddrA가 로우 디코더(4203)에서 사용되고 ColAddrA가 멀티플렉서(4205a)에서 사용되어 어드레스 (RowAddrA, ColAddrA)를 가진 메모리로부터의 데이터를 버퍼링 한다. 동일한 메모리 클럭 사이클 동안에, ColAddrB가 멀티플렉서(4205b)에서 사용되어 어드레스 (RowAddrA, ColAddrB)를 가진 메모리로부터의 데이터를 버퍼링 한다. 따라서, 회로(4200)는 동일한 행 또는 워드라인을 따라 2개의 상이한 어드레스에서 메모리 셀 상에 저장된 데이터(예, DataA 및 DataB)로의 듀얼 포트 접근을 가능하게 할 수 있다. 따라서, 로우 디코더(4203)가 두 읽기에 대한 동일한 워드라인을 활성화하도록 두 어드레스가 행을 공유할 수 있다. 또한, 도 42에 도시된 예와 같은 실시예들은 동일한 메모리 클럭 사이클 동안에 2 어드레스가 접근될 수 있도록 컬럼 멀티플렉서를 사용할 수 있다.42 illustrates an exemplary circuit 4200 that provides dual port access along the row of memory chips in which the circuit 4200 according to the present disclosure is used. 42 uses one memory array 4201 with two column multiplexers ('mux') 4205a and 4205b to access 2 words of the same row during the same clock cycle for the logic circuit. can For example, during a memory clock cycle, RowAddrA is used in row decoder 4203 and ColAddrA is used in multiplexer 4205a to buffer data from memory with addresses (RowAddrA, ColAddrA). During the same memory clock cycle, ColAddrB is used in multiplexer 4205b to buffer data from the memory with addresses (RowAddrA, ColAddrB). Accordingly, circuit 4200 may enable dual port access to data (eg, DataA and DataB) stored on memory cells at two different addresses along the same row or wordline. Thus, two addresses can share a row so that the row decoder 4203 activates the same wordline for both reads. Also, embodiments such as the example shown in FIG. 42 may use a column multiplexer so that two addresses can be accessed during the same memory clock cycle.

이와 유사하게, 도 43은 본 개시에 따른 회로(4300)가 사용되는 메모리 칩의 행을 따라 듀얼 포트 접근을 제공하는 예시적인 회로(4300)를 도시한 것이다. 도 43에 도시된 실시예는 로우 디코더(4303)가 멀티플렉서('mux')에 결합된 하나의 메모리 어레이(4301)를 사용하여 논리 회로에 대한 동일 클럭 사이클 동안에 동일 열 상의 2개의 워드에 접근할 수 있다. 예를 들어, 2 메모리 클럭 사이클의 제1 사이클에서, RowAddrA가 로우 디코더(4303)에서 사용되고, ColAddrA가 컬럼 멀티플렉서(4305)에서 사용되어 어드레스 (RowAddrA, ColAddrA)를 가진 메모리 셀로부터의 데이터를 버퍼링 할 수 있다(예, 도 43의 'Buffered Word' 버퍼로). 2 메모리 클럭 사이클의 제2 사이클에서, RowAddrB가 로우 디코더(4303)에서 사용되고, ColAddrA가 컬럼 멀티플렉서(4305)에서 사용되어 어드레스 (RowAddrB, ColAddrA)를 가진 메모리 셀로부터의 데이터를 버퍼링 할 수 있다. 따라서, 회로(4300)는 동일한 열 또는 비트라인을 따라 2개의 상이한 어드레스에서 메모리 셀 상에 저장된 데이터(예, DataA 및 DataB)로의 듀얼 포트 접근을 가능하게 할 수 있다. 따라서, 컬럼 디코더(도 43에 도시된 바와 같이, 하나 이상의 컬럼 멀티플렉서와 분리되거나 병합될 수 있음)가 두 읽기에 대한 동일한 비트라인을 활성화하도록 두 어드레스가 행을 공유할 수 있다. 로우 디코더(4303)는 각 워드라인을 활성화하기 위해 하나의 메모리 클럭 사이클이 필요할 수 있기 때문에 도 43에 도시된 예와 같은 실시예들은 2개의 메모리 클럭 사이클을 사용할 수 있다. 이에 따라, 회로(4300)를 사용하는 메모리 칩은 상응하는 논리 회로보다 클럭 속도가 적어도 2배 빠른 경우에 듀얼 포트 메모리로 기능할 수 있다.Similarly, FIG. 43 depicts an exemplary circuit 4300 that provides dual port access along the row of memory chips in which the circuit 4300 according to the present disclosure is used. 43 shows that the row decoder 4303 uses one memory array 4301 coupled to a multiplexer ('mux') to access two words on the same column during the same clock cycle for the logic circuit. can For example, in the first cycle of 2 memory clock cycles, RowAddrA is used in row decoder 4303 and ColAddrA is used in column multiplexer 4305 to buffer data from memory cells with addresses (RowAddrA, ColAddrA). It can be (eg, with the 'Buffered Word' buffer of FIG. 43). In the second cycle of two memory clock cycles, RowAddrB is used in the row decoder 4303 and ColAddrA is used in the column multiplexer 4305 to buffer data from the memory cell with addresses (RowAddrB, ColAddrA). Accordingly, circuitry 4300 may enable dual port access to data (eg, DataA and DataB) stored on memory cells at two different addresses along the same column or bitline. Thus, two addresses may share a row such that a column decoder (which may be separate or merged with one or more column multiplexers, as shown in FIG. 43) activates the same bitline for both reads. Since the row decoder 4303 may require one memory clock cycle to activate each word line, embodiments such as the example shown in FIG. 43 may use two memory clock cycles. Accordingly, a memory chip using the circuit 4300 can function as a dual-port memory if the clock speed is at least twice as fast as the corresponding logic circuit.

이에 따라, 앞서 설명한 바와 같이, 도 43은 상응하는 논리 회로에 대한 클럭 사이클보다 빠른 2개의 메모리 클럭 사이클 동안에 DataA와 DataB를 읽어올 수 있다. 예컨대, 로우 디코더(예, 도 43의 로우 디코더(4303)) 및 컬럼 디코더(도 43에 도시된 바와 같이, 하나 이상의 컬럼 멀티플렉서와 분리되거나 병합될 수 있음)는 2개의 어드레스를 생성하는 상응하는 논리 회로의 속도보다 적어도 2배 빠른 속도로 클러킹 되도록 구성될 수 있다. 예를 들어, 회로(4300)에 대한 클럭 회로(도 43에는 미도시)는 2개의 어드레스를 생성하는 상응하는 논리 회로의 속도보다 적어도 2배 빠른 속도에 따라 회로(4300)를 클러킹 할 수 있다.Accordingly, as described above, Figure 43 can read DataA and DataB during two memory clock cycles faster than the clock cycle for the corresponding logic circuit. For example, a row decoder (eg, row decoder 4303 in FIG. 43 ) and a column decoder (which may be separate or merged with one or more column multiplexers, as shown in FIG. 43 ) may have corresponding logic generating two addresses. It may be configured to be clocked at a speed at least twice as fast as the speed of the circuit. For example, a clock circuit (not shown in FIG. 43 ) for circuit 4300 may clock circuit 4300 at a speed that is at least twice as fast as the speed of a corresponding logic circuit generating two addresses.

도 42와 도 43의 실시예는 별개로 또는 함께 사용될 수 있다. 이에 따라, 단일 포트 메모리 어레이 또는 매트 상에 듀얼 포트 기능을 제공하는 회로(예, 회로(4200) 또는 회로(4300))는 적어도 하나의 행과 적어도 하나의 열을 따라 배열된 복수의 메모리 뱅크를 포함할 수 있다. 복수의 메모리 뱅크는 도 42에 메모리 어레이(4201)로 도시되어 있고 도 43에 메모리 어레이(4301)로 도시되어 있다. 본 실시예는 또한 단일 클럭 사이클 동안에 읽기 또는 쓰기를 위한 2개의 어드레스를 수신하도록 구성된 적어도 하나의 로우 멀티플렉서(도 43에 도시) 또는 적어도 하나의 컬럼 멀티플렉서(도 42에 도시)를 사용할 수 있다. 또한, 본 실시예는 로우 디코더(예, 도 42의 로우 디코더(4203) 및 도 43의 로우 디코더(4303)) 및 컬럼 디코더(도 42와 도 43에 도시된 바와 같이, 하나 이상의 컬럼 멀티플렉서와 분리되거나 병합될 수 있음)를 사용하여 2개의 어드레스에 읽기 또는 쓰기를 할 수 있다. 예를 들어, 로우 디코더와 컬럼 디코더는 제1 사이클 동안에 적어도 하나의 로우 멀티플렉서 또는 적어도 하나의 컬럼 멀티플렉서로부터 2개의 어드레스 중 제1 어드레스를 읽어오고 제1 어드레스에 상응하는 워드라인 및 비트라인을 디코딩할 수 있다. 또한, 로우 디코더와 컬럼 디코더는 제2 사이클 동안에 적어도 하나의 로우 멀티플렉서 또는 적어도 하나의 컬럼 멀티플렉서로부터 2개의 어드레스 중 제2 어드레스를 읽어오고 제2 어드레스에 상응하는 워드라인 및 비트라인을 디코딩할 수 있다. 이러한 읽어오기는 각각, 로우 디코더를 사용하여 어드레스에 상응하는 워드라인을 활성화하고, 컬럼 디코더를 사용하여 이 어드레스에 상응하는 활성화된 워드라인 상의 비트라인을 활성화하여 이루어질 수 있다.42 and 43 may be used separately or together. Accordingly, a circuit (eg, circuit 4200 or circuit 4300) that provides dual port functionality on a single port memory array or mat may include a plurality of memory banks arranged along at least one row and at least one column. may include The plurality of memory banks is illustrated as a memory array 4201 in FIG. 42 and a memory array 4301 in FIG. 43 . This embodiment may also use at least one row multiplexer (shown in Figure 43) or at least one column multiplexer (shown in Figure 42) configured to receive two addresses for reading or writing during a single clock cycle. In addition, this embodiment separates the row decoder (eg, the row decoder 4203 of FIG. 42 and the row decoder 4303 of FIG. 43 ) and the column decoder (as shown in FIGS. 42 and 43 , one or more column multiplexers) can be combined or merged) to read or write to the two addresses. For example, the row decoder and the column decoder read a first address of two addresses from at least one row multiplexer or at least one column multiplexer during a first cycle and decode wordlines and bitlines corresponding to the first addresses. can In addition, the row decoder and the column decoder may read a second address of two addresses from at least one row multiplexer or at least one column multiplexer during a second cycle and decode a word line and a bit line corresponding to the second address. . Each of these reads can be accomplished by activating a word line corresponding to an address using a row decoder and activating a bit line on an activated word line corresponding to this address using a column decoder, respectively.

상기에서는 읽어오기에 대하여 설명하였지만, 도 42와 도 43의 실시예들은 개별적으로 구현되거나 함께 구현되는 여부와 관계없이, 쓰기 명령을 포함할 수 있다. 예를 들어, 제1 사이클 동안에, 로우 디코더와 컬럼 디코더는 적어도 하나의 로우 멀티플렉서 또는 적어도 하나의 컬럼 멀티플렉서에서 읽어온 제1 데이터를 2개의 어드레스 중의 제1 어드레스에 기록할 수 있다. 또한, 제2 사이클 동안에, 로우 디코더와 컬럼 디코더는 적어도 하나의 로우 멀티플렉서 또는 적어도 하나의 컬럼 멀티플렉서에서 읽어온 제2 데이터를 2개의 어드레스 중의 제2 어드레스에 기록할 수 있다.Although the read has been described above, the embodiments of FIGS. 42 and 43 may include a write command regardless of whether they are implemented individually or together. For example, during the first cycle, the row decoder and the column decoder may write first data read from the at least one row multiplexer or the at least one column multiplexer to a first address of the two addresses. Also, during the second cycle, the row decoder and the column decoder may write second data read from the at least one row multiplexer or the at least one column multiplexer to a second address of the two addresses.

도 42의 예는 제1 및 제2 어드레스가 워드라인 어드레스를 공유하는 경우의 이러한 프로세스를 보여주고, 도 43의 예는 제1 및 제2 어드레스가 컬럼 어드레스를 공유하는 경우의 이러한 프로세스를 보여준다. 하기에 도 47을 참조하여 설명하는 바와 같이, 제1 및 제2 어드레스가 워드라인 어드레스 또는 컬럼 어드레스를 공유하지 않는 경우에 동일 프로세스가 구현될 수 있다.The example of FIG. 42 shows this process when the first and second addresses share a wordline address, and the example of FIG. 43 shows this process when the first and second addresses share a column address. As described below with reference to FIG. 47 , the same process may be implemented when the first and second addresses do not share a wordline address or a column address.

이에 따라, 상기의 예는 행 및 열 중의 적어도 하나를 따라 듀얼 포트 접근을 제공하지만, 추가적인 실시예는 행 및 열 모두를 따라 듀얼 포트 접근을 제공할 수 있다. 도 44는 본 개시에 따른 회로(4400)가 사용되는 메모리 칩의 행과 열 모두를 따라 듀얼 포트 접근을 제공하는 예시적인 회로(4400)를 도시한 것이다. 이에 따라, 회로(4400)는 도 42의 회로(4200)와 도 43의 회로(4300)의 조합을 나타내는 것을 수 있다.Accordingly, while the above examples provide dual port access along at least one of rows and columns, additional embodiments may provide dual port access along both rows and columns. 44 illustrates an exemplary circuit 4400 that provides dual port access along both rows and columns of a memory chip in which circuit 4400 according to the present disclosure is used. Accordingly, the circuit 4400 may represent a combination of the circuit 4200 of FIG. 42 and the circuit 4300 of FIG. 43 .

도 44에 도시된 실시예는 하나의 메모리 어레이(4401)를 멀티플렉서("mux")와 결합된 로우 디코더(4403)와 함께 사용하여 논리 회로에 대한 동일한 클럭 사이클 동안에 2개의 행에 접근할 수 있다. 또한, 도 44에 도시된 실시예는 메모리 어레이(4401)를 멀티플렉서("mux")와 결합된 컬럼 디코더(또는 멀티플렉서)(4405)와 함께 사용하여 동일한 클럭 사이클 동안에 2개의 열에 접근할 수 있다. 예를 들어, 2 메모리 클럭 사이클의 제1 사이클에서, RowAddrA가 로우 디코더(4403)에서 사용되고, ColAddrA가 컬럼 멀티플렉서(4405)에서 사용되어 어드레스 (RowAddrA, ColAddrA)를 가진 메모리 셀로부터의 데이터를 버퍼링 할 수 있다(예, 도 44의 'Buffered Word' 버퍼로). 2 메모리 클럭 사이클의 제2 사이클에서, RowAddrB가 로우 디코더(4403)에서 사용되고, ColAddrA가 컬럼 멀티플렉서(4405)에서 사용되어 어드레스 (RowAddrB, ColAddrA)를 가진 메모리 셀로부터의 데이터를 버퍼링 할 수 있다. 따라서, 회로(4400)는 2개의 상이한 어드레스에서 메모리 셀 상에 저장된 데이터(예, DataA 및 DataB)로의 듀얼 포트 접근을 가능하게 할 수 있다. 로우 디코더(4403)는 각 워드라인을 활성화하기 위해 하나의 메모리 클럭 사이클이 필요할 수 있기 때문에 도 44에 도시된 예와 같은 실시예들은 추가적인 버퍼를 사용할 수 있다. 이에 따라, 회로(4400)를 사용하는 메모리 칩은 상응하는 논리 회로보다 클럭 속도가 적어도 2배 빠른 경우에 듀얼 포트 메모리로 기능할 수 있다.44 may use one memory array 4401 with a row decoder 4403 coupled with a multiplexer (“mux”) to access two rows during the same clock cycle for the logic circuit. . 44 may also use the memory array 4401 with a column decoder (or multiplexer) 4405 coupled with a multiplexer (“mux”) to access two columns during the same clock cycle. For example, in the first cycle of 2 memory clock cycles, RowAddrA is used in row decoder 4403 and ColAddrA is used in column multiplexer 4405 to buffer data from memory cells with addresses (RowAddrA, ColAddrA). It can be (eg, with the 'Buffered Word' buffer of FIG. 44). In the second cycle of two memory clock cycles, RowAddrB is used in row decoder 4403 and ColAddrA can be used in column multiplexer 4405 to buffer data from the memory cell with addresses (RowAddrB, ColAddrA). Accordingly, circuitry 4400 may enable dual port access to data (eg, DataA and DataB) stored on memory cells at two different addresses. Since the row decoder 4403 may require one memory clock cycle to activate each word line, embodiments such as the example shown in FIG. 44 may use an additional buffer. Accordingly, a memory chip using circuit 4400 can function as a dual-port memory if the clock speed is at least twice as fast as the corresponding logic circuit.

도 44에는 도시되어 있지 않지만, 회로(4400)는 행 또는 워드라인을 따라 도 46(하기에 설명)의 추가 회로 및/또는 열 또는 비트라인을 따라 유사한 추가 회로를 더 포함할 수 있다. 이에 따라, 회로(4400)는 상응하는 회로를 활성화하여(예, 도 46의 스위칭 요소(4613a, 4613b) 등의 하나 이상과 같은 하나 이상의 스위칭 요소를 개방하여) 어드레스를 포함하는 단전된 부분을 활성화할 수 있다(예, 단전된 부분에 전압을 연결하거나 전류를 흐르게 하여). 이에 따라, 회로는 회로의 요소(예, 라인 등)가 어드레스에 의해 식별된 위치를 포함하는 경우 및/또는 회로의 요소(예, 스위칭 요소)가 어드레스에 의해 식별된 메모리 셀로의 전압의 공급 및/또는 전류의 흐름을 제어하는 경우에 '상응'할 수 있다. 이후, 회로(4400)는 로우 디코더(4403)와 컬럼 멀티플렉서(4405)를 사용하여 상응하는 워드라인과 비트라인을 디코딩하고, 활성화된 단전 부분에 위치하는 어드레스에 데이터를 읽기 또는 쓰기 할 수 있다.Although not shown in FIG. 44 , circuit 4400 may further include additional circuitry of FIG. 46 (discussed below) along rows or wordlines and/or similar additional circuitry along columns or bitlines. Accordingly, the circuit 4400 activates a corresponding circuit (eg, by opening one or more switching elements, such as one or more of the switching elements 4613a, 4613b of FIG. 46 ) to activate the de-energized portion comprising the address. This can be done (eg, by connecting a voltage to a de-energized part or passing a current through it). Accordingly, a circuit may be configured to provide a voltage to a memory cell identified by an address when an element of the circuit (eg, a line, etc.) includes a location identified by an address and/or when an element (eg, a switching element) of the circuit is identified by an address. / or may be 'corresponding' in the case of controlling the flow of current. Thereafter, the circuit 4400 may decode the corresponding word line and the bit line using the row decoder 4403 and the column multiplexer 4405 , and may read or write data to an address located in the activated single power portion.

도 44에 더 도시된 바와 같이, 회로(4400)는 읽기와 쓰기를 위한 두 개의 어드레스를 단일 클럭 사이클 동안에 가져오도록 구성된 적어도 하나의 로우 멀티플렉서(로우 디코더(4403)와 분리된 것으로 도시되어 있지만 그 안에 포함될 수 있음) 및/또는 적어도 하나의 컬럼 멀티플렉서(컬럼 멀티플렉서(4405)와 분리된 것으로 도시되어 있지만 그 안에 포함될 수 있음)를 더 사용할 수 있다. 이에 따라, 본 실시예들은 로우 디코더(예, 로우 디코더(4403))와 컬럼 디코더(컬럼 멀티플렉서(4405)와 분리되거나 병합될 수 있음)를 사용하여 두 개의 어드레스에 읽기와 쓰기를 할 수 있다. 예를 들어, 로우 디코더와 컬럼 디코더는 메모리 클럭 사이클 동안에 적어도 하나의 로우 멀티플렉서 또는 적어도 하나의 컬럼 멀티플렉서로부터 2개의 어드레스 중의 제1 어드레스를 가져오고 제1 어드레스에 상응하는 워드라인 및 비트라인을 디코딩할 수 있다. 또한, 로우 디코더와 컬럼 디코더는 동일 메모리 사이클 동안에 적어도 하나의 로우 멀티플렉서 또는 적어도 하나의 컬럼 멀티플렉서로부터 2개의 어드레스 중의 제2 어드레스를 가져오고 제2 어드레스에 상응하는 워드라인 및 비트라인을 디코딩할 수 있다.As further shown in Figure 44, circuit 4400 includes at least one row multiplexer (shown as separate from row decoder 4403, but configured to fetch two addresses for read and write during a single clock cycle). may be included) and/or at least one column multiplexer (shown as separate from column multiplexer 4405 but may be included therein) may further be used. Accordingly, in the present embodiments, reading and writing to two addresses can be performed using a row decoder (eg, the row decoder 4403 ) and a column decoder (which may be separated from or combined with the column multiplexer 4405 ). For example, a row decoder and a column decoder may obtain a first address of two addresses from at least one row multiplexer or at least one column multiplexer during a memory clock cycle and decode a wordline and a bitline corresponding to the first address. can Further, the row decoder and the column decoder may obtain a second address of the two addresses from the at least one row multiplexer or the at least one column multiplexer during the same memory cycle and decode the word line and the bit line corresponding to the second address. .

도 45a와 도 45b는 단일 포트 메모리 어레이 또는 매트 상에 듀얼 포트 기능을 제공하기 위한 기존의 복제 방식을 도시한 것이다. 도 45a에 도시된 바와 같이, 듀얼 포트 읽기는 메모리 어레이 또는 매트 전체에 걸쳐 데이터의 복제 사본의 동기화 상태를 유지함으로써 제공될 수 있다. 이에 따라, 읽기는 도 45a에 도시된 바와 같이 메모리 인스턴스의 두 사본 모두에서 수행될 수 있다. 또한, 도 45b에 도시된 바와 같이, 메모리 어레이 또는 매트 전체에 걸쳐 모든 쓰기를 복제함으로써 듀얼 포트 쓰기가 제공될 수 있다. 예를 들면, 메모리 칩은 메모리 칩을 사용하는 모든 논리 회로가 데이터의 각 복제 사본에 하나씩 쓰기 명령의 복제를 전송할 것을 요구할 수 있다. 또는, 일부 실시예에서, 도 45a에 도시된 바와 같이, 추가 회로를 구비함으로써, 사본의 동기화 상태를 유지하기 위하여 메모리 어레이 또는 매트 전체에 걸쳐 기록된 데이터의 복제 사본을 생성하도록 추가 회로에 의해 자동으로 복제된 단일 쓰기 명령을 메모리 인스턴스를 사용하는 논리 회로가 전송할 수 있다. 도 42, 도 43, 및 도 44의 실시예들은, 멀티플렉서를 사용하여 단일 메모리 클럭 사이클에서 2개의 비트라인에 접근(예, 도 42 참조) 및/또는 메모리의 클럭 속도를 상응하는 논리 회로보다 빠르게 하고(예, 도 43 참조) 모든 데이터를 메모리에 복제하는 대신에 추가 어드레스를 처리할 추가 멀티플렉서를 제공함으로써, 이러한 기존의 복제 방식에서 중복성을 감소시킬 수 있다.45A and 45B illustrate a conventional replication scheme for providing dual port functionality on a single port memory array or mat. As shown in FIG. 45A , dual port reads can be provided by keeping duplicate copies of data synchronized across a memory array or mat. Accordingly, the read may be performed on both copies of the memory instance as shown in FIG. 45A . Also, as shown in FIG. 45B , dual port writes can be provided by duplicating all writes across the memory array or mat. For example, a memory chip may require that all logic circuits using the memory chip send a copy of a write command, one for each duplicate copy of the data. Or, in some embodiments, as shown in FIG. 45A , by having additional circuitry automatically by the additional circuitry to create duplicate copies of the data written across the memory array or mat to keep the copies synchronized. can send a single write command duplicated by the logic circuitry using the memory instance. 42, 43, and 44 use a multiplexer to access two bitlines in a single memory clock cycle (see, eg, FIG. 42) and/or to speed the clock speed of the memory faster than the corresponding logic circuit and by providing an additional multiplexer to process additional addresses instead of duplicating all data into memory (eg, see FIG. 43 ), redundancy in this existing replication scheme can be reduced.

상기에 설명한 클럭 속도의 증가 및/또는 추가 멀티플렉서 외에도, 본 개시의 실시예들은 메모리 어레이 내의 어느 지점에서 비트라인 및/또는 워드라인을 단전하는 회로를 사용할 수 있다. 이러한 실시예들에서, 단전 회로의 동일 부분으로 결합되지 않은 상이한 위치에 로우 디코더 및 컬럼 디코더가 접근하는 한, 어레이로 다중 동시 접근이 가능하다. 예를 들어, 단전 회로는 로우 디코더 및 컬럼 디코더가 전기적 간섭 없이 상이한 어드레스에 접근할 수 있게 할 수 있으므로, 상이한 워드라인과 비트라인을 가진 위치가 동시에 접근될 수 있다. 메모리 내의 단전 영역의 입도(granularity)는 메모리 칩의 설계 과정에서 단전 회로에 의해 요구되는 추가 부분과 비교하여 고려될 수 있다.In addition to the clock rate increase and/or additional multiplexers described above, embodiments of the present disclosure may use circuitry to de-energize bitlines and/or wordlines at any point within the memory array. In such embodiments, multiple simultaneous accesses to the array are possible as long as the row decoder and column decoder access different locations that are not coupled to the same part of the single circuit. For example, a single circuit may allow the row decoder and column decoder to access different addresses without electrical interference, so that locations with different wordlines and bitlines can be accessed simultaneously. The granularity of the single power region in the memory may be considered in comparison with the additional parts required by the single power circuit in the design process of the memory chip.

이러한 동시 접근을 이행하기 위한 아키텍처가 도 46에 도시되어 있다. 구체적으로, 도 46은 단일 포트 메모리 어레이 또는 매트에 듀얼 포트 기능을 제공하는 예시적인 회로(4600)를 도시한 것이다. 도 46에 도시된 바와 같이, 회로(4600)는 적어도 하나의 행과 적어도 하나의 열을 따라 배열된 복수의 메모리 매트(예, 4609a, 4609b 등)를 포함할 수 있다. 회로(4600)의 레이아웃은 행에 상응하는 워드라인(4611a, 4611b)과 같은 복수의 워드라인과 열에 상응하는 비트라인(4615a, 4615b)을 더 포함할 수 있다.An architecture for implementing this concurrent approach is shown in FIG. 46 . Specifically, FIG. 46 shows an exemplary circuit 4600 for providing dual port functionality to a single port memory array or mat. 46 , the circuit 4600 may include a plurality of memory mats (eg, 4609a, 4609b, etc.) arranged along at least one row and at least one column. The layout of circuit 4600 may further include a plurality of wordlines, such as wordlines 4611a and 4611b corresponding to rows, and bitlines 4615a, 4615b, corresponding to columns.

도 46의 예는 각각 2개의 라인과 8개의 열이 있는 12개의 메모리 매트를 포함한다. 다른 실시예에서, 기판은 모든 임의의 수의 메모리 매트를 포함할 수 있고, 각 메모리 매트는 모든 임의의 수의 라인과 모든 임의의 수의 열을 포함할 수 있다. 일부 메모리 매트는 동일 수의 라인과 열(도 46의 예)을 포함할 수 있는 반면에, 다른 메모리 매트는 상이한 수의 라인 및/또는 열을 포함할 수 있다.The example of Figure 46 includes 12 memory mats each with 2 lines and 8 columns. In other embodiments, the substrate may include any number of memory mats, and each memory mat may include any number of lines and any number of columns. Some memory mats may include the same number of lines and columns (example in FIG. 46), while other memory mats may include different numbers of lines and/or columns.

도 46에는 도시되어 있지 않지만, 회로(4600)는 읽기와 쓰기를 위한 2개(또는 3개 또는 모든 임의의 복수)의 어드레스를 단일 클럭 사이클 동안에 수신하도록 구성된 적어도 하나의 로우 멀티플렉서(로우 디코더(4601a 및/또는 4601b)와 분리 또는 일체) 또는 적어도 하나의 컬럼 멀티플렉서(예, 컬럼 멀티플렉서(4603a 및/또는 4603b))를 더 사용할 수 있다. 또한, 본 실시예들은 로우 디코더(예, 로우 디코더(4601a 및/또는 4601b) 및 컬럼 멀티플렉서(4603a 및/또는 4603b)와 분리 또는 일체일 수 있음)를 사용하여 2개(또는 그 이상)의 어드레스에 읽기 또는 쓰기를 할 수 있다. 예를 들어, 로우 디코더와 컬럼 디코더는 메모리 클럭 사이클 동안에 2개의 어드레스 중의 제1 어드레스를 적어도 하나의 로우 멀티플렉서 또는 적어도 하나의 컬럼 멀티플렉서로부터 읽어오고 제1 어드레스에 상응하는 워드라인 또는 비트라인을 디코딩할 수 있다. 또한, 로우 디코더와 컬럼 디코더는 동일한 메모리 사이클 동안에 2개의 어드레스 중의 제2 어드레스를 적어도 하나의 로우 멀티플렉서 또는 적어도 하나의 컬럼 멀티플렉서로부터 읽어오고 제2 어드레스에 상응하는 워드라인 또는 비트라인을 디코딩할 수 있다. 앞서 설명한 바와 같이, 2개의 어드레스가 단전 회로의 동일 부분(예, 4613a, 4613b 등과 같은 스위칭 요소)에 결합되지 않은 상이한 위치에 있는 한, 접근은 동일 메모리 클럭 사이클 동안에 일어날 수 있다. 또한, 회로(4600)는 제1 메모리 클럭 사이클 동안에 첫 번째 2개의 어드레스에 동시에 접근한 후에 제2 메모리 클럭 사이클 동안에 두 번째 2개의 어드레스에 동시에 접근할 수 있다. 이러한 실시예에서, 회로(4600)를 사용하는 메모리 칩은 상응하는 논리 회로보다 클럭 속도가 적어도 2배 빠른 경우에 4포트 메모리로 기능할 수 있다.Although not shown in FIG. 46, circuit 4600 includes at least one row multiplexer (row decoder 4601a) configured to receive, during a single clock cycle, two (or three or any arbitrary plurality) of addresses for reading and writing. and/or separate or integrated with 4601b) or at least one column multiplexer (eg, column multiplexers 4603a and/or 4603b). Further, the present embodiments use a row decoder (eg, which may be separate or integral with a row decoder 4601a and/or 4601b and a column multiplexer 4603a and/or 4603b) to two (or more) addresses can read or write to. For example, a row decoder and a column decoder may read a first address of two addresses from at least one row multiplexer or at least one column multiplexer during a memory clock cycle and decode a word line or bit line corresponding to the first address. can In addition, the row decoder and the column decoder may read a second address of the two addresses from the at least one row multiplexer or the at least one column multiplexer and decode a word line or a bit line corresponding to the second address during the same memory cycle. . As discussed above, as long as the two addresses are in different locations that are not coupled to the same part of the power circuit (eg, a switching element such as 4613a, 4613b, etc.), accesses can occur during the same memory clock cycle. Further, the circuit 4600 may simultaneously access the first two addresses during a first memory clock cycle and then concurrently access the second two addresses during a second memory clock cycle. In such an embodiment, a memory chip using circuit 4600 may function as a four-port memory if the clock speed is at least twice as fast as the corresponding logic circuit.

도 46은 스위치로 기능하도록 구성된 적어도 하나의 행 회로 및 적어도 하나의 열 회로를 더 포함한다. 예를 들어, 4613a, 4613b 등과 같은 상응하는 스위칭 요소는 4613a, 4613b 등과 같은 스위칭 요소에 연결된 워드라인 도는 비트라인으로부터 전류가 흐르도록 또는 흐르지 못하도록 및/또는 전압이 연결되도록 또는 연결되지 못하도록 구성된 트랜지스터 또는 모든 임의의 다른 전기 요소를 포함할 수 있다. 따라서, 상응하는 스위칭 요소는 회로(4600)를 단전된 부분으로 구분할 수 있다. 여기에는 단일 행과 각 행의 16 열을 포함하는 것으로 도시되어 있지만, 회로(4600) 내의 단전된 부분은 회로(4600)의 설계에 따라 상이한 수준의 입도를 포함할 수 있다.46 further includes at least one row circuit and at least one column circuit configured to function as a switch. For example, a corresponding switching element, such as 4613a, 4613b, etc., may include a transistor or a transistor configured to allow or disengage current and/or to connect or disengage voltage from a wordline or bitline coupled to a switching element such as 4613a, 4613b, etc. It may include any and all other electrical elements. Accordingly, a corresponding switching element may divide circuit 4600 into a de-energized portion. Although shown as including a single row and 16 columns of each row, the de-energized portions within circuit 4600 may include different levels of granularity depending on the design of circuit 4600 .

회로(4600)는 앞서 설명한 어드레스 동작 동안에 상응하는 단전된 영역을 활성화하기 위하여 컨트롤러(예, 행 컨트롤(4607))를 사용하여 적어도 하나의 행 회로 및 적어도 하나의 열 회로의 상응하는 회로를 활성화할 수 있다. 예를 들어, 회로(4600)는 스위칭 요소 중의 가까운 상응하는 스위칭 요소(예, 4613a, 4613b 등)로 하나 이상의 제어 신호를 전송할 수 있다. 스위칭 요소(4613a, 4613b 등)가 트랜지스터를 포함하는 실시예에서, 제어 신호는 트랜지스터를 개방하기 위한 전압을 포함할 수 있다.Circuit 4600 may activate corresponding circuits of at least one row circuit and at least one column circuit using a controller (eg, row control 4607) to activate corresponding de-energized regions during the address operations described above. can For example, circuit 4600 may send one or more control signals to a nearby corresponding one of the switching elements (eg, 4613a, 4613b, etc.). In embodiments where the switching elements 4613a, 4613b, etc. include a transistor, the control signal may include a voltage to open the transistor.

어드레스를 포함하는 단전된 영역에 따라, 스위칭 요소 중의 하나 이상이 회로(4600)에 의해 활성화될 수 있다. 예를 들어, 도 46의 메모리 매트(4609b) 내의 어드레스에 도달하기 위해, 메모리 매트(4609a)로의 접근을 허용하는 스위칭 요소는 메모리 매트(4609b)로의 접근을 허용하는 스위칭 요소와 함께 개방되어야 한다. 행 컨트롤(4607)은 회로(4600) 내의 특정 어드레스를 읽어오기 위하여 특정 어드레스에 따라 스위칭 요소의 활성화를 결정할 수 있다.Depending on the de-energized region containing the address, one or more of the switching elements may be activated by the circuit 4600 . For example, to reach an address in the memory mat 4609b of FIG. 46 , a switching element allowing access to the memory mat 4609a must be opened along with the switching element allowing access to the memory mat 4609b. The row control 4607 may determine activation of the switching element according to the specific address in order to read the specific address in the circuit 4600 .

도 46은 메모리 어레이(예, 메모리 매트(4609a, 4609b 등)를 포함)의 워드라인을 분리하는데 사용되는 회로(4600)의 예를 나타낸 것이다. 그러나 다른 실시예들은 유사한 회로(예, 메모리 칩(4600)을 단전 영역으로 분리하는 스위칭 요소)를 사용하여 메모리 어레이의 비트라인을 분리할 수 있다. 이에 따라, 회로(4600)의 아키텍처는 도 42 또는 도 44에 도시된 것과 같은 듀얼-컬럼 접근에 사용될 수 있을 뿐만 아니라 도 43 또는 도 44에 도시된 것과 같은 듀얼-로우 접근에 사용될 수 있다.46 shows an example of circuit 4600 used to isolate wordlines of a memory array (eg, including memory mats 4609a, 4609b, etc.). However, other embodiments may use similar circuitry (eg, a switching element that separates the memory chip 4600 into disconnected regions) to isolate the bitlines of the memory array. Accordingly, the architecture of circuit 4600 may be used for a dual-column approach as shown in FIG. 42 or 44 as well as for a dual-row approach as shown in FIG. 43 or 44 .

메모리 어레이 또는 매트로의 멀티 사이클 접근을 위한 프로세스가 도 47a에 도시되어 있다. 구체적으로, 도 47a는 싱글 포트 메모리 어레이 또는 매트(예, 도 43의 회로(4300) 또는 도 44의 회로(4400) 사용) 상에 듀얼 포트 접근을 제공하기 위한 프로세스(4700)의 예시적인 순서도이다. 프로세스(4700)는 도 43 또는 도 44의 로우 디코더(4303 또는 4304) 및 컬럼 디코더(도 43 또는 도 44의 컬럼 멀티플렉서(4305 또는 4405)와 같은 하나 이상의 컬럼 멀티플렉서와 분리 또는 일체일 수 있음)와 같은 본 개시에 따른 로우 디코더 및 컬럼 디코더를 사용하여 실행될 수 있다.A process for multi-cycle access to a memory array or metro is shown in FIG. 47A. Specifically, FIG. 47A is an exemplary flow diagram of a process 4700 for providing dual port access on a single port memory array or mat (eg, using circuit 4300 of FIG. 43 or circuit 4400 of FIG. 44 ). . Process 4700 includes a row decoder 4303 or 4304 and a column decoder of FIG. 43 or 44 (which may be separate or integral with one or more column multiplexers, such as column multiplexer 4305 or 4405 of FIG. 43 or 44 ); The same may be implemented using a row decoder and a column decoder according to the present disclosure.

단계 4710에서, 제1 메모리 클럭 사이클 동안에, 회로는 적어도 하나의 로우 멀티플렉서와 적어도 하나의 컬럼 멀티플렉서를 사용하여 2개의 어드레스의 제1 어드레스에 상응하는 워드라인 및 비트라인을 디코딩할 수 있다. 예를 들어, 적어도 하나의 로우 디코더는 워드라인을 활성화할 수 있고, 적어도 하나의 컬럼 멀티플렉서는 활성화된 워드라인을 따르고 제1 어드레스에 상응하여 메모리 셀로부터의 전압을 증폭할 수 있다. 증폭된 전압은 회로를 포함하는 메모리 칩을 활용하는 논리 회로로 제공되거나 하기에 설명하는 단계 4720에 따라 버퍼링 될 수 있다. 논리 회로는 GPU나 CPU와 같은 회로를 포함하거나, 예를 들어 도 7a에 도시된 바와 같이 메모리 칩과 동일한 기판 상에 프로세싱 그룹을 포함할 수 있다.At step 4710, during a first memory clock cycle, the circuitry may decode the wordline and bitline corresponding to the first address of the two addresses using the at least one row multiplexer and the at least one column multiplexer. For example, the at least one row decoder may activate a word line, and the at least one column multiplexer may amplify a voltage from the memory cell along the activated word line and corresponding to a first address. The amplified voltage may be provided to a logic circuit utilizing a memory chip comprising the circuit or may be buffered according to step 4720 described below. The logic circuit may include circuitry such as GPU or CPU, or may include processing groups on the same substrate as the memory chip, for example, as shown in FIG. 7A .

상기에서는 읽기 동작으로 설명하였지만, 방법(4700)은 쓰기 동작도 유사하게 처리할 수 있다. 예를 들면, 적어도 하나의 로우 디코더는 워드라인을 활성화하고, 적어도 하나의 컬럼 멀티플렉서는 활성화된 워드라인을 따르고 제1 어드레스에 상응하여 메모리 셀로 전압을 인가하여 새로운 데이터를 메모리 셀에 기록할 수 있다. 일부 실시예에서, 회로는 회로를 포함하는 메모리 칩을 활용하는 논리 회로로 쓰기의 확인을 제공하거나 하기의 단계 4720에 따라 확인을 버퍼링 할 수 있다.Although described above as a read operation, the method 4700 may similarly process a write operation. For example, the at least one row decoder may activate a word line, and the at least one column multiplexer may write new data to the memory cell by applying a voltage to the memory cell along the activated word line and corresponding to the first address. . In some embodiments, the circuitry may provide an acknowledgment of a write to a logic circuit utilizing a memory chip comprising the circuitry or may buffer acknowledgment according to step 4720 below.

단계 4720에서, 회로는 제1 어드레스의 읽어온 데이터를 버퍼링 할 수 있다. 예를 들어, 도 43과 도 44에 도시된 바와 같이, 버퍼로 인해 회로는 2개의 어드레스 중의 제2 어드레스를 읽어올 수 있고(하기의 단계 4730에서 설명) 두 읽어오기 모두의 결과를 함께 출력할 수 있다. 버퍼는 레지스터, SRAM, 비휘발성 메모리, 또는 모든 임의의 다른 데이터 스토리지 장치를 포함할 수 있다.In operation 4720, the circuit may buffer the read data of the first address. For example, as shown in FIGS. 43 and 44 , due to the buffer, the circuit can read the second address among the two addresses (described in step 4730 below) and output the results of both reads together. can The buffer may include registers, SRAM, non-volatile memory, or any other data storage device.

단계 4730에서, 제2 메모리 클럭 사이클 동안에, 회로는 적어도 하나의 로우 멀티플렉서와 적어도 하나의 컬럼 멀티플렉서를 사용하여 2개의 어드레스의 제2 어드레스에 상응하는 워드라인과 비트라인을 디코딩할 수 있다. 예를 들어, 적어도 하나의 로우 디코더는 워드라인을 활성화할 수 있고, 적어도 하나의 컬럼 멀티플렉서는 활성화된 워드라인을 따르고 제2 어드레스에 상응하여 메모리 셀로부터의 전압을 증폭할 수 있다. 증폭된 전압은 회로를 포함하는 메모리 칩을 활용하는 논리 회로로, 개별적으로 또는 예를 들어 단계 4720에서 버퍼링된 전압과 함께, 제공될 수 있다. 논리 회로는 GPU나 CPU와 같은 회로를 포함하거나, 예를 들어 도 7a에 도시된 바와 같이 메모리 칩과 동일한 기판 상에 프로세싱 그룹을 포함할 수 있다.In step 4730, during a second memory clock cycle, the circuitry may use the at least one row multiplexer and the at least one column multiplexer to decode the wordline and the bitline corresponding to the second address of the two addresses. For example, the at least one row decoder may activate a word line, and the at least one column multiplexer may amplify a voltage from the memory cell along the activated word line and corresponding to the second address. The amplified voltage may be provided to a logic circuit utilizing a memory chip comprising the circuit, either individually or with a buffered voltage, for example in step 4720 . The logic circuit may include circuitry such as GPU or CPU, or may include processing groups on the same substrate as the memory chip, for example, as shown in FIG. 7A .

상기에서는 읽기 동작으로 설명하였지만, 방법(4700)은 쓰기 동작도 유사하게 처리할 수 있다. 예를 들면, 적어도 하나의 로우 디코더는 워드라인을 활성화하고, 적어도 하나의 컬럼 멀티플렉서는 활성화된 워드라인을 따르고 제2 어드레스에 상응하여 메모리 셀로 전압을 인가하여 새로운 데이터를 메모리 셀에 기록할 수 있다. 일부 실시예에서, 회로는 개별적으로 또는 단계 4720 등에서 버퍼링된 전압과 함께 회로를 포함하는 메모리 칩을 활용하는 논리 회로로 쓰기의 확인을 제공할 수 있다Although described above as a read operation, the method 4700 may similarly process a write operation. For example, the at least one row decoder may activate a word line, and the at least one column multiplexer may write new data to the memory cell by applying a voltage to the memory cell along the activated word line and corresponding to the second address. . In some embodiments, the circuit may provide confirmation of a write to a logic circuit utilizing a memory chip that contains the circuit, either individually or with a buffered voltage at step 4720 or the like.

단계 4740에서, 회로는 버퍼링된 제1 어드레스와 함께 제2 어드레스의 읽어온 데이터를 출력할 수 있다. 예를 들어, 도 43과 도 44에 도시된 바와 같이, 회로는 두 읽어오기(예, 단계 4710과 단계 4730) 모두의 결과를 함께 출력할 수 있다. 회로는 회로를 포함하는 메모리 칩을 활용하는 논리 회로로 결과를 출력할 수 있다. 논리 회로는 GPU나 CPU와 같은 회로를 포함하거나, 예를 들어 도 7a에 도시된 바와 같이 메모리 칩과 동일한 기판 상에 프로세싱 그룹을 포함할 수 있다.In operation 4740, the circuit may output the read data of the second address together with the buffered first address. For example, as shown in FIGS. 43 and 44 , the circuit may output the results of both readings (eg, steps 4710 and 4730 ) together. The circuit may output a result to a logic circuit utilizing a memory chip including the circuit. The logic circuit may include circuitry such as GPU or CPU, or may include processing groups on the same substrate as the memory chip, for example, as shown in FIG. 7A .

상기에서는 다중 사이클에 대하여 설명하였지만, 도 42에 도시된 바와 같이 2개의 어드레스가 워드라인을 공유하는 경우에, 방법(4700)은 2개의 어드레스로 단일 사이클 접근도 가능하다. 예를 들어, 다중 컬럼 멀티플렉서가 동일 메모리 클럭 사이클 동안에 동일 워드라인 상의 상이한 비트라인을 디코딩할 수 있으므로, 단계 4710과 단계 4730은 동일 메모리 클럭 사이클 동안에 수행될 수 있다. 이러한 실시예에서, 버퍼링을 하는 단계 4720은 건너뛸 수 있다.Although multiple cycles have been described above, as shown in FIG. 42 , when two addresses share a wordline, the method 4700 allows single-cycle access to the two addresses as well. For example, since a multi-column multiplexer can decode different bitlines on the same wordline during the same memory clock cycle, steps 4710 and 4730 can be performed during the same memory clock cycle. In such an embodiment, step 4720 of buffering may be skipped.

동시 접근(예, 상기에 설명한 회로(4600)를 활용)을 위한 프로세스가 도 47b에 도시되어 있다. 이에 따라, 단계가 순차적으로 도시되어 있지만, 도 47b의 단계들은 동일 메모리 클럭 사이클 동안에 수행될 수 있고, 적어도 일부 단계들(예, 단계 4760과 4780 또는 단계 4770과 4790)은 동시에 실행될 수 있다. 구체적으로, 도 47b는 단일 포트 메모리 어레이 또는 매트(예, 도 42의 회로(4200) 또는 도 46의 회로(4600) 활용) 상에 듀얼 포트 접근을 제공하기 위한 프로세스(4750)의 예시적인 순서도이다. 프로세스(4750)는 도 42 또는 도 46의 로우 디코더(4203 또는 4601a 및 4601b) 및 컬럼 디코더(도 42 또는 도 46의 컬럼 멀티플렉서(4205a 및 4205b 또는 4603a 및 4603b)와 같은 하나 이상의 컬럼 멀티플렉서와 분리 또는 일체일 수 있음)와 같은 본 개시에 따른 로우 디코더 및 컬럼 디코더를 사용하여 실행될 수 있다.A process for concurrent access (eg, utilizing circuit 4600 described above) is shown in FIG. 47B . Accordingly, although the steps are shown sequentially, the steps of FIG. 47B may be performed during the same memory clock cycle, and at least some steps (eg, steps 4760 and 4780 or steps 4770 and 4790) may be executed concurrently. Specifically, FIG. 47B is an exemplary flow diagram of a process 4750 for providing dual port access on a single port memory array or mat (eg, utilizing circuit 4200 of FIG. 42 or circuit 4600 of FIG. 46 ). . Process 4750 separates or separates from one or more column multiplexers, such as row decoders 4203 or 4601a and 4601b and column decoders of FIG. 42 or 46 (column multiplexers 4205a and 4205b or 4603a and 4603b of FIG. 42 or 46). may be integrally implemented) using a row decoder and a column decoder according to the present disclosure.

단계 4760에서, 회로는, 메모리 클럭 사이클 동안에, 2개의 어드레스의 제1 어드레스에 의거하여 적어도 하나의 행 회로와 적어도 하나의 열 회로의 상응하는 회로를 활성화할 수 있다. 예를 들어, 회로는 적어도 하나의 행 회로와 적어도 하나의 열 회로를 포함하는 스위칭 요소의 가까운 상응하는 요소로 하나 이상의 제어 신호를 전송할 수 있다. 이에 따라, 회로는 2개의 어드레스의 제1 어드레스를 포함하는 상응하는 단전 영역에 접근할 수 있다.In step 4760, the circuit may activate, during the memory clock cycle, corresponding circuits of the at least one row circuit and the at least one column circuit based on the first address of the two addresses. For example, the circuit may send one or more control signals to a nearby corresponding element of the switching element comprising at least one row circuit and at least one column circuit. Accordingly, the circuit can access the corresponding disconnection area including the first address of the two addresses.

단계 4770에서, 메모리 클럭 사이클 동안에, 회로는 적어도 하나의 로우 멀티플렉서와 적어도 하나의 컬럼 멀티플렉서를 활용하여 제1 어드레스에 상응하는 워드라인과 비트라인을 디코딩할 수 있다. 예를 들면, 적어도 하나의 로우 디코더는 워드라인을 활성화할 수 있고, 적어도 하나의 컬럼 멀티플렉서는 활성화된 워드라인을 따르고 제1 어드레스에 상응하여 메모리 셀로부터의 전압을 증폭할 수 있다. 증폭된 전압은 회로를 포함하는 메모리 칩을 활용하는 논리 회로로 제공될 수 있다. 예컨대, 앞서 설명한 바와 같이, 논리 회로는 GPU나 CPU와 같은 회로를 포함하거나, 예를 들어 도 7a에 도시된 바와 같이 메모리 칩과 동일한 기판 상에 프로세싱 그룹을 포함할 수 있다.At step 4770, during a memory clock cycle, the circuitry may utilize the at least one row multiplexer and the at least one column multiplexer to decode the wordline and bitline corresponding to the first address. For example, the at least one row decoder may activate a word line, and the at least one column multiplexer may amplify a voltage from the memory cell along the activated word line and corresponding to a first address. The amplified voltage may be provided to a logic circuit utilizing a memory chip including the circuit. For example, as described above, the logic circuit may include a circuit such as a GPU or a CPU, or may include a processing group on the same substrate as the memory chip, for example, as shown in FIG. 7A .

상기에서는 읽기 동작으로 설명하였지만, 방법(4750)은 쓰기 동작도 유사하게 처리할 수 있다. 예를 들면, 적어도 하나의 로우 디코더는 워드라인을 활성화하고, 적어도 하나의 컬럼 멀티플렉서는 활성화된 워드라인을 따르고 제1 어드레스에 상응하여 메모리 셀로 전압을 인가하여 새로운 데이터를 메모리 셀에 기록할 수 있다. 일부 실시예에서, 회로는 회로를 포함하는 메모리 칩을 활용하는 논리 회로로 쓰기의 확인을 제공할 수 있다.Although the above has been described as a read operation, the method 4750 may similarly process a write operation. For example, the at least one row decoder may activate a word line, and the at least one column multiplexer may write new data to the memory cell by applying a voltage to the memory cell along the activated word line and corresponding to the first address. . In some embodiments, the circuit may provide confirmation of a write to a logic circuit that utilizes a memory chip that includes the circuit.

단계 4780에서, 동일 사이클 동안에, 회로는 2개의 어드레스의 제2 어드레스에 의거하여 적어도 하나의 행 회로와 적어도 하나의 열 회로의 상응하는 회로를 활성화할 수 있다. 예를 들어, 회로는 적어도 하나의 행 회로와 적어도 하나의 열 회로를 포함하는 스위칭 요소의 가까운 상응하는 요소로 하나 이상의 제어 신호를 전송할 수 있다. 이에 따라, 회로는 2개의 어드레스의 제2 어드레스를 포함하는 상응하는 단전 영역에 접근할 수 있다.In step 4780, during the same cycle, the circuit may activate corresponding circuits of the at least one row circuit and the at least one column circuit based on the second address of the two addresses. For example, the circuit may send one or more control signals to a nearby corresponding element of the switching element comprising at least one row circuit and at least one column circuit. Accordingly, the circuit can access the corresponding disconnection area including the second address of the two addresses.

단계 4790에서, 동일 사이클 동안에, 회로는 적어도 하나의 로우 멀티플렉서와 적어도 하나의 컬럼 멀티플렉서를 사용하여 제2 어드레스에 상응하는 워드라인과 비트라인을 디코딩할 수 있다. 예를 들어, 적어도 하나의 로우 디코더는 워드라인을 활성화할 수 있고, 적어도 하나의 컬럼 멀티플렉서는 활성화된 워드라인을 따르고 제2 어드레스에 상응하여 메모리 셀로부터의 전압을 증폭할 수 있다. 증폭된 전압은 회로를 포함하는 메모리 칩을 활용하는 논리 회로로 제공될 수 있다. 예를 들어, 앞서 설명한 바와 같이, 논리 회로는 GPU나 CPU와 같은 종래의 회로를 포함하거나, 도 7a에 도시된 바와 같이 메모리 칩과 동일한 기판 상에 프로세싱 그룹을 포함할 수 있다.In step 4790, during the same cycle, the circuit may decode the wordline and the bitline corresponding to the second address using the at least one row multiplexer and the at least one column multiplexer. For example, the at least one row decoder may activate a word line, and the at least one column multiplexer may amplify a voltage from the memory cell along the activated word line and corresponding to the second address. The amplified voltage may be provided to a logic circuit utilizing a memory chip including the circuit. For example, as described above, the logic circuit may include a conventional circuit such as a GPU or CPU, or may include a processing group on the same substrate as the memory chip as shown in FIG. 7A .

상기에서는 읽기 동작으로 설명하였지만, 방법(4750)은 쓰기 동작도 유사하게 처리할 수 있다. 예를 들면, 적어도 하나의 로우 디코더는 워드라인을 활성화하고, 적어도 하나의 컬럼 멀티플렉서는 활성화된 워드라인을 따르고 제2 어드레스에 상응하여 메모리 셀로 전압을 인가하여 새로운 데이터를 메모리 셀에 기록할 수 있다. 일부 실시예에서, 회로는 회로를 포함하는 메모리 칩을 활용하는 논리 회로로 쓰기의 확인을 제공할 수 있다.Although described above as a read operation, the method 4750 may similarly process a write operation. For example, the at least one row decoder may activate a word line, and the at least one column multiplexer may write new data to the memory cell by applying a voltage to the memory cell along the activated word line and corresponding to the second address. . In some embodiments, the circuit may provide confirmation of a write to a logic circuit that utilizes a memory chip comprising the circuit.

상기에서는 단일 사이클을 참조하여 설명하였지만, 2개의 어드레스가 워드라인과 비트라인을 공유하는(또는 적어도 하나의 행 회로와 적어도 하나의 열 회로에서 스위칭 요소를 공유하는) 단전 영역에 있는 경우, 방법(4750)은 2개의 어드레스로 멀티 사이클 접근을 가능하게 할 수 있다. 예를 들면, 단계 4760과 단계 4770은 제1 로우 디코더와 제1 컬럼 멀티플렉서가 제1 어드레스에 상응하는 워드라인과 비트라인을 디코딩할 수 있는 제1 메모리 클럭 사이클 동안에 수행될 수 있고, 단계 4780과 단계 4790은 제2 로우 디코더와 제2 컬럼 멀티플렉서가 제2 어드레스에 상응하는 워드라인과 비트라인을 디코딩할 수 있는 제2 메모리 클럭 사이클 동안에 수행될 수 있다.Although described above with reference to a single cycle, if the two addresses are in a de-energized region that shares a wordline and a bitline (or shares a switching element in at least one row circuit and at least one column circuit), the method ( 4750) can enable multi-cycle access with two addresses. For example, steps 4760 and 4770 may be performed during a first memory clock cycle during which the first row decoder and the first column multiplexer may decode the wordline and bitline corresponding to the first address, and step 4780 and Operation 4790 may be performed during a second memory clock cycle in which the second row decoder and the second column multiplexer may decode the word line and the bit line corresponding to the second address.

행과 열을 모두 따라 듀얼 포트 접근을 하는 아키텍처의 다른 예가 도 48에 도시되어 있다. 구체적으로, 도 48은 다중 로우 디코더를 다중 컬럼 멀티플렉서와 함께 활용하여 행과 열을 모두 따라 듀얼 포트 접근을 제공하는 예시적인 회로(4800)를 도시한 것이다. 도 48에서, 로우 디코더(4801a)는 제1 워드라인에 접근할 수 있고, 컬럼 멀티플렉서(4803a)는 제1 워드라인을 따라 하나 이상의 메모리 셀로부터의 데이터를 디코딩할 수 있는 반면, 로우 디코더(4801b)는 제2 워드라인에 접근할 수 있고, 컬럼 멀티플렉서(4803b)는 제2 워드라인을 따라 하나 이상의 메모리 셀로부터의 데이터를 디코딩할 수 있다.Another example of an architecture with dual port access along both rows and columns is shown in FIG. 48 . Specifically, FIG. 48 shows an exemplary circuit 4800 utilizing a multi-row decoder in conjunction with a multi-column multiplexer to provide dual port access along both rows and columns. 48, row decoder 4801a can access a first wordline, and column multiplexer 4803a can decode data from one or more memory cells along the first wordline, while row decoder 4801b ) may access the second wordline, and the column multiplexer 4803b may decode data from one or more memory cells along the second wordline.

도 47b를 참조하여 설명한 바와 같이, 이 접근은 하나의 메모리 클럭 사이클 동안에 동시일 수 있다. 이에 따라, 도 46의 아키텍처와 유사하게, 도 48의 아키텍처(하기의 도 49에서 설명하는 메모리 매트를 포함)는 다중 어드레스가 동일 클럭 사이클에서 접근되도록 할 수 있다. 예를 들어, 도 48의 아키텍처는 모든 임의의 수의 로우 디코더와 모든 임의의 수의 컬럼 멀티플렉서를 포함하여 로우 디코더와 컬럼 멀티플렉서의 수에 상응하는 수의 어드레스가 모두 단일 메모리 클럭 사이클 이내에서 접근되도록 할 수 있다.As described with reference to FIG. 47B , these accesses may be concurrent during one memory clock cycle. Accordingly, similar to the architecture of FIG. 46 , the architecture of FIG. 48 (including the memory mat described in FIG. 49 below) may allow multiple addresses to be accessed in the same clock cycle. For example, the architecture of FIG. 48 includes any arbitrary number of row decoders and any arbitrary number of column multiplexers so that a number of addresses corresponding to the number of row decoders and column multiplexers are all accessed within a single memory clock cycle. can do.

다른 실시예에서, 이 접근은 두 메모리 클럭 사이클을 따라 순차적일 수 있다. 메모리 칩(4800)의 클럭 속도를 상응하는 논리 회로보다 빠르게 함으로써, 두 메모리 클럭 사이클은 메모리를 사용하는 논리 회로에 대한 하나의 클럭 사이클과 동등할 수 있다. 예를 들어, 앞서 설명한 바와 같이, 논리 회로는 GPU나 CPU와 같은 종래의 회로를 포함하거나, 도 7a에 도시된 바와 같이 메모리 칩과 동일한 기판 상에 프로세싱 그룹을 포함할 수 있다.In other embodiments, this approach may be sequential along two memory clock cycles. By making the clock speed of the memory chip 4800 faster than the corresponding logic circuit, two memory clock cycles can be equivalent to one clock cycle for the logic circuit using the memory. For example, as described above, the logic circuit may include a conventional circuit such as a GPU or CPU, or may include a processing group on the same substrate as the memory chip as shown in FIG. 7A .

다른 실시예들은 동시 접근을 가능하게 할 수 있다. 예를 들어, 도 42를 참조하여 설명한 바와 같이, 다중 컬럼 디코더(도 48에 도시된 4803a 및 4803b와 같은 컬럼 멀티플렉서를 포함할 수 있음)가 단일 메모리 클럭 사이클 동안에 동일 워드라인을 따라 다중 비트라인을 읽을 수 있다. 추가적으로 또는 대안적으로, 도 46을 참조하여 설명한 바와 같이, 회로(4800)는 추가적인 회로를 포함하여 이 접근이 동시가 되게 할 수 있다. 예를 들면, 로우 디코더(4801a)는 제1 워드라인에 접근하고, 컬럼 멀티플렉서(4803a)는 로우 디코더(4801b)가 제2 워드라인에 접근하는 동일 메모리 클럭 사이클 동안에 제1 워드라인을 따라 메모리 셀로부터의 데이터를 디코딩할 수 있고, 컬럼 멀티플렉서(4803b)는 제2 워드라인을 따라 메모리 셀로부터의 데이터를 디코딩할 수 있다.Other embodiments may enable concurrent access. For example, as described with reference to FIG. 42, a multi-column decoder (which may include column multiplexers such as 4803a and 4803b shown in FIG. 48) writes multiple bitlines along the same wordline during a single memory clock cycle. can read Additionally or alternatively, as described with reference to FIG. 46 , circuit 4800 may include additional circuitry to allow this approach to be concurrent. For example, the row decoder 4801a accesses the first wordline and the column multiplexer 4803a accesses the memory cells along the first wordline during the same memory clock cycle as the row decoder 4801b accesses the second wordline. may decode the data from, and the column multiplexer 4803b may decode the data from the memory cell along the second wordline.

도 48의 아키텍처는 도 49에 도시된 바와 같은 메모리 뱅크를 형성하는 변형 메모리 매트와 함께 사용될 수 있다. 도 49에서, 각 메모리 셀(DRAM과 유사하게 커패시터로 도시되었지만 SRAM 또는 임의의 다른 메모리 셀과 유사한 방식으로 배열된 다수의 트랜지스터를 포함할 수도 있음)에는 2개의 워드라인과 2개의 비트라인이 접근된다. 이에 따라, 도 49의 메모리 매트(4900)는 2개의 상이한 비트를 동시에 접근할 수 있게 하거나 2개의 상이한 논리 회로에 의한 동일 비트의 접근마저 가능하게 한다. 그러나 도 49의 실시예는 상기의 실시예와 같이 단일 포트 접근을 위해 배선되는 표준 DRAM메모리 매트 상의 듀얼 포트 솔루션을 구현하기 보다는 메모리 매트의 변형을 활용한다.The architecture of FIG. 48 may be used with a variant memory mat forming a memory bank as shown in FIG. 49 . 49, each memory cell (shown as a capacitor similar to DRAM, but may include a plurality of transistors arranged in a manner similar to SRAM or any other memory cell) has two wordlines and two bitlines accessed do. Accordingly, the memory mat 4900 of FIG. 49 enables simultaneous access of two different bits or even access of the same bit by two different logic circuits. However, the embodiment of Figure 49 utilizes a variant of the memory mat rather than implementing a dual port solution on a standard DRAM memory mat that is wired for single port access as in the above embodiment.

상기에서는 2개의 포트로 설명하였지만, 상기에 설명한 모든 임의의 실시예들은 2개의 포트 이상으로 확장될 수 있다. 예를 들어, 도 42, 도 46, 도 48, 및 도 49의 실시예들은 각각 추가적인 열 또는 로우 멀티플렉서를 포함하여 단일 클럭 사이클 동안에 각각 추가적인 열 또는 행으로의 접근을 제공할 수 있다. 다른 예를 들면, 도 43 및 도 44의 실시예들은 추가적인 로우 디코더 및/또는 컬럼 멀티플렉서를 포함하여 단일 클럭 사이클 동안에 각각 추가적인 행 또는 열로의 접근을 제공할 수 있다.Although described above with two ports, any of the embodiments described above can be extended to more than two ports. For example, the embodiments of Figures 42, 46, 48, and 49 may each include an additional column or row multiplexer to provide access to each additional column or row during a single clock cycle. For another example, the embodiments of FIGS. 43 and 44 may include additional row decoders and/or column multiplexers to provide access to additional rows or columns, respectively, during a single clock cycle.

메모리 내의 가변 워드 길이 접근Variable word length access in memory

이상과 이하에서, '결합'이라는 용어는 직접 연결, 간접 연결, 전기적 연결 등을 포함하는 의미로 사용될 수 있다.In the above and below, the term 'coupled' may be used in a sense including direct connection, indirect connection, electrical connection, and the like.

또한, '제1', '제2' 등의 용어는 이름이나 명칭이 동일하거나 유사한 구성요소 또는 방법 단계 사이의 구분을 위해 사용될 뿐이고, 반드시 공간적 또는 시간적 순서를 나타내는 것이 아니다.In addition, terms such as 'first' and 'second' are used only to distinguish between components or method steps having the same or similar names or names, and do not necessarily indicate spatial or temporal order.

일반적으로, 메모리 칩은 메모리 뱅크를 포함한다. 메모리 뱅크는 읽기나 쓰기를 할 특정 워드(또는 기타 고정 사이즈의 데이터 유닛)를 선택하도록 구성된 로우 디코더 및 컬럼 디코더에 결합될 수 있다. 각 메모리 뱅크는 데이터 유닛을 저장하기 위한 메모리 셀, 로우 디코더 및 컬럼 디코더에 의해 선택된 메모리 셀로부터의 전압을 증폭하기 위한 센스 증폭기, 및 모든 임의의 기타 적절한 회로를 포함할 수 있다.Generally, a memory chip includes a memory bank. The memory bank may be coupled to a row decoder and a column decoder configured to select a particular word (or other fixed-size data unit) to be read or written. Each memory bank may include memory cells for storing data units, sense amplifiers for amplifying voltages from memory cells selected by the row decoder and column decoder, and any other suitable circuitry.

각 메모리 뱅크는 보통 특정 I/O 폭을 가진다. 예를 들어, I/O 폭은 워드를 포함할 수 있다.Each memory bank usually has a specific I/O width. For example, an I/O width may include words.

메모리 칩을 사용하는 논리 회로에 의해 실행되는 일부 프로세스는 매우 긴 워드를 사용하는 것이 이점이 있을 수 있는 반면에, 일부 다른 프로세스는 워드의 일부만 필요할 수 있다.Some processes executed by logic circuitry using memory chips may benefit from using very long words, while some other processes may only need a portion of a word.

실제로, 인메모리(in-memory) 컴퓨팅 유닛(예, 도 7a에 도시되고 설명된 바와 같은, 메모리 칩과 동일한 기판에 배치된 프로세서 서브유닛)이 워드의 일부만 필요로 하는 메모리 접근 동작을 수행하는 경우가 흔하다.Indeed, when an in-memory computing unit (eg, a processor subunit disposed on the same substrate as a memory chip, as shown and described in FIG. 7A ) performs a memory access operation that requires only a subset of words. is common

워드의 일부만이 사용되는 경우에 워드 전체의 접근과 연관된 지연을 줄이기 위하여, 본 개시의 실시예들은 워드의 하나 이상의 부분만을 가져오는 방법과 시스템을 제공함으로써 워드의 불필요한 부분의 전송과 연관된 데이터 손실을 감소하고 메모리 장치의 전력 소모를 감소할 수 있다.To reduce the delay associated with accessing an entire word when only a portion of a word is used, embodiments of the present disclosure provide a method and system for retrieving only one or more portions of a word, thereby reducing data loss associated with transmission of an unnecessary portion of a word. and may reduce power consumption of the memory device.

또한, 본 개시의 실시예들은 메모리 칩과, 워드의 일부에만 읽기와 쓰기를 하는 메모리 칩에 접근하는 기타 구성(예, CPU나 GPU와 같이 분리되어 있거나, 도 7a에 도시되고 설명된 프로세서 서브유닛과 같이 메모리 칩과 동일한 기판 상에 포함된, 논리 회로) 사이의 상호작용의 전력 소모도 감소시킬 수 있다.In addition, embodiments of the present disclosure include a memory chip and other components that access a memory chip that reads and writes only a part of a word (eg, separate like a CPU or GPU, or a processor subunit illustrated and described in FIG. 7A ) Power consumption of the interaction between the memory chip and the logic circuit included on the same substrate) can also be reduced.

메모리 접근 명령(예, 메모리를 사용하는 논리 회로로부터의 명령)은 메모리 내의 어드레스를 포함할 수 있다. 예를 들어, 어드레스는 로우 어드레스와 컬럼 어드레스를 포함하거나 메모리의 메모리 컨트롤러 등에 의해 로우 어드레스 및 컬럼 어드레스로 변환될 수 있다.A memory access command (eg, a command from a logic circuit using the memory) may include an address in the memory. For example, the address may include a row address and a column address or may be converted into a row address and a column address by a memory controller of a memory.

DRAM과 같은 많은 휘발성 메모리에서, 로우 어드레스는 로우 디코더로 전송되고(예, 논리 회로에 의해 직접 또는 메모리 컨트롤러를 활용하여), 로우 디코더는 행(워드라인이라고도 지칭) 전체를 활성화하고 당해 행에 포함된 모든 비트라인을 로딩한다.In many volatile memories, such as DRAM, the row address is sent to a row decoder (e.g., either directly by logic circuitry or utilizing a memory controller), which activates an entire row (also called a wordline) and contains it in that row. All bitlines are loaded.

컬럼 어드레스는 비트라인을 포함하는 메모리 뱅크 외부와 다음 레벨 회로로 전송되는 활성화된 행 상의 비트라인을 식별한다. 예를 들면, 다음 레벨 회로는 메모리 칩의 I/O 버스를 포함할 수 있다. 인메모리 프로세싱을 활용하는 실시예에서, 다음 레벨 회로는 메모리 칩의 프로세서 서브유닛(예, 도 7a에 도시)을 포함할 수 있다.The column address identifies the bitline on the active row that is transferred to the next level circuit and outside the memory bank containing the bitline. For example, the next level circuit may include the I/O bus of the memory chip. In embodiments utilizing in-memory processing, the next level circuitry may include a processor subunit (eg, shown in FIG. 7A ) of a memory chip.

이에 따라, 아래에 설명하는 메모리 칩은 도 3a, 도 3b, 도 4 내지 도 6, 도 7a 내지 도 7d, 도 11 내지 도 13, 도 16 내지 도 19, 도 22, 및 도 23의 하나 이상에 도시된 바와 같은 메모리 칩에 포함되거나 메모리 칩을 포함할 수 있다.Accordingly, the memory chip described below is shown in one or more of FIGS. 3A, 3B, 4 to 6, 7A to 7D, 11 to 13, 16 to 19, 22, and 23 . It may be included in a memory chip as shown or may include a memory chip.

메모리 칩은 논리 셀보다는 메모리 셀에 최적화된 제1 제조 프로세스에 의해 제조될 수 있다. 예를 들면, 제1 제조 프로세스에 의해 제조된 메모리 셀은 제1 제조 프로세스에 의해 제조된 논리 회로보다 임계 치수가 작을(예, 2배, 3배, 4배, 5배, 6배, 7배, 8배, 9배, 10배 등만큼) 수 있다. 예컨대, 제1 제조 프로세스는 아날로그 제조 프로세스, DRAM 제조 프로세스 등을 포함할 수 있다.The memory chip may be manufactured by a first manufacturing process optimized for memory cells rather than logic cells. For example, a memory cell manufactured by a first manufacturing process may have a smaller critical dimension (eg, 2x, 3x, 4x, 5x, 6x, 7x) than a logic circuit manufactured by the first manufacturing process. , 8x, 9x, 10x, etc.). For example, the first manufacturing process may include an analog manufacturing process, a DRAM manufacturing process, or the like.

이러한 메모리 칩은 메모리 유닛을 포함할 수 있는 집적회로를 포함할 수 있다. 메모리 유닛은 메모리 셀, 출력 포트, 및 읽기 회로를 포함할 수 있다. 일부 실시예에서, 메모리 유닛은 앞서 설명한 프로세서 서브유닛과 같은 처리부를 더 포함할 수 있다.Such a memory chip may include an integrated circuit that may include a memory unit. The memory unit may include a memory cell, an output port, and a read circuit. In some embodiments, the memory unit may further include a processing unit such as the processor subunit described above.

예를 들어, 읽기 회로는 출력 포트를 통해 제1 수의 비트까지 출력하기 위한 제1 그룹의 메모리 읽기 경로 및 리덕션 유닛을 포함할 수 있다. 출력 포트는 오프칩(off-chip) 논리 회로(예, 가속기, CPU, GPU 등) 또는 앞서 설명한 바와 같은 온칩(on-chip) 프로세서 서브유닛으로 연결될 수 있다.For example, the read circuit may include a first group of memory read paths and reduction units for outputting up to a first number of bits through the output port. The output port may be connected to an off-chip logic circuit (eg, accelerator, CPU, GPU, etc.) or an on-chip processor subunit as described above.

일부 실시예에서, 처리부는 리덕션 유닛을 포함하거나, 리덕션 유닛의 일부이거나, 리덕션 유닛과 상이하거나, 리덕션 유닛을 포함할 수 있다.In some embodiments, the processing unit may include, be part of, be different from, or include a reduction unit, a reduction unit.

인메모리 읽기 경로는 집적회로에 포함(예를 들어, 메모리 유닛에 포함)되거나, 메모리 셀에 읽기 및/또는 쓰기를 하도록 구성된 모든 임의의 회로 및/또는 링크를 포함할 수 있다. 예를 들면, 인메모리 읽기 경로는 센스 증폭기, 메모리 셀에 결합된 컨덕터, 멀티플렉서 등을 포함할 수 있다.An in-memory read path may include any circuit and/or link included in an integrated circuit (eg, included in a memory unit) or configured to read and/or write to a memory cell. For example, the in-memory read path may include a sense amplifier, a conductor coupled to the memory cell, a multiplexer, and the like.

처리부는 메모리 유닛에서 제2 수의 비트를 읽기 위한 읽기 요청을 메모리 유닛으로 전송하도록 구성될 수 있다. 추가적으로 또는 대안적으로, 읽기 요청은 오프칩 논리 회로(예, 가속기, CPU, GPU 등)에서 유래할 수 있다.The processing unit may be configured to send a read request for reading the second number of bits from the memory unit to the memory unit. Additionally or alternatively, the read request may originate from an off-chip logic circuit (eg, accelerator, CPU, GPU, etc.).

리덕션 유닛은 접근 요청과 관련된 전력 소모의 감소를 보조하도록(예, 여기에 기재된 부분 워드 접근의 하나 이상을 활용) 구성될 수 있다.The reduction unit may be configured to assist in reducing power consumption associated with access requests (eg, utilizing one or more of the partial word accesses described herein).

리덕션 유닛은 제1 수의 비트와 제2 수의 비트에 의거하여, 읽기 요청에 의해 촉발된 읽기 동작 동안에, 메모리 읽기 경로를 제어하도록 구성될 수 있다. 예를 들어, 리덕션 유닛으로부터의 제어 신호는 읽기 경로의 메모리 소모에 영향을 주어, 요청된 제2 수의 비트와 관련이 없는 메모리 읽기 경로의 에너지 소모를 감소시킬 수 있다. 예컨대, 리덕션 유닛은 제2 수가 제1 수보다 작은 경우에 무관한 메모리 읽기 경로를 제어하도록 구성될 수 있다.The reduction unit may be configured to control the memory read path during a read operation triggered by the read request according to the first number of bits and the second number of bits. For example, a control signal from the reduction unit may influence memory consumption of the read path, thereby reducing energy consumption of the memory read path unrelated to the second number of bits requested. For example, the reduction unit may be configured to control the irrelevant memory read path when the second number is less than the first number.

앞서 설명한 바와 같이, 집적회로는 도 3a, 도 3b, 도 4 내지 도 6, 도 7a 내지 도 7d, 도 11 내지 도 13, 도 16 내지 도 19, 도 22, 및 도 23의 하나 이상에 도시된 바와 같은 메모리 칩에 포함되거나 메모리 칩을 포함할 수 있다.As previously described, an integrated circuit may be illustrated in one or more of FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, and 23 . It may be included in or include a memory chip as such.

무관한 인메모리 읽기 경로는, 제2 수의 비트에 포함되지 않은 제1 수의 비트의 비트와 같은, 제1 수의 비트의 무관한 비트와 연관될 수 있다.The irrelevant in-memory read path may be associated with an irrelevant bit of the first number of bits, such as bits of the first number of bits that are not included in the second number of bits.

도 50은 메모리 셀 어레이(5050)의 메모리 셀(5001-5008), 비트(5021-5028)를 포함하는 출력 포트(5020), 메모리 읽기 경로(5011-5018)를 포함하는 읽기 회로(5040), 및 리덕션 유닛(5030)을 포함하는 예시적인 집적회로(5000)를 도시한 것이다.50 is a read circuit 5040 comprising memory cells 5001-5008 of a memory cell array 5050, an output port 5020 comprising bits 5021-5028, and a memory read path 5011-5018; and an exemplary integrated circuit 5000 including a reduction unit 5030 .

제2 수의 비트가 상응하는 메모리 읽기 경로를 활용하여 읽히는 경우, 제1 수의 비트의 무관한 비트는 읽히지 않아야 하는 비트(예, 제2 수의 비트에 포함되지 않는 비트)에 상응할 수 있다.If the second number of bits are read utilizing a corresponding memory read path, the unrelated bits of the first number of bits may correspond to bits that should not be read (eg, bits that are not included in the second number of bits). .

읽기 동작 동안에, 리덕션 유닛(5030)은 활성화된 메모리 읽기 경로가 제2 수의 비트를 전달하도록 구성될 수 있도록 제2 수의 비트에 상응하는 메모리 읽기 경로를 활성화하도록 구성될 수 있다. 이러한 실시예에서, 제2 수의 비트에 상응하는 메모리 읽기 경로만이 활성화될 수 있다.During a read operation, reduction unit 5030 may be configured to activate a memory read path corresponding to the second number of bits such that the activated memory read path may be configured to pass the second number of bits. In such an embodiment, only memory read paths corresponding to the second number of bits may be active.

읽기 동작 동안에, 리덕션 유닛(5030)은 각각의 무관한 메모리 읽기 경로의 적어도 일부를 폐쇄하도록 구성될 수 있다. 예컨대, 무관한 메모리 읽기 경로는 제1 수의 비트의 무관한 비트에 상응할 수 있다.During a read operation, reduction unit 5030 may be configured to close at least a portion of each irrelevant memory read path. For example, an irrelevant memory read path may correspond to an irrelevant bit of the first number of bits.

여기서, 무관한 메모리 경로의 적어도 일부분을 폐쇄하는 대신에, 리덕션 유닛(5030)은 무관한 경로가 활성화되지 않도록 할 수 있다.Here, instead of closing at least a portion of the irrelevant memory path, the reduction unit 5030 may prevent the irrelevant path from being activated.

추가적으로 또는 대안적으로, 읽기 동작 동안에, 리덕션 유닛(5030)은 무관한 메모리 읽기 경로를 저전력 모드로 유지하도록 구성될 수 있다. 예를 들면, 저전력 모드는 무관한 메모리 경로에 정상 동작 전압 또는 전류보다 낮은 전압 또는 전류가 공급되는 모드를 포함할 수 있다.Additionally or alternatively, during a read operation, reduction unit 5030 may be configured to maintain the unrelated memory read path in a low power mode. For example, the low power mode may include a mode in which an unrelated memory path is supplied with a voltage or current lower than a normal operating voltage or current.

리덕션 유닛(5030)은 무관한 메모리 읽기 경로의 비트라인을 제어하도록 더 구성될 수 있다.The reduction unit 5030 may be further configured to control the bitlines of the irrelevant memory read path.

이에 따라, 리덕션 유닛(5030)은 관련 있는 메모리 읽기 경로의 비트라인을 로드(load) 하고 무관한 메모리 읽기 경로의 비트라인을 저전력 모드로 유지하도록 구성될 수 있다. 예를 들면, 관련 있는 메모리 읽기 경로의 비트라인만이 로드 될 수 있다.Accordingly, the reduction unit 5030 may be configured to load the bitlines of the relevant memory read paths and maintain the bitlines of the irrelevant memory read paths in a low power mode. For example, only the bitlines of the relevant memory read path can be loaded.

추가적으로 또는 대안적으로, 리덕션 유닛(5030)은 무관한 메모리 읽기 경로를 비활성화 상태로 유지하면서 관련 있는 메모리 읽기 경로의 비트라인을 로드 하도록 구성될 수 있다.Additionally or alternatively, reduction unit 5030 may be configured to load the bitlines of the relevant memory read path while keeping the unrelated memory read path in an inactive state.

일부 실시예에서, 리덕션 유닛(5030)은 읽기 동작 동안에 관련 있는 메모리 읽기 경로의 일부를 활용하고 무관한 각 메모리 읽기 경로의 일부(비트라인과 다름)를 저전력 모드로 유지하도록 구성될 수 있다.In some embodiments, reduction unit 5030 may be configured to utilize a portion of the memory read path that is relevant during a read operation and maintain a portion of each memory read path that is not relevant (different from the bitline) in a low power mode.

앞서 설명한 바와 같이, 메모리 칩은 센스 증폭기를 사용하여 내부에 포함된 메모리 셀로부터의 전압을 증폭할 수 있다. 이에 따라, 리덕션 유닛(5030)은 읽기 동작 동안에 관련 있는 메모리 읽기 경로의 일부를 활용하고 무관한 메모리 읽기 경로의 적어도 일부와 연관된 센스 증폭기를 저전력 모드로 유지하도록 구성될 수 있다.As described above, the memory chip may use a sense amplifier to amplify a voltage from a memory cell included therein. Accordingly, reduction unit 5030 may be configured to utilize a portion of the relevant memory read path and maintain the sense amplifier associated with at least a portion of the unrelated memory read path in a low power mode during a read operation.

이러한 실시예에서, 리덕션 유닛(5030)은 읽기 동작 동안에 관련 있는 메모리 읽기 경로의 일부를 활용하고 관련 없는 메모리 읽기 경로 모두와 연관된 하나 이상의 센스 증포기를 저전력 모드로 유지하도록 구성될 수 있다.In such an embodiment, reduction unit 5030 may be configured to utilize a portion of the relevant memory read path during a read operation and to maintain one or more sense amplifiers associated with both the unrelated memory read path in a low power mode.

추가적으로 또는 대안적으로, 리덕션 유닛(5030)은 읽기 동작 동안에 관련 있는 메모리 읽기 경로의 일부를 활용하고 무관한 메모리 읽기 경로와 연관된 하나 이상의 센스 증폭기를 따라가는(예, 공간적으로 및/또는 시간적으로) 무관한 메모리 읽기 경로의 일부를 저전력 모드로 유지하도록 구성될 수 있다.Additionally or alternatively, reduction unit 5030 utilizes a portion of the relevant memory read path during a read operation and follows (eg, spatially and/or temporally) independent of one or more sense amplifiers associated with the unrelated memory read path. It can be configured to keep part of a memory read path in a low power mode.

상기에 설명한 모든 임의의 실시예에서, 메모리 유닛은 컬럼 멀티플렉서(미도시)를 포함할 수 있다.In any of the embodiments described above, the memory unit may include a column multiplexer (not shown).

이러한 실시예에서, 리덕션 유닛(5030)은 컬럼 멀티플렉서와 출력 포트 사이에 결합될 수 있다.In such an embodiment, a reduction unit 5030 may be coupled between the column multiplexer and the output port.

추가적으로 또는 대안적으로, 리덕션 유닛(5030)은 컬럼 멀티플렉서에 내장될 수 있다.Additionally or alternatively, the reduction unit 5030 may be built into the column multiplexer.

추가적으로 또는 대안적으로, 리덕션 유닛(5030)은 메모리 셀과 컬럼 멀티플렉서 사이에 결합될 수 있다.Additionally or alternatively, a reduction unit 5030 may be coupled between the memory cell and the column multiplexer.

리덕션 유닛(5030)은 독립적으로 제어 가능할 수 있는 리덕션 서브유닛을 포함할 수 있다. 예를 들어, 상이한 리덕션 서브유닛은 상이한 메모리 유닛 열과 연관될 수 있다.The reduction unit 5030 may include a reduction subunit that may be independently controllable. For example, different reduction subunits may be associated with different memory unit columns.

상기에서는 읽기 동작과 읽기 회로를 참조하여 설명하였지만, 상기의 실시예들은 쓰기 동작과 쓰기 회로에 대하여 유사하게 적용될 수 있다.Although the above has been described with reference to the read operation and the read circuit, the above embodiments may be similarly applied to the write operation and the write circuit.

예를 들어, 본 개시에 따른 집적회로는 메모리 셀, 출력 포트, 및 쓰기 회로를 포함하는 메모리 유닛을 포함할 수 있다. 일부 실시예에서, 메모리 유닛은 앞서 설명한 프로세서 서브유닛과 같은 처리부를 더 포함할 수 있다. 쓰기 회로는 출력 포트를 통하여 제1 수의 비트까지 출력하기 위한 제1 그룹의 메모리 쓰기 경로 및 리덕션 유닛을 포함할 수 있다. 처리부는 메모리 유닛으로부터 제2 수의 비트를 쓰기 위한 쓰기 요청을 메모리 유닛으로 전송하도록 구성될 수 있다. 추가적으로 또는 대안적으로, 쓰기 요청은 오프칩 논리 회로(예, 가속기, CPU, GPU 등)에서 유래할 수 있다. 리덕션 유닛(5030)은 제1 수의 비트와 제2 수의 비트에 의거하여, 쓰기 요청에 의해 촉발된 쓰기 동작 동안에, 메모리 쓰기 경로를 제어하도록 구성될 수 있다.For example, an integrated circuit according to the present disclosure may include a memory unit including a memory cell, an output port, and a write circuit. In some embodiments, the memory unit may further include a processing unit such as the processor subunit described above. The write circuit may include a first group of memory write paths and a reduction unit for outputting up to a first number of bits through the output port. The processing unit may be configured to send a write request for writing the second number of bits from the memory unit to the memory unit. Additionally or alternatively, the write request may originate from an off-chip logic circuit (eg, accelerator, CPU, GPU, etc.). The reduction unit 5030 may be configured to control a memory write path during a write operation triggered by a write request, based on the first number of bits and the second number of bits.

도 51은 로우 어드레스와 컬럼 어드레스를 활용하여(예, 온칩 프로세서 서브유닛 또는 가속기, CPU, GPU 등과 같은 오프칩 논리 회로로부터) 어드레스 된 메모리 셀 어레이(5111)를 포함하는 메모리 뱅크(5100)를 도시한 것이다. 도 51에 도시된 바와 같이, 메모리 셀은 비트라인(세로) 및 워드라인(가로―대부분 생략되어 간략히 묘사됨)으로 연결된다. 또한, 로우 디코더(5112)에는 로우 어드레스가 공급(예, 온칩 프로세서 서브유닛, 오프칩 논리 회로, 또는 도 51에 도시되지 않은 메모리 컨트롤러로부터)될 수 있고, 컬럼 멀티플렉서(5113)에는 컬럼 어드레스가 공급(예, 온칩 프로세서 서브유닛, 오프칩 논리 회로, 또는 도 51에 도시되지 않은 메모리 컨트롤러로부터)될 수 있고, 컬럼 멀티플렉서(5113)는 최대 전체 라인으로부터 출력을 수신하고 출력 버스(5115)를 통해 워드까지 출력할 수 있다. 도 51에서, 컬럼 멀티플렉서(5113)의 출력 버스(5115)는 메인 I/O 버스(5114)에 결합된다. 다른 실시예에서, 출력 버스(5115)는 로우 어드레스 및 컬럼 어드레스를 전송하는 메모리 칩의 프로세서 서브유닛(예, 도 7a에 도시)에 결합될 수 있다. 간략한 묘사를 위해, 메모리 뱅크를 메모리 매트로 분할하는 것은 도시되지 않았다.Figure 51 shows a memory bank 5100 comprising an array of memory cells 5111 addressed utilizing row and column addresses (e.g., from an on-chip processor subunit or off-chip logic circuitry such as an accelerator, CPU, GPU, etc.) did it As shown in Figure 51, the memory cells are connected by bitlines (vertical) and wordlines (horizontally - mostly depicted for simplicity). In addition, a row address may be supplied to the row decoder 5112 (eg, from an on-chip processor subunit, an off-chip logic circuit, or a memory controller not shown in FIG. 51 ), and a column address may be supplied to the column multiplexer 5113 . may be (eg, from an on-chip processor subunit, off-chip logic circuitry, or a memory controller not shown in FIG. 51 ), and column multiplexer 5113 receives output from up to the entire line and word via output bus 5115 . You can print up to . In FIG. 51 , the output bus 5115 of the column multiplexer 5113 is coupled to the main I/O bus 5114 . In another embodiment, output bus 5115 may be coupled to a processor subunit (eg, shown in FIG. 7A ) of a memory chip that transmits row and column addresses. For the sake of brevity, partitioning a memory bank into memory mats is not shown.

도 52는 메모리 뱅크(5101)를 도시한 것이다. 도 52에서, 메모리 뱅크는 출력 버스(5115)에 입력이 결합된 PIM(processing in memory) 로직(5116)도 포함하는 것으로 도시되어 있다. PIM 로직(5116)은 어드레스(예, 로우 어드레스와 컬럼 어드레스를 포함)를 생성하고 PIM 어드레스 버스(5118)를 통해 출력하여 메모리 뱅크에 접근할 수 있다. PIM 로직(5116)은 처리부를 포함하는 리덕션 유닛(예, 5030)의 예이다. PIM 로직(5116)은 전력의 감소를 보조하는 도 52에 도시되지 않은 다른 회로를 제어할 수 있다. PIM 로직(5116)은 또한 메모리 뱅크(5101)를 포함하는 메모리 유닛의 메모리 경로를 포함할 수 있다.52 shows a memory bank 5101 . In FIG. 52 , a memory bank is shown that also includes processing in memory (PIM) logic 5116 whose input is coupled to an output bus 5115 . The PIM logic 5116 may generate an address (eg, including a row address and a column address) and output it through the PIM address bus 5118 to access the memory bank. The PIM logic 5116 is an example of a reduction unit (eg, 5030 ) that includes a processing unit. PIM logic 5116 may control other circuitry not shown in FIG. 52 that aids in power reduction. PIM logic 5116 may also include a memory path of a memory unit including memory bank 5101 .

앞서 설명한 바와 같이, 워드 길이(예, 한 번에 전송되는 것으로 선택된 비트라인의 수)는 일부 경우에 클 수 있다.As discussed above, the word length (eg, the number of bitlines selected to be transmitted at one time) can be large in some cases.

이러한 경우에서, 읽기 및/또는 쓰기를 위한 각 워드는 아래의 예와 같은 읽기 및/또는 쓰기 동작의 다양한 단계에서 전력을 소모할 수 있는 메모리 경로와 연관될 수 있다. In this case, each word for read and/or write may be associated with a memory path that may consume power at various stages of a read and/or write operation, such as in the example below.

a. 비트라인 로딩―필요한 값으로 비트라인이 로드 되는 것을(읽기 사이클에서 비트라인 상의 커패시터로부터 또는 쓰기 사이클에서 커패시터로 기록될 새 값으로) 방지하려면, 메모리 어레이의 끝에 위치한 센스 증폭기를 비활성화하고 데이터를 보유하는 커패시터가 방전되거나 충전되지 않도록 할 필요가 있다(그렇지 않으면 저장된 데이터가 파괴된다). a. Bitline Loading—To prevent the bitline from being loaded with the required value (either from the capacitor on the bitline in a read cycle or with a new value to be written to the capacitor in a write cycle), disable the sense amplifier located at the end of the memory array and retain the data It is necessary to ensure that the capacitor is not discharged or charged (otherwise the stored data will be destroyed).

b. 센스 증폭기로부터, 비트라인을 선택하는 컬럼 멀티플렉서를 통해, 칩의 나머지 부분으로(칩의 내부로 또는 외부로 데이터를 전송하는 I/O 버스로 또는 메모리와 동일 기판 상의 프로세서 서브유닛과 같은, 데이터를 사용하는, 내장 로직으로) 데이터 이동.b. data from the sense amplifier, through a column multiplexer that selects the bitlines, to the rest of the chip (either to an I/O bus that transfers data into or out of the chip or to a processor subunit on the same board as the memory, such as a processor subunit). using, with built-in logic) to move data.

전력 절약을 활성화하기 위해, 본 개시의 집적회로는 워드의 일부 부분이 무관하다는 판단을 행 활성화 시간에 한 후에 비활성화 신호를 워드의 무관한 부분에 대한 하나 이상의 센스 증폭기로 전송할 수 있다.To enable power saving, the integrated circuit of the present disclosure may send a deactivation signal to one or more sense amplifiers for the irrelevant portion of the word after determining at row activation time that some portion of the word is irrelevant.

도 53은 메모리 셀의 어레이(5111), 로우 디코더(5112), 출력 버스(5115)에 결합된 컬럼 멀티플렉서(5113), 및 PIM 로직(5116)을 포함하는 메모리 유닛(5102)을 도시한 것이다.53 shows a memory unit 5102 including an array of memory cells 5111 , a row decoder 5112 , a column multiplexer 5113 coupled to an output bus 5115 , and PIM logic 5116 .

메모리 유닛(5102)은 또한 컬럼 멀티플렉서(5113)로의 비트의 경로를 활성화 또는 비활성화하는 스위치(5201)를 포함한다. 스위치(5201)는 아날로그 스위치, 스위치 기능하도록 구성된 트랜지스터, 또는 메모리 유닛(5102)의 일부로 전압의 공급 및/또는 전류의 흐름을 제어하도록 구성된 모든 임의의 기타 회로를 포함할 수 있다. 센스 증폭기(미도시)는 메모리 셀 어레이의 끝에, 예를 들어 스위치(5201)보다 앞에(공간적으로 및/또는 시간적으로), 위치할 수 있다.Memory unit 5102 also includes a switch 5201 that activates or deactivates the path of bits to column multiplexer 5113 . The switch 5201 may include an analog switch, a transistor configured to function as a switch, or any other circuitry configured to control the supply of voltage and/or the flow of current as part of the memory unit 5102 . A sense amplifier (not shown) may be located at the end of the memory cell array, eg, before (spatially and/or temporally) the switch 5201 .

스위치(5201)는 PIM 로직(5116)으로부터 버스(5117)를 통해 전송되는 인에이블 신호(enable signal)에 의해 제어될 수 있다. 스위치는, 연결이 끊어진 경우에, 메모리 유닛(5102)의 센스 증폭기(미도시)의 연결을 끊고 따라서 센스 증폭기로부터 연결이 끊어진 비트라인을 방전 또는 충전하지 않도록 구성될 수 있다.The switch 5201 may be controlled by an enable signal transmitted from the PIM logic 5116 over the bus 5117 . The switch may be configured to disconnect the sense amplifier (not shown) of the memory unit 5102 and thus not discharge or charge the disconnected bit line from the sense amplifier when disconnected.

스위치(5201)와 PIM 로직(5116)은 리덕션 유닛(예, 5030)을 구성할 수 있다.The switch 5201 and the PIM logic 5116 may constitute a reduction unit (eg, 5030 ).

또 다른 예에서, PIM 로직(5116)은 인에이블 신호를 스위치(5201)로 전송하는 대신에 센스 증폭기로 전송할 수 있다(예, 센스 증폭기에 활성화 입력이 있는 경우).In another example, PIM logic 5116 may send an enable signal to a sense amplifier instead of sending it to switch 5201 (eg, if the sense amplifier has an enable input).

비트라인은 추가적으로 또는 대안적으로 다른 지점에서, 예를 들어 비트라인의 끝과 센스 증폭기 이후가 아닌 지점에서, 연결이 끊어질 수 있다. 예를 들면, 비트라인은 어레이(5111)로 들어가기 이전에 연결이 끊어질 수 있다.The bitline may additionally or alternatively be disconnected at another point, eg at a point other than at the end of the bitline and after the sense amplifier. For example, the bitline may be disconnected prior to entering the array 5111 .

이러한 실시예에서, 센스 증폭기 및 전송 하드웨어(예, 출력 버스(5115))로부터의 데이터 전송에서도 전력이 절약될 수 있다.In such an embodiment, power may also be saved in transferring data from the sense amplifier and transmission hardware (eg, output bus 5115).

다른 실시예들(전력 절약은 덜하지만 구현이 더 수월한 실시예들)은 컬럼 멀티플렉서(5113)의 전력과 컬럼 멀티플렉서(5113)으로부터 다음 레벨 회로로의 전송 손실의 감소에 초점을 맞춘다. 예를 들어, 앞서 설명한 바와 같이, 다음 레벨 회로는 메모리 칩의 I/O 버스(예, 버스(5115))를 포함할 수 있다. 인메모리 프로세싱을 활용하는 실시예에서, 다음 레벨 회로는 메모리 칩의 프로세서 서브유닛(예, PIM 로직(5116))을 추가적으로 또는 대안적으로 포함할 수 있다.Other embodiments (those that save less power but are easier to implement) focus on reducing the power of the column multiplexer 5113 and the transmission loss from the column multiplexer 5113 to the next level circuit. For example, as described above, the next level circuit may include an I/O bus (eg, bus 5115) of the memory chip. In embodiments that utilize in-memory processing, the next level circuitry may additionally or alternatively include a processor subunit (eg, PIM logic 5116 ) of a memory chip.

도 54a는 세그먼트(5202)로 분할된 컬럼 멀티플렉서(5113)를 도시한 것이다. 컬럼 멀티플렉서(5113)의 각 세그먼트(5202)는 PIM 로직(5116)으로부터 버스(5119)를 통해 전송된 활성화 및/또는 비활성화 신호에 의해 개별적으로 활성화 또는 비활성화될 수 있다. 컬럼 멀티플렉서(5113)는 또한 어드레스 컬럼 버스(5118)에 의해 공급될 수 있다.FIG. 54A shows column multiplexer 5113 divided into segments 5202 . Each segment 5202 of column multiplexer 5113 may be individually activated or deactivated by activation and/or deactivation signals sent from PIM logic 5116 over bus 5119 . A column multiplexer 5113 may also be supplied by an address column bus 5118 .

도 54a의 실시예는 컬럼 멀티플렉서(5113)로부터의 출력의 다른 부분보다 더 나은 제어를 제공할 수 있다.54A may provide better control than other portions of the output from column multiplexer 5113.

여기서, 상이한 메모리 경로의 제어는 비트 해상도가 상이할 수 있다. 예를 들어, 비트 해상도는 단일 비트 해상도에서 다중 비트 해상도까지의 범위일 수 있다. 전자는 전력 감소에 더 효과적일 수 있고, 후자는 구현이 더 단순하고 제어 신호가 덜 필요할 수 있다.Here, the control of different memory paths may have different bit resolutions. For example, bit resolution may range from single bit resolution to multi-bit resolution. The former may be more effective at reducing power, and the latter may be simpler to implement and require less control signals.

도 54b는 예시적인 방법(5130)을 도시한 것이다. 예컨대, 방법(5130)은 상기에 도 50, 도 51, 도 52, 도 53, 또는 도 54a를 참조하여 설명한 모든 임의의 메모리 유닛을 활용하여 이행될 수 있다.54B illustrates an example method 5130 . For example, method 5130 may be implemented utilizing any of the memory units described above with reference to FIG. 50, 51, 52, 53, or 54A.

방법(5130)은 단계 5132와 단계 5134를 포함할 수 있다.Method 5130 may include steps 5132 and 5134 .

단계 5132에서, 메모리 유닛에서 제2 수의 비트를 읽기 위한 접근 요청을 집적회로의 처리부(예, PIM 로직(5116))가 집적회로의 메모리 유닛으로 전송할 수 있다. 메모리 유닛은 메모리 셀(예, 어레이(5111)의 메모리 셀), 출력 포트(예, 출력 버스(5115)), 및 리덕션 유닛(예, 리덕션 유닛(5030))과 출력 포트를 통해 제1 수의 비트까지의 출력 및/또는 입력을 위한 제1 그룹의 메모리 읽기/쓰기 경로를 포함할 수 있는 읽기/쓰기 회로를 포함할 수 있다.In operation 5132 , the processing unit (eg, the PIM logic 5116 ) of the integrated circuit may transmit an access request for reading the second number of bits from the memory unit to the memory unit of the integrated circuit. The memory unit may be configured through a memory cell (eg, memory cells of array 5111 ), an output port (eg, output bus 5115 ), and a reduction unit (eg, reduction unit 5030 ) and a first number of output ports. and read/write circuitry that may include a first group of memory read/write paths for input and/or output to bits.

접근 요청은 읽기 요청 및/또는 쓰기 요청을 포함할 수 있다.The access request may include a read request and/or a write request.

메모리 입력/출력 경로는 메모리 읽기 경로, 메모리 쓰기 경로, 및/또는 읽기와 쓰기에 모두 사용되는 경로를 포함할 수 있다.The memory input/output path may include a memory read path, a memory write path, and/or a path used for both reading and writing.

단계 5134에서, 접근 요청에 대응할 수 있다.In step 5134, it may respond to the access request.

예를 들어, 단계 5134에서, 제1 수의 비트와 제2 수의 비트에 의거하여, 접근 요청에 의해 촉발된 접근 동작 동안에, 리덕션 유닛(예, 리덕션 유닛(5030))이 메모리 읽기/쓰기 경로를 제어할 수 있다.For example, in step 5134 , based on the first number of bits and the second number of bits, during an access operation triggered by an access request, a reduction unit (eg, reduction unit 5030 ) is configured to execute a memory read/write path can control

또한, 단계 5134에서, 다음 중의 어느 하나 및/또는 다음 중의 하나 이상의 모든 임의의 조합이 수행될 수 있다. 아래에 나열된 모든 임의의 동작은 접근 요청에 대응하는 동안에 실행될 수 있지만 접근 요청에 대한 대응 이전 및/또는 이후에 실행될 수도 있다.Further, in step 5134, any one of and/or any combination of one or more of the following may be performed. Any of the actions listed below may be executed while responding to an access request, but may also be executed before and/or after responding to an access request.

따라서, 단계 5134는 다음 중의 적어도 하나를 포함할 수 있다. Accordingly, step 5134 may include at least one of the following.

a.제2 수가 제1수보다 작은 경우에 무관한 메모리 읽기 경로를 제어하는 단계― 여기서, 무관한 메모리 읽기 경로는 제2 수의 비트에 포함되지 않는 제1 수의 비트의 비트와 연관됨; a. controlling the irrelevant memory read path if the second number is less than the first number, wherein the irrelevant memory read path is associated with bits of the first number of bits that are not included in the second number of bits;

b.읽기 동작 동안에 관련 있는 메모리 읽기 경로를 활성화하는 단계― 여기서, 관련 있는 메모리 읽기 경로는 제2 수의 비트를 전달하도록 구성됨; b. activating an associated memory read path during a read operation, wherein the associated memory read path is configured to pass a second number of bits;

c.무관한 메모리 읽기 경로 각각의 적어도 일부분을 읽기 동작 동안에 폐쇄하는 단계; c. closing at least a portion of each of the irrelevant memory read paths during a read operation;

d.무관한 메모리 읽기 경로를 읽기 동작 동안에 저전력 모드로 유지하는 단계; d. maintaining an unrelated memory read path in a low power mode during a read operation;

e.무관한 메모리 읽기 경로의 비트라인을 제어하는 단계; e. controlling a bit line of an irrelevant memory read path;

f.관련 있는 메모리 읽기 경로의 비트라인을 로드하고 무관한 메모리 읽기 경로의 비트라인을 저전력 모드로 유지하는 단계; f. loading the bitlines of the relevant memory read paths and maintaining the bitlines of the unrelated memory read paths in a low power mode;

g.무관한 메모리 읽기 경로의 비트라인을 비활성화 상태로 유지하는 반면에 관련 있는 메모리 읽기 경로의 비트라인을 로드하는 단계; g. loading the bitlines of the relevant memory read paths while keeping the bitlines of the unrelated memory read paths inactive;

h.읽기 동작 동안에 관련 있는 메모리 읽기 경로의 부분들을 활용하고 무관한 메모리 읽기 경로 각각의 일부분을 저전력 모드로 유지하는 단계―여기서, 상기 일부분은 비트라인과 상이함; h. utilizing portions of the relevant memory read path and maintaining a portion of each of the unrelated memory read paths in a low power mode during a read operation, wherein the portion is different from a bitline;

i. 읽기 동작 동안에 관련 있는 메모리 읽기 경로의 부분들을 활용하고 적어도 일부 무관한 메모리 읽기 경로에 대한 센스 증폭기를 저전력 모드로 유지하는 단계; i. utilizing portions of the relevant memory read path and maintaining the sense amplifier for at least some unrelated memory read paths in a low power mode during a read operation;

j.읽기 동작 동안에 관련 있는 메모리 읽기 경로의 부분들을 활용하고 적어도 무관한 메모리 읽기 경로의 센스 증폭기를 저전력 모드로 유지하는 단계; 및 j. utilizing portions of the relevant memory read path during a read operation and maintaining at least the sense amplifier of the unrelated memory read path in a low power mode; and

k.읽기 동작 동안에 관련 있는 메모리 읽기 경로의 부분들을 활용하고 무관한 메모리 읽기 경로의 센스 증폭기 이후의 무관한 메모리 읽기 경로의 부분들을 저전력 모드로 유지하는 단계. k. Utilizing the portions of the relevant memory read path during the read operation and maintaining the portions of the irrelevant memory read path after the sense amplifier of the irrelevant memory read path in a low power mode.

저전력 모드 또는 유휴 모드(idle mode)는 메모리 접근 경로가 접근 동작을 위해 사용되는 경우의 전력 소비보다 메모리 접근 경로의 전력 소비가 적은 모드를 포함할 수 있다. 일부 실시예에서, 저전력 모드는 메모리 접근 경로를 폐쇄하는 것을 포함할 수도 있다. 저전력 모드는 추가적으로 또는 대안적으로 메모리 접근 경로를 활성화하지 않는 것을 포함할 수 있다.The low power mode or idle mode may include a mode in which power consumption of the memory access path is lower than power consumption when the memory access path is used for an access operation. In some embodiments, the low power mode may include closing the memory access path. The low power mode may additionally or alternatively include not activating the memory access path.

여기서, 비트라인 단계 동안에 전력 감소가 일어나려면 메모리 접근 경로의 관련성 또는 무관성을 워드라인 개방 전에 알고 있어야 할 수 있다. 다른 위치(예, 컬럼 멀티플렉서)에서 일어나는 전력 감소는 각 접근의 메모리 접근 경로의 관련성 또는 무관성의 결정이 가능할 수 있다.Here, the relevance or irrelevance of the memory access path may need to be known before the wordline is opened for power reduction to occur during the bitline phase. Power reductions occurring in other locations (eg, column multiplexers) may enable determination of the relevance or irrelevance of each access's memory access path.

고속 저전력 활성화 및 고속 접근 메모리High-speed, low-power activation and fast-access memory

DRAM 및 기타 메모리 유형들(SRAM, 플래시메모리 등)은 일반적으로 로우 및 컬럼 접근 스키마를 감안하여 구성되는 메모리 뱅크로부터 구성되는 경우가 흔하다.DRAM and other types of memory (SRAM, Flash, etc.) are often constructed from memory banks, which are typically organized with row and column access schemes in mind.

도 55는 다중 메모리 매트와 연관 로직(예, 도 55에 각각 RD 및 COL로 도시된 로우 디코더 및 컬럼 디코더)을 포함하는 메모리 칩(5140)의 일례를 도시한 것이다. 도 55의 예에서, 매트는 뱅크로 그룹 지어지고 워드라인과 비트라인이 지나간다. 메모리 뱅크와 연관 로직은 도 55에 5141, 5142, 5143, 5144, 5145, 및 5146으로 표시되고 적어도 하나의 버스(5147)를 공유한다.FIG. 55 shows an example of a memory chip 5140 that includes multiple memory mats and associative logic (eg, row and column decoders shown as RD and COL in FIG. 55, respectively). In the example of Figure 55, mats are grouped into banks and wordlines and bitlines run through. The memory bank and associative logic are denoted 5141 , 5142 , 5143 , 5144 , 5145 , and 5146 in FIG. 55 and share at least one bus 5147 .

메모리 칩(5140)은 도 3a, 도 3b, 도 4 내지 도 6, 도 7a 내지 도 7d, 도 11 내지 도 13, 도 16 내지 도 19, 도 22, 및 도 23의 하나 이상에 도시된 바와 같은 메모리 칩에 포함되거나 메모리 칩을 포함할 수 있다.The memory chip 5140 may be configured as shown in one or more of FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, and 23 . It may be included in a memory chip or may include a memory chip.

예컨대, DRAM에는 새 행의 활성화((예, 접근을 위한 새 라인을 준비)와 연관된 오버헤드가 많다. 라인이 활성화(또는 라인의 개방)되면, 그 행의 데이터는 훨씬 빠른 접근을 위해 사용 가능할 수 있다. DRAM에서, 이러한 접근은 무작위 방식으로 일어날 수 있다.For example, DRAM has a lot of overhead associated with the activation of a new row (ie preparing a new line for access). Once a line is activated (or open a line), the data in that row becomes available for much faster access. In DRAM, this approach can happen in a random fashion.

새 라인의 활성화와 연관된 두 가지 문제는 전력과 시간이다. Two issues associated with the activation of a new line are power and time.

c. 라인 상의 모든 커패시터에 한꺼번에 접근하고 라인을 로드해야 함에 따라 유발되는 전류의 빠른 흐름으로 인해 전력이 상승한다(예, 몇 개의 메모리 뱅크만 있는 라인을 개방하는 경우에 전력은 수 암페어에 이를 수 있다). c. Power rises due to the rapid flow of current caused by having to access all capacitors on the line at once and load the line (for example, if you open a line with only a few memory banks, the power can reach several amps) .

d. 시간 지연 문제는 대부분 행(워드라인)을 로드한 후에 열(비트라인)을 로드하는데 들어가는 시간과 연관된다. d. The time delay problem is mostly related to the time taken to load a column (bitline) after loading a row (wordline).

본 개시의 일부 실시예들은 라인의 활성화 동안에 첨두 전력 소비를 감소하고 라인의 활성화 시간을 감소하는 시스템 및 방법을 포함할 수 있다. 일부 실시예들은 어느 정도까지 완전 무작위 접근을 희생하여 이러한 전력과 시간 비용을 감소시킬 수 있다.Some embodiments of the present disclosure may include a system and method for reducing peak power consumption during activation of a line and reducing activation time of a line. Some embodiments may, to some extent, reduce this power and time cost at the expense of a completely random approach.

예를 들어, 일 실시예에서, 메모리 유닛은 제1 메모리 매트, 제2 메모리 매트, 및 제2 메모리 매트에 포함된 제2 그룹의 메모리 셀을 활성화하지 않고 제1 메모리 매트에 포함된 제1 그룹의 메모리 셀을 활성화하도록 구성된 활성화부를 포함할 수 있다. 제1 그룹의 메모리 셀과 제2 그룹의 메모리 셀은 모두 메모리 유닛의 단일 행에 속할 수 있다.For example, in one embodiment, the memory unit does not activate the first group of memory cells included in the first memory mat, the second memory mat, and the second group included in the second memory mat. and an activation unit configured to activate the memory cell of the . Both the first group of memory cells and the second group of memory cells may belong to a single row of the memory unit.

대안적으로, 활성화부는 제1 그룹의 메모리 셀을 활성화하지 않고 제2 메모리 매트에 포함된 제2 그룹의 메모리 셀을 활성화하도록 구성될 수 있다.Alternatively, the activation unit may be configured to activate the second group of memory cells included in the second memory mat without activating the first group of memory cells.

일부 실시예에서, 활성화부는 제1 그룹의 메모리 셀의 활성화 이후에 제2 그룹의 메모리 셀을 활성화하도록 구성될 수 있다.In some embodiments, the activation unit may be configured to activate the second group of memory cells after activation of the first group of memory cells.

예를 들면, 활성화부는 제1 그룹의 메모리 셀의 활성화가 완료된 이후에 개시된 지연 기간의 만료에 후속하여 제2 그룹의 메모리 셀을 활성화하도록 구성될 수 있다.For example, the activator may be configured to activate the second group of memory cells following expiration of a delay period initiated after activation of the first group of memory cells is completed.

추가적으로 또는 대안적으로, 활성화부는 제1 그룹의 메모리 셀에 결합된 제1 워드라인에 생성된 신호의 값에 의거하여 제2 그룹의 메모리 셀을 활성화하도록 구성될 수 있다.Additionally or alternatively, the activator may be configured to activate the second group of memory cells based on a value of a signal generated on a first wordline coupled to the first group of memory cells.

상기에 설명한 모든 임의의 실시예들에서, 활성화부는 제1 워드라인 세그먼트와 제2 워드라인 세그먼트 사이에 배치된 중간 회로를 포함할 수 있다. 제1 워드라인 세그먼트는 제1 메모리 셀에 결합될 수 있고, 제2 워드라인 세그먼트는 제2 메모리 셀에 결합될 수 있다. 중간 회로의 예로는, 이게 국한되지 않지만, 도 56 내지 도 61에 일부가 도시된 스위치, 플립플롭, 버퍼, 인버터(inverter) 등이 있을 수 있다.In any of the embodiments described above, the activator may include an intermediate circuit disposed between the first wordline segment and the second wordline segment. A first wordline segment may be coupled to a first memory cell, and a second wordline segment may be coupled to a second memory cell. Examples of the intermediate circuit include, but are not limited to, switches, flip-flops, buffers, inverters, and the like, some of which are illustrated in FIGS. 56 to 61 .

일부 실시예에서, 제2 메모리 셀은 제2 워드라인 세그먼트에 결합될 수 있다. 이러한 실시예에서, 제2 워드라인 세그먼트는 적어도 제1 메모리 매트를 통과하는 바이패스(bypass) 워드라인 경로에 결합될 수 있다. 이러한 바이패스 경로의 일례가 도 61에 도시되어 있다. In some embodiments, the second memory cell may be coupled to a second wordline segment. In such an embodiment, the second wordline segment may be coupled to a bypass wordline path through at least the first memory mat. An example of such a bypass path is shown in FIG. 61 .

활성화부는 단일 행과 연관된 워드라인으로부터의 활성화 신호에 의거하여 제1 그룹의 메모리 셀과 제2 그룹의 메모리 셀로의 전압의 공급(및/또는 전류의 흐름)을 제어하도록 구성된 제어부를 포함할 수 있다.The activation unit may include a control unit configured to control supply of voltage (and/or flow of current) to the first group of memory cells and the second group of memory cells based on an activation signal from a wordline associated with the single row. .

다른 예시적인 실시예에서, 메모리 유닛은 제1 메모리 매트, 제2 메모리 매트, 및 제1 메모리 매트의 제1 그룹의 메모리 셀로 활성화 신호를 공급하고 적어도 제1 그룹의 메모리 셀의 활성화가 완료될 때까지 제2 메모리 매트의 제2 그룹의 메모리 셀로의 활성화 신호의 공급을 지연하도록 구성된 활성화부를 포함할 수 있다. 제1 그룹의 메모리 셀과 제2 그룹의 메모리 셀은 메모리 유닛의 단일 행에 속할 수 있다.In another exemplary embodiment, the memory unit supplies an activation signal to the first memory mat, the second memory mat, and the first group of memory cells of the first memory mat, and at least when activation of the first group of memory cells is complete. and an activation unit configured to delay supply of an activation signal to the memory cells of the second group of the second memory mat until . The first group of memory cells and the second group of memory cells may belong to a single row of memory units.

예를 들어, 활성화부는 활성화 신호의 공급을 지연하도록 구성된 지연부를 포함할 수 있다.For example, the activation unit may include a delay unit configured to delay the supply of the activation signal.

추가적으로 또는 대안적으로, 활성화부는 입력부에서 활성화 신호를 수신하고 활성화 신호의 적어도 하나의 특성에 의거하여 지연부를 제어하도록 구성된 비교기(comparator)를 포함할 수 있다.Additionally or alternatively, the activator may comprise a comparator configured to receive the activation signal at the input and control the delay part based on at least one characteristic of the activation signal.

다른 예시적인 실시예에서, 메모리 매트는 제1 메모리 매트, 제2 메모리 매트, 및 제1 메모리 매트의 제1 메모리 셀을 제1 메모리 셀이 활성화되는 초기 활성화 기간 동안에 제2 메모리 매트의 제2 메모리 셀로부터 격리하고 초기 활성화 기간 후에 제1 메모리 셀을 제2 메모리 셀에 결합하도록 구성된 격리부를 포함할 수 있다. 제1 및 제2 메모리 셀은 메모리 유닛의 단일 행에 속할 수 있다.In another exemplary embodiment, the memory mat comprises a first memory mat, a second memory mat, and a first memory cell of the first memory mat during an initial activation period in which the first memory cell is activated to a second memory of the second memory mat. an isolation portion configured to isolate from the cell and couple the first memory cell to the second memory cell after an initial activation period. The first and second memory cells may belong to a single row of memory units.

하기의 예시들에서, 메모리 매트 자체에 대한 수정은 필요 없다. 일부 예에서, 실시예들은 메모리 뱅크에 대한 약간의 수정에 의존할 수 있다.In the examples below, no modifications to the memory mat itself are necessary. In some examples, embodiments may rely on minor modifications to the memory bank.

아래에 설명하는 도면들은 메모리 뱅크에 추가된 워드 신호를 짧게 하여 워드라인을 여러 개의 짧은 부분으로 나누는 메커니즘을 도시한다.The figures described below show a mechanism for dividing a word line into several shorter portions by shortening a word signal added to a memory bank.

다음의 도면에서, 간략한 설명을 위해 다양한 메모리 뱅크 구성이 생략되었다.In the following drawings, various memory bank configurations are omitted for the sake of brevity.

도 56 내지 도 61은 상이한 그룹 내에 그룹으로 된 다중 메모리 매트(5150(1), 5150(2), 5150(3), 5150(4), 5150(5), 5150(6), 5151(1), 5151(2), 5151(3), 5151(4), 5151(5), 5151(6), 5152(1), 5152(2), 5152(3), 5152(4), 5152(5), 및 5152(6))와 로우 디코더를 포함하는 메모리 뱅크의 부분들(각각 5140(1), 5140(2), 5140(3), 5140(4), 5140(5), 및 5149(6)로 표시)을 도시한 것이다.56-61 show multiple memory mats 5150(1), 5150(2), 5150(3), 5150(4), 5150(5), 5150(6), 5151(1) grouped into different groups. , 5151(2), 5151(3), 5151(4), 5151(5), 5151(6), 5152(1), 5152(2), 5152(3), 5152(4), 5152(5) , and 5152(6) and portions of the memory bank including row decoders (5140(1), 5140(2), 5140(3), 5140(4), 5140(5), and 5149(6), respectively) indicated by ) is shown.

행으로 배열된 메모리 매트는 상이한 그룹을 포함할 수 있다.The memory mats arranged in rows may include different groups.

도 56 내지 도 59 및 도 61은 각 그룹이 한 쌍의 메모리 매트를 포함하는 9 그룹의 메모리 매트를 도시한다. 각각 임의의 수의 메모리 매트를 포함하는 임의의 수의 그룹이 사용될 수 있다. 56 to 59 and 61 show nine groups of memory mats, each group including a pair of memory mats. Any number of groups may be used, each containing any number of memory mats.

메모리 매트(5150(1), 5150(2), 5150(3), 5150(4), 5150(5), 및 5150(6))는 행으로 배열되고, 다중 메모리 라인을 공유하며, 3개의 그룹으로 분류된다. 즉, 제1 상부 그룹은 5150(1)과 5150(2)로 표시된 메모리 매트를 포함하고, 제2 상부 그룹은 5150(3)과 5150(4)로 표시된 메모리 매트를 포함하고, 제3 상부 그룹은 5150(5)와 5150(6)으로 표시된 메모리 매트를 포함한다.Memory mats 5150(1), 5150(2), 5150(3), 5150(4), 5150(5), and 5150(6) are arranged in rows, share multiple memory lines, and are grouped into three groups. classified as That is, the first upper group includes memory mats denoted 5150(1) and 5150(2), the second upper group includes memory mats denoted 5150(3) and 5150(4), and the third upper group includes memory mats denoted 5150(3) and 5150(4). includes memory mats labeled 5150(5) and 5150(6).

마찬가지로, 메모리 매트(5151(1), 5151(2), 5151(3), 5151(4), 5151(5), 및 5151(6))는 행으로 배열되고, 다중 메모리 라인을 공유하며, 3개의 그룹으로 분류된다. 즉, 제1 중간 그룹은 5151(1)과 5151(2)로 표시된 메모리 매트를 포함하고, 제2 중간 그룹은 5151(3)과 5151(4)로 표시된 메모리 매트를 포함하고, 제3 중간 그룹은 5151(5)와 5151(6)으로 표시된 메모리 매트를 포함한다.Similarly, memory mats 5151(1), 5151(2), 5151(3), 5151(4), 5151(5), and 5151(6) are arranged in rows and share multiple memory lines, 3 classified into groups of dogs. That is, the first intermediate group includes the memory mats labeled 5151(1) and 5151(2), the second intermediate group includes the memory mats labeled 5151(3) and 5151(4), and the third intermediate group includes the memory mats labeled 5151(3) and 5151(4). includes memory mats labeled 5151(5) and 5151(6).

또한, 메모리 매트(5152(1), 5152(2), 5152(3), 5152(4), 5152(5) 및 5152(6))는 행으로 배열되고, 다중 메모리 라인을 공유하며, 3개의 그룹으로 분류된다. 즉, 제1 하부 그룹은 5152(1)과 5152(2)로 표시된 메모리 매트를 포함하고, 제2 하부 그룹은 5152(3)과 5152(4)로 표시된 메모리 매트를 포함하고, 제3 하부 그룹은 5152(5)와 5152(6)로 표시된 메모리 매트를 포함한다. 모든 임의의 수의 메모리 매트가 행으로 배열되고, 메모리 라인을 공유하고, 모든 임의의 수의 그룹으로 분류될 수 있다.In addition, memory mats 5152(1), 5152(2), 5152(3), 5152(4), 5152(5), and 5152(6) are arranged in rows, share multiple memory lines, and classified into groups. That is, the first subgroup includes the memory mats labeled 5152(1) and 5152(2), the second subgroup includes the memory mats labeled 5152(3) and 5152(4), and the third subgroup includes the memory mats labeled 5152(3) and 5152(4). includes memory mats labeled 5152(5) and 5152(6). Any number of memory mats can be arranged in rows, share memory lines, and grouped into any number of groups.

예컨대, 각 그룹의 메모리 매트의 수는 1, 2, 또는 그 이상일 수 있다.For example, the number of memory mats in each group may be 1, 2, or more.

앞서 설명한 바와 같이, 활성화 회로는 동일 메모리 라인을 공유하는 다른 그룹의 메모리 매트를 활성화하지 않고 한 그룹의 메모리 매트를 활성화하도록 구성될 수 있다. 또는, 라인 어드레스가 동일한 상이한 메모리 라인 세그먼트에 적어도 결합될 수 있다.As described above, the activation circuit may be configured to activate one group of memory mats without activating another group of memory mats that share the same memory line. Alternatively, the line addresses may be coupled at least to different memory line segments that are the same.

도 56 내지 도 61은 활성화 회로의 상이한 예를 도시한 것이다. 일부 실시예에서, 활성화 회로의 적어도 일부분(예, 중간 회로)은 메모리 매트의 그룹 사이에 위치하여 한 그룹의 메모리 매트가 활성화되게 하는 반면에 동일한 행의 다른 그룹의 메모리 매트는 활성화되지 않게 할 수 있다.56 to 61 show different examples of the activation circuit. In some embodiments, at least a portion of activation circuitry (eg, intermediate circuitry) may be positioned between groups of memory mats to allow one group of memory mats to be activated while other groups of memory mats in the same row not activated. have.

도 56은 제1 상부 그룹의 메모리 매트와 제2 상부 그룹의 메모리 매트의 상이한 라인 사이에 위치한 지연 또는 격리 회로(5153(1) - 5153(3))와 같은 중간 회로를 도시하고 있다.Figure 56 shows an intermediate circuit, such as delay or isolation circuitry 5153(1) - 5153(3), positioned between different lines of a first upper group of memory mats and a second upper group of memory mats.

도 56은 또한 제2 상부 그룹의 메모리 매트와 제3 상부 그룹의 메모리 매트의 상이한 라인 사이에 위치한 지연 또는 격리 회로(5154(1) - 5154(3))와 같은 중간 회로를 도시하고 있다. 추가적으로, 일부 지연 또는 격리 회로는 중간 그룹의 메모리 매트로부터 형성된 그룹 사이에 위치한다. 또한, 일부 지연 또는 격리 회로는 하부 그룹의 메모리 매트로부터 형성된 그룹 사이에 위치한다.Figure 56 also shows intermediate circuitry, such as delay or isolation circuitry 5154(1) - 5154(3), located between different lines of the second upper group of memory mats and the third upper group of memory mats. Additionally, some delay or isolation circuitry is placed between groups formed from intermediate groups of memory mats. Also, some delay or isolation circuitry is placed between groups formed from subgroups of memory mats.

지연 또는 연기 회로는 로우 디코더(5112)로부터의 워드라인 신호가 다른 그룹의 행을 따라 전파되는 것을 지연 또는 중지할 수 있다.The delay or defer circuit may delay or stop the wordline signal from the row decoder 5112 from propagating along other groups of rows.

도 57은 플립플롭(5155(1) - 5155(3) and 5156(1)-5156(3))을 포함하는 지연 또는 격리 회로와 같은 중간회로를 도시하고 있다.57 shows an intermediate circuit, such as a delay or isolation circuit, comprising flip-flops 5155(1) - 5155(3) and 5156(1)-5156(3).

활성화 신호가 워드라인으로 주입되는 경우, 제1 그룹의 매트(워드라인에 따라)가 활성화되는 반면에 워드라인 상의 다른 그룹은 비활성화 상태로 유지된다. 다른 그룹은 다음 클럭 사이클에 활성화될 수 있다. 예를 들어, 다른 그룹의 제2 그룹은 다음 클럭 사이클에서 활성화될 수 있고, 다른 그룹의 제3 그룹은 또 다른 클럭 사이클 이후에 활성화될 수 있다.When an activation signal is injected into a wordline, the first group of mats (according to the wordline) are activated while the other groups on the wordline remain inactive. Another group can be active on the next clock cycle. For example, a second group of another group may be activated at the next clock cycle, and a third group of the other group may be activated after another clock cycle.

플립플롭은 D형 플립플롭 또는 모든 임의의 다른 유형의 플립플롭을 포함할 수 있다. D형 플립플롭으로 공급되는 클럭은 단순화를 위해서 도면에서 생략되었다.A flip-flop may include a D-type flip-flop or any other type of flip-flop. The clock supplied to the D-type flip-flop is omitted from the drawing for simplicity.

따라서, 제1 그룹으로의 접근은 전력을 사용하여 제1 그룹과 연관된 워드라인의 부분만을 충전할 수 있고, 이는 전체 워드라인을 충전하는 것보다 빠르고 전류도 덜 필요하다.Thus, access to the first group can use power to only charge a portion of the wordline associated with the first group, which is faster and requires less current than charging the entire wordline.

하나 이상의 플립플롭이 메모리 매트의 그룹 사이에 사용될 수 있으므로, 개방 부분 사이의 지연을 증가시킬 수 있다. 추가적으로 또는 대안적으로, 실시예들은 클럭을 느리게 하여 지연을 증가시킬 수 있다.More than one flip-flop may be used between groups of memory mats, thereby increasing the delay between open portions. Additionally or alternatively, embodiments may slow the clock to increase delay.

또한, 활성화되는 그룹들은 이전에 사용된 라인 값으로부터의 그룹들을 여전히 포함할 수 있다. 예를 들어, 방법은 이전 라인의 데이터에 여전히 접근하면서 새로운 라인 세그먼트를 활성화하게 할 수 있어서, 새로운 라인의 활성화와 연관된 페널티를 감소시킬 수 있다.Also, the groups being activated may still contain groups from previously used line values. For example, the method may allow activating a new line segment while still accessing the data of the previous line, thereby reducing the penalty associated with activation of a new line.

이에 따라, 일부 실시예들은 활성화된 제1 그룹이 있을 수 있고, 이전에 활성화된 라인의 다른 그룹들을 서로 간섭하지 않는 비트라인의 신호를 가지고 활성화 상태로 유지되도록 할 수 있다.Accordingly, some embodiments may have a first group activated, and may cause other groups of previously activated lines to remain activated with a signal of a bit line that does not interfere with each other.

추가적으로, 일부 실시예들은 스위치와 제어 신호를 포함할 수 있다. 제어 신호는 뱅크 컨트롤러에 의해 제어되거나 제어 신호 사이에 플립플롭을 추가하여(예, 상기에 설명한 메커니즘과 동일한 타이밍 효과를 생성하여) 제어될 수 있다.Additionally, some embodiments may include switches and control signals. The control signal may be controlled by the bank controller or may be controlled by adding flip-flops between the control signals (eg, creating the same timing effect as the mechanism described above).

도 58은 스위치(5157(1) - 5157(3) 및 5158(1)-5158(3) 등)이고 한 그룹과 다른 그룹 사이에 위치하는 지연 또는 격리 회로와 같은 중간 회로를 도시하고 있다. 그룹 사이에 위치한 한 세트의 스위치는 전용 제어 신호에 의해 제어될 수 있다. 도 58에서, 제어 신호는 행 제어부(5160(1))에 의해 전송되고 상이한 세트의 스위치 사이의 하나 이상의 연속된 지연부(예, 5160(2) 및 5160(3))에 의해 지연될 수 있다.Figure 58 is a switch (such as 5157(1) - 5157(3) and 5158(1)-5158(3)) and shows an intermediate circuit, such as a delay or isolation circuit, positioned between one group and the other. A set of switches located between groups can be controlled by dedicated control signals. 58, the control signal is transmitted by row control 5160(1) and may be delayed by one or more consecutive delays between different sets of switches (eg, 5160(2) and 5160(3)). .

도 59는 연속된 인버터 게이트 또는 버퍼(예, 159(1) - 5159(3) 및 5159'1(0 - 5159'(3))이고 메모리 매트의 그룹들 사이에 위치한 지연 또는 격리 회로와 같은 중간 회로를 도시하고 있다.59 is a series of inverter gates or buffers (eg, 159(1) - 5159(3) and 5159'1 (0 - 5159'(3)) and intermediate such as delay or isolation circuitry located between groups of memory mats; circuit is shown.

스위치 대신에, 버퍼는 메모리 매트의 그룹들 사이에 사용될 수 있다. 버퍼는 스위치와 스위치 간에, 단일 트랜지스터 구조를 사용하는 경우에 때때로 발생하는 효과인, 워드라인을 따른 전압 강하를 방지할 수 있다.Instead of a switch, a buffer can be used between groups of memory mats. The buffer can prevent a voltage drop across the wordline between the switch and the switch, an effect that sometimes occurs when using a single transistor structure.

다른 실시예들은 메모리 뱅크에 추가된 영역을 활용함으로써 더욱 무작위한 접근을 허용하면서 매우 낮은 활성화 전력과 시간을 제공할 수 있다.Other embodiments may provide very low activation power and time while allowing more random access by utilizing the added area to the memory bank.

메모리 매트에 근접하여 위치한 워드라인(예, 5152(1) - 5152(8))을 사용하는 일례가 도 60에 도시되어 있다. 이러한 워드라인들은 메모리 매트를 통과할 수도 또는 통과하지 않을 수도 있고 스위치(예, 5157(1) - 5157(8))와 같은 중간 회로를 통해 메모리 매트 내의 워드라인들에 결합할 수 있다. 스위치는 어느 메모리 매트가 활성화될지를 제어할 수 있고 메모리 컨트롤러가 각 시점에 관련이 있는 라인 부분만 활성화하도록 할 수 있다. 상기에 설명한 라인 부분의 연속 활성화를 사용하는 실시예와 달리, 도 60의 예는 더 강한 제어를 제공할 수 있다.An example of using wordlines (eg, 5152(1) - 5152(8)) located close to a memory mat is shown in FIG. 60 . These wordlines may or may not pass through the memory mat and may couple to wordlines within the memory mat through an intermediate circuit such as a switch (eg, 5157(1) - 5157(8)). Switches can control which memory mats are active and allow the memory controller to only activate the relevant portion of the line at each point in time. Unlike the embodiment using continuous activation of line portions described above, the example of FIG. 60 may provide stronger control.

로우 파트 인에이블 신호(5170(1) 및 5170(2))와 같은 인에이블 신호는 도시되지 않은 메모리 컨트롤러와 같은 로직에서 유래할 수 있다.Enable signals such as low part enable signals 5170 ( 1 ) and 5170 ( 2 ) may originate from logic such as a memory controller (not shown).

도 61은 글로벌 워드라인(5180)이 메모리 매트를 통과하고, 메모리 매트 외부로 보내질 필요가 없을 수 있는 워드라인 신호에 대한 바이패스 경로를 형성하는 것을 도시하고 있다. 이에 따라, 도 61에 도시된 실시예는 메모리 밀도에 약간의 손실을 감수하고 메모리 뱅크의 면적을 줄일 수 있다.Figure 61 shows that global wordline 5180 passes through the memory mat and forms a bypass path for wordline signals that may not need to be routed out of the memory mat. Accordingly, the embodiment shown in FIG. 61 can reduce the area of the memory bank with a slight loss in memory density.

도 61에서, 글로벌 워드라인은 중단 없이 메모리 매트를 통과할 수 있고 메모리 셀에 연결되지 않을 수 있다. 로컬 워드라인 세그먼트는 스위치 중의 하나에 의해 제어되고 메모리 매트 내에서 메모리 셀에 연결될 수 있다.61 , the global wordline may pass through the memory mat without interruption and may not be coupled to a memory cell. A local wordline segment may be controlled by one of the switches and coupled to a memory cell within the memory mat.

메모리 매트의 그룹이 워드라인의 상당한 파티션을 제공하는 경우, 메모리 뱅크는 완전한 무작위 접근을 사실상 지원할 수 있다.If a group of memory mats provide significant partitioning of wordlines, the memory bank can virtually support completely random access.

일부 배선과 로직을 또한 절약할 수 있고 워드라인을 따라 활성화 신호의 전파 속도를 감소시키는 다른 실시예는 전용 인에이블 신호 및 인에이블 신호를 전달하는 전용 라인을 사용하지 않고 메모리 매트 사이에 다른 버퍼링 또는 격리 회로 및/또는 스위치를 사용한다Another embodiment that also saves some wiring and logic and reduces the propagation speed of the enable signal along the wordlines is to avoid using a dedicated enable signal and a dedicated line to carry the enable signal, but with other buffering or other buffering between memory mats. Use isolation circuits and/or switches

예를 들어, 비교기를 사용하여 스위치 또는 기타 버퍼링 또는 격리 회로를 제어할 수 있다. 비교기는 비교기에 의해 모니터링 된 워드라인 세그먼트 상의 신호의 레벨이 특정 레벨에 도달하는 경우에 스위치 또는 기타 버퍼링 또는 격리 회로를 활성화할 수 있다. 예를 들어, 특정 레벨은 이전 워드라인 세그먼트가 완전히 로드되었음을 나타낼 수 있다.For example, a comparator may be used to control a switch or other buffering or isolation circuit. The comparator may activate a switch or other buffering or isolation circuitry when the level of the signal on the wordline segment monitored by the comparator reaches a certain level. For example, a certain level may indicate that the previous wordline segment was fully loaded.

도 62는 메모리 유닛의 동작을 위한 방법(5190)을 도시한 것이다. 예컨대, 방법(5190)은 도 56 내지 도 61을 참조하여 앞서 설명한 메모리 뱅크의 하나 이상을 사용하여 이행될 수 있다.62 illustrates a method 5190 for operation of a memory unit. For example, method 5190 may be implemented using one or more of the memory banks described above with reference to FIGS. 56-61 .

방법(5190)은 단계 5192와 단계 5194를 포함할 수 있다.Method 5190 may include steps 5192 and 5194 .

단계 5192는 활성화부가 메모리 유닛의 제2 메모리 매트에 포함된 제2 그룹의 메모리 셀을 활성화하지 않고 메모리 유닛의 제1 메모리 매트에 포함된 제1 그룹의 메모리 셀을 활성화하는 단계를 포함할 수 있다. 제1 그룹의 메모리 셀과 제2 그룹의 메모리 셀은 모두 메모리 유닛의 단일 행에 속할 수 있다.Step 5192 may include, by the activation unit, activating the first group of memory cells included in the first memory mat of the memory unit without activating the second group of memory cells included in the second memory mat of the memory unit . Both the first group of memory cells and the second group of memory cells may belong to a single row of the memory unit.

단계 5194는 활성화부가 단계 5192 이후에 제2 그룹의 메모리 셀을 활성화하는 단계를 포함할 수 있다.Operation 5194 may include an activator activating the second group of memory cells after operation 5192 .

단계 5194는 제1 그룹의 메모리 셀이 활성화되는 동안, 제1 그룹의 메모리 셀의 완전 활성화 이후, 제1 그룹의 메모리 셀의 활성화가 완료된 이후에 개시된 지연 기간의 만료 이후, 제1 그룹의 메모리 셀이 비활성화된 이후 등에 실행될 수 있다.Step 5194 is performed during activation of the first group of memory cells, after full activation of the first group of memory cells, and after expiration of a delay period initiated after activation of the first group of memory cells is completed, the first group of memory cells It can be executed after it has been deactivated, etc.

지연 기간은 고정 또는 조정될 수 있다. 예를 들어, 지연 기간의 길이는 메모리 유닛의 예상 접근 패턴에 의거할 수 있거나 예상 접근 패턴과 무관하게 설정될 수 있다. 지연 기간은 1000분의 1초 내지 1초 이상의 범위일 수 있다.The delay period may be fixed or adjustable. For example, the length of the delay period may be based on an expected access pattern of the memory unit or may be set irrespective of the expected access pattern. The delay period may range from one thousandths of a second to one second or more.

일부 실시예에서, 단계 5194는 제1 그룹의 메모리 셀에 결합된 제1 워드라인 세그먼트 상에 생성된 신호의 값에 의거하여 개시될 수 있다. 예컨대, 신호의 값이 제1 임계값을 초과하는 경우, 제1 그룹의 메모리 셀이 완전히 활성화되었음을 나타내는 것일 수 있다.In some embodiments, step 5194 may be initiated based on a value of a signal generated on a first wordline segment coupled to a first group of memory cells. For example, when the value of the signal exceeds the first threshold value, it may indicate that the memory cells of the first group are fully activated.

단계 5192와 단계 5194의 하나는 제1 워드라인 세그먼트와 제2 워드라인 세그먼트 사이에 배치된 중간 회로(예, 활성화부의 중간 회로)를 사용할 수 있다. 제1 워드라인 세그먼트는 제1 메모리 셀에 결합될 수 있고, 제2 워드라인 세그먼트는 제2 메모리 셀에 결합될 수 있다.One of steps 5192 and 5194 may use an intermediate circuit (eg, an intermediate circuit of an activation unit) disposed between the first wordline segment and the second wordline segment. A first wordline segment may be coupled to a first memory cell, and a second wordline segment may be coupled to a second memory cell.

중간 회로의 예들은 도 56 내지 도 61에 도시되어 있다.Examples of intermediate circuits are shown in FIGS. 56-61.

단계 5192와 단계 5194는 단일 행과 연관된 워드라인으로부터의 활성화 신호의 제1 그룹의 메모리 셀과 제2 그룹의 메모리 셀로의 인가를 제어부가 제어하는 단계를 더 포함할 수 있다.Steps 5192 and 5194 may further include controlling, by the controller, application of an activation signal from a word line associated with a single row to the memory cells of the first group and the memory cells of the second group.

검사 시간 단축을 위한 메모리 병렬화(parallelism) 및 벡터를 활용하는 메모리의 검사 로직Inspection logic in memory utilizing vector parallelism and memory to reduce inspection time

본 개시의 일부 실시예들은 인칩 검사부(in chip testing unit)를 활용하여 검사의 속도를 빠르게 할 수 있다.Some embodiments of the present disclosure may increase the speed of testing by using an in-chip testing unit.

일반적으로, 메모리 칩의 검사에는 상당한 검사 시간이 필요하다. 검사 시간을 줄이면 생산 비용을 줄일 수 있고 또한 더 많은 검사가 가능해지므로 제품의 신뢰성이 높아질 수 있다.In general, the inspection of a memory chip requires a considerable inspection time. Reducing inspection time can reduce production costs and increase product reliability by allowing more inspections.

도 63과 도 64는 검사기(5200) 및 칩(또는 칩의 웨이퍼)(5210)을 도시한 것이다. 검사기(5200)는 검사를 관리하는 소프트웨어를 포함할 수 있다. 검사기(5200)는 메모리(5210)의 전체에 대해 데이터의 상이한 시퀀스를 실행한 후에 이 시퀀스를 다시 읽어 메모리(5210)의 불량 비트가 위치한 장소를 식별할 수 있다. 일단 인지가 되면, 검사기(5200)는 비트를 수리하기 위한 명령을 생성하고, 문제가 해결되는 경우에, 검사기(5200)는 메모리(5210)의 검사 합격을 표명할 수 있다. 그렇지 않은 경우에, 일부 칩은 불량으로 표명될 수 있다.63 and 64 show an inspector 5200 and a chip (or wafer of chips) 5210 . The tester 5200 may include software for managing the test. After executing a different sequence of data for the entirety of the memory 5210 , the checker 5200 may read the sequence again to identify a place where the bad bit of the memory 5210 is located. Once recognized, the checker 5200 can generate an instruction to repair the bit, and if the problem is resolved, the checker 5200 can assert that the memory 5210 passes the test. Otherwise, some chips may be flagged as defective.

검사기(5200)는 검사 시퀀스를 기록한 후에 데이터를 다시 읽어와서 예상 결과와 비교할 수 있다.The tester 5200 may read the data again after recording the test sequence and compare it with an expected result.

도 64는 검사기(5200) 및 병렬로 검사되는 칩(예, 5210)의 전체 웨이퍼(5202)가 있는 검사 시스템을 도시한 것이다. 예컨대, 검사기(5200)는 배선의 버스로 칩의 각각에 연결될 수 있다.64 shows an inspection system with an inspector 5200 and an entire wafer 5202 of chips (eg, 5210) being inspected in parallel. For example, the tester 5200 may be connected to each of the chips by a bus of wiring.

도 64에 도시된 바와 같이, 검사기(5200)는 메모리 칩 모두에 읽기 및 쓰기를 몇 차례 해야 하고, 데이터는 외부 칩 인터페이스를 통해 통과해야 한다.As shown in FIG. 64 , the checker 5200 needs to read and write several times to all of the memory chips, and data must pass through the external chip interface.

또한, 일반적인 I/O 동작을 활용하여 제공될 수 있는 프로그램 가능한 구성 정보 등을 활용하여 집적회로의 로직과 메모리 뱅크 모두를 검사하는 것이 이로울 수 있다.It may also be beneficial to test both the logic and memory banks of the integrated circuit using programmable configuration information, etc. that may be provided utilizing normal I/O operations.

또한, 집적회로 이내에 검사부가 있는 것이 검사에 유리할 수 있다.Also, it may be advantageous for inspection to have an inspection unit within the integrated circuit.

검사부는 집적회로에 속해 있을 수 있고 검사 결과를 분석하고 로직(예, 도 7a에 도시되고 설명한 프로세서 서브유닛) 및/또는 메모리(예, 복수의 메모리 뱅크 전체)의 불량 등을 발견할 수 있다.The inspection unit may belong to the integrated circuit and may analyze the inspection result and detect failures in logic (eg, the processor subunit illustrated and described in FIG. 7A ) and/or memory (eg, an entire plurality of memory banks).

메모리 검사기는 보통 매우 단순하고 단순 형식에 따라 집적회로와 검사 벡터를 교환한다. 예를 들어, 기록될 메모리 엔트리의 어드레스의 쌍 및 메모리 엔트리에 기록될 값을 포함하는 쓰기 벡터가 있을 수 있다. 또한, 읽을 메모리 엔트리의 어드레스를 포함하는 읽기 벡터도 있을 수 있다. 쓰기 벡터의 어드레스의 적어도 일부는 읽기 벡터의 어드레스의 적어도 일부와 동일할 수 있다. 쓰기 벡터의 적어도 일부 다른 어드레스는 읽기 벡터의 적어도 일부 다른 어드레스와 상이할 수 있다. 프로그램이 되면, 메모리 검사기는 또한 읽을 메모리 엔트리의 어드레스와 읽을 예상 값을 포함할 수 있는 예상 결과 벡터를 수신할 수 있다. 메모리 검사기는 읽는 값에 예상 값을 비교할 수 있다.Memory checkers are usually very simple and exchange test vectors with integrated circuits according to a simple form. For example, there may be a write vector containing a pair of addresses of a memory entry to be written and a value to be written to the memory entry. There may also be a read vector containing the address of the memory entry to be read. At least a portion of the address of the write vector may be the same as at least a portion of the address of the read vector. At least some other addresses in the write vector may be different from at least some other addresses in the read vector. Once programmed, the memory checker may also receive an expected result vector which may contain the address of the memory entry to be read and an expected value to be read. The memory checker can compare the expected value to the value it reads.

일 실시예에 따르면, 집적회로(집적회로의 메모리 포함 또는 미포함)의 로직(예, 프로세서 서브유닛)은 동일한 프로토콜/형식을 사용하여 메모리 검사기에 의해 검사될 수 있다. 예를 들어. 쓰기 벡터의 값의 일부는 집적회로의 로직에 의해 실행될 명령일 수(및 계산 및/또는 메모리 접근을 포함할 수) 있다. 메모리 검사기는 메모리 엔트리 어드레스(적어도 일부는 계산의 예상 값을 저장)를 포함할 수 있는 예산 결과 벡터 및 읽기 벡터로 프로그램 될 수 있다. 따라서, 메모리 검사기는 로직뿐만 아니라 메모리의 검사에 사용될 수 있다. 메모리 검사기는 일반적으로 로직 검사기보다 훨씬 단순하고 저렴하며, 제시된 방법을 통해 단순한 메모리 검사기를 활용하여 복잡한 로직 검사를 수행할 수 있다.According to one embodiment, the logic (eg, processor subunit) of an integrated circuit (with or without memory of the integrated circuit) may be tested by the memory tester using the same protocol/format. for example. Part of the value of the write vector may be an instruction to be executed by the logic of the integrated circuit (and may include computation and/or memory access). The memory checker may be programmed with a budget result vector and a read vector which may contain memory entry addresses (at least some of which store the expected values of the calculations). Accordingly, the memory checker can be used to check not only logic but also memory. Memory checkers are generally much simpler and less expensive than logic checkers, and the presented method allows a simple memory checker to be utilized to perform complex logic checks.

일부 실시예에서, 메모리 이내의 로직은 벡터(또는 기타 데이터 구조)만을 활용하고 로직 검사에 흔히 사용되는 더욱 복잡한 메커니즘(예, 인터페이스 등을 통해 컨트롤러와 통신하여 어느 회로를 검사할지 로직에 지시)을 활용하지 않고 메모리 이내의 로직의 검사를 가능하게 한다.In some embodiments, the logic within the memory utilizes only vectors (or other data structures) and uses more complex mechanisms commonly used for logic checking (e.g., communicating with the controller via an interface, etc. to instruct the logic which circuit to test). Enables checking of logic in memory without being utilized.

검사부를 사용하는 대신에, 메모리 컨트롤러는 구성 정보에 포함된 메모리 엔트리에 접근하라는 명령을 수신하고 접근 명령을 실행하고 결과를 출력하도록 구성될 수 있다.Instead of using the check unit, the memory controller may be configured to receive a command to access a memory entry included in the configuration information, execute the access command, and output a result.

도 65 내지 도 69에 도시된 집적회로의 하나 이상은 검사부가 없이도 또는 검사를 수행할 능력이 없는 검사기가 있어도 검사를 실행할 수 있다.One or more of the integrated circuits shown in FIGS. 65-69 may perform inspection without an inspection unit or even with an inspection unit incapable of performing the inspection.

본 개시의 실시예들은 메모리의 병렬화(parallelism) 및 내부 칩 대역폭을 활용하여 검사 시간을 단축하고 향상하는 방법 및 시스템을 포함할 수 있다.Embodiments of the present disclosure may include a method and system for shortening and improving test time by utilizing parallelism of memory and internal chip bandwidth.

방법과 시스템은 메모리 칩이 그 자체를 검사하고(검사기가 검사를 실행하고, 검사 결과를 읽고, 결과를 분석하는 것이 아닌), 결과를 저장하고, 및 궁극적으로 검사기가 결과를 읽도록 하는(및, 필요한 경우, 메모리 칩을 다시 프로그램하여 중복 메커니즘을 활성화하는 등) 것에 기반할 수 있다. 검사는 메모리의 검사 또는 메모리 뱅크와 로직의 검사(앞서 도 7a를 참조하여 설명한 바와 같은 검사할 기능적 논리부가 있는 계산 메모리의 경우)를 포함할 수 있다.The method and system allow the memory chip to test itself (rather than for the inspector to run the test, read the test results, and analyze the results), store the results, and ultimately cause the tester to read the results (and , reprogramming the memory chip, if necessary, to activate the redundancy mechanism, etc.). Testing may include testing of memory or testing of memory banks and logic (in the case of computational memory with functional logic to be tested as previously described with reference to FIG. 7A).

일 실시예에서, 방법은 외부 대역폭이 검사를 제한하지 않도록 칩 이내에서 데이터를 읽기 및 쓰기 하는 단계를 포함할 수 있다.In one embodiment, the method may include reading and writing data within the chip such that external bandwidth does not limit the inspection.

메모리 칩이 프로세서 서브유닛을 포함하는 실시예에서, 각 프로세서 서브유닛은 검사 코드 또는 구성으로 프로그램 될 수 있다.In embodiments where the memory chip includes processor subunits, each processor subunit may be programmed with a test code or configuration.

메모리 칩이 검사 코드를 실행할 수 없는 프로세서 서브유닛을 포함하거나 프로세서 서브유닛이 없지만 메모리 컨트롤러를 포함하는 실시예에서, 메모리 컨트롤러는 패턴을 읽고 쓰고(예, 외부적으로 컨트롤러에 프로그램) 추가적인 분석을 위해 불량의 위치를 표시(예, 메모리 엔트리에 값을 기록하고, 엔트리를 읽고, 기록된 값과 다른 값을 수신)하도록 구성될 수 있다.In embodiments where the memory chip includes a processor subunit that is not capable of executing check code, or does not have a processor subunit but includes a memory controller, the memory controller reads and writes patterns (eg, programmed externally to the controller) for further analysis It may be configured to indicate the location of the bad (eg, write a value to a memory entry, read the entry, and receive a value different from the written value).

여기서, 메모리의 검사에는 방대한 수의 비트의 검사, 예를 들어, 메모리의 각 비트를 검사하고 검사된 비트가 기능함을 확인하는 것이 필요할 수 있다. 또한, 메모리 검사는 상이한 전압 및 온도 조건 하에서 때때로 반복될 수 있다.Here, the inspection of the memory may require inspection of a vast number of bits, for example, inspecting each bit of the memory and verifying that the inspected bits function. Also, the memory test may be repeated from time to time under different voltage and temperature conditions.

일부 불량에 대해, 하나 이상의 중복 메커니즘이 활성화될 수(예, 플래시 또는 OTP를 프로그램 하거나 퓨즈를 태워서) 있다. 또한, 메모리 칩의 논리 및 아날로그 회로(예, 컨트롤러, 레귤레이터, I/O)의 검사도 필요할 수 있다.For some faults, more than one redundancy mechanism may be activated (eg by programming flash or OTP or burning a fuse). It may also be necessary to inspect the logic and analog circuits of the memory chip (eg, controllers, regulators, I/Os).

일 실시예에서, 집적회로는 기판, 기판 상에 배치된 메모리 어레이, 기판 상에 배치된 프로세싱 어레이, 및 기판 상에 배치된 인터페이스를 포함할 수 있다.In one embodiment, an integrated circuit may include a substrate, a memory array disposed on the substrate, a processing array disposed on the substrate, and an interface disposed on the substrate.

여기서 설명하는 집적회로는 도 3a, 도 3b, 도 4 내지 도 6, 도 7a 내지 도 7d, 도 11 내지 도 13, 도 16 내지 도 19, 도 22, 및 도 23의 하나 이상에 도시된 바와 같은 메모리 칩에 포함되거나 메모리 칩을 포함할 수 있다.The integrated circuit described herein may be a circuit as shown in one or more of FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, and 23 . It may be included in a memory chip or may include a memory chip.

도 65 내지 도 69는 다양한 집적회로(5210) 및 검사기(5200)를 도시하고 있다.65 to 69 show various integrated circuits 5210 and testers 5200 .

집적회로는 메모리 뱅크(5212), 칩 인터페이스(5211)(예, 메모리 뱅크가 공유하는 버스(5213) 및 I/O 컨트롤러(5214)), 및 논리부(이하 "로직"으로 칭함)(5215)를 포함하는 것으로 도시되어 있다. 도 66은 퓨즈 인터페이스(5216) 및 퓨즈 인터페이스와 상이한 메모리 뱅크에 결합된 버스(5217)를 도시하고 있다.The integrated circuit includes a memory bank 5212, a chip interface 5211 (e.g., a bus 5213 and I/O controller 5214 shared by the memory bank), and logic (hereinafter referred to as “logic”) 5215 is shown to include. 66 shows a fuse interface 5216 and a bus 5217 coupled to a different memory bank than the fuse interface.

도 65 내지 도 70은 또한 다음과 같은 검사 프로세스의 다양한 단계를 도시하고 있다. 65-70 also illustrate the various stages of the inspection process as follows.

a. 검사 시퀀스를 쓰는 단계(5221) (도 65, 도 67, 도 68, 도 69); a. writing the check sequence 5221 (FIGS. 65, 67, 68, 69);

b. 검사 결과를 읽어오는 단계(5222) (도 67, 도 68, 도 69); b. reading the test result ( 5222 ) ( FIGS. 67 , 68 , and 69 );

c. 예상 결과 시퀀스를 쓰는 단계(5223) (도 65);c. writing the expected result sequence 5223 (FIG. 65);

d. 수정할 불량 어드레스를 읽어오는 단계(5224) (도 66); 및 d. reading a bad address to be corrected (5224) (FIG. 66); and

e. 퓨즈를 프로그램 하는 단계(5225) (도 66). e. Programming the fuse 5225 (FIG. 66).

각 메모리 뱅크는 그 자체의 논리부(5215)에 결합 및/또는 논리부(5215)에 의해 제어될 수 있다. 그러나 앞서 설명한 바와 같이, 논리부(5215)로의 모든 임의의 할당이 제공될 수 있다. 따라서, 논리부(5215)의 수는 메모리 뱅크의 수와 다를 수 있고, 논리부는 하나 이상의 메모리 뱅크 또는 한 메모리 뱅크의 일부 등을 제어할 수 있다.Each memory bank may be coupled to its own logic 5215 and/or controlled by logic 5215 . However, as previously described, any arbitrary assignment to logic 5215 may be provided. Accordingly, the number of logic units 5215 may be different from the number of memory banks, and the logic units may control one or more memory banks or portions of one memory bank, and the like.

논리부(5215)는 하나 이상의 검사부를 포함할 수 있다. 또 65는 논리부(5215) 내의 검사부(TU)(5218)를 도시하고 있다. TU는 논리부(5215)의 전체 또는 일부에 포함될 수 있다. 여기서, 검사부는 논리부와 분리되어 있거나 논리부와 일체일 수 있다.The logic unit 5215 may include one or more check units. Further, reference numeral 65 shows a check unit (TU) 5218 in the logic unit 5215 . A TU may be included in all or part of the logic unit 5215 . Here, the inspection unit may be separate from the logic unit or may be integrated with the logic unit.

도 65는 또한 TU(5218) 내의 검사 패턴 생성기(GEN으로 표시)(5219)를 도시하고 있다.65 also shows an inspection pattern generator (denoted GEN) 5219 within the TU 5218 .

검사 패턴 생성기는 검사부의 전체 또는 일부에 포함될 수 있다. 간략한 도시를 위해, 검사 패턴 생성기와 검사부는 도 66 내지 도 70에는 도시되어 있지 않지만 이러한 실시예들에 포함될 수 있다.The inspection pattern generator may be included in all or part of the inspection unit. For simplicity of illustration, the inspection pattern generator and the inspection unit are not shown in FIGS. 66 to 70 , but may be included in these embodiments.

메모리 어레이는 다중 메모리 뱅크를 포함할 수 있다. 또한, 프로세싱 어레이는 복수의 검사부를 포함할 수 있다. 복수의 검사부는 다중 메모리 뱅크를 검사하여 검사 결과를 제공하도록 구성될 수 있다. 인터페이스는 검사 결과를 나타내는 정보를 집적회로 외부의 장치로 출력하도록 구성될 수 있다.A memory array may include multiple memory banks. In addition, the processing array may include a plurality of inspection units. The plurality of test units may be configured to test multiple memory banks and provide test results. The interface may be configured to output information indicative of the test result to a device external to the integrated circuit.

복수의 검사부는 다중 메모리 뱅크의 하나 이상의 검사에 사용할 적어도 하나의 검사 패턴을 생성하도록 구성된 적어도 하나의 검사 패턴 생성기를 포함할 수 있다. 일부 실시예에서, 앞서 설명한 바와 같이, 복수의 검사부 각각은 복수의 검사부의 특정 검사부에 의해 사용될 검사 패턴을 생성하여 다중 메모리 뱅크의 적어도 하나를 검사하도록 구성된 검사 패턴 생성기를 포함할 수 있다. 앞서 설명한 바와 같이, 도 65는 검사부 내의 검사 패턴 생성기(GEN)(5219)를 도시하고 있다. 하나 이상의 또는 모든 논리부가 검사 패턴 생성기를 포함할 수 있다.The plurality of inspection units may include at least one inspection pattern generator configured to generate at least one inspection pattern for use in one or more inspections of multiple memory banks. In some embodiments, as described above, each of the plurality of inspection units may include an inspection pattern generator configured to generate an inspection pattern to be used by a specific inspection unit of the plurality of inspection units to test at least one of the multiple memory banks. As described above, FIG. 65 shows an inspection pattern generator (GEN) 5219 in the inspection unit. One or more or all of the logic may include a check pattern generator.

적어도 하나의 검사 패턴 생성기는 적어도 하나의 검사 패턴을 생성하기 위한 명령을 인터페이스로부터 수신하도록 구성될 수 있다. 검사 패턴은 검사 중에 접근되어야 할(예, 읽기 및/또는 쓰기 할) 메모리 엔트리 및/또는 엔트리에 기록될 값 등을 포함할 수 있다.The at least one inspection pattern generator may be configured to receive a command from the interface to generate the at least one inspection pattern. The test pattern may include memory entries to be accessed (eg, read and/or written) and/or values to be written to the entries during the test.

인터페이스는 집적회로 외부에 있을 수 있는 외부 장치로부터 적어도 하나의 검사 패턴 생성을 위한 명령을 포함하는 구성 정보를 수신하도록 구성될 수 있다.The interface may be configured to receive configuration information including commands for generating at least one test pattern from an external device that may be external to the integrated circuit.

적어도 하나의 검사 패턴 생성기는 메모리 어레이로부터 적어도 하나의 검사 패턴을 생성하기 위한 명령을 포함하는 구성 정보를 읽어오도록 구성될 수 있다.The at least one test pattern generator may be configured to read configuration information including a command for generating the at least one test pattern from the memory array.

일부 실시예에서, 구성 정보는 벡터를 포함할 수 있다.In some embodiments, the configuration information may include a vector.

인터페이스는 적어도 하나의 검사 패턴일 수 있는 명령을 포함하는 구성 정보를 집적회로의 외부에 있을 수 있는 장치로부터 수신하도록 구성될 수 있다.The interface may be configured to receive configuration information comprising a command, which may be at least one test pattern, from a device that may be external to the integrated circuit.

예를 들어, 적어도 하나의 검사 패턴은 메모리 어레이의 검사 동안에 접근되어야 할 메모리 어레이 엔트리를 포함할 수 있다.For example, the at least one test pattern may include a memory array entry to be accessed during a test of the memory array.

적어도 하나의 검사 패턴은 메모리 어레이의 검사 동안에 접근된 메모리 어레이에 기록될 입력 데이터를 더 포함할 수 있다.The at least one test pattern may further include input data to be written to the memory array accessed during the test of the memory array.

추가적으로 또는 대안적으로, 적어도 하나의 검사 패턴은 메모리 어레이의 검사 동안에 접근된 메모리 어레이 엔트리에 기록될 입력 데이터 및 메모리 어레이의 검사 동안에 접근된 메모리 어레이 엔트리에서 읽어올 출력 데이터의 예상 값을 더 포함할 수 있다.Additionally or alternatively, the at least one check pattern may further include an expected value of input data to be written to a memory array entry accessed during testing of the memory array and output data to be read from a memory array entry accessed during testing of the memory array. can

일부 실시예에서, 복수의 검사부는 복수의 검사부에 의해 실행되면 복수의 검사부가 메모리 어레이를 검사하도록 유발하는 검사 명령을 메모리 어레이로부터 가져오도록 구성될 수 있다.In some embodiments, the plurality of inspection units may be configured to obtain a test command from the memory array that, when executed by the plurality of inspection units, causes the plurality of inspection units to inspect the memory array.

예를 들어, 검사 명령은 구성 정보에 포함될 수 있다.For example, the inspection command may be included in the configuration information.

구성 정보는 메모리 어레이 검사의 예상 결과를 포함할 수 있다.The configuration information may include expected results of memory array tests.

추가적으로 또는 대안적으로, 구성 정보는 메모리 어레이의 검사 동안에 접근된 메모리 어레이 엔트리로부터 읽어올 출력 데이터의 값을 포함할 수 있다.Additionally or alternatively, the configuration information may include values of output data to be read from memory array entries accessed during inspection of the memory array.

추가적으로 또는 대안적으로, 구성 정보는 벡터를 포함할 수 있다.Additionally or alternatively, the configuration information may include a vector.

일부 실시예에서, 복수의 검사부는 복수의 검사부에 의해 실행되면 복수의 검사부가 메모리 어레이를 검사하고 프로세싱 어레이를 검사하도록 유발하는 검사 명령을 메모리 어레이로부터 가져오도록 구성될 수 있다.In some embodiments, the plurality of inspection units may be configured to obtain a test command from the memory array that, when executed by the plurality of inspection units, causes the plurality of inspection units to inspect the memory array and to inspect the processing array.

구성 정보는 벡터를 포함할 수 있다.The configuration information may include a vector.

추가적으로 또는 대안적으로, 구성 정보는 메모리 어레이 검사와 프로세싱 어레이 검사의 예상 결과를 포함할 수 있다.Additionally or alternatively, the configuration information may include expected results of memory array tests and processing array tests.

일부 실시예에서, 앞서 설명한 바와 같이, 복수의 검사부는 다중 메모리 뱅크의 검사 동안에 사용되는 검사 패턴의 생성을 위한 검사 패턴 생성기가 없을 수 있다.In some embodiments, as described above, the plurality of inspection units may not have an inspection pattern generator for generating inspection patterns used during inspection of multiple memory banks.

이러한 실시예에서, 복수의 검사부의 적어도 2개의 검사부는 다중 메모리 뱅크의 적어도 2개의 메모리 뱅크를 병렬로 검사하도록 구성될 수 있다.In such an embodiment, at least two test units of the plurality of test units may be configured to test at least two memory banks of a multiple memory bank in parallel.

대안적으로, 복수의 검사부의 적어도 2개의 검사부는 다중 메모리 뱅크의 적어도 2개의 메모리 뱅크를 직렬로 검사하도록 구성될 수 있다. Alternatively, at least two test sections of the plurality of test sections may be configured to test at least two memory banks of the multiple memory bank in series.

일부 실시예에서, 검사 결과를 나타내는 정보는 불량 메모리 어레이 엔트리의 식별자를 포함할 수 있다.In some embodiments, the information indicative of the test result may include an identifier of a bad memory array entry.

일부 실시예에서, 인터페이스는 복수의 검사 회로에 의해 확보된 부분적 검사 결과를 메모리 어레이의 검사 동안에 여러 회 가져오도록 구성될 수 있다.In some embodiments, the interface may be configured to fetch partial test results obtained by the plurality of test circuits multiple times during testing of the memory array.

일부 실시예에서, 집적회로는 메모리 어레이의 검사 동안에 검출된 적어도 하나의 오류를 수정하도록 구성된 오류 수정부를 포함할 수 있다. 예를 들어, 오류 수정부는 일부 메모리 워드를 비활성화하고 이러한 워드를 중복 워드로 대체하는 등의 모든 임의의 적절한 방법을 활용하여 메모리 오류를 수정하도록 구성될 수 있다.In some embodiments, the integrated circuit may include an error correction unit configured to correct at least one error detected during inspection of the memory array. For example, the error correction unit may be configured to correct memory errors utilizing any suitable method, such as disabling some memory words and replacing these words with duplicate words.

상기의 모든 임의의 실시예에서, 집적회로는 메모리 칩일 수 있다.In any of the above embodiments, the integrated circuit may be a memory chip.

예컨대, 집적회로는 분산 프로세서를 포함할 수 있고, 여기서 프로세싱 어레이는 도 7a에 도시된 바와 같은 분산 프로세서의 복수의 서브유닛을 포함할 수 있다.For example, an integrated circuit may include a distributed processor, wherein the processing array may include a plurality of subunits of the distributed processor as shown in FIG. 7A .

이러한 실시예에서, 프로세서 서브유닛의 각각은 다중 메모리 뱅크의 상응하는 전용 메모리 뱅크와 연관될 수 있다.In such embodiments, each of the processor subunits may be associated with a corresponding dedicated memory bank of multiple memory banks.

앞서 설명한 모든 임의의 실시예에서, 검사 결과를 나타내는 정보는 적어도 하나의 메모리 뱅크의 상태를 나타낼 수 있다. 메모리 뱅크의 상태는, 메모리 워드 당, 엔트리 그룹 당, 또는 전체 메모리 뱅크 당의 하나 이상의 입도로 제공될 수 있다.In any of the above-described embodiments, the information indicating the test result may indicate the state of at least one memory bank. The state of a memory bank may be provided at one or more granularities per memory word, per group of entries, or per entire memory bank.

도 65 내지 도 66은 검사기 검사 단계의 4단계를 도시하고 있다.65 to 66 show four stages of the inspection machine inspection step.

제1 단계에서, 검사기는 검사 시퀀스를 기록하고(5221), 메모리 뱅크의 논리부는 데이터를 메모리에 기록한다. 로직은 검사기로부터 명령을 수신하고 자체적으로 검사 시퀀스를 생성(하기에 설명)할 정도로 복잡할 수도 있다.In a first step, the tester writes 5221 the test sequence, and the logic of the memory bank writes the data to the memory. The logic may be complex enough to receive commands from the tester and generate its own test sequence (discussed below).

제2 단계에서, 검사기는 예상 결과를 검사된 메모리에 기록하고(5223), 논리부는 메모리 뱅크에서 읽어온 데이터에 예상 결과를 비교하여 오류의 목록을 저장한다. 예상 결과를 기록하는 단계는 로직이 자체적으로 예상 결과의 시퀀스를 생성(하기에 설명)할 수 있을 정도로 복잡한 경우에 단순화될 수 있다.In the second step, the checker writes the expected result to the checked memory ( 5223 ), and the logic unit compares the expected result with data read from the memory bank and stores a list of errors. The step of recording expected results may be simplified if the logic is complex enough to itself generate a sequence of expected results (discussed below).

제3 단계에서, 검사기는 논리부로부터 불량 어드레스를 판독한다(5224).In a third step, the checker reads 5224 the bad address from the logic.

제4 단계에서, 검사기는 결과에 의거하여 행동하고(5225) 오류를 수정할 수 있다. 예를 들면, 특정 인터페이스에 연결하여 메모리 내에 퓨즈를 프로그램 하지만 메모리 내에 오류 수정 메커니즘을 프로그램 할 수 있게 하는 모든 임의의 다른 메커니즘을 활용할 수도 있다.In a fourth step, the checker may act 5225 based on the result and correct the error. For example, you could connect to a particular interface and program a fuse in memory, but utilize any other mechanism that allows you to program an error correction mechanism in memory.

이러한 실시예에서, 메모리 검사기는 벡터를 활용하여 메모리를 검사할 수 있다.In such an embodiment, the memory checker may utilize the vector to check the memory.

예를 들어, 각 벡터는 입력 시리즈와 출력 시리즈로부터 구성될 수 있다.For example, each vector can be constructed from an input series and an output series.

입력 시리즈는 메모리에 기록할 데이터와 어드레스의 쌍을 포함할 수 있다(많은 실시예에서, 이 시리즈는 논리부에 의해 실행된 프로그램과 같은 프로그램이 필요한 경우에 생성할 수 있게 하는 공식으로 모델링 될 수 있다).The input series may contain pairs of addresses and data to be written to memory (in many embodiments, this series may be modeled as a formula that allows a program, such as a program executed by logic, to be created on demand) have).

일부 실시예에서, 검사 패턴 생성기는 이러한 벡터를 생성할 수 있다.In some embodiments, the inspection pattern generator may generate such a vector.

여기서, 벡터가 예시적인 데이터 구조이지만, 다른 실시예들은 다른 데이터 구조를 사용할 수 있다. 데이터 구조는 집적회로 외부에 위치한 검사기에 의해 생성된 다른 검사 데이터 구조를 준수할 수 있다.Here, a vector is an exemplary data structure, but other embodiments may use other data structures. The data structure may conform to other inspection data structures generated by a tester located external to the integrated circuit.

출력 시리즈는 메모리로부터 읽어올 예상 데이터를 포함하는 데이터와 어드레스 쌍을 포함할 수 있다(일부 실시예에서, 이 시리즈는 런타임(runtime)에 프로그램에 의해, 예를 들면 논리부에 의해, 생성될 수 있다).The output series may contain data and address pairs containing expected data to be read from memory (in some embodiments, the series may be generated by a program at runtime, e.g., by a logic unit) have).

메모리 검사는 각각의 벡터가 입력 시리즈에 따라 메모리에 데이터를 쓴 후에 출력 시리즈에 따라 데이터를 읽어오고 예상 데이터에 비교하는 벡터의 목록을 실행하는 단계를 일반적으로 포함한다.Memory checking usually involves running a list of vectors that reads data according to an output series and compares it to expected data after each vector writes data to memory according to the input series.

부조화(mismatch)의 경우, 메모리는 불량으로 분류되거나, 메모리에 중복성에 대한 메커니즘이 있으면 중복 메커니즘이 활성화되게 하여 활성화된 중복 메커니즘 상에서 벡터가 다시 검사되게 할 수 있다.In case of a mismatch, the memory may be classified as bad, or if there is a mechanism for redundancy in the memory, the redundancy mechanism may be activated, causing the vector to be checked again on the activated redundancy mechanism.

메모리가 프로세서 서브유닛(도 7a를 참조하여 설명)을 포함하거나 여러 메모리 컨트롤러를 포함하는 실시예에서, 전체 검사는 메모리 뱅크의 논리부에 의해 처리될 수 있다. 따라서, 메모리 컨트롤러 또는 프로세서 서브유닛이 검사를 수행할 수 있다.In embodiments where the memory includes a processor subunit (described with reference to FIG. 7A ) or includes multiple memory controllers, the full check may be handled by the logic of the memory bank. Accordingly, the memory controller or the processor subunit may perform the check.

메모리 컨트롤러는 검사기로부터 프로그램 될 수 있고, 검사의 결과는 검사기에 의한 추후 판독을 위해 메모리 컨트롤러 자체에 보관될 수 있다.The memory controller may be programmed from the tester, and the results of the test may be stored in the memory controller itself for later reading by the tester.

논리부의 동작을 구성 및 검사하기 위해, 검사기는 논리부를 메모리 접근이 되도록 구성하고 검사 결과가 메모리 접근에 의해 판독될 수 있음을 확인할 수 있다.To configure and check the operation of the logic, the checker may configure the logic to be a memory access and verify that the check result can be read by the memory access.

예를 들면, 입력 벡터는 논리부에 대한 프로그래밍 시퀀스를 포함할 수 있고, 출력 벡터는 이러한 검사의 예상 결과를 포함할 수 있다. 예컨대, 프로세서 서브유닛과 같은 논리부가 메모리의 2개의 어드레스 상의 계산을 수행하도록 구성된 배율기(multiplier) 또는 합산기(adder)를 포함하는 경우, 입력 벡터는 데이터를 메모리에 기록하는 명령의 세트 및 합산기/배율기 로직으로 기록하는 명령의 세트를 포함할 수 있다. 합산기/배율기 결과가 출력 벡터로 읽어올 수만 있다면, 결과는 검사기로 전송될 수 있다.For example, the input vector may contain the programming sequence for the logic, and the output vector may contain the expected result of this check. For example, if logic, such as a processor subunit, includes a multiplier or adder configured to perform calculations on two addresses in memory, the input vector is a set of instructions that write data to memory and a summer May contain a set of instructions to write to the /multiplier logic. As long as the summer/multiplier result can be read into an output vector, the result can be sent to the checker.

검사는 메모리로부터 로직 구성을 로드하고 로직 출력이 메모리로 전송되게 하는 단계를 더 포함할 수 있다.The checking may further include loading the logic configuration from the memory and causing the logic output to be transferred to the memory.

논리부가 그 구성을 메모리로부터 로드하는 실시예에서(예, 로직이 메모리 컨트롤러인 경우), 논리부는 코드를 메모리 그 자체로부터 실행할 수 있다.In embodiments where the logic loads its configuration from memory (eg, where the logic is a memory controller), the logic may execute code from the memory itself.

이에 따라, 입력 벡터는 논리부에 대한 프로그램을 포함할 수 있고, 프로그램 그 자체가 논리부의 다양한 회로를 검사할 수 있다.Accordingly, the input vector can include a program for the logic section, and the program itself can check the various circuits of the logic section.

따라서, 검사는 외부 검사기가 사용하는 형식으로 벡터를 수신하는 것으로 제한되지 않을 수 있다.Thus, inspection may not be limited to receiving vectors in a format used by an external inspector.

논리부로 로드되는 명령이 논리부에게 메모리 뱅크로 결과를 기록하도록 지시하는 경우, 검사기는 이러한 결과를 판독하고 예상 출력 시리즈에 비교할 수 있다.When the instruction loaded into the logic instructs the logic to write the results to the memory bank, the checker can read these results and compare them to the expected output series.

예를 들어, 메모리에 기록된 벡터는 논리부에 대한 검사 프로그램이거나 검사 프로그램(예, 검사는 메모리가 유효하다는 가정 하에 하지만, 그렇지 않은 경우에도, 기록된 검사 프로그램은 작동하지 않을 것이고, 검사는 실패할 것이며, 이는 어차피 칩이 유효하지 않으므로 용납될 만한 결과이다) 및/또는 논리부가 코드를 실행하고 결과를 메모리에 기록하는 방법을 포함할 수 있다. 논리부의 모든 검사는 메모리를 통해 수행될 수 있으므로(메모리에 로직 검사 입력을 기록하고 검사 결과를 기록), 검사기는 입력 시퀀스와 예상 출력 시퀀스로 단순 벡터 검사를 실행할 수 있다.For example, a vector written to memory is a check program for logic or a check program (e.g. the check assumes that the memory is valid, but even if it is not, the written check program will not work, and the check fails , which is an acceptable result since the chip is not valid anyway) and/or how logic executes code and writes the result to memory. All checks of the logic part can be performed via memory (writing the logic check inputs to memory and writing the test results), so the checker can perform simple vector checks with the input sequence and the expected output sequence.

로직 구성과 결과는 읽기 명령 및/또는 쓰기 명령으로 접근될 수 있다.Logic constructs and results can be accessed with read and/or write commands.

도 68은 벡터인 쓰기 검사 시퀀스(5221)를 전송하는 검사기(5200)를 도시한 것이다.68 shows a checker 5200 that transmits a write check sequence 5221 that is a vector.

벡터의 부분들에는 프로세싱 어레이의 로직(5215)에 결합된 메모리 뱅크(5212) 사이에 나누어진 검사 코드(5232)가 포함된다.Portions of the vector include test code 5232 divided between memory banks 5212 coupled to logic 5215 of the processing array.

각 로직(5215)은 연관된 메모리 뱅크에 저장된 코드(5232)를 실행할 수 있고, 이러한 실행은 하나 이상의 메모리 뱅크에 접근하는 단계, 계산을 수행하는 단계, 및 결과(예, 검사 결과(5231))를 메모리 뱅크(5212)에 저장하는 단계를 포함할 수 있다.Each logic 5215 may execute code 5232 stored in an associated memory bank, which execution may access one or more memory banks, perform calculations, and execute results (eg, test results 5231 ). storing in memory bank 5212 .

검사 결과는 검사기(5200)에 의해 반송될 수 있다(예, 결과를 판독(5222)).The test results may be returned by the tester 5200 (eg, read the results 5222).

이로써, 로직(5215)은 I/O 컨트롤러(5214)에 의해 수신된 명령에 의해 제어될 수 있다.As such, the logic 5215 may be controlled by commands received by the I/O controller 5214 .

도 68에서, I/O 컨트롤러(5214)는 메모리 뱅크와 로직으로 연결된다. 다른 실시예에서, 로직은 I/O 컨트롤러(5214)와 메모리 뱅크 사이에 연결될 수 있다.In Figure 68, I/O controller 5214 is logically coupled with the memory bank. In another embodiment, logic may be coupled between the I/O controller 5214 and the memory bank.

도 70은 메모리 뱅크를 검사하는 방법(5300)을 도시한 것이다. 예컨대, 방법(5300)은 앞서 도 65 내지 도 69를 참조하여 설명한 메모리 뱅크의 하나 이상을 활용하여 이행될 수 있다.70 illustrates a method 5300 of examining a memory bank. For example, method 5300 may be implemented utilizing one or more of the memory banks previously described with reference to FIGS. 65-69 .

방법(5300)은 단계 5302, 단계 5310, 및 단계 5320을 포함할 수 있다. 단계 5302에서, 집적회로의 메모리 뱅크를 검사하라는 요청을 수신할 수 있다. 집적회로는 기판, 기판 상에 배치되고 메모리 뱅크를 포함하는 메모리 어레이, 기판 상에 배치된 프로세싱 어레이, 및 기판 상에 배치된 인터페이스를 포함할 수 있다. 프로세싱 어레이는 앞서 설명한 바와 같이 복수의 검사부를 포함할 수 있다.Method 5300 may include steps 5302 , 5310 , and 5320 . At step 5302, a request may be received to examine a memory bank of the integrated circuit. An integrated circuit may include a substrate, a memory array disposed on the substrate and including a memory bank, a processing array disposed on the substrate, and an interface disposed on the substrate. The processing array may include a plurality of inspection units as described above.

일부 실시예에서, 요청은 구성 정보, 하나 이상의 벡터, 명령 등을 포함할 수 있다.In some embodiments, the request may include configuration information, one or more vectors, instructions, and the like.

이러한 실시예에서, 구성 정보는 메모리 어레이 검사의 예상 결과, 명령, 데이터, 메모리 어레이의 검사 동안에 접근되는 메모리 어레이 엔트리로부터 판독될 출력 데이터의 값, 검사 패턴 등을 포함할 수 있다.In such embodiments, the configuration information may include expected results of a memory array test, commands, data, values of output data to be read from memory array entries accessed during test of the memory array, test patterns, and the like.

검사 패턴은 (i) 메모리 어레이의 검사 동안에 접근될 메모리 어레이 엔트리, (ii) 메모리 어레이의 검사 동안에 접근된 메모리 어레이에 기록될 입력 데이터, 및 (iii) 메모리 어레이의 검사 동안에 접근된 메모리 어레이 엔트리로부터 판독될 출력 데이터의 예상 값 중의 적어도 하나를 포함할 수 있다.The check pattern is derived from (i) memory array entries to be accessed during testing of the memory array, (ii) input data to be written to the memory array accessed during testing of the memory array, and (iii) memory array entries accessed during testing of the memory array. at least one of the expected values of the output data to be read.

단계 5302는 다음과 같은 단계들의 적어도 하나를 포함하고/하거나 다음과 같은 단계들의 적어도 하나가 후속할 수 있다.Step 5302 may include at least one of the following steps and/or may be followed by at least one of the following steps.

a. 적어도 하나의 검사 패턴을 생성하기 위한 명령을 적어도 하나의 검사 패턴 생성기가 인터페이스로부터 수신하는 단계; a. receiving, by the at least one inspection pattern generator, a command for generating the at least one inspection pattern from the interface;

b. 적어도 하나의 검사 패턴을 생성하기 위한 명령을 포함하는 구성 정보를 인터페이스가 집적회로 외부의 외부 장치로부터 수신하는 단계; b. receiving, by the interface, configuration information including a command for generating at least one inspection pattern from an external device external to the integrated circuit;

c. 적어도 하나의 검사 패턴을 생성하기 위한 명령을 포함하는 구성 정보를 적어도 하나의 검사 패턴 생성기가 메모리 어레이로부터 읽어오는 단계; c. at least one test pattern generator reading configuration information including a command for generating at least one test pattern from the memory array;

d. 적어도 하나의 검사 패턴인 명령을 포함하는 구성 정보를 인터페이스가 집적회로 외부의 외부 장치로부터 수신하는 단계; d. receiving, by the interface, configuration information including a command that is at least one test pattern from an external device external to the integrated circuit;

e. 복수의 검사부에 의해 실행되면 복수의 검사부가 메모리 어레이를 검사하도록 유발하는 검사 명령을 복수의 검사부가 메모리 어레이로부터 가져오는 단계; 및 e. retrieving, by the plurality of inspection units, a test command from the memory array, which, when executed by the plurality of inspection units, causes the plurality of inspection units to inspect the memory array; and

f. 복수의 검사부에 의해 실행되면 복수의 검사부가 메모리 어레이를 검사하고 프로세싱 어레이를 검사하도록 유발하는 검사 명령을 복수의 검사부가 메모리 어레이로부터 수신하는 단계. f. receiving a test command from the memory array that, when executed by the plurality of test units, causes the plurality of test units to test the memory array and test the processing array.

단계 5302 이후에 단계 5310이 수행될 수 있다. 단계 5310에서, 복수의 검사부가 요청에 대응하여 다중 메모리 뱅크를 검사하고 검사 결과를 제공할 수 있다.Step 5310 may be performed after step 5302. In operation 5310 , a plurality of inspection units may inspect multiple memory banks in response to the request and provide inspection results.

방법(5300)은 인터페이스가 복수의 검사부에 의해 확보된 부분적 검사 결과를 메모리 어레이의 검사 동안에 여러 회 수신하는 단계를 더 포함할 수 있다.The method 5300 may further include the interface receiving the partial test results obtained by the plurality of test units multiple times during the test of the memory array.

단계 5310은 다음과 같은 단계들의 적어도 하나를 포함하고/하거나 다음과 같은 단계들의 적어도 하나가 후속할 수 있다. Step 5310 may include at least one of the following steps and/or may be followed by at least one of the following steps.

a. 하나 이상의 다중 메모리 뱅크의 적어도 하나를 검사하기 위해 검사부에 의해 사용될 검사 패턴을 하나 이상의 검사 패턴 생성기(예, 복수의 검사부의 하나, 일부, 또는 모두에 포함된 검사 패턴 생성기)가 생성하는 단계; a. generating, by one or more inspection pattern generators (eg, an inspection pattern generator included in one, some, or all of the plurality of inspection units), an inspection pattern to be used by the inspection unit to inspect at least one of the one or more multiple memory banks;

b. 복수의 검사부의 적어도 2개의 검사부가 다중 메모리 뱅크의 적어도 2개의 메모리 뱅크를 병렬로 검사하는 단계; b. testing, by at least two test units of the plurality of test units, at least two memory banks of a multi-memory bank in parallel;

c. 복수의 검사부의 적어도 2개의 검사부가 다중 메모리 뱅크의 적어도 2개의 메모리 뱅크를 직렬로 검사하는 단계; c. testing, by at least two test units of the plurality of test units, at least two memory banks of a multi-memory bank in series;

d. 메모리 엔트리에 값을 기록하고, 메모리 엔트리를 판독하고, 결과를 비교하는 단계; 및 d. writing a value to the memory entry, reading the memory entry, and comparing the results; and

e. 메모리 어레이의 검사 도중에 검출된 적어도 하나의 오류를 오류 수정부가 수정하는 단계. e. Correcting, by the error correction unit, at least one error detected during inspection of the memory array.

단계 5310 이후에 단계 5320이 수행될 수 있다. 단계 5320에서, 검사 결과를 나타내는 정보를 인터페이스가 집적회로의 외부로 출력할 수 있다.After step 5310, step 5320 may be performed. In operation 5320, the interface may output information indicating the test result to the outside of the integrated circuit.

검사 결과를 나타내는 정보는 불량 메모리 어레이 엔트리의 식별자를 포함할 수 있다. 이로써, 각 메모리 엔트리에 관한 읽기 데이터를 전송하지 않음으로써 시간을 절약할 수 있다.The information indicating the inspection result may include an identifier of a bad memory array entry. This saves time by not sending read data for each memory entry.

추가적으로 또는 대안적으로, 검사 결과를 나타내는 정보는 적어도 하나의 메모리 뱅크의 상태를 나타낼 수 있다.Additionally or alternatively, the information indicative of the test result may indicate the status of at least one memory bank.

이에 따라, 일부 실시예에서, 검사 결과를 나타내는 정보는 검사 동안에 메모리 뱅크에 기록되거나 메모리 뱅크로부터 판독된 데이터 유닛의 총 사이즈보다 훨씬 작을 수 있고, 검사부의 도움 없이 메모리를 검사하는 검사기에서 전송될 수 있는 입력 데이터보다 훨씬 작을 수 있다.Accordingly, in some embodiments, the information indicative of the test result may be much smaller than the total size of data units written to or read from the memory bank during the test, and may be transmitted from the tester examining the memory without the aid of the test unit. It can be much smaller than the input data that is present.

검사된 집적회로는 상기에 설명한 도면의 하나 이상에 도시된 바와 같은 메모리 칩 및/또는 분산 프로세서를 포함할 수 있다. 예를 들어, 여기서 설명하는 집적회로는 도 3a, 도 3b, 도 4 내지 도 6, 도 7a 내지 도 7d, 도 11 내지 도 13, 도 16 내지 도 19, 도 22, 및 도 23의 하나 이상에 도시된 바와 같은 메모리 칩에 포함되거나 메모리 칩을 포함할 수 있다.The tested integrated circuit may include a memory chip and/or a distributed processor as shown in one or more of the figures described above. For example, the integrated circuit described herein may be shown in one or more of FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22, and 23 . It may be included in a memory chip as shown or may include a memory chip.

도 71은 집적회로의 메모리 뱅크를 검사하는 방법(5350)의 일례를 도시한 것이다. 예컨대, 방법(5350)은 앞서 도 65 내지 도 69를 참조하여 설명한 메모리 뱅크의 하나 이상을 활용하여 이행될 수 있다.71 shows an example of a method 5350 of examining a memory bank of an integrated circuit. For example, method 5350 may be implemented utilizing one or more of the memory banks previously described with reference to FIGS. 65-69 .

방법(5350)은 단계 5352, 단계 5355, 및 단계 5358을 포함할 수 있다. 단계 5352에서, 명령을 포함하는 구성 정보를 집적회로의 인터페이스가 수신할 수 있다. 인터페이스를 포함하는 집적회로는 또한 기판, 메모리 뱅크를 포함하고 기판 상에 배치되는 메모리 어레이, 기판 상에 배치되는 프로세싱 어레이, 및 기판 상에 배치된 인터페이스를 포함할 수 있다.Method 5350 may include steps 5352 , 5355 , and 5358 . In step 5352, the interface of the integrated circuit may receive configuration information comprising a command. An integrated circuit including an interface may also include a substrate, a memory array including a memory bank and disposed on the substrate, a processing array disposed on the substrate, and an interface disposed on the substrate.

구성 정보는 메모리 어레이 검사의 예상 결과, 명령, 데이터, 메모리 어레이의 검사 동안에 접근되는 메모리 어레이 엔트리로부터 판독될 출력 데이터의 값, 검사 패턴 등을 포함할 수 있다.The configuration information may include expected results of a memory array test, commands, data, values of output data to be read from memory array entries accessed during test of the memory array, test patterns, and the like.

추가적으로 또는 대안적으로, 구성 정보는 명령, 명령을 기록할 메모리 엔트리의 어드레스, 입력 데이터를 포함할 수 있고, 명령의 실행 동안에 계산된 출력 값을 수신하기 위한 메모리 엔트리의 어드레스도 포함할 수 있다.Additionally or alternatively, the configuration information may include an instruction, an address of a memory entry into which to write the instruction, input data, and may also include an address of a memory entry for receiving an output value calculated during execution of the instruction.

검사 패턴은 (i) 메모리 어레이의 검사 동안에 접근될 메모리 어레이 엔트리, (ii) 메모리 어레이의 검사 동안에 접근된 메모리 어레이에 기록될 입력 데이터, 및 (iii) 메모리 어레이의 검사 동안에 접근된 메모리 어레이 엔트리로부터 판독될 출력 데이터의 예상 값 중의 적어도 하나를 포함할 수 있다.The inspection pattern is derived from (i) memory array entries to be accessed during inspection of the memory array, (ii) input data to be written to the memory array accessed during inspection of the memory array, and (iii) memory array entries accessed during inspection of the memory array. at least one of the expected values of the output data to be read.

단계 5352 이후에 단계 5355가 수행될 수 있다. 단계 5355에서, 메모리 어레이에 접근하고, 연산 동작을 수행하고, 결과를 제공하여 프로세싱 어레이가 명령을 실행할 수 있다.After step 5352, step 5355 may be performed. At step 5355, the processing array can execute instructions by accessing the memory array, performing computational operations, and providing results.

단계 5355 이후에 단계 5358이 수행될 수 있다. 단계 5358에서, 결과를 나타내는 정보를 인터페이스가 집적회로의 외부로 출력할 수 있다.Step 5358 may be performed after step 5355. In step 5358, the interface may output information indicating the result to the outside of the integrated circuit.

사이버 보안 및 위조 검출 방법Cybersecurity and counterfeit detection methods

메모리 칩 및/또는 프로세서는 악의적 이용자의 타깃이 될 수 있고 다양한 유형의 사이버 공격의 대상이 될 수 있다. 일부 경우에서, 이러한 공격은 하나 이상의 메모리 리소스에 저장된 데이터 및/또는 코드를 변경하려는 시도일 수 있다. 사이버 공격은 메모리에 저장된 상당한 양의 데이터에 의존하는 학습 신경망 또는 기타 유형의 인공지능(AI) 모델에 대하여 특히 문제가 될 수 있다. 저장된 데이터가 조작되거나 심지어 불명확하게 되면, 이러한 조작은 해로울 수 있다. 예를 들어, 데이터 집약적 AI 모델에 의존하여 다른 차량 또는 보행자 등을 식별하는 자율주행차 시스템은 의존하는 데이터가 손상되거나 불명확해지면 호스트 차량 주변상황을 잘못 파악할 수 있다. 그 결과, 사고가 발생할 수 있다. AI 모델이 더욱 광범위한 기술 분야에서 적용됨에 따라, 이러한 모델과 연관된 데이터에 대한 사이버 공격은 중대 혼란의 위험이 된다. Memory chips and/or processors may be targeted by malicious users and may be the target of various types of cyberattacks. In some cases, such an attack may be an attempt to alter data and/or code stored in one or more memory resources. Cyberattacks can be particularly problematic for learning neural networks or other types of artificial intelligence (AI) models that rely on significant amounts of data stored in memory. If stored data is manipulated or even obscured, such manipulation can be detrimental. For example, an autonomous vehicle system that relies on data-intensive AI models to identify other vehicles or pedestrians, etc. may misinterpret the surroundings of the host vehicle if the data it relies on becomes corrupted or unclear. As a result, accidents may occur. As AI models are applied in a wider range of technologies, cyberattacks on the data associated with these models pose a significant risk of disruption.

다른 경우에서, 사이버 공격은 한 명 이상의 행위자가 프로세서 또는 기타 유형의 집적회로 기반 논리 요소와 연관된 동작 파라미터를 위조하거나 위조하려는 시도를 포함할 수 있다. 예를 들어, 프로세서는 흔히 특정 동작 사양 내에서 동작하도록 설계된다. 위조를 포함하는 사이버 공격은 프로세서, 메모리 유닛, 또는 기타 회로의 동작 파라미터들 중의 하나 이상을 변경함으로써 설계된 동작 사양(예: 클럭 속도, 대역폭 사양, 온도 제한, 동작 속도 등)을 초과하도록 시도할 수 있다. 이러한 위조는 대상 하드웨어의 고장으로 이어질 수 있다. In other instances, a cyber attack may involve one or more actors falsifying or attempting to falsify operating parameters associated with a processor or other type of integrated circuit based logic element. For example, processors are often designed to operate within specific operating specifications. Cyberattacks, including counterfeiting, may attempt to exceed designed operating specifications (e.g. clock speed, bandwidth specifications, temperature limits, operating speed, etc.) by changing one or more of the operating parameters of a processor, memory unit, or other circuitry. have. Such counterfeiting can lead to failure of the target hardware.

사이버 공격에 대한 방어를 위한 기존의 방식들은 프로세서 레벨에서 동작하는 컴퓨터 프로그램(예: 바이러스 방지 소프트웨어, 악성코드 방지 소프트웨어)을 포함할 수 있다. 다른 방식들은 라우터 또는 기타 하드웨어와 연관된 소프트웨어 기반 방화벽의 사용을 포함할 수 있다. 이러한 방식들은 메모리 유닛의 외부에서 실행된 소프트웨어 프로그램을 이용한 사이버 공격에 대한 대응인 반면에, 메모리 유닛에 저장된 데이터를 효율적으로 보호하는 추가적인 또는 대안적인 방식이 여전히 필요하고, 이러한 데이터의 정확성과 사용 가능성이 신경망 등과 같은 메모리 집약적 어플리케이션의 동작에 절대적인 경우에 더욱 그렇다. 본 개시의 실시예들은 메모리에 대한 사이버 공격에 강한 메모리를 포함하는 다양한 집적회로 설계를 제공한다. Existing methods for defense against cyber attacks may include computer programs (eg, anti-virus software, anti-malware software) running at the processor level. Other approaches may include the use of a software-based firewall associated with a router or other hardware. While these methods are countermeasures against cyberattacks using software programs executed outside of the memory unit, additional or alternative methods for efficiently protecting data stored in the memory unit are still needed, and the accuracy and availability of such data This is especially true when it is absolutely necessary for the operation of memory intensive applications such as neural networks. Embodiments of the present disclosure provide various integrated circuit designs including a memory that is resistant to cyber attacks on the memory.

집적회로에 민감한 정보 및 명령을 안전한 방법으로 가져온(예: 칩/집적회로 외부로의 인터페이스가 아직 활성화되지 않은 부팅 프로세스 도중) 후에 이러한 민감한 정보 및 명령을 집적회로 외부로 노출시키지 않고 집적회로 내에 유지하고 집적회로 내에서 연산을 완료하면 민감한 정보 및 명령의 보안을 향상시킬 수 있다. CPU 및 기타 유형의 프로세싱 유닛은, 특히 이러한 CPU 및 프로세싱 유닛이 외부 메모리와 함께 동작하는 경우에, 사이버 공격에 취약하다. 복수의 메모리 뱅크를 포함하는 메모리 어레이 중의 메모리 칩 상에 배치되는 분산 프로세서 서브유닛을 포함하는 개시된 실시예들은 사이버 공격과 위조에 덜 취약할 수 있다(예: 프로세싱이 메모리 칩 내에서 이루어지기 때문). 하기에 더 상세히 설명하는 개시된 안전 조치의 임의의 모든 조합은 사이버 공격 및/또는 위조에 대한 개시된 실시예들의 취약성을 더 감소시킬 수 있다. After fetching sensitive information and commands to the integrated circuit in a secure way (e.g. during the boot process when the interface out of the chip/integrated circuit is not yet active), keep such sensitive information and commands within the integrated circuit without exposing it outside the integrated circuit and completing operations within the integrated circuit can improve the security of sensitive information and instructions. CPUs and other types of processing units are vulnerable to cyber attacks, especially when such CPUs and processing units operate in conjunction with external memory. Disclosed embodiments comprising a distributed processor subunit disposed on a memory chip of a memory array comprising a plurality of memory banks may be less susceptible to cyberattacks and counterfeiting (eg, as processing takes place within the memory chip). . Any and every combination of the disclosed safety measures described in greater detail below may further reduce the vulnerability of the disclosed embodiments to cyber attacks and/or counterfeiting.

도 72a는 본 개시의 실시예들에 따른 메모리 어레이 및 프로세싱 어레이를 포함하는 집적회로(7200)를 도시한 것이다. 예컨대, 집적회로(7200)는 앞서 설명하고 본 개시에 전체적으로 설명하는 임의의 모든 분산 프로세서 온 메모리 칩(distributed processor-on-a-memory chip) 아키텍처(및, 기능)를 포함할 수 있다. 메모리 어레이와 프로세싱 어레이는 동일 기판 상에 형성될 수 있고, 본 개시의 일부 실시예들에서, 집적회로(7200)는 메모리 칩을 구성할 수 있다. 예를 들어, 앞서 설명한 바와 같이, 집적회로(7200)는 복수의 메모리 뱅크를 포함하는 메모리 칩과 메모리 칩 상에 공간적으로 분산된 복수의 프로세서 서브유닛을 포함할 수 있고, 여기서 복수의 메모리 뱅크는 각각 복수의 프로세서 서브유닛의 하나 이상의 전용 서브유닛과 연관된다. 일부 경우에서, 각 프로세서 서브유닛은 하나 이상의 메모리 뱅크 전용일 수 있다. 72A illustrates an integrated circuit 7200 including a memory array and a processing array in accordance with embodiments of the present disclosure. For example, integrated circuit 7200 may include any and all distributed processor-on-a-memory chip architecture (and functionality) described above and generally described herein. The memory array and the processing array may be formed on the same substrate, and in some embodiments of the present disclosure, the integrated circuit 7200 may constitute a memory chip. For example, as described above, the integrated circuit 7200 may include a memory chip including a plurality of memory banks and a plurality of processor subunits spatially distributed on the memory chip, wherein the plurality of memory banks include: each associated with one or more dedicated subunits of the plurality of processor subunits. In some cases, each processor subunit may be dedicated to one or more memory banks.

일부 실시예들에서, 도 72a에 도시된 바와 같이, 메모리 어레이는 복수의 이산 메모리 뱅크(7210_1, 7210_2, ... 7210_J1, 7210_Jn)를 포함할 수 있다. 본 개시의 실시예들에 따르면, 메모리 어레이(7210)는 예컨대 휘발성 메모리(RAM, DRAM, SRAM, 상변화 RAM (PRAM), 자기 저항 RAM (MRAM), 저항 RAM (ReRAM) 등) 또는 비휘발성 메모리(플래시 메모리 또는 ROM)를 포함하는 하나 이상의 유형의 메모리를 포함할 수 있다. 본 개시의 일부 실시예들에 따르면, 메모리 뱅크(7210_1 내지7210_Jn)는 복수의 MOS 메모리 구조를 포함할 수 있다. In some embodiments, as shown in FIG. 72A , the memory array may include a plurality of discrete memory banks 7210_1 , 7210_2 , ... 7210_J1 , 7210_Jn. According to embodiments of the present disclosure, the memory array 7210 may include, for example, volatile memory (RAM, DRAM, SRAM, phase change RAM (PRAM), magnetoresistive RAM (MRAM), resistive RAM (ReRAM), etc.) or non-volatile memory may include one or more types of memory including (flash memory or ROM). According to some embodiments of the present disclosure, the memory banks 7210_1 to 7210_Jn may include a plurality of MOS memory structures.

앞서 언급한 바와 같이, 프로세싱 어레이는 복수의 프로세서 서브유닛(7220_1 내지 7220_K)을 포함할 수 있다. 일부 실시예들에서, 프로세서 서브유닛(7220_1 내지 7220_K)은 각각 복수의 이산 메모리 뱅크(7210_1 내지 7210_Jn) 중에서 하나 이상의 이산 메모리 뱅크와 연관될 수 있다. 도 72a의 예시적인 실시예에는 각 프로세서 서브유닛이 2개의 이산 메모리 뱅크(7210)와 연관되어 있는 것으로 도시되어 있지만, 각 프로세서 서브유닛이 임의의 모든 수의 전용 이산 메모리 뱅크와 연관될 수 있음은 당연하다 할 것이다. 또는, 반대로, 각 메모리 뱅크는 임의의 모든 수의 프로세서 서브유닛과 연관될 수 있다. 본 개시의 실시예들에 따라, 집적회로(7200)의 메모리 어레이에 포함된 이산 메모리 뱅크의 수는 집적회로(7200)의 프로세싱 어레이에 포함된 프로세서 서브유닛의 수와 비교하여 동일하거나, 적거나, 많을 수 있다. As mentioned above, the processing array may include a plurality of processor subunits 7220_1 to 7220_K. In some embodiments, the processor subunits 7220_1 - 7220_K may each be associated with one or more discrete memory banks among the plurality of discrete memory banks 7210_1 - 7210_Jn. 72A, each processor subunit is shown associating with two discrete memory banks 7210; however, each processor subunit may be associated with any and any number of dedicated discrete memory banks it would be natural Or, conversely, each memory bank may be associated with any and all number of processor subunits. According to embodiments of the present disclosure, the number of discrete memory banks included in the memory array of the integrated circuit 7200 is equal to, less than, or equal to, compared to the number of processor subunits included in the processing array of the integrated circuit 7200 . , can be many.

집적회로(7200)는 본 개시의 실시예들에 따른(및 앞서 설명한 바와 같은) 복수의 제1 버스(7260)를 더 포함할 수 있다. 각 버스(7260)는 프로세서 서브유닛(7220_k)을 상응하는 전용 메모리 뱅크(7210_j)로 연결시킬 수 있다. 본 개시의 일부 실시예들에 따르면, 집적회로(7200)는 복수의 제2 버스(7261)를 더 포함할 수 있다. 각 버스(7261)는 프로세서 서브유닛(7220_k)을 다른 프로세서 서브유닛(7220_k+1)으로 연결시킬 수 있다. 도 72a에 도시된 바와 같이, 복수의 프로세서 서브유닛(7220_1 내지 7220_K)은 버스(7261)를 통해 서로 연결될 수 있다. 도 72a에는 복수의 프로세서 서브유닛(7220_1 내지 7220_K)이 버스(7261)를 통해 직렬로 연결되어 루프를 형성하는 것으로 도시되어 있지만, 프로세서 서브유닛(7220)은 임의의 모든 다른 방식으로도 연결될 수 있음은 당연하다 할 것이다. 예를 들면, 일부의 경우에서, 특정 프로세서 서브유닛은 버스(7261)를 통해 다른 프로세서 서브유닛에 연결되지 않을 수 있다. 다른 경우에서, 특정 프로세서 서브유닛은 하나의 다른 프로세서 서브유닛에만 연결될 수 있고, 또 다른 경우에서, 특정 프로세서 서브유닛은 하나 이상의 버스(7261)를 통하여 둘 이상의 다른 프로세서 서브유닛에 연결(예: 직렬 연결, 병렬 연결, 가지(branched) 연결 등)될 수 있다. 여기서, 기재된 집적회로(7200)의 실시예들은 예시일 뿐이다. 일부 경우에서, 집적회로(7200)에는 상이한 내부 컴포넌트 및 연결이 있을 수 있고, 다른 경우에서, 하나 이상의 내부 컴포넌트 및 기재된 연결은 생략될 수 있다(예: 특정 적용 분야의 요구에 따라). The integrated circuit 7200 may further include a plurality of first buses 7260 according to (and described above) embodiments of the present disclosure. Each bus 7260 may connect a processor subunit 7220_k to a corresponding dedicated memory bank 7210_j. According to some embodiments of the present disclosure, the integrated circuit 7200 may further include a plurality of second buses 7261 . Each bus 7261 may connect a processor subunit 7220_k to another processor subunit 7220_k+1. 72A , a plurality of processor subunits 7220_1 to 7220_K may be connected to each other through a bus 7261 . Although a plurality of processor subunits 7220_1 through 7220_K are shown connected in series via a bus 7261 in FIG. 72A to form a loop, processor subunit 7220 may be connected in any and any other manner. will be taken for granted. For example, in some cases, a particular processor subunit may not be coupled to another processor subunit via bus 7261 . In other cases, a particular processor subunit may be coupled to only one other processor subunit, and in still other cases, a particular processor subunit may be coupled to two or more other processor subunits via one or more buses 7261 (eg, serially). connected, paralleled, branched, etc.). Here, the described embodiments of the integrated circuit 7200 are merely examples. In some cases, integrated circuit 7200 may have different internal components and connections, and in other cases, one or more internal components and described connections may be omitted (eg, depending on the needs of a particular application).

도 72a를 다시 참조하면, 집적회로(7200)는 집적회로(7200)에 대한 적어도 하나의 보안 조치를 이행하기 위한 하나 이상의 구조를 포함할 수 있다. 일부 경우에서, 이러한 구조는 하나 이상의 메모리 뱅크에 저장된 데이터를 조작 또는 불명확하게 하는(또는 조작 또는 불명확하게 하려 시도하는) 사이버 공경을 검출하도록 구성될 수 있다. 다른 경우에서, 이러한 구조는 집적회로(7200)와 연관된 동작 파라미터의 위조 또는 집적회로(7200)와 연관된 하나 이상의 동작에 직접적으로 또는 간접적으로 영향을 미치는 하나 이상의 하드웨어 요소(집적회로(7200) 내부에 포함되거나 집적회로(7200) 외부의 하드웨어 요소)의 위조를 검출하도록 구성될 수 있다. Referring back to FIG. 72A , the integrated circuit 7200 may include one or more structures for implementing at least one security measure for the integrated circuit 7200 . In some cases, such structures may be configured to detect cyber hoaxes that manipulate or obscure (or attempt to manipulate or obscure) data stored in one or more memory banks. In other instances, such a structure may include falsification of operating parameters associated with integrated circuit 7200 or one or more hardware elements that directly or indirectly affect one or more operations associated with integrated circuit 7200 (within integrated circuit 7200 ). hardware elements included or external to the integrated circuit 7200).

일부 경우에서, 컨트롤러(7240)가 집적회로(7200)에 포함될 수 있다. 컨트롤러(7240)는 하나 이상의 버스(7250)를 통하여 예컨대 프로세서 서브유닛(7220_1 ...7220_k) 중의 하나 이상에 연결될 수 있다. 컨트롤러(7240)는 메모리 뱅크(7210_1 ...7210_Jn) 중의 하나 이상에도 연결될 수 있다. 도 72a에는 하나의 컨트롤러(7240)가 도시되어 있지만, 컨트롤러(7240)가 다중 프로세서 요소 및/또는 논리 회로를 포함할 수 있음은 당연하다 할 것이다. 개시된 실시예들에서, 컨트롤러(7240)는 집적회로(7200)의 적어도 하나의 동작에 대한 적어도 하나의 보안 조치를 이행하도록 구성될 수 있다. 나아가, 개시된 실시예들에서, 컨트롤러(7240)는 적어도 하나의 보안 조치가 촉발되는 경우에 하나 이상의 해결책을 취하도록(또는 유발하도록) 구성될 수 있다. In some cases, controller 7240 may be included in integrated circuit 7200 . The controller 7240 may be coupled to, for example, one or more of the processor subunits 7220_1 ... 7220_k via one or more buses 7250 . The controller 7240 may also be connected to one or more of the memory banks 7210_1 ... 7210_Jn. Although one controller 7240 is shown in FIG. 72A , it will be appreciated that the controller 7240 may include multiple processor elements and/or logic circuits. In the disclosed embodiments, the controller 7240 may be configured to implement at least one security measure for at least one operation of the integrated circuit 7200 . Further, in the disclosed embodiments, the controller 7240 may be configured to take (or trigger) one or more solutions when at least one security action is triggered.

본 개시의 일부 실시예들에 따르면, 적어도 하나의 보안 조치는 집적회로(7200)의 특정 양상에 대한 접근을 차단하기 위해 컨트롤러에 의해 이행되는 프로세스를 포함할 수 있다. 접근 차단에는 칩의 외부로부터 메모리의 특정 영역으로의 접근(읽기 및/또는 쓰기 목적)을 컨트롤러가 방지하도록 하는 것이 포함된다. 접근 통제는 주소 결정(resolution), 메모리 뱅크의 일부 결정, 메모리 뱅크 결정 등에 적용될 수 있다. 일부 경우에서, 집적회로(7200)와 연관된 메모리의 하나 이상의 물리적 위치(예: 집적회로(7200)의 하나 이상의 메모리 뱅크 또는 하나 이상의 메모리 뱅크의 임의의 부분)가 차단될 수 있다. 일부 실시예들에서, 컨트롤러(7240)는 인공지능 모델(또는 기타 유형의 소프트웨어 기반 시스템)의 실행과 연관된 집적회로(7200)의 특정부분으로의 접근을 차단할 수 있다. 예를 들면, 일부 실시예들에서, 컨트롤러(7240)는 집적회로(7200)와 연관된 메모리에 저장된 신경망 모델의 가중치로의 접근을 차단할 수 있다. 여기서, 소프트웨어 프로그램(즉, 모델)은 프로그램으로의 입력 데이터, 프로그램의 코드 데이터, 및 프로그램 실행으로부터의 출력 데이터라는 3가지 컴포넌트를 포함할 수 있다. 이러한 컴포넌트들은 신경망 모델에도 적용될 수 있다. 이러한 모델의 동작 과정에서, 입력 데이터가 생성되고 모델로 공급될 수 있고, 모델을 실행하면 읽기를 위한 출력 데이터를 생성할 수 있다. 그러나 수신된 입력 데이터를 활용한 모델 실행과 연관된 프로그램 코드 및 데이터 값들은 고정되어 있을 수 있다. According to some embodiments of the present disclosure, the at least one security measure may include a process implemented by the controller to block access to a particular aspect of the integrated circuit 7200 . Blocking access involves having the controller prevent access (for read and/or write purposes) to specific areas of memory from outside the chip. Access control may be applied to address resolution, determination of a portion of a memory bank, determination of a memory bank, and the like. In some cases, one or more physical locations of memory associated with integrated circuit 7200 (eg, one or more memory banks of integrated circuit 7200 or any portion of one or more memory banks) may be blocked. In some embodiments, the controller 7240 may block access to certain portions of the integrated circuit 7200 associated with the execution of the artificial intelligence model (or other type of software-based system). For example, in some embodiments, the controller 7240 may block access to the weights of the neural network model stored in the memory associated with the integrated circuit 7200 . Here, a software program (ie, model) may include three components: input data to the program, code data of the program, and output data from program execution. These components can also be applied to neural network models. In the course of operation of such a model, input data may be generated and supplied to the model, and executing the model may generate output data for reading. However, program codes and data values associated with model execution using the received input data may be fixed.

여기에서 사용되는 '차단'이라는 용어는 예컨대 칩/집적회로의 외부에서 개시된 메모리의 특정 영역에 대한 읽기 또는 쓰기 동작을 허용하지 않는 컨트롤러의 동작을 의미할 수 있다. 칩/집적회로의 I/O가 통과할 수 있는 컨트롤러는 메모리 뱅크 전체를 차단할 수 있지만, 메모리 뱅크 내의 임의의 범위의 메모리 주소를, 예컨대 단일 메모리 주소 내지 사용 가능한 메모리 뱅크의 주소 전체를 포함하는 임의의 범위의 메모리 주소(또는 그 사이의 임의의 주소 범위)까지, 차단할 수도 있다. As used herein, the term 'blocking' may refer to, for example, an operation of a controller that does not allow a read or write operation to a specific area of a memory initiated from the outside of a chip/integrated circuit. A controller through which the I/O of a chip/integrated circuit can pass can block an entire bank of memory, but can use any range of memory addresses within a memory bank, e.g. You can also block up to a range of memory addresses (or any address range in between).

입력 데이터의 수신 및 출력 데이터의 저장과 연관된 메모리 위치는 변경값 및 집적회로(7200) 외부의 컴포넌트(입력 데이터를 제공하거나 출력 데이터를 수신하는 컴포넌트)와의 상호작용과 연관되므로, 일부의 경우에서 이러한 메모리 위치로의 접근 차단은 실용적이지 않을 수 있다. 반면, 모델 코드 및 고정 데이터 값과 연관된 메모리 위치로의 접근 차단은 특정 유형의 사이버 공격에 대해 효과적일 수 있다. 따라서, 일부 실시예들에서, 프로그램 코드 및 데이터 값과 연관된 메모리(예: 입력 데이터의 쓰기/수신하기 및 출력 데이터의 읽기/제공하기 위해 사용되는 것이 아닌 메모리)는 보안 조치로서 차단될 수 있다. 접근 제한은 특정 프로그램 코드 및/또는 데이터 값(예: 수신된 입력 데이터에 의거한 모델의 실행과 연관된 값)을 변경하지 못하도록 특정 메모리 위치를 차단하는 것을 포함할 수 있다. 또한, 중간 데이터(예: 모델의 실행 과정에서 생성되는 데이터)와 연관된 메모리 영역도 외부 접근으로부터 차단될 수 있다. 따라서, 집적회로(7200) 상에 있거나 외부에 위치하는지 여부와 상관없이 다양한 연산 로직이 입력 데이터의 수신 또는 생성된 출력 데이터의 검색과 연관된 메모리 위치로 데이터를 제공하거나 이 메모리 위치로부터 데이터를 수신할 수 있는 반면에, 이러한 연산 로직은 수신된 입력 데이터에 기반한 프로그램 실행과 연관된 프로그램 코드 및 데이터 값을 저장하는 메모리 위치에 대한 접근 또는 수정할 능력이 없게 된다. Since memory locations associated with receiving input data and storing output data are associated with change values and interactions with components external to integrated circuit 7200 (components that provide input data or receive output data), in some cases these Blocking access to memory locations may not be practical. On the other hand, blocking access to memory locations associated with model code and fixed data values can be effective against certain types of cyberattacks. Accordingly, in some embodiments, memory associated with program code and data values (eg, memory that is not used to write/receive input data and read/provide output data) may be blocked as a security measure. Access restrictions may include blocking certain memory locations from changing certain program code and/or data values (eg, values associated with the execution of a model based on received input data). Also, a memory area associated with intermediate data (eg, data generated during the execution of a model) may be blocked from external access. Accordingly, various computational logic, whether on or external to integrated circuit 7200 , may provide data to, or receive data from, a memory location associated with receiving input data or retrieving generated output data. On the other hand, such computational logic would not have the ability to access or modify memory locations that store program code and data values associated with program execution based on received input data.

보안 조치를 제공하기 위한 집적회로(7200) 상의 메모리 위치 차단 외에도, 특정 프로그램 또는 모델과 연관된 코드를 실행하도록 구성된 특정 연산 로직 요소(및 접근되는 메모리 영역)로의 접근을 제한하여 기타 보안 조치가 이행될 수 있다. 일부 경우에서, 이러한 접근 제한은 집적회로(7200) 상에 위치한(예: 연산 메모리(예: 본 개시의 메모리 칩 상의 분산 프로세서와 같은 연산 능력을 보유한 메모리) 등) 연산 로직(및 연관 메모리 영역)에 대하여 이루어질 수 있다. 집적회로(7200)의 차단된 메모리 부분에 저장된 코드의 모든 임의의 실행과 연관되거나 집적회로(7200)의 차단된 메모리 부분에 저장된 데이터 값으로의 모든 임의의 접근과 연관된 연산 로직(및 연관 메모리 영역)으로의 접근도 해당 연산 로직이 집적회로(7200) 상에 위치하는지 여부와 상관없이 차단/제한될 수 있다. 프로그램/모델의 실행을 담당하는 연산 로직으로의 접근을 제한하면, 수신된 입력 데이터에 따른 동작과 연관된 코드 및 데이터 값이 조작되거나 불명확해지는 등으로부터의 보호가 유지되는 것을 더 보장할 수 있다. In addition to blocking memory locations on the integrated circuit 7200 to provide security measures, other security measures may be implemented by restricting access to specific computational logic elements (and memory regions accessed) configured to execute code associated with a specific program or model. can In some cases, such access restrictions may include computational logic (and associated memory regions) located on integrated circuit 7200 (eg, computational memory (eg, memory having computational capabilities, such as a distributed processor on a memory chip of the present disclosure), etc.). can be done for arithmetic logic (and associated memory regions) associated with any arbitrary execution of code stored in the blocked memory portion of integrated circuit 7200 or associated with any arbitrary access to data values stored in the blocked memory portion of integrated circuit 7200 ) may also be blocked/restricted regardless of whether the corresponding operation logic is located on the integrated circuit 7200 . Restricting access to the computational logic responsible for the execution of the program/model can further ensure that protection from being manipulated, obscured, and the like, of code and data values associated with operations according to received input data is maintained.

집적회로(7200)의 메모리 어레이의 특정 부분과 연관된 하드웨어 기반 영역으로의 접근에 대한 차단 또는 제한을 포함하는 컨트롤러에 의해 이행되는 보안 조치는 임의의 모든 적절한 방법으로 이루어질 수 있다. 일부 실시예들에서, 이러한 차단은 컨트롤러(7240)가 특정 메모리 부분을 차단하게 유발하도록 구성된 컨트롤러(7240)로 명령을 추가 또는 제공하여 이행될 수 있다. 일부 실시예들에서, 차단될 하드웨어 기반 메모리 부분은 특정 메모리 주소(예; 메모리 뱅크(7210_1 ... 7210_J2)의 임의의 모든 메모리 요소와 연관된 주소 등)에 의해 지정될 수 있다. 일부 실시예들에서, 메모리의 차단된 영역은 프로그램 또는 모델 실행 동안에 계속 고정될 수 있다. 다른 경우에서, 차단된 영역은 설정 가능할 수 있다. 즉, 일부 경우에서, 컨트롤러(7240)에는 차단된 영역이 프로그램 또는 모델의 실행 동안에 변경될 수 있도록 명령이 제공될 수 있다. 예를 들어, 특정 시간에, 특정 메모리 위치가 메모리의 차단된 영역에 추가될 수 있다. 또는, 특정 시간에, 특정 메모리 위치(예: 이전에 차단된 메모리 위치)가 메모리의 차단된 영역에서 제외될 수 있다. The security measures implemented by the controller, including blocking or restricting access to hardware-based regions associated with particular portions of the memory array of integrated circuit 7200, may be in any and any suitable manner. In some embodiments, such blocking may be accomplished by adding or providing a command to the controller 7240 configured to cause the controller 7240 to block a particular portion of memory. In some embodiments, the hardware-based memory portion to be blocked may be designated by a specific memory address (eg, an address associated with any and all memory elements of memory banks 7210_1 ... 7210_J2, etc.). In some embodiments, a blocked region of memory may remain frozen during program or model execution. In other cases, the blocked area may be configurable. That is, in some cases, the controller 7240 may be provided with instructions such that the blocked region may be changed during execution of the program or model. For example, at a specific time, a specific memory location may be added to a blocked area of memory. Alternatively, at a specific time, a specific memory location (eg, a previously blocked memory location) may be excluded from a blocked area of memory.

특정 메모리 위치의 차단은 임의의 모든 방법으로 이루어질 수 있다. 일부 경우에서, 특정 메모리 요청이 차단된 메모리 위치와 관련되는지 여부를 컨트롤러(7240)가 판단할 수 있도록, 차단된 메모리 위치의 기록(예: 차단된 메모리 주소를 저장하고 식별하는 파일, 데이터페이스, 데이터 구조 등)이 컨트롤러(7240)에 의해 접근 가능할 수 있다. 일부 경우에서, 컨트롤러(7240)는 차단된 주소의 데이터베이스를 유지하여 특정 메모리 위치로의 접근을 제어하는데 사용한다. 다른 경우에서, 차단될 때까지 설정 가능하고 차단할 메모리 위치(예: 칩 외부로부터의 메모리 접근이 제한되는 메모리 위치)를 식별하는 고정된 소정의 값을 포함할 수 있는 하나 이상의 레지스터의 표 또는 집합이 컨트롤러에 있을 수 있다. 예를 들면, 메모리 접근이 요청되는 경우, 컨트롤러(7240)는 해당 메모리 접근 요청과 연관된 메모리 주소를 차단된 메모리 주소에 비교할 수 있다. 메모리 접근 요청과 연관된 메모리 주소가 차단된 메모리 주소의 리스트 내에 있는 것으로 판단되는 경우, 메모리 접근 요청(읽기 또는 쓰기 동작 여부와 무관하게)는 거절될 수 있다. Blocking of a particular memory location may be accomplished in any and all manner. In some cases, writes of blocked memory locations (eg, files, dataspaces that store and identify blocked memory addresses; data structures, etc.) may be accessible by the controller 7240 . In some cases, the controller 7240 maintains a database of blocked addresses that is used to control access to specific memory locations. In other cases, there is a table or set of one or more registers that are configurable until blocked and may contain a fixed predetermined value that identifies a memory location to block (e.g. a memory location where memory access from outside the chip is restricted). It could be in the controller. For example, when a memory access is requested, the controller 7240 may compare a memory address associated with the corresponding memory access request to a blocked memory address. When it is determined that the memory address associated with the memory access request is in the list of blocked memory addresses, the memory access request (regardless of whether a read or write operation is performed) may be rejected.

앞서 설명한 바와 같이, 적어도 하나의 보안 조치는 입력 데이터의 수신 또는 생성된 출력 데이터에 대한 접근 제공에 사용되지 않는 메모리 어레이(7210)의 특정 메모리 부분에 대한 접근 차단을 포함할 수 있다. 일부 경우에서, 차단된 영역 내의 메모리 부분은 조정될 수 있다. 예를 들면, 차단된 메모리 부분이 차단 해제될 수 있고, 차단되지 않은 메모리 부분이 차단될 수 있다. 임의의 모든 적적한 방법이 활용되어 차단된 메모리 부분을 차단 해제할 수 있다. 예컨대, 보안 조치 이행으로, 차단된 메모리 영역의 하나 이상의 부분을 차단 해제하기 위한 암호를 요구할 수 있다. As described above, the at least one security measure may include blocking access to specific memory portions of the memory array 7210 that are not used for receiving input data or providing access to generated output data. In some cases, portions of the memory within the blocked region may be adjusted. For example, a blocked memory portion may be unblocked, and an unblocked memory portion may be blocked. Any and all suitable methods may be utilized to unblock the blocked memory portion. For example, implementing security measures may require a password to unblock one or more portions of a blocked memory area.

이행된 보안 조치는 이행된 보안 조치에 역행하는 동작이 검출되면 촉발될 수 있다. 예를 들어, 차단된 메모리 부분으로의 접근 시도(읽기 또는 쓰기 요청 여부와 무관)는 보안 조치를 촉발할 수 있다. 또한, 소정의 암호와 일치하지 않는 암호가 입력되면(예: 차단된 메모리 부분의 차단 해제 시도) 보안 조치가 촉발될 수 있다. 일부 경우에서, 허용된 임계 수(예: 1회, 2회, 3회 등)의 암호 입력 시도 이내에 정확한 암호가 입력되지 않으면 보안 조치가 촉발될 수 있다. The implemented security measures may be triggered when an action contrary to the implemented security measures is detected. For example, an attempt to access a blocked portion of memory (whether a read or write request) can trigger a security measure. In addition, if a password that does not match a predetermined password is input (eg, an attempt to unblock a blocked memory portion), security measures may be triggered. In some cases, a security action may be triggered if the correct password is not entered within an allowed threshold number of password entry attempts (eg 1, 2, 3, etc.).

메모리 부분은 임의의 모든 시간에 차단될 수 있다. 예를 들면, 일부 경우에서, 메모리 부분은 프로그램 실행 도중의 다양한 시간에 차단될 수 있다. 다른 경우에서, 메모리 부분은 프로그램/모델 실행 시작시 및/또는 전에 차단될 수 있다. 예를 들어, 차단될 메모리 주소가 프로그램/모델 코드의 프로그래밍과 함께 또는 프로그램/모델에 의해 접근될 데이터의 생성과 저장이 됨에 따라 결정되고 식별될 수 있다. 이로써, 메모리 어레이(7210) 공격에 대한 취약성은 프로그램/모델 실행이 시작되는 시점 동안에 또는 이후에, 프로그램/모델에 의해 사용될 데이터가 생성 및 저장된 이후 등에 감소되거나 제거될 수 있다. A portion of the memory may be blocked at any time. For example, in some cases, portions of memory may be blocked at various times during program execution. In other cases, portions of memory may be blocked at the beginning and/or prior to program/model execution. For example, the memory address to be blocked may be determined and identified with programming of the program/model code or with the creation and storage of data to be accessed by the program/model. Accordingly, the vulnerability to the memory array 7210 attack may be reduced or eliminated during or after the program/model execution is started, after data to be used by the program/model is generated and stored, and the like.

차단된 메모리는 임의의 모든 적합한 방법에 의해 또는 임의의 모든 적합한 시간에 차단 해제될 수 있다. 앞서 설명한 바와 같이, 차단된 메모리 부분은 정확한 암호의 수신 이후에 차단 해제될 수 있다. 다른 경우에서, 차단된 메모리는 전체 메모리 어레이(7210)를 재시동(명령에 의해 또는 파워를 껐다 켜서) 하거나 삭제하여 차단 해제될 수 있다. 추가적으로 또는 대안적으로, 해제 명령 시퀀스를 이행하여 하나 이상의 메모리 부분을 차단 해제할 수 있다. Blocked memory may be unblocked by any suitable method or at any suitable time. As described above, the blocked memory portion can be unblocked after receipt of the correct password. In other cases, blocked memory may be unblocked by restarting (either by command or by power cycling) or erasing the entire memory array 7210 . Additionally or alternatively, one or more memory portions may be unblocked by executing a sequence of unblock commands.

본 개시의 실시예들에 따르면, 또한 앞서 설명한 바와 같이, 컨트롤러(7240)는 집적회로(7200)와 왕래하는, 특히 집적회로(7200) 외부의 소스로부터의 트래픽을 제어하도록 구성될 수 있다. 예를 들어, 도 72a에 도시된 바와 같이, 집적회로(7200) 외부의 컴포넌트와 집적회로(7200) 내부의 컴포넌트(예: 메모리 어레이(7210) 또는 프로세서 서브유닛(7220)) 사이의 트래픽은 컨트롤러(7240)에 의해 제어될 수 있다. 이러한 트래픽은 컨트롤러(7240)를 통해 또는 컨트롤러(7240)에 의해 제어되거나 모니터링 되는 하나 이상의 버스(예: 7250, 7260, 또는 7261)를 통해 통과할 수 있다. According to embodiments of the present disclosure, as also described above, the controller 7240 may be configured to control traffic to and from the integrated circuit 7200 , in particular from a source external to the integrated circuit 7200 . For example, as shown in FIG. 72A , traffic between a component external to the integrated circuit 7200 and a component internal to the integrated circuit 7200 (eg, a memory array 7210 or a processor subunit 7220) is controlled by the controller 7240 . Such traffic may pass through the controller 7240 or one or more buses controlled or monitored by the controller 7240 (eg, 7250, 7260, or 7261).

본 개시의 일부 실시예들에 따르면, 집적회로(7200)는 부팅 프로세스 동안에 변경 불가능 데이터(예: 모델 가중치, 계수 등의 고정 데이터) 또는 특정 명령(예: 차단될 메모리 부분을 식별하는 코드)을 수신할 수 있다. 여기서, 변경 불가능 데이터는 프로그램 또는 모델의 실행 동안에 계속 고정되고 차후의 부팅 프로세스까지 변경되지 않고 유지되는 데이터를 의미할 수 있다. 프로그램 실행 동안에, 집적회로(7200)는 처리될 입력 데이터 및/또는 집적회로(7200)와 연관된 프로세싱에 의해 생성되는 출력 데이터를 포함할 수 있는 변경 불가능 데이터와 상호작용할 수 있다. 앞서 설명한 바와 같이, 메모리 어레이(7210) 또는 프로세싱 어레이(7220)로의 접근은 프로그램 또는 모델 실행 동안에 제한될 수 있다. 예를 들어, 메모리 어레이(7210)의 특정 부분으로의 접근 또는 기록될 입력 데이터와의 처리 또는 상호작용 또는 읽어질 생성된 출력 데이터와의 처리 또는 상호작용과 연관된 특정 프로세서 서브유닛으로의 접근이 제한될 수 있다. 프로그램 또는 모델 실행 동안에, 변경 불가능 데이터를 포함하는 메모리 부분은 차단될 수 있고, 이로써 접근 불가능해질 수 있다. 일부 실시예들에서, 변경 불가능 데이터 및/또는 차단될 메모리 부분과 연관된 명령은 임의의 모든 적절한 데이터 구조에 포함될 수 있다. 예를 들면, 이러한 데이터 및/또는 명령은 부팅 시퀀스 동안 또는 이후에 접근 가능한 하나 이상의 설정 파일을 통해 컨트롤러(7240)가 사용 가능하게 될 수 있다. According to some embodiments of the present disclosure, the integrated circuit 7200 transmits immutable data (eg, fixed data such as model weights, coefficients, etc.) or a specific instruction (eg, a code identifying a portion of memory to be blocked) during the booting process. can receive Here, the immutable data may refer to data that is continuously fixed during execution of a program or model and remains unchanged until a subsequent booting process. During program execution, integrated circuit 7200 may interact with immutable data, which may include input data to be processed and/or output data generated by processing associated with integrated circuit 7200 . As previously discussed, access to memory array 7210 or processing array 7220 may be restricted during program or model execution. For example, access to particular portions of memory array 7210 or access to particular processor subunits associated with processing or interaction with input data to be written or processing or interaction with generated output data to be read is restricted. can be During program or model execution, portions of memory containing immutable data may be blocked and thereby rendered inaccessible. In some embodiments, the immutable data and/or instructions associated with the portion of memory to be blocked may be included in any and all suitable data structures. For example, such data and/or commands may be made available to the controller 7240 via one or more configuration files accessible during or after the boot sequence.

도 72a를 다시 참조하면, 집적회로(7200)는 통신 포트(7230)를 더 포함할 수 있다. 도 72a에 도시된 바와 같이, 컨트롤러(7240)는 프로세싱 서브유닛(7220_1 내지 7220_K) 사이에 공유된 버스(7250)와 통신 포트(7230) 사이에 결합될 수 있다. 일부 실시예들에서, 통신 포트(7230)는 비휘발성 메모리 등을 포함할 수 있는 호스트 메모리(7280)와 연관된 호스트 컴퓨터(7270)에 직접 또는 간접 결합될 수 있다. 일부 실시예들에서, 호스트 컴퓨터(7270)는 변경 가능 데이터(7281)(예: 프로그램 또는 모델의 실행 동안에 사용될 입력 데이터), 변경 불가능 데이터(7282), 및/또는 명령(7283)을 연관된 호스트 메모리(7280)로부터 가져올 수 있다. 변경 가능 데이터(7281), 변경 불가능 데이터(7282), 및 명령(7283)은 부팅 절차 동안에 통신 포트(7230)를 통하여 호스트 컴퓨터(7270)로부터 컨트롤러(7240)로 업로드 될 수 있다. Referring back to FIG. 72A , the integrated circuit 7200 may further include a communication port 7230 . 72A , a controller 7240 may be coupled between a communication port 7230 and a bus 7250 shared between processing subunits 7220_1 - 7220_K. In some embodiments, communication port 7230 may be coupled directly or indirectly to a host computer 7270 associated with host memory 7280 , which may include non-volatile memory or the like. In some embodiments, host computer 7270 stores mutable data 7281 (eg, input data to be used during execution of a program or model), immutable data 7282 , and/or instructions 7283 to associated host memory. (7280). Mutable data 7281 , non-modifiable data 7282 , and commands 7283 may be uploaded from the host computer 7270 to the controller 7240 via the communication port 7230 during the boot procedure.

도 72b는 본 개시의 실시예들에 따른 집적회로 내부의 메모리 영역을 도시한 것이다. 도시된 바와 같이, 도 72b는 호스트 메모리(7280)에 포함된 데이터 구조의 예를 묘사하고 있다. 72B illustrates a memory area inside an integrated circuit according to embodiments of the present disclosure. As shown, FIG. 72B depicts an example of a data structure contained in host memory 7280 .

도 73a에는 본 개시의 실시예들에 따른 집적회로의 다른 예가 도시되어 있다. 도 73a에 도시된 바와 같이, 컨트롤러(7240)는 사이버공격 검출기(7241)와 응답 모듈(7242)을 포함할 수 있다. 본 개시의 일부 실시예들에서, 컨트롤러(7240)는 접근 제어 규칙(7243)을 저장하거나 접근 제어 규칙(7243)에 접근하도록 구성될 수 있다. 본 개시의 일부 실시예들에 따르면, 접근 제어 규칙(7243)은 컨트롤러(7240)가 접근 가능한 설정 파일에 포함될 수 있다. 일부 실시예들에서, 접근 제어 규칙(7243)은 부팅 절차 동안에 컨트롤러(7240)로 업로드 될 수 있다. 접근 제어 규칙(7243)은 임의의 모든 변경 가능 데이터(7281), 변경 불가능 데이터(7282), 명령(7283), 및 이들의 상응하는 메모리 위치와 연관된 접근 규칙을 나타내는 정보를 포함할 수 있다. 앞서 설명한 바와 같이, 접근 제어 규칙(7243) 또는 설정 파일은 메모리 어레이(7210) 중의 특정 메모리 주소를 식별하는 정보를 포함할 수 있다. 일부 실시예들에서, 컨트롤러(7240)는 메모리 어레이(7210)의 다양한 주소, 예를 들어 명령 또는 변경 불가능 데이터를 저장하기 위한 주소를 차단하는 차단 메커니즘 및/또는 기능을 제공하도록 구성될 수 있다. 73A illustrates another example of an integrated circuit according to embodiments of the present disclosure. As shown in FIG. 73A , the controller 7240 may include a cyberattack detector 7241 and a response module 7242 . In some embodiments of the present disclosure, the controller 7240 may be configured to store or access the access control rule 7243 . According to some embodiments of the present disclosure, the access control rule 7243 may be included in a configuration file accessible to the controller 7240 . In some embodiments, access control rules 7243 may be uploaded to controller 7240 during the boot procedure. Access control rules 7243 may include information indicative of access rules associated with any and all mutable data 7281 , immutable data 7282 , commands 7283 , and their corresponding memory locations. As described above, the access control rule 7243 or the configuration file may include information identifying a specific memory address in the memory array 7210 . In some embodiments, the controller 7240 may be configured to provide a blocking mechanism and/or function to block various addresses of the memory array 7210, eg, addresses for storing commands or immutable data.

컨트롤러(7240)는 접근 제어 규칙(7243)을 집행하도록, 예를 들어, 허가되지 않은 엔티티가 변경 불가능 데이터 또는 명령을 변경하는 것을 방지하도록 구성될 수 있다. 일부 실시예들에서, 변경 불가능 데이터 또는 명령의 읽기는 접근 제어 규칙(7243)에 따라 금지될 수 있다. 본 개시의 일부 실시예들에 따르면, 컨트롤러(7240)는 특정 명령 또는 변경 불가능 데이터의 적어도 일부분에 대한 접근 시도가 있는지 여부를 판단하도록 구성될 수 있다. 컨트롤러(7240)(예: 사이버 공격 검출기(7241)를 포함)는 접근 요청과 연관된 메모리 주소를 변경 불가능 데이터 및 명령에 대한 메모리 주소에 비교하여 하나 이상의 차단된 메모리 위치에 무단 접근 시도가 있었는지 여부를 검출할 수 있다. 이로써, 예를 들어, 컨트롤러(7240)의 사이버 공격 검출기(7241)는 의심되는 사이버 공격이 발생하는지, 예를 들어, 하나 이상의 명령을 변경하거나 하나 이상의 차단된 메모리 부분과 연관된 변경 불가능 데이터를 변경 또는 불명확하게 하려는 요청이 있는지 여부를 판단하도록 구성될 수 있다. 응답 모듈(7242)은 검출된 사이버 공격에 어떻게 대응할지 및/또는 어떻게 대응을 이행할지를 판단하도록 구성될 수 있다. 예컨대, 일부 경우에서, 하나 이상의 차단된 메모리 위치의 데이터 또는 명령에 대해 검출된 공격에 대응하여, 컨트롤러(7240)의 응답 모듈(7242)은 검출된 공격과 연관된 메모리 접근 동작과 같은 하나 이상의 동작을 중단하는 등이 포함될 수 있는 대응을 이행하거나 이행되도록 유발할 수 있다. 검출된 공격에 대한 대응은 또한 프로그램 또는 모델의 실행과 연관된 하나 이상의 동작의 중단, 공격 시도의 경고 또는 기타 지시자의 출력, 호스트에 지시 라인 주장, 또는 전체 메모리의 삭제 등을 포함할 수 있다. The controller 7240 may be configured to enforce the access control rules 7243 , eg, to prevent unauthorized entities from changing immutable data or commands. In some embodiments, reading of immutable data or commands may be prohibited according to access control rule 7243 . According to some embodiments of the present disclosure, the controller 7240 may be configured to determine whether there is an attempt to access at least a portion of a specific command or immutable data. The controller 7240 (eg, including the cyber attack detector 7241 ) compares the memory address associated with the access request to the memory address for immutable data and instructions to determine whether an unauthorized access attempt was made to one or more blocked memory locations. can be detected. Thereby, for example, the cyber attack detector 7241 of the controller 7240 can determine whether a suspected cyber attack is occurring, e.g., changing one or more commands or altering the immutable data associated with one or more blocked memory portions, or and may be configured to determine whether there is a request to disambiguate. The response module 7242 may be configured to determine how to respond to a detected cyber attack and/or how to implement a response. For example, in some cases, in response to a detected attack on data or commands in one or more blocked memory locations, the response module 7242 of the controller 7240 may perform one or more actions, such as a memory access operation associated with the detected attack. Implement or cause a response to be implemented, which may include stopping, etc. Response to a detected attack may also include aborting one or more operations associated with execution of a program or model, outputting a warning or other indicator of an attack attempt, asserting an instruction line to the host, or erasing entire memory, and the like.

메모리 부분의 차단 외에도, 사이버 공격으로부터 보호하는 다른 방법들이 이행되어 집적회로(7200)와 연관된 기재된 보안 조치를 제공할 수 있다. 예를 들면, 일부 실시예들에서, 컨트롤러 (7240)는 집적회로(7200)와 연관된 상이한 메모리 위치와 프로세서 서브유닛 내에 프로그램 또는 모델을 복제하도록 구성될 수 있다. 이로써, 프로그램/모델 및 프로그램/모델의 복제는 서로 개별적으로 실행될 수 있고, 개별 프로그램/모델 실행의 결과가 비교될 수 있다. 예를 들어, 프로그램/모델은 2개의 상이한 메모리 뱅크(7210)에 복제되고 집적회로(7200) 내의 상이한 프로세서 서브유닛(7220)에서 실행될 수 있다. 다른 실시예들에서, 프로그램/모델은 2개의 상이한 집적회로(7200)에 복제될 수 있다. 어느 경우이든, 프로그램/모델 실행의 결과가 비교되어 복제 프로그램/모델 실행 사이에 차이가 존재하는지 여부를 판단할 수 있다. 실행 결과(예: 중간 실행 결과, 최종 실행 결과 등)에서 검출된 차이는 프로그램/모델의 하나 이상의 양상 또는 그와 연관된 데이터를 변경한 사이버 공격의 존재를 나타내는 것일 수 있다. 일부 실시예들에서, 상이한 메모리 뱅크(7210)와 프로세서 서브유닛(7220)은 동일한 입력 데이터에 의거하여 2개의 복제된 모델을 실행하도록 설정될 수 있다. 일부 실시예들에서, 동일한 입력 데이터에 의거하여 2개의 복제된 모델을 실행하는 동안에 중간 결과가 비교될 수 있고, 동일한 단계에서 2개의 중간 결과 사이에 불일치가 있으면, 잠재적 해결책으로서 실행이 정지될 수 있다. 동일한 집적회로의 프로세서 서브유닛들이 2개의 복제된 모델을 실행하는 경우에, 그 집적회로는 또한 결과를 비교할 수 있다. 이는 집적회로 외부의 엔티티에 2개의 복제된 모델의 실행에 관하여 알리지 않고 이루어질 수 있다. 즉, 칩 외부의 엔티티는 집적회로 상에서 복제 모델들이 병렬로 실행되고 있는 것을 인식하지 못한다. In addition to blocking the memory portion, other methods of protecting against cyber attacks may be implemented to provide the described security measures associated with the integrated circuit 7200 . For example, in some embodiments, the controller 7240 may be configured to duplicate a program or model within a processor subunit and a different memory location associated with the integrated circuit 7200 . Thereby, the program/model and the copy of the program/model can be executed separately from each other, and the results of the individual program/model execution can be compared. For example, the program/model may be copied to two different memory banks 7210 and executed on different processor subunits 7220 within the integrated circuit 7200 . In other embodiments, the program/model may be replicated on two different integrated circuits 7200 . In either case, the results of the program/model runs can be compared to determine whether differences exist between duplicate program/model runs. A detected difference in execution results (eg, intermediate execution results, final execution results, etc.) may be indicative of the presence of a cyberattack that has altered one or more aspects of the program/model or data associated therewith. In some embodiments, different memory bank 7210 and processor subunit 7220 may be configured to execute two replicated models based on the same input data. In some embodiments, intermediate results may be compared while running two replicated models based on the same input data, and if there is a discrepancy between two intermediate results in the same step, execution may be halted as a potential solution. have. When processor subunits of the same integrated circuit run two replicated models, the integrated circuit can also compare the results. This can be done without informing entities external to the integrated circuit about the execution of the two replicated models. That is, the entity outside the chip is unaware that the replica models are running in parallel on the integrated circuit.

도 73b는 본 개시의 실시예들에 따른 복제 모델을 동시에 실행하기 위한 구성을 도시한 것이다. 사이버 공격의 가능성을 검출을 위한 일례로 단일 프로그램/모델 복제가 기재되지만, 사이버 공격의 가능성을 검출을 위해 임의의 모든 수의 복제(예: 1, 2, 3, 또는 그 이상)가 사용될 수 있다. 복제의 수와 개별 프로그램/모델 실행이 증가함에 따라, 사이버 공격 검출의 신뢰 수준도 상승할 수 있다. 복제의 수가 증가하면 사이버 공격의 성공 가능 비율도 또한 감소하는데, 이는 공격자가 다중 프로그램/모델 복제에 영향을 주기 더 어려울 수 있기 때문이다. 프로그램 또는 모델 복제의 수는 런타임에서 판단되어, 사이버 공격자가 프로그램 또는 모델 실행에 성공적으로 영향을 미치는 것에 대한 난이도를 더 상승시킬 수 있다. 73B illustrates a configuration for concurrently executing a replication model according to embodiments of the present disclosure. Although a single program/model clone is described as an example for detecting the likelihood of a cyber attack, any and any number of clones (eg 1, 2, 3, or more) may be used to detect the likelihood of a cyber attack. . As the number of clones and the execution of individual programs/models increases, the level of confidence in detecting cyberattacks may also increase. As the number of clones increases, the probability of success of a cyber attack also decreases, as it may be more difficult for an attacker to influence multiple program/model clones. The number of program or model clones can be determined at runtime, further increasing the difficulty for a cyber attacker to successfully influence the program or model execution.

일부 실시예들에서, 복제된 모델들은 하나 이상의 양상에서 서로 상이할 수 있다. 본 예에서, 2개의 프로그램/모델과 연관된 코드는 서로 상이하게 되어 있을 수 있지만, 이러한 프로그램/모델들은 모두 동일한 출력 결과를 내도록 설계될 수 있다. 적어도 이러한 방식에서, 2개의 프로그램/모델은 서로의 복제인 것으로 고려될 수 있다. 예를 들어, 2개의 신경망 모델에는 서로에 대한 층에서 뉴런의 순서가 상이할 수 있다. 그러나 모델 코드의 이러한 변화에도 불구하고, 동일한 출력 결과를 낼 수 있다. 이러한 방식으로 프로그램/모델을 복제하면, 손상시킬 프로그램 또는 모델의 이러한 효과적인 복제본을 사이버 공격자가 식별하는 것이 더 어려워질 수 있고, 그 결과, 복제 모델/프로그램은 사이버 공격의 영향을 최소화하는 중복성을 제공하는 방법을 제공할 수 있을 뿐만 아니라 사이버 공격 검출을 향상시킬 수 있다(예: 사이버 공격자가 한 프로그램/모델 또는 그 데이터를 바꾸지만 프로그램/모델 복제본은 상응하게 바꾸지 못하는 위조 또는 무단 접근을 하이라이트 함으로써). In some embodiments, replicated models may differ from each other in one or more aspects. In this example, the codes associated with the two programs/models may be different from each other, but all of these programs/models may be designed to produce the same output result. At least in this way, the two programs/models can be considered to be replicas of each other. For example, two neural network models may have different order of neurons in a layer relative to each other. However, despite these changes in the model code, the same output result can be obtained. Cloning a program/model in this way may make it more difficult for a cyber attacker to identify such an effective copy of the program or model to compromise, and as a result, the clone model/program provides redundancy that minimizes the impact of a cyber attack. cyberattack detection as well as improving cyberattack detection (e.g. by highlighting counterfeiting or unauthorized access where a cyberattacker changes one program/model or its data but does not change a copy of the program/model correspondingly) .

많은 경우에서, 복제 프로그램/모델들(특히, 코드 차이를 보이는 복제 프로그램/모델을 포함)은 그 출력들이 완전히 일치하지는 않지만 정확하고 고정된 값이 아닌 소프트 값(예: 거의 동일한 출력 값)을 구성하도록 설계될 수 있다. 이러한 실시예들에서, 둘 이상의 효과적인 프로그램/모델 복제본으로부터의 출력 결과가 비교되어(예: 전용 모듈 또는 호스트 프로세서를 활용) 출력 결과들(중간 결과 또는 최종 결과) 사이의 차이가 소정의 범위 이내인지 여부를 판단할 수 있다. 소정의 임계값 또는 범위를 초과하지 않는 출력 소프트 값의 차이는 위조, 무단 접근 등이 발생하지 않았다는 증거로 고려될 수 있다. 반면, 출력된 소프트 값의 차이가 소정의 임계값 또는 범위를 초과하는 경우, 이러한 차이는 메모리에 대한 위조, 무단 접근 등의 형태로 사이버 공격이 있었다는 증거로 고려될 수 있다. 이러한 경우, 복제 프로그램/모델 보안 조치가 촉발될 수 있고, 하나 이상의 해결책(예: 프로그램 또는 모델의 실행을 중단, 집적회로(7200)의 하나 이상의 동작을 정지, 기능을 제한하는 안전 모드로 동작 등)이 취해질 수 있다. In many cases, clone programs/models (especially including clone programs/models that exhibit code differences) construct soft values (e.g. nearly identical output values) that are not exact and fixed values, although their outputs are not exactly identical. can be designed to In such embodiments, output results from two or more effective program/model replicas are compared (e.g., utilizing a dedicated module or host processor) to determine if the difference between the output results (intermediate result or final result) is within a predetermined range. can determine whether A difference in the output soft value that does not exceed a predetermined threshold or range may be considered as evidence that forgery, unauthorized access, or the like has not occurred. On the other hand, when the difference between the output soft values exceeds a predetermined threshold or range, the difference may be considered as evidence that there has been a cyber attack in the form of forgery or unauthorized access to the memory. In such a case, cloning program/model security measures may be triggered and one or more remedies (eg, halt execution of the program or model, halt one or more operations of integrated circuit 7200 , operate in safe mode limiting functionality, etc.) ) can be taken.

집적회로(7200)와 연관된 보안 조치는 또한 프로그램 또는 모델의 실행과 연관된 데이터의 정량 분석을 포함할 수 있다. 예를 들어, 일부 실시예들에서, 컨트롤러(7240)는 메모리 어레이(7210)의 적어도 일부분에 저장된 데이터에 대한 하나 이상의 검사 합계(checksum)/해시(hash)/순환 중복 검사(CRC)/패리티(parity) 값을 계산하도록 구성될 수 있다. 계산된 값(들)은 소정의 값(들)과 비교될 수 있다. 비교된 값들 사이에 차이가 있으면, 이러한 차이는 메모리 어레이(7210)의 적어도 일부분에 저장된 데이터에 위조가 있다는 증거로 해석될 수 있다. 일부 실시예들에서, 검사 합계/해시/CRC/패리티 값은 메모리 어레이(7210)와 연관된 모든 메모리 위치에 대해 계산되어 데이터의 변경을 식별할 수 있다. 본 예에서, 문제의 전체 메모리(또는 메모리 뱅크)는 검사 합계/해시/CRC/패리티 값의 계산을 위해 호스트 컴퓨터(7270) 또는 집적회로(7200)와 연관된 프로세서 등이 읽을 수 있다. 다른 경우에서, 검사 합계/해시/CRC/패리티 값은 메모리 어레이(7210)와 연관된 메모리 위치의 소정의 서브세트에 대해 계산되어 메모리 위치의 서브세트와 연관된 데이터의 변경을 식별할 수 있다. 일부 실시예들에서, 컨트롤러(7240)는 소정의 데이터 경로와 연관된 검사 합계/해시/CRC/패리티 값(예: 메모리 접근 패턴과 연관)을 계산하도록 구성될 수 있고, 계산된 값들은 서로 또는 소정의 값에 비교되어 위조 또는 다른 형태의 사이버 공격이 있었는지 여부를 판단할 수 있다. Security measures associated with integrated circuit 7200 may also include quantitative analysis of data associated with execution of a program or model. For example, in some embodiments, the controller 7240 performs one or more checksum/hash/cyclic redundancy check (CRC)/parity ( parity) value. The calculated value(s) may be compared to a predetermined value(s). If there is a difference between the compared values, the difference may be interpreted as evidence of forgery in data stored in at least a portion of the memory array 7210 . In some embodiments, a checksum/hash/CRC/parity value may be calculated for all memory locations associated with the memory array 7210 to identify a change in data. In this example, the entire memory (or memory bank) in question can be read by the host computer 7270 or a processor associated with the integrated circuit 7200, or the like, for calculation of checksum/hash/CRC/parity values. In other cases, checksum/hash/CRC/parity values may be calculated for a given subset of memory locations associated with memory array 7210 to identify changes in data associated with the subset of memory locations. In some embodiments, the controller 7240 may be configured to calculate a checksum/hash/CRC/parity value (eg, associated with a memory access pattern) associated with a given data path, the computed values being either relative to each other or to a given data path. It can be compared to the value of to determine whether there has been a forgery or other form of cyberattack.

집적회로(7200)는 집적회로(7200) 내의 또는 집적회로(7200)로 접근 가능한 위치의 하나 이상의 소정의 값(예: 예상 검사 합계/해시/CRC/패리티 값, 중간 출력 결과 또는 최종 출력 결과의 예상 차이 값, 특정 값과 연관된 예상 차이 범위 등)을 보호함으로써 사이버 공격에 대한 보안이 더 강화될 수 있다. 예를 들어, 일부 실시예들에서, 하나 이상의 소정의 값은 메모리 어레이(7210)의 레지스터에 저장될 수 있고 모델의 각 런 동안 또는 후에 중간 또는 최종 출력 값, 검사 합계 등을 평가하는데 활용(예: 집적회로(7200)의 컨트롤러(7240)에 의해)될 수 있다. 일부 경우에서, 레지스터 값은 "최종 결과 데이터 저장" 명령을 활용하여 업데이트 되어 소정의 값의 바로 계산할 수 있고, 계산된 값은 레지스터 또는 다른 메모리 위치에 저장될 수 있다. 이로써, 유효한 출력 값이 각 프로그램 또는 모델 실행 또는 부분 실행 이후에 비교를 위해 활용되는 소정의 값을 업데이트 하기 위해 활용될 수 있다. 이러한 방법은 사이버 공격 활동을 노출시키기 위해 설계된 하나 이상의 소정의 참조 값을 변경하거나 위조하려는 사이버 공격자의 시도의 어려움을 증가시킬 수 있다. Integrated circuit 7200 may include one or more predetermined values (eg, expected checksum/hash/CRC/parity values, intermediate output results, or final output results) at locations within or accessible to integrated circuit 7200 . Security against cyberattacks can be further strengthened by protecting the expected difference value, the expected difference range associated with a specific value, etc.). For example, in some embodiments, one or more predetermined values may be stored in registers of memory array 7210 and utilized to evaluate intermediate or final output values, checksums, etc. during or after each run of the model (e.g., : by the controller 7240 of the integrated circuit 7200). In some cases, register values may be updated utilizing a “save final result data” instruction to directly compute a given value, and the computed value may be stored in a register or other memory location. Thereby, a valid output value can be utilized to update a predetermined value that is utilized for comparison after each program or model run or partial run. Such methods may increase the difficulty of cyber attackers' attempts to alter or falsify one or more predetermined reference values designed to expose cyber-attack activity.

동작에서, 메모리 접근을 추적하기 위해 CRC 계산기가 사용될 수 있다. 예를 들어, 이러한 계산 회로는 메모리 뱅크 레벨에, 프로세서 서브유닛에, 또는 컨트롤러에 배치될 수 있고, 이들 각각은 각 메모리 접근마다 CRC 계산기로 누적하도록 구성될 수 있다. In operation, a CRC calculator may be used to track memory accesses. For example, such calculation circuitry may be located at the memory bank level, in a processor subunit, or in a controller, each of which may be configured to accumulate with a CRC calculator for each memory access.

도 74a는 집적회로(7200)의 다른 실시예를 도시한 것이다. 도 74a에 도시된 실시예에서, 컨트롤러(7240)는 위조 검출기(7245) 및 응답 모듈(7246)을 포함할 수 있다. 개시된 다른 실시예들과 유사하게, 위조 검출기(7245)는 잠재적 위조 시도의 증거를 검출하도록 구성될 수 있다. 본 개시의 일부 실시예들에 따르면, 예컨대, 집적회로(7200)와 연관되고 컨트롤러(7240)에 의해 이행되는 보안 조치는 실제 프로그램/모델 동작 패턴과 소정의/허용된 동작 패턴의 비교를 포함할 수 있다. 실제 프로그램/모델 동작 패턴이 하나 이상의 양상에서 소정의/허용된 동작 패턴과 다른 경우에 보안 조치가 촉발될 수 있다. 보안 조치가 촉발되면, 컨트롤러(7240)의 응답 모듈(7246)이 하나 이상의 해결책을 이행하여 대응하도록 구성될 수 있다. 74A illustrates another embodiment of an integrated circuit 7200 . 74A , the controller 7240 may include a spoof detector 7245 and a response module 7246 . Similar to other disclosed embodiments, spoof detector 7245 may be configured to detect evidence of a potential spoof attempt. According to some embodiments of the present disclosure, for example, a security measure associated with integrated circuit 7200 and implemented by controller 7240 may include comparison of an actual program/model operating pattern with a predefined/allowed operating pattern. can A security action may be triggered when the actual program/model operating pattern differs from a predetermined/allowed operating pattern in one or more aspects. When a security action is triggered, the response module 7246 of the controller 7240 may be configured to respond by implementing one or more remedies.

도 74c는 본 개시의 예시에 따른 칩 내의 다양한 지점에 위치할 수 있는 검출 요소를 도시한 것이다. 앞서 설명한 바와 같은 사이버 공격과 위조의 검출은 도 74c의 예에 도시된 바와 같은 칩 내의 다양한 지점에 위치한 검출 요소를 활용하여 수행될 수 있다. 예를 들어, 특정 코드는 특정 시간 주기 이내의 예상 수의 프로세싱 이벤트와 연관될 수 있다. 도 74c에 도시된 검출기는 특정 시간 주기(타임 카운터로 모니터) 동안에 시스템이 겪는 이벤트의 수(이벤트 카운터로 모니터)를 카운트할 수 있다. 이벤트의 수가 소정의 임계값(예: 소정의 시간 주기 동안에 예상되는 이벤트의 수)을 초과하는 경우, 위조를 나타내는 것일 수 있다. 도 74c에 도시된 바와 같이, 이러한 검출기는 이벤트의 다양한 유형을 모니터하도록 시스템의 다중 지점에 포함될 수 있다. 74C illustrates a detection element that may be located at various points within a chip according to an example of the present disclosure. The detection of cyber attacks and counterfeiting as described above may be performed by utilizing detection elements located at various points within the chip as shown in the example of FIG. 74C . For example, particular code may be associated with an expected number of processing events within a particular period of time. The detector shown in FIG. 74C is capable of counting the number of events (monitored by the event counter) that the system experiences during a particular period of time (monitored by the time counter). If the number of events exceeds a predetermined threshold (eg, the number of events expected during a predetermined period of time), this may indicate a forgery. 74C, such detectors may be included at multiple points in the system to monitor various types of events.

더욱 구체적으로, 일부 실시예들에서, 컨트롤러(7240)는 예산 프로그램/모델 동작 패턴(7244)을 저장 또는 접근하도록 구성될 수 있다. 예를 들어, 일부 경우에서, 동작 패턴은 시간 당 허용된 부하 패턴과 시간 당 금지 또는 불법 패턴을 나타내는 그래프(7247)로 묘사될 수 있다. 위조 시도는 메모리 어레이(7210) 또는 프로세싱 어레이(7220)가 특정 동작 사양 밖에서 동작하도록 유발할 수 있다. 이는 메모리 어레이(7210) 또는 프로세싱 어레이(7220)가 발열하거나 제대로 작동하지 않게 할 수 있고, 메모리 어레이(7210) 또는 프로세싱 어레이(7220)와 관련된 데이터 또는 코드가 변경되게 할 수 있다. 이러한 변경의 결과로, 그래프(7247)에 도시된 바와 같이, 허용된 동작 패턴 밖의 동작 패턴이 야기될 수 있다. More specifically, in some embodiments, the controller 7240 may be configured to store or access the budget program/model operating pattern 7244 . For example, in some cases, the operating pattern may be depicted as a graph 7247 representing an allowed load pattern per hour and a forbidden or illegal pattern per hour. Counterfeiting attempts may cause memory array 7210 or processing array 7220 to operate outside of specific operating specifications. This may cause the memory array 7210 or processing array 7220 to heat up or not function properly, and may cause data or code associated with the memory array 7210 or processing array 7220 to change. As a result of these changes, an operating pattern outside the permitted operating pattern may result, as shown in graph 7247 .

본 개시의 일부 실시예들에 따르면, 컨트롤러(7240)는 메모리 어레이(7210) 또는 프로세싱 어레이(7220)와 연관된 동작 패턴을 모니터하도록 구성될 수 있다. 동작 패턴은 접근 요청의 수, 접근 요청의 유형, 접근 요청의 시간 등과 연관될 수 있다. 컨트롤러(7240)는 동작 패턴이 허용 가능한 동작 패턴과 상이한 경우에 위조 공격을 검출하도록 더 구성될 수 있다. According to some embodiments of the present disclosure, the controller 7240 may be configured to monitor an operating pattern associated with the memory array 7210 or the processing array 7220 . The operation pattern may be associated with the number of access requests, the type of access request, the time of the access request, and the like. The controller 7240 may be further configured to detect a spoof attack when the operation pattern is different from the allowable operation pattern.

여기서, 개시된 실시예들은 사이버 공격에 대한 보호 외에도 동작에서의 비악성 오류에 대한 보호를 위해 활용될 수 있다. 예컨대, 개시된 실시예들은 특히 레벨이 집적회로(7200)의 동작 사양 밖인 경우의 온도 또는 전압 변화 또는 레벨과 같은 환경 요인에서 기인하는 오류에 대해 집적회로(7200)와 같은 시스템의 보호에도 효과적일 수 있다. Here, the disclosed embodiments may be utilized for protection against non-malicious errors in operation in addition to protection against cyber attacks. For example, the disclosed embodiments may also be effective in protecting a system such as integrated circuit 7200 against errors due to environmental factors such as temperature or voltage changes or levels, particularly when the level is outside the operating specifications of the integrated circuit 7200. have.

의심되는 사이버 공경의 검출에 대응하여(예: 촉발된 보안 조치에 대한 응답으로) 임의의 모든 적합한 해결책이 이행될 수 있다. 예를 들면, 해결책은 프로그램/모델 실행과 연관된 하나 이상의 동작의 중단, 집적회로(7200)와 연관된 하나 이상의 컴포넌트를 안전 모드로 동작, 집적회로(7200)의 하나 이상의 컴포넌트에 추가적인 입력 또는 접근 차단 등을 포함할 수 있다. Any and all suitable remedies may be implemented in response to the detection of a suspected cyber hoax (eg in response to a triggered security action). For example, a solution may include stopping one or more operations associated with program/model execution, operating one or more components associated with integrated circuit 7200 in a safe mode, blocking additional input or access to one or more components of integrated circuit 7200, etc. may include

도 74b는 개시된 예시적인 실시예들에 따라 위조에 대해 집적회로를 보호하는 방법(7450)의 순서도를 도시한 것이다. 예컨대, 단계 7452에서, 집적회로와 연관된 컨트롤러를 이용하여 집적회로의 동작에 대한 적어도 하나의 보안 조치를 이행할 수 있다. 단계 7454에서, 적어도 하나의 보안 조치가 촉발되는 경우에, 하나 이상의 해결책이 취해질 수 있다. 집적회로는 기판, 기판 상에 배치되고 복수의 이산 메모리 뱅크를 포함하는 메모리 어레이, 및 기판 상에 배치되고 복수의 프로세서 서브유닛을 포함하는 프로세싱 어레이를 포함할 수 있고, 복수의 프로세서 서브유닛은 각각 복수의 이산 메모리 뱅크 중의 하나 이상의 이산 메모리 뱅크와 연관될 수 있다. 74B depicts a flow diagram of a method 7450 of protecting an integrated circuit against counterfeiting in accordance with the disclosed exemplary embodiments. For example, in step 7452, a controller associated with the integrated circuit may be used to implement at least one security measure for the operation of the integrated circuit. At step 7454 , if at least one security action is triggered, one or more solutions may be taken. The integrated circuit may include a substrate, a memory array disposed on the substrate and comprising a plurality of discrete memory banks, and a processing array disposed on the substrate and comprising a plurality of processor subunits, each of the plurality of processor subunits comprising: It may be associated with one or more discrete memory banks of the plurality of discrete memory banks.

일부 실시예들에서, 개시된 보안 조치는 다중 메모리 칩에서 이행될 수 있고, 개시된 보안 메커니즘의 적어도 하나 이상은 각 메모리 칩/집적회로에 대해 이행될 수 있다. 일부 경우에서, 각 메모리 칩/집적회로는 동일한 보안 조치를 이행할 수 있지만, 다른 경우에서, 상이한 메모리 칩/집적회로는 상이한 보안조치를 이행할 수 있다(예: 상이한 보안 조치가 특정 집적회로와 연관된 특정 유형의 동작에 더 적합할 수 있는 경우). 일부 실시예들에서, 하나 이상의 보안 조치가 집적회로의 특정 컨트롤러에 의해 이행될 수 있다. 예를 들면, 특정 집적회로는 개시된 보안 조치의 임의의 모든 수 또는 유형을 이행할 수 있다. 또한, 특정 집적회로 컨트롤러는 촉발된 보안 조치에 응답하여 다중의 상이한 해결책을 이행하도록 구성될 수 있다. In some embodiments, the disclosed security measures may be implemented in multiple memory chips, and at least one or more of the disclosed security mechanisms may be implemented for each memory chip/integrated circuit. In some cases, each memory chip/integrated circuit may implement the same security measures, while in other cases different memory chips/integrated circuits may implement different security measures (eg, different security measures may may be more suitable for the specific type of behavior involved). In some embodiments, one or more security measures may be implemented by a particular controller of the integrated circuit. For example, a particular integrated circuit may implement any and all number or type of security measures disclosed. Further, a particular integrated circuit controller may be configured to implement a number of different solutions in response to the triggered security action.

여기서, 상기 보안 메커니즘의 둘 이상이 조합되어 사이버 공격 또는 위조 공격에 대한 보안을 강화할 수 있다. 또한, 보안 조치는 상이한 집적회로들에 모두 이행될 수 있고, 이러한 집적회로들은 서로 협조하여 보안 조치를 이행할 수 있다. 예를 들어, 모델 복제가 한 메모리 칩 내에서 수행될 수도 있고 상이한 메모리 칩 모두에서 수행될 수 있다. 이러한 예에서, 한 메모리 칩에서의 결과 또는 둘 이상의 메모리 칩에서의 결과가 비교되어 사이버 공격 또는 위조 공격의 가능성을 검출할 수 있다. 일부 실시예들에서, 다중 집적회로 모두에 적용된 복제 보안 조치는 개시된 접근 차단 메커니즘, 해시 보호 메커니즘, 모델 복제, 프로그램/모델 실행 패턴 분석 중의 하나 이상, 및 이들의 임의의 모든 조합 또는 다른 개시된 실시예들을 포함할 수 있다. Here, two or more of the above security mechanisms may be combined to enhance security against a cyber attack or a forgery attack. Also, the security measures may be implemented on different integrated circuits, and these integrated circuits may cooperate with each other to implement the security measures. For example, model replication may be performed in one memory chip or both in different memory chips. In such an example, results from one memory chip or results from two or more memory chips may be compared to detect the possibility of a cyber attack or spoof attack. In some embodiments, replication security measures applied to all multiple integrated circuits may include one or more of the disclosed access blocking mechanism, hash protection mechanism, model replication, program/model execution pattern analysis, and any and all combinations thereof or other disclosed embodiments. may include

DRAM 내 다중 포트 프로세서 서브유닛Multi-port processor subunit in DRAM

앞서 설명한 바와 같이, 여기에 개시된 실시예들은 프로세서 서브유닛의 어레이와 메모리 뱅크의 어레이를 포함하는 분산 프로세서 메모리 칩을 포함할 수 있고, 여기서 프로세서 서브유닛의 각각은 메모리 뱅크의 어레이의 적어도 하나에 전용일 수 있다. 아래에 설명하는 바와 같이, 분산 프로세서 메모리 칩은 스케일러블(scalable) 시스템에 대한 근거 역할을 할 수 있다. 즉, 일부 경우에서, 분산 프로세서 메모리 칩은 하나의 분산 프로세서 메모리 칩으로부터 다른 분산 프로세서 메모리 칩으로 데이터를 전송하도록 구성된 하나 이상의 통신 포트를 포함할 수 있다. 이로써, 임의의 모든 원하는 수의 분산 프로세서 메모리 칩이 서로 연결(예: 직렬, 병렬, 루프, 또는 임의의 모든 조합)되어 분산 프로세서 메모리 칩의 스케일러블 어레이를 형성할 수 있다. 이러한 어레이는 메모리 집약적 동작을 효율적으로 수행하고 메모리 집약적 동작의 수행과 연관된 연산 리소스를 스케일링하기 위한 유연한 해법을 제공할 수 있다. 분산 프로세서 메모리 칩은 타이밍 패턴이 상이한 클럭을 포함할 수 있기 때문에, 여기에 개시된 실시예들은 클럭 타이밍 차이가 있어도 분산 프로세서 메모리 칩 사이의 데이터 전송을 정확하게 제어하는 특징을 포함한다. 이러한 실시예들은 상이한 분산 프로세서 메모리 칩들 간의 효율적인 데이터 공유를 가능하게 할 수 있다. As previously described, embodiments disclosed herein may include a distributed processor memory chip comprising an array of processor subunits and an array of memory banks, wherein each of the processor subunits is dedicated to at least one of the array of memory banks. can be As discussed below, a distributed processor memory chip can serve as the basis for a scalable system. That is, in some cases, a distributed processor memory chip may include one or more communication ports configured to transfer data from one distributed processor memory chip to another distributed processor memory chip. Thereby, any and any desired number of distributed processor memory chips may be interconnected (eg, in series, parallel, loop, or any combination thereof) to form a scalable array of distributed processor memory chips. Such an array can efficiently perform memory-intensive operations and provide a flexible solution for scaling computational resources associated with performing memory-intensive operations. Because distributed processor memory chips may include clocks with different timing patterns, embodiments disclosed herein include features that accurately control data transfer between distributed processor memory chips even with clock timing differences. Such embodiments may enable efficient data sharing between different distributed processor memory chips.

도 75a는 본 개시의 실시예들에 따른 복수의 분산 프로세서 메모리 칩을 포함하는 스케일러블 프로세서 메모리 시스템을 도시한 것이다. 본 개시의 실시예들에 따르면, 스케일러블 프로세서 메모리 시스템은 제1 분산 프로세서 메모리 칩(7500), 제2 분산 프로세서 메모리 칩(7500'), 및 제3 분산 프로세서 메모리 칩(7500")과 같은 복수의 분산 프로세서 메모리 칩을 포함할 수 있다. 제1 분산 프로세서 메모리 칩(7500), 제2 분산 프로세서 메모리 칩(7500'), 및 제3 분산 프로세서 메모리 칩(7500")은 각각 본 개시에 기재된 임의의 모든 실시예들과 연관된 임의의 모든 구성 및/또는 특징을 포함할 수 있다. 75A illustrates a scalable processor memory system including a plurality of distributed processor memory chips according to embodiments of the present disclosure. According to embodiments of the present disclosure, a scalable processor memory system includes a plurality of distributed processor memory chips, such as a first distributed processor memory chip 7500 , a second distributed processor memory chip 7500 ′, and a third distributed processor memory chip 7500 ″. The first distributed processor memory chip 7500, the second distributed processor memory chip 7500', and the third distributed processor memory chip 7500" are each any may include any and all configurations and/or features associated with all embodiments of

일부 실시예들에서, 제1 분산 프로세서 메모리 칩(7500), 제2 분산 프로세서 메모리 칩(7500'), 및 제3 분산 프로세서 메모리 칩(7500")은 각각 도 72a에 도시된 집적회로(7200)와 유사하게 구현될 수 있다. 도 75a에 도시된 바와 같이, 제1 분산 프로세서 메모리 칩(7500)은 메모리 어레이(7510), 프로세싱 어레이(7520), 및 컨트롤러(7540)를 포함할 수 있다. 메모리 어레이(7510), 프로세싱 어레이(7520), 및 컨트롤러(7540)는 도 72a의 메모리 어레이(7210), 프로세싱 어레이(7220), 및 컨트롤러(7240)와 유사하게 구성될 수 있다. In some embodiments, the first distributed processor memory chip 7500, the second distributed processor memory chip 7500', and the third distributed processor memory chip 7500" are each integrated circuit 7200 shown in FIG. 72A. 75A, the first distributed processor memory chip 7500 may include a memory array 7510, a processing array 7520, and a controller 7540. Memory Array 7510 , processing array 7520 , and controller 7540 may be configured similarly to memory array 7210 , processing array 7220 , and controller 7240 of FIG. 72A .

본 개시의 실시예들에 따르면, 제1 분산 프로세서 메모리 칩(7500)은 제1 통신 포트(7530)를 포함할 수 있다. 일부 실시예들에서, 제1 통신 포트(7530)는 하나 이상의 외부 엔티티와 통신하도록 구성될 수 있다. 예를 들면, 통신 포트(7530)는 분산 프로세서 메모리 칩(7500)과 분산 프로세서 메모리 칩(7500', 7500")과 같은 다른 분산 프로세서 메모리 칩이 아닌 외부 엔티티 사이에 통신 연결을 구축하도록 구성될 수 있다. 예컨대, 통신 포트(7530)는 도 72에 도시된 바와 같이 호스트 컴퓨터, 또는 임의의 모든 다른 컴퓨팅 장치, 통신 모듈 등에 직접 또는 간접 결합될 수 있다. According to embodiments of the present disclosure, the first distributed processor memory chip 7500 may include a first communication port 7530 . In some embodiments, first communication port 7530 may be configured to communicate with one or more external entities. For example, the communication port 7530 may be configured to establish a communication connection between the distributed processor memory chip 7500 and an external entity other than the distributed processor memory chip, such as the distributed processor memory chips 7500', 7500". For example, the communication port 7530 may be coupled directly or indirectly to a host computer, or any and any other computing device, communication module, etc. as shown in Figure 72 .

본 개시의 실시예들에 따르면, 제1 분산 프로세서 메모리 칩(7500)은 다른 분산 프로세서 메모리 칩(예: 7500', 7500")과 통신하도록 구성된 하나 이상의 추가 통신 포트를 더 포함할 수 있다. 일부 실시예들에서, 하나 이상의 추가 통신 포트는 도 75a에 도시된 바와 같이 제2 통신 포트(7531) 및 제3 통신 포트(7532)를 포함할 수 있다. 제2 통신 포트(7531)는 제2 분산 프로세서 메모리 칩(7500')과 통신하고 제1 분산 프로세서 메모리 칩(7500)과 제2 분산 프로세서 메모리 칩(7500') 사이에 통신 연결을 구축하도록 구성될 수 있다. 마찬가지로, 제3 통신 포트(7532)는 제3 분산 프로세서 메모리 칩(7500")과 통신하고 제1 분산 프로세서 메모리 칩(7500)과 제3 분산 프로세서 메모리 칩(7500") 사이에 통신 연결을 구축하도록 구성될 수 있다. 일부 실시예들에서, 제1 분산 프로세서 메모리 칩(7500)(및 여기에 개시된 임의의 모든 메모리 칩)은 임의의 모든 적절한 수(예: 2개, 3개, 4개, 5개, 6개, 7개, 8개, 9개, 10개, 20개, 50개, 100개, 1000개 등)의 통신 포트를 포함하는 복수의 통신 포트를 포함할 수 있다. According to embodiments of the present disclosure, the first distributed processor memory chip 7500 may further include one or more additional communication ports configured to communicate with other distributed processor memory chips (eg, 7500', 7500"). Some In embodiments, the one or more additional communication ports may include a second communication port 7531 and a third communication port 7532 as shown in Figure 75A. The second communication port 7531 may be a second distributed may be configured to communicate with the processor memory chip 7500' and establish a communication connection between the first distributed processor memory chip 7500 and the second distributed processor memory chip 7500' Similarly, the third communication port 7532 ) may be configured to communicate with the third distributed processor memory chip 7500" and establish a communication connection between the first distributed processor memory chip 7500 and the third distributed processor memory chip 7500". Some embodiments In certain instances, the first distributed processor memory chip 7500 (and any and all memory chips disclosed herein) may be any and any suitable number (eg, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 1000, etc.) may include a plurality of communication ports including communication ports.

일부 실시예들에서, 제1 통신 포트, 제2 통신 포트, 및 제3 통신 포트는 상응하는 버스와 연관된다. 상응하는 버스는 제1 통신 포트, 제2 통신 포트, 및 제3 통신 포트 각각에 공통인 버스일 수 있다. 일부 실시예들에서, 제1 통신 포트, 제2 통신 포트, 및 제3 통신 포트 각각과 연관된 상응하는 버스는 모두 복수의 이산 메모리 뱅크로 연결된다. 일부 실시예들에서, 제1 통신 포트는 메모리 칩 내부의 메인 버스 및 메모리 칩 내부에 포함된 적어도 하나의 프로세서 서브유닛 중의 적어도 하나로 연결된다. 일부 실시예들에서, 제2 통신 포트는 메모리 칩 내부의 메인 버스 및 메모리 칩 내부에 포함된 적어도 하나의 프로세서 서브유닛 중의 적어도 하나로 연결된다.In some embodiments, the first communication port, the second communication port, and the third communication port are associated with a corresponding bus. The corresponding bus may be a bus common to each of the first communication port, the second communication port, and the third communication port. In some embodiments, a corresponding bus associated with each of the first communication port, the second communication port, and the third communication port are all coupled to the plurality of discrete memory banks. In some embodiments, the first communication port is connected to at least one of a main bus inside the memory chip and at least one processor subunit included inside the memory chip. In some embodiments, the second communication port is connected to at least one of a main bus inside the memory chip and at least one processor subunit included within the memory chip.

개시된 분산 프로세서 메모리 칩의 구성이 제1 분산 프로세서 메모리 칩(7500)에 대하여 설명되었지만, 제2 분산 프로세서 메모리 칩(7500') 및 제3 분산 프로세서 메모리 칩(7500")도 제1 분산 프로세서 메모리 칩(7500)과 유사하게 구성될 수 있다. 예를 들어, 제2 분산 프로세서 메모리 칩(7500')도 메모리 어레이(7510'), 프로세싱 어레이(7520'), 컨트롤러(7540'), 및/또는 포트(7530', 7531', 7532')와 같은 복수의 통신 포트를 포함할 수 있다. 마찬가지로, 제3 분산 프로세서 메모리 칩(7500")도 메모리 어레이(7510"), 프로세싱 어레이(7520"), 컨트롤러(7540"), 및/또는 포트(7530", 7531", 7532")와 같은 복수의 통신 포트를 포함할 수 있다. 일부 실시예들에서, 제2 분산 프로세서 메모리 칩(7500')의 제2 통신 포트(7531')와 제3 통신 포트(7532')는 제3 분산 프로세서 메모리 칩(7500") 및 제1 분산 프로세서 메모리 칩(7500)과 각각 통신하도록 구성될 수 있다. 이와 유사하게, 제3 분산 프로세서 메모리 칩(7500")의 제2 통신 포트(7531")와 제3 통신 포트(7532")는 제1 분산 프로세서 메모리 칩(7500) 및 제2 분산 프로세서 메모리 칩(7500')과 각각 통신하도록 구성될 수 있다. 분산 프로세서 메모리 칩들 간의 구성 유사성으로 인해 개시된 분산 프로세서 메모리 칩에 의거한 연산 시스템의 스케일링이 용이해질 수 있다. 나아가, 각 분산 프로세서 메모리 칩과 연관된 통신 포트의 개시된 구성은 분산 프로세서 메모리 칩의 어레이의 배치(예: 직렬, 병렬, 루프, 별 형상, 그물망 형상 등의 연결)를 유연하게 할 수 있다. Although the disclosed configuration of the distributed processor memory chip has been described with respect to the first distributed processor memory chip 7500 , the second distributed processor memory chip 7500 ′ and the third distributed processor memory chip 7500 ″ are also described with respect to the first distributed processor memory chip. It may be configured similarly to 7500. For example, the second distributed processor memory chip 7500' may also include a memory array 7510', a processing array 7520', a controller 7540', and/or a port. It may include a plurality of communication ports, such as 7530', 7531', 7532'. Similarly, the third distributed processor memory chip 7500" also includes a memory array 7510", a processing array 7520", a controller 7540", and/or a plurality of communication ports, such as ports 7530", 7531", 7532". In some embodiments, the second communication port 7531 ′ and the third communication port 7532 ′ of the second distributed processor memory chip 7500 ′ are connected to the third distributed processor memory chip 7500 ″ and the first distributed processor Each may be configured to communicate with the memory chip 7500. Similarly, the second communication port 7531 ″ and the third communication port 7532″ of the third distributed processor memory chip 7500 ″ may be configured to communicate with the first distributed processor memory chip 7500 ″. It may be configured to communicate with the processor memory chip 7500 and the second distributed processor memory chip 7500 ′, respectively. Scaling of a computing system based on the disclosed distributed processor memory chip may be facilitated due to configuration similarity between distributed processor memory chips. Furthermore, the disclosed configuration of communication ports associated with each distributed processor memory chip may allow flexible placement of arrays of distributed processor memory chips (eg, serial, parallel, loop, star, mesh, etc. connections).

본 개시의 실시예들에 따르면, 제1 내지 제3 분산 프로세서 메모리 칩(7500, 7500', 7500")과 같은 분산 프로세서 메모리 칩들은 버스(7533)를 통해 서로 통신할 수 있다. 일부 실시예들에 따르면, 버스(7533)는 2개의 상이한 분산 프로세서 메모리 칩의 2개의 통신 포트를 연결할 수 있다. 예를 들어, 제1 프로세서 메모리 칩(7500)의 제2 통신 포트(7531)는 버스(7533)를 통해 제2 프로세서 메모리 칩(7500')의 제3 통신 포트(7532')로 연결될 수 있다. 본 개시의 실시예들에 따르면, 제1 내지 제3 분산 프로세서 메모리 칩(7500, 7500', 7500")과 같은 분산 프로세서 메모리 칩들은 버스(7534)와 같은 버스를 통해 외부 엔티티(예: 호스트 컴퓨터)와 통신할 수 있다. 예를 들면, 제1 분산 프로세서 메모리 칩(7500)의 제1 통신 포트(7530)는 버스(7534)를 통해 하나 이상의 외부 엔티티에 연결될 수 있다. 분산 프로세서 메모리 칩들은 다양한 방법으로 서로 연결될 수 있다. 일부 경우에서, 분산 프로세서 메모리 칩들은 각 분산 프로세서 메모리 칩이 한 쌍의 인접한 분산 프로세서 메모리 칩에 연결되는 직렬 연결성을 나타낼 수 있다. 다른 경우에서, 분산 프로세서 메모리 칩들은 적어도 하나의 분산 프로세서 메모리 칩이 둘 이상의 다른 분산 프로세서 메모리 칩으로 연결되는 더 높은 정도의 연결성을 나타낼 수 있다. 일부 경우에서, 복수의 메모리 칩 내의 모든 분산 프로세서 메모리 칩들은 다른 모든 분산 프로세서 메모리 칩에 복수로 연결될 수 있다. According to embodiments of the present disclosure, distributed processor memory chips such as the first to third distributed processor memory chips 7500 , 7500 ′ and 7500 ″ may communicate with each other via the bus 7533 . Some embodiments According to , the bus 7533 may connect two communication ports of two different distributed processor memory chips, for example, the second communication port 7531 of the first processor memory chip 7500 may be connected to the bus 7533 may be connected to the third communication port 7532' of the second processor memory chip 7500'. According to embodiments of the present disclosure, the first to third distributed processor memory chips 7500, 7500', and 7500 ) may communicate with an external entity (eg, a host computer) via a bus such as bus 7534 . For example, the first communication port 7530 of the first distributed processor memory chip 7500 may be coupled to one or more external entities via a bus 7534 . Distributed processor memory chips may be interconnected in a variety of ways. In some cases, distributed processor memory chips may exhibit serial connectivity in which each distributed processor memory chip is coupled to a pair of adjacent distributed processor memory chips. In other cases, distributed processor memory chips may exhibit a higher degree of connectivity in which at least one distributed processor memory chip is coupled to two or more other distributed processor memory chips. In some cases, all distributed processor memory chips in a plurality of memory chips may be plurally coupled to all other distributed processor memory chips.

도 75a에 도시된 바와 같이, 버스(7533)(또는 도 75a의 실시예와 연관된 임의의 모든 기타 버스)는 단방향성(unidirectional)일 수 있다. 도 75a에는 버스(7533)가 단방향성이고 특정 데이터 전송 흐름(도 75a에 화살표로 표시)이 있는 것으로 도시되어 있지만, 버스(7533)(또는 도 75a의 임의의 모든 기타 버스)는 양방향성 버스로 구현될 수 있다. 본 개시의 일부 실시예들에 따르면, 2개의 분산 프로세서 메모리 칩 사이에 연결된 버스는 분산 프로세서 메모리 칩과 외부 엔티티 사이에 연결된 버스보다 통신 속도가 빠르도록 구성될 수 있다. 일부 실시예들에서, 분산 프로세서 메모리 칩과 외부 엔티티 사이의 통신은 제한된 시간 동안에, 예를 들면 실행 준비(호스트 컴퓨터로부터 프로그램 코드, 입력 데이터, 가중치 데이터 등의 로딩) 동안에, 신경망 모델의 실행으로 생성된 결과를 호스트 컴퓨터로 출력하는 주기 동안에 일어날 수 있다. 분산 프로세서 메모리 칩(7500, 7500', 7500")과 연관된 하나 이상의 프로그램의 실행 동안에(예: 인공지능 어플리케이션 등과 연관된 메모리 집약적 동작 동안에) 분산 프로세서 메모리 칩 사이의 통신은 버스(7533, 7533' 등)를 통해 일어날 수 있다. 일부 실시예들에서, 분산 프로세서 메모리 칩과 외부 엔티티 사이의 통신은 2개의 프로세서 메모리 칩 사이의 통신보다 덜 빈번히 일어날 수 있다. 통신 요구사항과 실시예들에 따르면, 분산 프로세서 메모리 칩과 외부 엔티티 사이의 버스는 통신 속도가 분산 프로세서 메모리 칩들 사이의 버스의 통신 속도에 비하여 동일하거나, 빠르거나, 느리도록 구성될 수 있다. 75A, bus 7533 (or any other bus associated with the embodiment of FIG. 75A) may be unidirectional. While bus 7533 is shown in FIG. 75A as unidirectional and with a specific data transfer flow (indicated by arrows in FIG. 75A), bus 7533 (or any other bus in FIG. 75A) is implemented as a bidirectional bus. can be According to some embodiments of the present disclosure, a bus connected between two distributed processor memory chips may be configured to have a faster communication speed than a bus connected between the distributed processor memory chip and an external entity. In some embodiments, the communication between the distributed processor memory chip and an external entity is generated by the execution of the neural network model for a limited amount of time, eg, during preparation for execution (loading of program code, input data, weight data, etc. from the host computer). This may occur during the cycle of outputting the output to the host computer. During execution of one or more programs associated with the distributed processor memory chips 7500, 7500', 7500" (eg, during memory intensive operations associated with artificial intelligence applications, etc.), communications between the distributed processor memory chips 7500, 7500', 7500", etc. In some embodiments, communication between a distributed processor memory chip and an external entity may occur less frequently than communication between two processor memory chips.According to communication requirements and embodiments, the distributed processor The bus between the memory chip and the external entity may be configured such that a communication speed is equal to, faster, or slower than a communication speed of the bus between the distributed processor memory chips.

일부 실시예들에서, 도 75a에 묘사된 바와 같이, 제1 내지 제3 분산 프로세서 메모리 칩(7500, 7500', 7500")과 같은 복수의 분산 프로세서 메모리 칩은 서로 통신하도록 구성될 수 있다. 이러한 능력은 스케일러블 분산 프로세서 메모리 칩 시스템의 조립을 용이하게 할 수 있다. 예를 들면, 제1 내지 제3 분산 프로세서 메모리 칩(7500, 7500', 7500")의 메모리 어레이(7510, 7510', 7510") 및 프로세싱 어레이(7520, 7520', 7520")은 통신 채널(예: 도 75a에 도시된 버스)에 의해 연결되는 경우에 단일 분산 프로세서 메모리 칩에 사실상 속하는 것으로 간주될 수 있다. In some embodiments, as depicted in FIG. 75A , a plurality of distributed processor memory chips, such as first through third distributed processor memory chips 7500 , 7500 ′, 7500″ may be configured to communicate with each other. The capability may facilitate assembly of a scalable distributed processor memory chip system. For example, the memory arrays 7510, 7510', 7510 of the first through third distributed processor memory chips 7500, 7500', 7500". ") and processing arrays 7520, 7520', 7520" may be considered to effectively belong to a single distributed processor memory chip when connected by a communication channel (eg, the bus shown in FIG. 75A).

본 개시의 실시예들에 따르면, 복수의 분산 프로세서 메모리 칩들 사이의 통신 및/또는 분산 프로세서 메모리 칩과 하나 이상의 외부 엔티티 사이의 통신은 임의의 모든 적절한 방법으로 관리될 수 있다. 일부 실시예들에서, 이러한 통신은 분산 프로세서 메모리 칩(7500)의 프로세싱 어레이(7520)와 같은 프로세싱 리소스에 의해 관리될 수 있다. 분산 프로세서의 어레이에 의해 제공되는 프로세싱 리소스를 통신 관리에 의한 연산 부하로부터 완화시키기 위하는 등의 일부 다른 실시예들에서, 분산 프로세서 메모리 칩의 컨트롤러(7540, 7540', 7540")와 같은 컨트롤러는 분산 프로세서 메모리 칩들 사이의 통신 및/또는 분산 프로세서 메모리 칩(들)과 하나 이상의 외부 엔티티 사이의 통신을 관리하도록 구성될 수 있다. 예를 들어, 제1 내지 제3 분산 프로세서 메모리 칩(7500, 7500', 7500")의 각 컨트롤러(7540, 7540', 7540")는 그에 상응하는 분산 프로세서 메모리 칩의 다른 분산 프로세서 메모리 칩과 관련된 다른 분산 프로세서 메모리 칩에 대한 통신을 관리하도록 구성될 수 있다. 일부 실시예들에서, 컨트롤러(7540, 7540', 7540")는 이러한 통신을 포트(7531, 7531', 7531", 7532, 7532', 7532") 등과 같은 상응하는 통신 포트를 통해 제어하도록 구성될 수 있다. According to embodiments of the present disclosure, communication between a plurality of distributed processor memory chips and/or communication between a distributed processor memory chip and one or more external entities may be managed in any and any suitable manner. In some embodiments, such communication may be managed by a processing resource, such as processing array 7520 of distributed processor memory chip 7500 . In some other embodiments, such as to offload processing resources provided by an array of distributed processors from computational load due to communication management, a controller, such as controllers 7540, 7540', 7540" of a distributed processor memory chip, is be configured to manage communication between the processor memory chips and/or communication between the distributed processor memory chip(s) and one or more external entities, for example, first to third distributed processor memory chips 7500, 7500' , 7500") may be configured to manage communications of corresponding distributed processor memory chips to other distributed processor memory chips associated with other distributed processor memory chips. Some implementations In examples, the controller 7540, 7540', 7540" may be configured to control such communication via a corresponding communication port, such as ports 7531, 7531', 7531", 7532, 7532', 7532", etc. .

컨트롤러(7540, 7540', 7540")는 또한 분산 프로세서 메모리 칩들 사이에 존재할 수 있는 시간차를 고려하면서 분산 프로세서 메모리 칩들 사이의 통신을 관리하도록 구성될 수 있다. 예를 들면, 분산 프로세서 메모리 칩(예: 7500)은 다른 분산 프로세서 메모리 칩(예: 7500', 7500")의 클럭과 다를 수 있는(예: 상이한 타이밍 패턴) 내부 클럭에 의해 공급될 수 있다. 따라서, 일부 실시예들에서, 컨트롤러(7540)는 분산 프로세서 메모리 칩들 사이의 상이한 타이밍 패턴을 고려하는 하나 이상의 전략을 이행하고 분산 프로세서 메모리 칩들 사이의 시간 편차 가능성을 고려하여 분산 프로세서 메모리 칩들 사이의 통신을 관리하도록 구성될 수 있다. Controllers 7540, 7540', 7540" may also be configured to manage communications between distributed processor memory chips while taking into account time differences that may exist between them. For example, distributed processor memory chips (e.g., : 7500) may be supplied by an internal clock which may be different (eg different timing pattern) from the clocks of other distributed processor memory chips (eg 7500', 7500"). Accordingly, in some embodiments, the controller 7540 implements one or more strategies to account for different timing patterns between the distributed processor memory chips and to account for possible time variability between the distributed processor memory chips for communication between the distributed processor memory chips. can be configured to manage

예를 들어, 일부 실시예들에서, 제1 분산 프로세서 메모리 칩(7500)의 컨트롤러(7540)는 특정 조건 하에서 제1 분산 프로세서 메모리 칩(7500)으로부터 제2 분산 프로세서 메모리 칩(7500')으로의 데이터 전송을 가능하게 하도록 구성될 수 있다. 일부 경우에서, 컨트롤러(7540)는 제1 분산 프로세서 메모리 칩(7500)의 하나 이상의 프로세서 서브유닛이 데이터를 전송할 준비가 되어 있지 않은 경우에 데이터 전송을 보류할 수 있다. 대안적으로 또는 추가적으로, 컨트롤러(7540)는 제2 분산 프로세서 메모리 칩(7500')의 수신 프로세서 서브유닛이 데이터를 수신할 준비가 되어 있지 않은 경우에 데이터 전송을 보류할 수 있다. 일부 경우에서, 컨트롤러(7540)는 전송 프로세서 서브유닛(예: 칩(7500) 내부의 서브유닛)이 데이터를 전송할 준비가 되어 있고 수신 프로세서 서브유닛(예: 칩(7500') 내부의 서브유닛)이 데이터를 수신할 준비가 되어 있는 것을 모두 확인한 후에 전송 프로세서 서브유닛으로부터 수신 프로세서 서브유닛으로의 데이터 전송을 개시할 수 있다. 다른 실시예들에서, 컨트롤러(7540)는 특히 수신 프로세서 서브유닛이 전송된 데이터를 수신할 준비가 될 때까지 데이터가 컨트롤러(예: 7540 또는 7540')에서 버퍼링 될 수 있는 경우에 전송 프로세서 서브유닛이 데이터를 전송할 준비가 되어 있는지 여부에만 의거하여 데이터 전송을 개시할 수 있다. For example, in some embodiments, the controller 7540 of the first distributed processor memory chip 7500 is configured to transfer data from the first distributed processor memory chip 7500 to the second distributed processor memory chip 7500 ′ under certain conditions. may be configured to enable data transfer. In some cases, the controller 7540 may suspend data transmission when one or more processor subunits of the first distributed processor memory chip 7500 are not ready to transmit data. Alternatively or additionally, the controller 7540 may suspend data transmission when the receiving processor subunit of the second distributed processor memory chip 7500' is not ready to receive the data. In some cases, the controller 7540 indicates that a transmitting processor subunit (eg, a subunit inside chip 7500) is ready to transmit data and a receiving processor subunit (eg, a subunit inside chip 7500'). After confirming that it is all ready to receive this data, data transmission from the transmitting processor subunit to the receiving processor subunit may be initiated. In other embodiments, the controller 7540 may be the transmitting processor subunit, particularly if data may be buffered in the controller (eg, 7540 or 7540') until the receiving processor subunit is ready to receive the transmitted data. Data transmission can be initiated only based on whether this data is ready to be transmitted.

본 개시의 실시예들에 따르면, 컨트롤러(7540)는 데이터 전송을 가능하게 하기 위하여 하나 이상의 다른 타이밍 제약이 충족되었는지 여부를 판단하도록 구성될 수 있다. 이러한 타이밍 제약은 전송 프로세서 서브유닛의 전송 시간과 수신 프로세서 서브유닛의 수신 시간 사이의 시간차, 외부 엔티티(예: 호스트 컴퓨터)의 처리되는 데이터 접근 요청, 전송 프로세서 서브유닛 및 수신 프로세서 서브유닛과 연관된 메모리 리소스(예: 메모리 어레이) 상에서 수행되는 리프레시 동작 등에 관련된 것일 수 있다. According to embodiments of the present disclosure, the controller 7540 may be configured to determine whether one or more other timing constraints have been met to enable data transmission. These timing constraints may include the time difference between the transmit time of the transmitting processor subunit and the receive time of the receiving processor subunit, the data access request being processed by an external entity (eg, a host computer), the memory associated with the transmitting processor subunit and the receiving processor subunit. It may be related to a refresh operation performed on a resource (eg, a memory array).

도 75e는 본 개시의 실시예들에 따른 타이밍을 예시적으로 도시한 것이다. 도 75e는 하기의 예를 도시한 것이다. 75E exemplarily illustrates timing according to embodiments of the present disclosure. 75E shows the following example.

일부 실시예들에서, 컨트롤러(7540) 및 분산 프로세서 메모리 칩과 연관된 다른 컨트롤러들은 클럭 인에이블 신호를 활용하여 칩들 사이의 데이터 전송을 관리하도록 구성될 수 있다. 예를 들면, 프로세싱 어레이(7520)는 클럭에 의해 공급될 수 있다. 일부 실시예들에서, 하나 이상의 프로세서 서브유닛이 공급된 클럭 신호에 응답하는지 여부는 클럭 인에이블 신호(도 7a에 "CE"로 표시됨)를 활용하여 컨트롤러(7540) 등에 의해 제어될 수 있다. 각 프로세서 서브유닛(예: 7520_1 내지 7520_K)은 프로그램 코드를 실행할 수 있고, 프로그램 코드는 통신 명령을 포함할 수 있다. 본 개시의 일부 실시예들에 따르면, 컨트롤러(7540)는 프로세서 서브유닛(7520_1 내지 7520_K)으로의 클럭 인에이블 신호를 제어함으로써 통신 명령의 타이밍을 제어할 수 있다. 예를 들어, 전송 프로세서 서브유닛(예: 제1 분산 프로세서 메모리 칩(7500)의 프로세서 서브유닛)이 특정 사이클(예: 1000번째 클럭 사이클)에 데이터를 전송하도록 프로그램되어있고 수신 프로세서 서브유닛(예: 제2 분산 프로세서 메모리 칩(7500')의 프로세서 서브유닛)이 특정 사이클(예: 1000번째 클럭 사이클)에 데이터를 수신하도록 프로그램 되어 있는 경우, 제1 분산 프로세서 메모리 칩(7500)의 컨트롤러(7540)와 제2 분산 프로세서 메모리 칩(7500')의 컨트롤러(7540')는 전송 프로세서 서브유닛과 수신 프로세서 서브유닛이 모두 데이터 전송을 수행할 준비가 될 때까지 데이터 전송을 허용하지 않을 수 있다. 예컨대, 컨트롤러(7540)는 수신된 클럭 신호에 대응하여 전송 프로세서 서브유닛이 데이터를 전송하는 것을 방지하는 특정 클럭 인에이블 신호(예: logic low)를 전송 프로세서 서브유닛에 공급함으로써 전송 프로세서 서브유닛으로부터의 데이터 전송을 '대기'시킬 수 있다. 특정 클럭 인에이블 신호는 분산 프로세서 메모리 칩의 전체 또는 일부를 '동결' 시킬 수 있다. 반면에, 컨트롤러(7540)는 전송 프로세서 서브유닛이 수신된 클럭 신호에 대응하도록 유발하는 반대 클럭 인에이블 신호(예: logic high)를 전송 프로세서 서브유닛에 공급함으로써 전송 프로세서 서브유닛이 데이터 전송을 개시하도록 유발할 수 있다. 칩(7500')의 수신 프로세서서 서브유닛이 수신을 하거나 수신을 하지 않는 등의 유사한 동작도 컨트롤러(7540')에서 제공된 클럭 인에이블 신호를 활용하여 제어될 수 있다. In some embodiments, controller 7540 and other controllers associated with the distributed processor memory chip may be configured to utilize a clock enable signal to manage data transfer between the chips. For example, the processing array 7520 may be supplied by a clock. In some embodiments, whether one or more processor subunits respond to the supplied clock signal may be controlled by the controller 7540 or the like utilizing a clock enable signal (labeled “CE” in FIG. 7A ). Each processor subunit (eg, 7520_1 to 7520_K) may execute a program code, and the program code may include a communication instruction. According to some embodiments of the present disclosure, the controller 7540 may control the timing of a communication command by controlling a clock enable signal to the processor subunits 7520_1 to 7520_K. For example, if the transmitting processor subunit (eg, the processor subunit of the first distributed processor memory chip 7500) is programmed to transmit data at a specific cycle (eg, the 1000th clock cycle) and the receiving processor subunit (eg, the processor subunit of the first distributed processor memory chip 7500) : When the processor subunit of the second distributed processor memory chip 7500 ′) is programmed to receive data at a specific cycle (eg, the 1000th clock cycle), the controller 7540 of the first distributed processor memory chip 7500 . ) and the controller 7540' of the second distributed processor memory chip 7500' may not allow data transmission until both the transmitting processor subunit and the receiving processor subunit are ready to perform data transmission. For example, the controller 7540 is configured to provide a specific clock enable signal (eg, logic low) that prevents the transmit processor subunit from transmitting data in response to the received clock signal to the transmit processor subunit by supplying it from the transmit processor subunit. data transmission can be 'waited'. Certain clock enable signals can 'freeze' all or part of a distributed processor memory chip. On the other hand, the controller 7540 supplies the transmit processor subunit with an opposite clock enable signal (eg, logic high) that causes the transmit processor subunit to correspond to the received clock signal to the transmit processor subunit, so that the transmit processor subunit initiates data transmission. may cause it to A similar operation such as whether the subunit receives or does not receive in the reception processor of the chip 7500' may be controlled by using the clock enable signal provided by the controller 7540'.

일부 실시예들에서, 클럭 인에이블 신호는 프로세서 메모리 칩(예: 7500) 내의 모든 프로세서 서브 유닛(예: 7520_1 내지 7520_K)으로 보내질 수 있다. 클럭 인에이블 신호는 일반적으로 프로세서 서브유닛이 각 클럭 신호에 대응하거나 무시하도록 유발하는 효과가 있을 수 있다. 예를 들면, 일부 경우에서, 클럭 인에이블 신호가 하이(high)인 경우(특정 어플리케이션의 컨벤션에 따라), 프로세서 서브유닛은 클럭 신호에 대응할 수 있고 클럭 신호 타이밍에 따라 하나 이상의 명령을 실행할 수 있다. 반면에, 클럭 신호가 로우(low)인 경우, 프로세서 서브유닛은 클럭 신호에 대응하는 것이 방지되어 클럭 타이밍에 대응하여 명령을 실행하지 않도록 할 수 있다. 즉, 클럭 인에이블 신호가 로우인 경우, 프로세서 서브유닛은 수신된 클럭 신호를 무시할 수 있다. In some embodiments, the clock enable signal may be sent to all processor sub-units (eg, 7520_1 to 7520_K) in the processor memory chip (eg, 7500). The clock enable signal may generally have the effect of causing the processor subunit to either respond to or ignore each clock signal. For example, in some cases, when the clock enable signal is high (according to the convention of a particular application), the processor subunit may correspond to the clock signal and execute one or more instructions according to the clock signal timing. . On the other hand, when the clock signal is low, the processor subunit may be prevented from responding to the clock signal and thus not executing the instruction in response to the clock timing. That is, when the clock enable signal is low, the processor subunit may ignore the received clock signal.

도 75a를 다시 참조하면, 임의의 컨트롤러(7540, 7540', 7540")는 해당 어레이의 하나 이상의 프로세서 서브유닛이 수신된 클럭 신호에 대응하거나 대응하지 않게 유발함으로써 클럭 인에이블 신호를 활용하여 각 분산 프로세서 메모리 칩의 동작을 제어하도록 구성될 수 있다. 일부 실시예들에서, 컨트롤러(7540, 7540', 7540")는 코드 실행을 선택적으로, 예컨대 이러한 코드가 데이터 전송 동작과 그 타이밍에 관한 것이거나 데이터 전송 동작과 그 타이밍을 포함하는 경우에, 진행하도록 구성될 수 있다. 일부 실시예들에서, 컨트롤러(7540, 7540', 7540")는 클럭 인에이블 신호를 활용하여 임의의 통신 포트(7531, 7531', 7531", 7532, 7532', 7532")를 통하여 2개의 상이한 분산 프로세서 메모리 칩 사이의 데이터 전송의 타이밍을 제어하도록 구성될 수 있다. 일부 실시예들에서, 컨트롤러(7540, 7540', 7540")는 클럭 인에이블 신호를 활용하여 임의의 통신 포트(7531, 7531', 7531", 7532, 7532', 7532")를 통하여 2개의 상이한 분산 프로세서 메모리 칩 사이의 데이터 수신의 시간을 제어하도록 구성될 수 있다. Referring back to FIG. 75A, any controller 7540, 7540', 7540" utilizes a clock enable signal to cause one or more processor subunits in that array to either respond or not respond to a received clock signal, thereby utilizing each distributed It may be configured to control the operation of the processor memory chip. In some embodiments, the controllers 7540, 7540', 7540" selectively control code execution, eg, whether such code relates to data transfer operations and their timing. When including data transfer operations and their timing, they may be configured to proceed. In some embodiments, the controller 7540, 7540', 7540" utilizes a clock enable signal to communicate via any communication port 7531, 7531', 7531", 7532, 7532', 7532" to two different It may be configured to control the timing of data transfers between the distributed processor memory chips. In some embodiments, the controllers 7540, 7540', 7540" utilize a clock enable signal to any communication port 7531, 7531. ', 7531", 7532, 7532', 7532") may be configured to control the time of data reception between two different distributed processor memory chips.

일부 실시예들에서, 2개의 상이한 분산 프로세서 메모리 칩 사이의 데이터 전송 타이밍은 컴파일 최적화(compilation optimization) 단계의 의거하여 설정될 수 있다. 컴파일은 2개의 상이한 분산 프로세서 메모리 칩 사이에 연결된 버스를 통한 전송 지연에 영향을 받지 않고 작업이 프로세싱 서브유닛에 효율적으로 배정될 수 있는 프로세싱 루틴을 구축하게 할 수 있다. 컴파일은 호스트 컴퓨터 내의 컴파일러에 의해 수행되거나 호스트 컴퓨터로 전송될 수 있다. 정상적으로, 2개의 상이한 분산 프로세서 메모리 칩 사이의 버스를 통한 전송 지연은 데이터를 필요로 하는 프로세싱 서브유닛에 대한 데이터 병목을 가져올 수 있다. 개시된 컴파일은 버스를 통한 바람직하지 않은 전송 지연이 있어도 프로세싱 유닛이 연속적으로 데이터를 수신할 수 있게 하는 식으로 데이터 전송 일정을 짤 수 있다. In some embodiments, the data transfer timing between two different distributed processor memory chips may be set based on a compilation optimization step. Compilation allows building processing routines in which tasks can be efficiently assigned to processing subunits without being affected by delays in transmission over the bus connected between two different distributed processor memory chips. Compilation may be performed by a compiler in the host computer or transmitted to the host computer. Normally, delays in transmission over the bus between two different distributed processor memory chips can create a data bottleneck for the processing subunits that need the data. The disclosed compilation can schedule data transfers in such a way that the processing unit can continuously receive data even with undesirable transfer delays over the bus.

도 75a의 실시예는 분산 프로세서 메모리 칩(7500', 7500'', 7500''') 당 3개의 포트를 포함하고 있지만, 개시된 실시예에 따른 분산 프로세서 메모리 칩에는 임의의 모든 수의 포트가 포함될 수 있다. 예를 들면, 일부 경우에서, 분산 프로세서 메모리 칩은 더 많은 수의 또는 더 적은 수의 포트를 포함할 수 있다. 도 75b의 실시예에서, 각 분산 프로세서 메모리 칩(예: 7500A-7500I)은 다중 포트로 구성될 수 있다. 이러한 포트들은 사실상 서로 동일하거나 상이할 수 있다. 도시된 예에서, 각 분산 프로세서 메모리 칩은 하나의 호스트 통신 포트(7570)와 4개의 칩 포트(7572)를 포함하는 5개의 포트를 포함한다. 호스트 통신 포트(7570)는 도 75b에 도시된 바와 같은 어레이 내의 임의의 분산 프로세서 메모리 칩과 분산 프로세서 메모리 칩의 어레이에 대해 원격에 위치한 호스트 컴퓨터 등 사이의 통신(버스(7534) 경유)을 하도록 구성될 수 있다. 칩 포트(7572)는 버스(7535)를 통해 분산 프로세서 메모리 칩들 사이의 통신이 가능하게 하도록 구성될 수 있다. While the embodiment of Figure 75A includes three ports per distributed processor memory chip 7500', 7500'', and 7500'', a distributed processor memory chip in accordance with the disclosed embodiment may include any and any number of ports. can For example, in some cases, a distributed processor memory chip may include a greater number or fewer ports. In the embodiment of FIG. 75B , each distributed processor memory chip (eg, 7500A-7500I) may be configured with multiple ports. These ports may be substantially the same as or different from each other. In the illustrated example, each distributed processor memory chip includes five ports, including one host communication port 7570 and four chip ports 7572 . Host communication port 7570 is configured for communication (via bus 7534) between any of the distributed processor memory chips in the array as shown in FIG. 75B and a host computer located remotely to the array of distributed processor memory chips, etc. can be Chip port 7572 may be configured to enable communication between distributed processor memory chips via bus 7535 .

임의의 모든 수의 분산 프로세서 메모리 칩이 서로 연결될 수 있다. 도 75b에 도시된 예에서, 분산 프로세서 메모리 칩 당 4개의 칩 포트를 포함함으로써 각 분산 프로세서 메모리 칩이 둘 이상의 다른 분산 프로세서 메모리 칩으로 연결되는 어레이가 가능할 수 있고, 일부 경우에서, 특정 칩은 4개의 다른 분산 프로세서 메모리 칩으로 연결될 수 있다. 분산 프로세서 메모리 칩에 더 많은 칩 포트를 포함시키면 분산 프로세서 메모리 칩 간의 상호연결성이 더 확대될 수 있다. Any and all number of distributed processor memory chips may be interconnected. In the example shown in FIG. 75B , including four chip ports per distributed processor memory chip may enable an array in which each distributed processor memory chip is coupled to two or more other distributed processor memory chips, and in some cases, a particular chip may have 4 It can be connected to multiple different distributed processor memory chips. The inclusion of more chip ports in a distributed processor memory chip can further expand the interconnectivity between the distributed processor memory chips.

또한, 도 75b에는 분산 프로세서 메모리 칩(7500A-7500I)에 2가지의 상이한 유형의 통신 포트(7570, 7572)가 있는 것으로 도시되어 있지만, 일부 경우에서는 각 분산 프로세서 메모리 칩에 단일 유형의 통신 포트가 포함될 수 있다. 다른 경우에서, 2가지 이상의 유형의 통신 포트가 하나 이상의 분산 프로세서 메모리 칩에 포함될 수 있다. 도 75c의 예에서, 각각의 분산 프로세서 메모리 칩(7500A'-7500C')은 2개(또는 그 이상)의 동일한 유형의 통신 포트(7570)를 포함한다. 본 실시예에서, 통신 포트(7570)는 버스(7534)를 통해 호스트 컴퓨터와 같은 외부 엔티티와 통신하는 것을 가능하게 하도록 구성될 수 있고, 또한 버스(7535)를 통해 분산 프로세서 메모리 칩들(예: 7500B' 및 7500C') 간의 통신을 가능하게 하도록 구성될 수 있다. Also, although FIG. 75B shows that there are two different types of communication ports 7570 and 7572 on the distributed processor memory chips 7500A-7500I, in some cases there is a single type of communication port on each distributed processor memory chip. may be included. In other cases, two or more types of communication ports may be included in one or more distributed processor memory chips. In the example of Figure 75C, each distributed processor memory chip 7500A'-7500C' includes two (or more) communication ports 7570 of the same type. In this embodiment, communication port 7570 may be configured to enable communication with an external entity, such as a host computer, via bus 7534 , and also via bus 7535 for distributed processor memory chips (eg, 7500B). ' and 7500C').

일부 실시예들에서, 하나 이상의 분산 프로세서 메모리 칩 상에 제공된 포트들은 하나 이상의 호스트로의 접근을 제공하는데 사용될 수 있다. 예를 들어, 도 75d에 도시된 실시예에서, 분산 프로세서 메모리 칩은 둘 또는 그 이상의 포트(7570)를 포함한다. 포트(7570)는 호스트 포트, 칩 포트, 또는 호스트 포트와 칩포트의 조합을 구성할 수 있다. 도시된 실시예에서, 2개의 포트(7570, 7570')는 상이한 2개의 호스트(예: 호스트 컴퓨터 또는 연산 요소, 또는 다른 유형의 논리부)에 버스(7534, 7534')를 통한 분산 프로세서 메모리 칩(7500A)으로의 접근을 제공할 수 있다. 이러한 실시예는 2개(또는 그 이상)의 상이한 호스트 컴퓨터에 분산 프로세서 메모리 칩(7500A)으로의 접근을 제공할 수 있다. 그러나 다른 실시예들에서, 예를 들어 호스트 엔티티가 분산 프로세서 메모리 칩(7500A)의 프로세서 서브유닛/메모리 뱅크의 하나 이상으로 추가적인 대역폭 또는 병렬 접근을 요구하는 경우와 같이, 버스(7534, 7534')는 모두 동일한 호스트 엔티티로 연결될 수 있다. In some embodiments, ports provided on one or more distributed processor memory chips may be used to provide access to one or more hosts. For example, in the embodiment shown in Figure 75D, the distributed processor memory chip includes two or more ports 7570. The port 7570 may constitute a host port, a chip port, or a combination of a host port and a chip port. In the illustrated embodiment, the two ports 7570, 7570' are distributed processor memory chips via buses 7534, 7534' to two different hosts (eg, host computers or computational elements, or other types of logic). access to (7500A) may be provided. Such an embodiment may provide access to the distributed processor memory chip 7500A to two (or more) different host computers. However, in other embodiments, such as when a host entity requires additional bandwidth or parallel access to one or more of the processor subunits/memory banks of the distributed processor memory chip 7500A, the bus 7534, 7534' can all be connected to the same host entity.

일부 경우에서, 도 75d에 도시된 바와 같이, 분산 프로세서 메모리 칩(7500A)의 분산 프로세서 서브유닛/메모리 뱅크로의 접근을 제어하는 데에 하나 이상의 컨트롤러(7540, 7540')가 사용될 수 있다. 다른 경우에서, 하나 이상의 외부 호스트 엔티티로부터의 통신을 처리하는 데에 단일 컨트롤러가 사용될 수 있다. In some cases, one or more controllers 7540 , 7540 ′ may be used to control access to the distributed processor subunit/memory bank of the distributed processor memory chip 7500A, as shown in FIG. 75D . In other cases, a single controller may be used to handle communications from one or more external host entities.

또한, 분산 프로세서 메모리 칩(7500A) 내의 하나 이상의 버스는 분산 프로세서 메모리 칩(7500A)의 분산 프로세서 서브유닛/메모리 뱅크로의 병렬 접근을 가능하게 할 수 있다. 예컨대, 분산 프로세서 메모리 칩(7500A)은 분산 프로세서 서브유닛(7520_1 내지 7520_6) 및 해당 전용 메모리 뱅크(7510_1 내지 7510_6) 등으로의 병렬 접근을 가능하게 하는 제1 버스(7580) 및 제2 버스(7580')를 포함할 수 있다. 이러한 구성은 분산 프로세서 메모리 칩(7500A) 내의 2개의 상이한 위치로의 동시 접근을 가능하게 할 수 있다. 또한, 모든 포트가 동시에 사용되지 않는 경우에서, 분산 프로세서 메모리 칩(7500A) 내의 하드웨어 리소스(예: 공통 버스 및/또는 공통 컨트롤러)를 공유할 수 있고, 이러한 하드웨어로 멀티플렉스된 IO를 구성할 수 있다. Additionally, one or more buses within the distributed processor memory chip 7500A may enable parallel access to the distributed processor subunits/memory banks of the distributed processor memory chip 7500A. For example, the distributed processor memory chip 7500A has a first bus 7580 and a second bus 7580 that enable parallel access to the distributed processor subunits 7520_1 to 7520_6 and corresponding dedicated memory banks 7510_1 to 7510_6, etc. ') may be included. This configuration may enable simultaneous access to two different locations within the distributed processor memory chip 7500A. In addition, in cases where all ports are not used at the same time, it is possible to share hardware resources (eg, a common bus and/or a common controller) within the distributed processor memory chip 7500A, and to configure multiplexed IO with such hardware. have.

일부 실시예들에서, 연산 유닛(예: 프로세서 서브유닛(7520_1 내지 7520_6))의 일부는 추가 포트(7570') 또는 컨트롤러로 연결될 수 있는 반면에, 다른 연산 유닛들은 연결되지 않을 수 있다. 그러나 추가 포트(7570')로 연결되지 않은 연산 유닛으로부터의 데이터는 포트(7570')로 연결된 연산 유닛으로 연결의 내부 그리드를 통해 들어갈 수 있다. 이로써, 추가 버스를 추가하지 않고도 포트(7570, 7570') 모두에서 통신이 동시에 수행될 수 있다. In some embodiments, some of the computational units (eg, processor subunits 7520_1 to 7520_6 ) may be coupled to an additional port 7570 ′ or controller, while other computational units may not be coupled. However, data from computational units not connected to additional ports 7570' may enter through the internal grid of connections to computational units connected to ports 7570'. This allows communication to be performed simultaneously on both ports 7570 and 7570' without adding an additional bus.

통신 포트(예: 7530 내지 7532)와 컨트롤러(예: 7540)가 서로 별개의 요소인 것으로 도시되어 있지만, 본 개시의 실시예들에 따르면, 통신 포트와 컨트롤러(또는 임의의 모든 다른 컴포넌트)가 통합된 유닛으로 구현될 수 있음은 당연하다 할 것이다. 도 76은 본 개시의 실시예들에 따른 통합 컨트롤러 및 인터페이스 모듈을 구비한 분산 프로세서 메모리 칩(7600)을 도시한 것이다. 도 76에 도시된 바와 같이, 프로세서 메모리 칩(7600)은 도 75의 컨트롤러(7540)와 통신 포트(7530, 7531, 7532)의 기능을 수행하도록 구성된 통합 컨트롤러 및 인터페이스 모듈(7547)을 구비하여 구현될 수 있다. 도 76에 도시된 바와 같이, 컨트롤러 및 인터페이스 모듈(7547)은 통신 포트(예: 7530, 7531, 7532)와 유사한 인터페이스(7548_1 내지 7548_N)를 통해 외부 엔티티, 하나 이상의 분산 프로세서 메모리 칩 등과 같은 다중의 상이한 엔티티와 통신하도록 구성될 수 있다. 컨트롤러 및 인터페이스 모듈(7547)은 또한, 분산 프로세서 메모리 칩들 사이 또는 분산 프로세서 메모리 칩(7600)과 호스트 컴퓨터와 같은 외부 엔티티 사이의 통신을 제어하도록 구성될 수 있다. 일부 실시예들에서, 컨트롤러 및 인터페이스 모듈(7547)은 하나 이상의 다른 분산 프로세서 메모리 칩과 병렬 통신 및 호스트 컴퓨터, 통신 모듈 등과 같은 외부 엔티티와 병렬 통신하도록 구성된 통신 인터페이스(7548_1 내지 7548_N)를 포함할 수 있다. Although the communication port (eg, 7530 to 7532) and the controller (eg, 7540) are shown as separate elements from each other, according to embodiments of the present disclosure, the communication port and the controller (or any other component) are integrated It will be natural that it can be implemented as a unit. 76 illustrates a distributed processor memory chip 7600 having an integrated controller and an interface module according to embodiments of the present disclosure. As shown in FIG. 76 , the processor memory chip 7600 is implemented with an integrated controller and interface module 7547 configured to perform the functions of the controller 7540 and communication ports 7530 , 7531 , 7532 of FIG. 75 . can be As shown in Figure 76, the controller and interface module 7547 communicates via interfaces 7548_1 to 7548_N similar to communication ports (eg, 7530, 7531, 7532) to external entities, such as one or more distributed processor memory chips, and the like. It may be configured to communicate with different entities. The controller and interface module 7547 may also be configured to control communications between the distributed processor memory chips or between the distributed processor memory chip 7600 and an external entity, such as a host computer. In some embodiments, the controller and interface module 7547 may include communication interfaces 7548_1 through 7548_N configured for parallel communication with one or more other distributed processor memory chips and for parallel communication with an external entity such as a host computer, communication module, etc. have.

도 77은 본 개시의 실시예들에 따른 도 75a에 도시된 스케일러블 프로세서 메모리 시스템에서 분산 프로세서 메모리 칩 사이에 데이터를 전송하는 순서도를 도시한 것이다. 도시의 목적상, 데이터 전송의 흐름은 도 75a를 참조하여 설명하고 데이터가 제1 프로세서 메모리 칩(7500)으로부터 제2 프로세서 메모리 칩(7500')으로 전송되는 것으로 간주하기로 한다. 77 is a flowchart illustrating data transfer between distributed processor memory chips in the scalable processor memory system shown in FIG. 75A according to embodiments of the present disclosure. For illustration purposes, the flow of data transfer will be described with reference to FIG. 75A and it will be assumed that data is transferred from the first processor memory chip 7500 to the second processor memory chip 7500 ′.

단계 S7710에서, 데이터 전송 요청이 수신될 수 있다. 그러나 여기서, 앞서 설명한 바와 같이, 일부 실시예들에서, 데이터 전송 요청이 필요하지 않을 수 있다. 예를 들면, 일부 경우에서, 데이터 전송의 시기는 미리 결정될 수 있다(예: 특정 소프트웨어 코드에 의해). 이러한 경우, 데이터 전송은 별도의 데이터 전송 요청 없이 진행될 수 있다. 단계 S7710은 예컨대 컨트롤러(7540)에 의해 수행될 수 있다. 일부 실시예들에서, 데이터 전송 요청은 제1 분산 프로세서 메모리 칩(7500)의 하나의 프로세서 서브유닛으로부터 제2 분산 프로세서 메모리 칩(7500')의 다른 프로세서 서브유닛으로의 데이터 전송 요청을 포함할 수 있다. In step S7710, a data transmission request may be received. However, here, as described above, in some embodiments, a data transmission request may not be necessary. For example, in some cases, the timing of data transfer may be predetermined (eg, by specific software code). In this case, data transmission may proceed without a separate data transmission request. Step S7710 may be performed, for example, by the controller 7540 . In some embodiments, the data transfer request may include a data transfer request from one processor subunit of the first distributed processor memory chip 7500 to another processor subunit of the second distributed processor memory chip 7500 ′. have.

단계 S7720에서, 데이터 전송 시기가 결정될 수 있다. 설명한 바와 같이, 데이터 전송 시기는 미리 결정될 수 있고 특정 소프트웨어 프로그램의 실행 순서에 따를 수 있다. 단계 S7720은 예컨대 컨트롤러(7540) 등에 의해 수행될 수 있다. 일부 실시예들에서, 데이터 전송 시기는 (1) 전송하는 프로세서 서브유닛이 데이터를 전송할 준비가 되었는지 여부 및/또는 (2) 수신하는 프로세서 서브유닛이 데이터를 수신할 준비가 되었는지 여부를 고려하여 판단될 수 있다. 본 개시의 실시예들에 따르면, 이러한 데이터 전송을 가능하게 하도록 하나 이상의 다른 타이밍 제약이 완수되었는지 여부도 고려될 수 있다. 이러한 하나 이상의 타이밍 제약은 전송하는 프로세서 서브유닛으로부터의 전송 시간과 수신하는 프로세서 서브유닛의 수신 시간 사이의 시간차, 외부 엔티티(예: 호스트 컴퓨터)로부터의 처리되는 데이터로의 접근 요청, 전송 또는 수신하는 프로세서 서브유닛과 연관된 메모리 리소스(예: 메모리 어레이) 상에 수행되는 리프레시 동작 등과 관련될 수 있다. 본 개시의 실시예들에 따르면, 프로세싱 서브유닛은 클럭에 의해 공급될 수 있다. 일부 실시예들에서, 프로세싱 서브유닛에 공급되는 클럭은 클럭 인에이블 신호 등을 활용하여 제어될 수 있다. 본 개시의 일부 실시예들에 따르면, 컨트롤러(7540)는 프로세서 서브유닛(7520_1 내지 7520_K)로의 클럭 인에이블 신호를 제어함으로써 통신 명령의 시기를 제어할 수 있다. In step S7720, a data transmission time may be determined. As described, the data transmission timing may be predetermined and may depend on the execution order of a particular software program. Step S7720 may be performed, for example, by the controller 7540 or the like. In some embodiments, the timing of data transmission is determined by considering (1) whether the transmitting processor subunit is ready to transmit data and/or (2) whether the receiving processor subunit is ready to receive data. can be According to embodiments of the present disclosure, it may also be considered whether one or more other timing constraints have been fulfilled to enable such data transmission. One or more of these timing constraints may be a time difference between a transmission time from the transmitting processor subunit and a reception time of the receiving processor subunit, requesting access to the processed data from an external entity (eg a host computer), sending or receiving it. It may relate to a refresh operation performed on a memory resource (eg, a memory array) associated with the processor subunit, and the like. According to embodiments of the present disclosure, the processing subunit may be supplied by a clock. In some embodiments, the clock supplied to the processing subunit may be controlled using a clock enable signal or the like. According to some embodiments of the present disclosure, the controller 7540 may control the timing of the communication command by controlling the clock enable signal to the processor subunits 7520_1 to 7520_K.

단계 S7703에서, 단계 S7720에서 결정된 데이터 전송 시기에 의거하여 데이터 전송이 수행될 수 있다. 단계 S7730은 컨트롤러(7540) 등에 의해 관리될 수 있다. 예를 들어, 제1 분산 프로세서 메모리 칩(7500)의 전송 프로세서 서브유닛은 단계 S7720에서 결정된 데이터 전송 시기에 따라 제2 분산 프로세서 메모리 칩(75000')의 수신 프로세서 서브유닛으로 데이터를 전송할 수 있다. In step S7703, data transmission may be performed based on the data transmission timing determined in step S7720. Step S7730 may be managed by the controller 7540 or the like. For example, the transmitting processor subunit of the first distributed processor memory chip 7500 may transmit data to the receiving processor subunit of the second distributed processor memory chip 75000' according to the data transmission timing determined in step S7720.

개시된 아키텍처는 다양한 응용에 활용될 수 있다. 예를 들어, 일부 경우에서, 상기 아키텍처는 신경망(특히 대형 신경망)과 연관된 가중치 또는 신경값(neuron values) 또는 부분 신경값과 같은 데이터가 상이한 분산 프로세서 메모리 칩들 간에 공유되는 것을 용이하게 할 수 있다. 또한, SUM, AVG 등과 같은 특정 연산은 다중의 상이한 분산 프로세서 메모리 칩들로부터의 데이터를 필요로 할 수 있다. 이러한 경우, 개시된 아키텍처는 이러한 데이터가 다중의 상이한 분산 프로세서 메모리 칩들 간에 공유되는 것을 용이하게 할 수 있다. 나아가, 개시된 아키텍처는 분산 프로세서 메모리 칩들 간에 기록이 공유되는 것을 용이하게 하여 쿼리의 결합 연산 등을 지원할 수 있다. The disclosed architecture can be utilized for a variety of applications. For example, in some cases, the architecture may facilitate data such as weights or neuron values or partial neural values associated with a neural network (especially a large neural network) to be shared between different distributed processor memory chips. Also, certain operations, such as SUM, AVG, etc., may require data from multiple different distributed processor memory chips. In such a case, the disclosed architecture may facilitate sharing of such data among multiple different distributed processor memory chips. Furthermore, the disclosed architecture may facilitate the sharing of records between distributed processor memory chips to support concatenated operations of queries, and the like.

또한, 기재된 실시예들은 분산 프로세서 메모리 칩에 대한 것이지만, 분산 프로세서 서브유닛 등을 포함하지 않는 일반적인 메모리 칩에도 동일한 원리와 방식이 적용될 수 있다. 예를 들어, 일부 경우에서, 다중 메모리 칩이 다중 포트 메모리 칩으로 서로 결합되어 프로세서 서브유닛의 어레이가 없이도 메모리 칩의 어레이를 형성할 수 있다. 다른 실시예에서, 다중 메모리 칩이 서로 결합되어 연결된 메모리의 어레이를 형성하여 다중 메모리 칩으로 구성된 사실상 하나의 큰 메모리를 호스트에 제공할 수 있다. In addition, although the described embodiments relate to a distributed processor memory chip, the same principle and method may be applied to a general memory chip that does not include a distributed processor subunit or the like. For example, in some cases, multiple memory chips may be coupled together into a multi-port memory chip to form an array of memory chips without the need for an array of processor subunits. In another embodiment, multiple memory chips may be coupled together to form an array of interconnected memories to provide the host with effectively one large memory comprised of multiple memory chips.

포트의 내부 연결은 메인 버스로의 연결 또는 프로세싱 어레이에 포함된 내부 프로세서 서브유닛들 중의 하나로의 연결일 수 있다. The internal connection of the port may be a connection to a main bus or a connection to one of the internal processor subunits included in the processing array.

인메모리 0 값 검출In-memory zero value detection

본 개시의 일부 실시예들은 복수의 메모리 뱅크의 하나 이상의 특정 주소에 저장된 0 값을 검출하기 위한 메모리 유닛에 관한 것이다. 개시된 메모리 유닛의 0 값 검출 기능은 컴퓨팅 시스템의 전력 소모를 감소시키는데 유용할 수 있고, 추가적으로 또는 대안적으로, 메모리로부터 0 값을 검색하는데 필요한 프로세싱 시간을 줄일 수 있다. 이 기능은 읽은 데이터에 실제로 0 값이 많은 시스템에서, 또한 메모리로부터의 0 값 검색이 필요 없을 수 있고(예: 0 값을 임의의 모든 다른 값과 곱하면 0이 됨) 연산 회로는 피연산자 중의 하나가 0이고 결과를 더욱 시간 또는 에너지 효율적으로 계산한다는 사실을 활용할 수 있는 곱셈/덧셈/뺄셈 등의 연산과 같은 계산 동작에, 특히 관련이 있다. 이러한 경우에, 0 값의 존재 검출은 메모리 접근과 메모리로부터의 0 값 검색을 대신하여 활용될 수 있다. Some embodiments of the present disclosure relate to a memory unit for detecting a zero value stored in one or more specific addresses of a plurality of memory banks. The disclosed zero value detection function of the memory unit may be useful for reducing power consumption of a computing system, and additionally or alternatively, may reduce the processing time required to retrieve a zero value from memory. This function may also eliminate the need for retrieving a zero value from memory on systems where the data read actually has many zero values (e.g. a zero value multiplied by any other value to get zero) and the arithmetic circuitry may be one of the operands. Of particular relevance to computational operations, such as operations such as multiplication/addition/subtraction, that can take advantage of the fact that is 0 and computes the result more time- or energy-efficiently. In this case, detection of the presence of a zero value can be utilized in lieu of memory access and retrieval of the zero value from memory.

본 섹션에서, 개시된 실시예들은 읽기 기능에 관하여 기재한다. 그러나 개시된 아키텍처와 방식이 0 값 쓰기 연산 및 다른 값들이 더 자주 등장할 가능성이 있는 다른 특정 소정의 비제로(non-zero) 값 연산에도 동일하게 적용될 수 있음은 당연하다 할 것이다. In this section, the disclosed embodiments are described with respect to the read function. It should be understood, however, that the disclosed architectures and schemes are equally applicable to zero-value write operations and certain other predetermined non-zero value operations in which other values are likely to appear more frequently.

개시된 실시예들에서, 메모리에서 0 값을 가져오는 대신에, 이러한 값이 특정 주소에서 검출되는 경우에, 메모리 유닛은 메모리 유닛 외부의 하나 이상의 회로(예: 메모리 유닛 외부에 위치한 하나 이상의 프로세서, CPU 등)로 0 값 지시자를 내보낼 수 있다. 0 값은 다중 비트 0 값 제로(예: 0 값 바이트, 0 값 워드, 1 바이트 미만, 1 바이트 이상 등인 다중 비트 0 값) 이다. 0 값 지시자는 메모리에 저장된 0 값을 지시하는 1비트 신호이므로, 1비트 0 값 지시 신호를 전달하는 것이 메모리에 저장된 n 비트의 데이터를 전송하는 것보다 유리하다. 송신된 제로 지시는 전송에 대한 에너지 소비를 1/n만큼 감소시킬 수 있고, 뉴런의 가중치에 의한 입력의 계산, 컨볼루션, 입력 데이터에 커널(kernel)의 적용, 및 학습 신경망, 인공지능과 연관된 계산, 및 광범위한 어레이의 다른 유형의 계산에 곱셈 연산이 개입되는 경우 등에서 계산의 속도를 높일 수 있다. 이러한 기능성을 제공하기 위하여, 개시된 메모리 유닛은 메모리의 특정 위치에 있는 0 값의 존재를 검출하고 0 값의 검색(예: 읽기 명령을 통한 검색)을 방지하고 메모리 유닛 외부의 회로망으로 0 값 지시자가 대신 송신(예: 메모리의 하나 이상의 제어라인, 메모리 유닛과 연관된 하나 이상의 버스 등을 활용)되도록 하는 하나 이상의 0 값 검출 논리부를 포함할 수 있다. 0 값 검출은 메모리 매트 레벨, 뱅크 레벨, 서브뱅크 레벨, 칩 레벨 등에서 수행될 수 있다. In the disclosed embodiments, instead of fetching a zero value from memory, when such a value is detected at a specific address, the memory unit is configured with one or more circuitry external to the memory unit (eg, one or more processors located external to the memory unit, CPU etc.) to emit a zero-value indicator. A zero-value is a multi-bit zero-valued zero (eg, zero-value byte, zero-valued word, multi-bit zero-value that is less than one byte, greater than one byte, etc.). Since the zero-value indicator is a 1-bit signal indicating a zero value stored in the memory, transmitting the 1-bit zero-value indicating signal is more advantageous than transmitting n-bit data stored in the memory. A transmitted zero indication can reduce the energy consumption for transmission by 1/n, and is associated with computation of inputs by weights of neurons, convolution, application of kernels to input data, and learning neural networks, artificial intelligence. It can speed up calculations, such as when multiplication operations are involved in calculations and other types of calculations in a wide array. To provide this functionality, the disclosed memory unit detects the presence of a zero value in a specific location in the memory, prevents retrieval of a zero value (eg, via a read command), and provides a zero-value indicator to circuitry external to the memory unit. It may instead include one or more zero-value detection logic to be transmitted (eg, utilizing one or more control lines of memory, one or more buses associated with the memory unit, etc.). Zero value detection may be performed at the memory mat level, bank level, subbank level, chip level, and the like.

여기서, 개시된 실시예들은 메모리 칩 외부의 위치로 0 값 지시자를 전달하는 것과 관련하여 설명되어 있지만, 개시된 실시예들과 기능들은 메모리 칩 내부에서 프로세싱이 수행되는 경우에도 상당히 유리할 수 있다. 예를 들어, 여기에 개시된 분산 프로세서 메모리 칩과 같은 실시예들에서, 프로세싱은 해당 프로세스 서브유닛에 의해 다양한 메모리 뱅크 내의 데이터에 대해 수행될 수 있다. 연관 데이터가 0을 많이 포함할 수 있는 신경망 또는 데이터 분석의 실행과 같은 많은 경우에서, 개시된 방식은 분산 프로세서 메모리 칩의 프로세서 서브유닛에 의해 수행되는 프로세싱과 연관된 프로세싱의 속도 증가 및/또는 전력 소비 감소를 가져올 수 있다. Here, although the disclosed embodiments are described with respect to passing a zero-value indicator to a location external to the memory chip, the disclosed embodiments and functions may be significantly advantageous when processing is performed inside the memory chip. For example, in embodiments such as the distributed processor memory chip disclosed herein, processing may be performed on data in various memory banks by corresponding processing subunits. In many cases, such as in the execution of data analysis or neural networks where the associated data may contain many zeros, the disclosed scheme may increase the speed and/or reduce power consumption of processing associated with processing performed by processor subunits of a distributed processor memory chip. can bring

도 78a는 본 개시의 실시예들에 따른 칩 레벨에서 메모리 칩(7810) 내에 구현된 복수의 메모리 뱅크의 하나 이상의 특정 주소에 저장된 0 값을 검출하는 시스템(7800)을 도시한 것이다. 시스템(7800)은 메모리 칩(7810)과 호스트(7820)를 포함할 수 있다. 메모리 칩(7810)은 복수의 제어부를 포함할 수 있고, 각 제어부에는 전용 메모리 뱅크가 있을 수 있다. 예컨대, 제어부는 전용 메모리 뱅크에 작동적으로 연결될 수 있다. 78A illustrates a system 7800 for detecting a zero value stored in one or more specific addresses of a plurality of memory banks implemented in a memory chip 7810 at a chip level according to embodiments of the present disclosure. The system 7800 may include a memory chip 7810 and a host 7820 . The memory chip 7810 may include a plurality of controllers, and each controller may include a dedicated memory bank. For example, the control may be operatively coupled to a dedicated memory bank.

예컨대, 메모리 뱅크의 어레이 간에 공간적으로 분산된 프로세서 서브유닛을 포함하는 본 개시의 분산 프로세서 메모리 칩과 관련된 일부 경우에서, 메모리 칩 내의 프로세싱에는 메모리 접근(읽기, 쓰기 여부와 무관하게)이 개입될 수 있다. 메모리 칩 내부의 프로세싱의 경우에서도, 읽기 또는 쓰기 명령과 연관된 0 값을 검출하는 개시된 방식은 내부 프로세서 유닛 또는 서브유닛이 실세 0 값의 전송을 포기하도록 할 수 있다. 대신에, 0 값의 검출과 0 값 지시자의 송신(예: 하나 이상의 내부 프로세싱 서브유닛으로)에 응답하여 분산 프로세서 메모리 칩은 메모리 칩 내의 0 값의 데이터 송신에 소모되었을 수 있는 에너지를 절약할 수 있다. For example, in some cases involving a distributed processor memory chip of the present disclosure that includes processor subunits spatially distributed between an array of memory banks, processing within the memory chip may involve memory accesses (whether read or written). have. Even in the case of processing inside a memory chip, the disclosed method of detecting a zero value associated with a read or write command may cause the internal processor unit or subunit to abandon the transmission of the actual zero value. Instead, in response to detection of a zero-value and transmission of a zero-value indicator (eg, to one or more internal processing subunits), the distributed processor memory chip may conserve energy that may have been expended in transmitting zero-valued data within the memory chip. have.

다른 예에서, 메모리 칩(7810)과 호스트(7820)는 각각 메모리 칩(7810)과 호스트(7820) 사이의 통신을 가능하게 하는 입력/출력(IO)을 포함할 수 있다. 각 IO는 0 값 지시자 라인(7830A)과 버스(7840A)에 결합될 수 있다. 0 값 지시자 라인(7830A)은 0 값 지시자를 메모리 칩(7810)으로부터 호스트(7820)로 전송할 수 있고, 0 값 지시자는 호스트(7820)에 의해 요청된 메모리 뱅크의 특정 주소에 저장된 0 값을 검출함에 따라 메모리 칩(7810)에 의해 생성된 1비트 신호를 포함할 수 있다. 호스트(7820)는 0 값 지시자 라인(7830A)를 통해 0 값 지시자를 수신하면 0 값 지시자와 연관된 하나 이상의 미리 정의된 동작을 수행할 수 있다. 예를 들어, 호스트(7820)가 곱셈을 위한 피연산자를 검색하도록 메모리 칩(7810)에 요청했다면, 피연산자 중의 하나가 0임을 수신된 0 값 지시자로부터 호스트(7820)가 (실제 메모리 값을 수신하지 않고) 확인하게 되므로, 호스트(7820)는 곱셈을 더욱 효율적으로 계산할 수 있다. 호스트(7820)는 또한 메모리 칩(7810)으로 명령, 데이터, 및 기타 입력을 제공하고 버스(7840)를 통해 메모리 칩(7810)으로부터 출력을 읽어올 수 있다. 호스트(7820)로부터의 통신을 수신하면, 메모리 칩(7810)은 수신된 통신과 연관된 데이터를 가져올 수 있고 가져온 데이터를 버스(7840)를 통해 호스트(7820)로 전송할 수 있다. In another example, the memory chip 7810 and the host 7820 may include input/output (IO) that enables communication between the memory chip 7810 and the host 7820 , respectively. Each IO may be coupled to a zero value indicator line 7830A and a bus 7840A. The zero-value indicator line 7830A may transmit a zero-value indicator from the memory chip 7810 to the host 7820, where the zero-value indicator detects a zero value stored at a specific address in the memory bank requested by the host 7820. Accordingly, the 1-bit signal generated by the memory chip 7810 may be included. When the host 7820 receives the zero-value indicator through the zero-value indicator line 7830A, it may perform one or more predefined actions associated with the zero-value indicator. For example, if host 7820 has requested memory chip 7810 to retrieve an operand for multiplication, host 7820 from a received zero-value indicator that one of the operands is zero (without receiving an actual memory value) ), the host 7820 can calculate the multiplication more efficiently. The host 7820 can also provide commands, data, and other inputs to the memory chip 7810 and read outputs from the memory chip 7810 over the bus 7840 . Upon receiving the communication from the host 7820 , the memory chip 7810 may retrieve data associated with the received communication and transmit the retrieved data to the host 7820 through the bus 7840 .

일부 실시예들에서, 호스트는 제로 데이터 값 대신에 0 값 지시자를 메모리 칩으로 보낼 수 있다. 이로써, 메모리 칩(예: 메모리 칩 상에 배치된 컨트롤러)는제로 데이터 값을 수신할 필요 없이 메모리 내에 0 값을 저장하거나 리프레시할 수 있다. 이러한 업데이트는 0 값 지시자의 수신(예: 쓰기 명령의 일부)에 의거하여 일어날 수 있다. In some embodiments, the host may send a zero value indicator to the memory chip instead of a zero data value. This allows the memory chip (eg, a controller disposed on the memory chip) to store or refresh zero values in memory without having to receive zero data values. These updates may occur upon receipt of a zero-value indicator (eg, as part of a write command).

도 78b는 본 개시의 실시예들에 따른 메모리 뱅크 레벨에서 복수의 메모리 뱅크(7811A, 7811B)의 하나 이상의 특정 주소에 저장된 0 값을 검출하는 메모리 칩(7810)을 도시한 것이다. 메모리 칩(7810)은 복수의 메모리 뱅크(7811A, 7811B)와 IO 버스(7812)를 포함할 수 있다. 도 78b에는 메모리 칩(7810)에 2개의 메모리 뱅크(7811A, 7811B)가 구현된 것으로 도시하고 있지만, 메모리 칩(7810)은 임의의 모든 수의 메모리 뱅크를 포함할 수 있다. 78B illustrates a memory chip 7810 that detects a 0 value stored in one or more specific addresses of a plurality of memory banks 7811A and 7811B at a memory bank level according to embodiments of the present disclosure. The memory chip 7810 may include a plurality of memory banks 7811A and 7811B and an IO bus 7812 . Although FIG. 78B shows two memory banks 7811A and 7811B implemented in the memory chip 7810, the memory chip 7810 may include any number of memory banks.

IO 버스(7812)는 버스(7840B)를 통해 외부 칩(예: 도 78a의 호스트(7820))과 데이터를 송수신하도록 구성될 수 있다. 버스(7840B)는 도 78a의 버스(7840A)와 유사하게 기능할 수 있다. IO 버스(7812)는 또한 0 값 지시자 라인(7830B)을 통해 0 값 지시자를 전송할 수 있고, 0 값 지시자 라인(7830B)은 도 78a의 0 값 지시자 라인(7830A)과 유사하게 기능할 수 있다. IO 버스(7812)는 또한 내부 0 값 지시자 라인(7831)과 버스(7841)를 통해 메모리 뱅크(7811A, 7811B)와 통신하도록 구성될 수 있다. IO 버스(7812)는 수신된 데이터를 외부 칩으로부터 메모리 뱅크(7811A, 7811B)의 하나로 전송할 수 있다. 예컨대, IO 버스(7812)는 메모리 뱅크(7811A)의 특정 주소에 저장된 데이터의 읽기 명령을 포함하는 데이터를 버스(7841)를 통해 전송할 수 있다. IO 버스(7812)와 메모리 뱅크(7811A, 7811B) 사이에는 멀티플렉서(mux)가 포함될 수 있고 내부 0 값 지시자 라인(7831)과 버스(7841A)에 의해 연결될 수 있다. 멀티플렉서는 수신된 데이터를 IO 버스(7812)로부터 특정 메모리 뱅크로 전송하도록 구성될 수 있고, 수신된 데이터 또는 수신된 0 값 지시자를 특정 메모리 뱅크로부터 IO 버스(7812)로 전송하도록 더 구성될 수 있다. The IO bus 7812 may be configured to transmit and receive data to and from an external chip (eg, the host 7820 of FIG. 78A ) via the bus 7840B. Bus 7840B may function similarly to bus 7840A in FIG. 78A. IO bus 7812 may also transmit a zero-value indicator over zero-value indicator line 7830B, which may function similarly to zero-value indicator line 7830A in FIG. 78A. IO bus 7812 may also be configured to communicate with memory banks 7811A, 7811B via internal zero-value indicator line 7831 and bus 7841 . The IO bus 7812 may transfer the received data from an external chip to one of the memory banks 7811A, 7811B. For example, the IO bus 7812 may transmit data including a read command of data stored at a specific address of the memory bank 7811A through the bus 7841 . A multiplexer (mux) may be included between the IO bus 7812 and the memory banks 7811A, 7811B and may be connected by an internal zero value indicator line 7831 and a bus 7841A. The multiplexer may be configured to transfer received data from the IO bus 7812 to a particular memory bank, and may be further configured to transfer received data or a received zero value indicator from the particular memory bank to the IO bus 7812 .

일부 경우에서, 호스트 엔티티는 정상적인 데이터 송신만을 수신하도록 구성될 수 있고, 개시된 0 값 지시자에 대한 해석 또는 응답을 하지 못하도록 구성될 수 있다. 이러한 경우에서, 개시된 실시예들(예: 컨트롤러/칩 IO 등)은 0 값 지시자 신호 대신에 호스트 IO에 대한 데이터 라인에 대한 0 값을 재생할 수 있고, 이로써 칩의 내부적으로 데이터 전송 전력을 절약할 수 있다. In some cases, the host entity may be configured to receive only normal data transmissions, and may not be configured to interpret or respond to an initiated zero-value indicator. In this case, the disclosed embodiments (eg, controller/chip IO, etc.) may reproduce a zero value for the data line to the host IO instead of the zero-value indicator signal, thereby saving data transmission power internally of the chip. can

메모리 뱅크(7811A, 7811B)는 각각 제어부를 포함한다. 제어부는 메모리 뱅크의 요청된 주소에 저장된 0 값을 검출할 수 있다. 저장된 0 값을 검출하면, 제어부는 0 값 지시자를 생성하고 내부 0 값 지시자 라인(7831)을 통해 IO 버스(7812)로 전송할 수 있으며, 이어서 0 값 지시자는 0 값 지시자 라인(7830B)을 통해 외부 칩으로 전송될 수 있다. Memory banks 7811A and 7811B each include a control unit. The control unit may detect a value of 0 stored in the requested address of the memory bank. Upon detecting a stored zero value, the control unit may generate a zero value indicator and transmit it to the IO bus 7812 via an internal zero-value indicator line 7831, which may then be transmitted to an external zero-value indicator via an internal zero-value indicator line 7830B. can be transmitted to the chip.

도 79는 본 개시의 실시예들에 따른 메모리 매트 레벨에서 복수의 메모리 매트의 하나 이상의 특정 주소에 저장된 0 값을 검출하는 메모리 뱅크(7911)를 도시한 것이다. 일부 실시예들에서, 메모리 뱅크(7911)는 메모리 매트(7912A, 7912B)로 정리될 수 있고, 각 메모리 매트는 개별적으로 제어되고 접근될 수 있다. 메모리 뱅크(7911)는 0 값 검출 논리부(7914A, 7914B)를 포함할 수 있는 메모리 매트 컨트롤러(7913A, 7913B)를 포함할 수 있다. 메모리 매트 컨트롤러(7913A, 7913B)는 각각 메모리 매트(7912A, 7912B) 상의 위치로 읽기 및 쓰기가 가능하도록 할 수 있다. 메모리 뱅크(7911)는 읽기 비활성 요소, 로컬 센스 증폭기(7915A, 7915B), 및/또는 글로벌 센스 증폭기(7916)를 더 포함할 수 있다. 79 illustrates a memory bank 7911 for detecting a zero value stored in one or more specific addresses of a plurality of memory mats at a memory mat level according to embodiments of the present disclosure. In some embodiments, memory bank 7911 can be organized into memory mats 7912A, 7912B, each memory mat can be individually controlled and accessed. Memory bank 7911 may include memory mat controllers 7913A, 7913B, which may include zero value detection logic 7914A, 7914B. Memory mat controllers 7913A and 7913B may enable reading and writing to locations on memory mats 7912A and 7912B, respectively. Memory bank 7911 may further include read disable elements, local sense amplifiers 7915A, 7915B, and/or global sense amplifiers 7916 .

메모리 매트(7912A, 7912B)는 각각 복수의 메모리 셀을 포함할 수 있다. 복수의 메모리 셀은 각각 1비트의 2진 정보를 저장할 수 있다. 예를 들어, 임의의 모든 메모리 셀은 개별적으로 0 값을 저장할 수 있다. 특정 메모리 매트의 모든 메모리 셀이 0 값을 저장하고 있으면, 0 값은 메모리 매트 전체와 연관이 있을 수 있다. Memory mats 7912A and 7912B may each include a plurality of memory cells. Each of the plurality of memory cells may store 1-bit binary information. For example, any and all memory cells may individually store a value of zero. If all memory cells of a particular memory mat store a value of 0, the value of 0 may be associated with the entire memory mat.

메모리 매트 컨트롤러(7913A, 7913B)는 각각 전용 메모리 매트에 접근하도록 구성될 수 있고, 전용 메모리 매트에 저장된 데이터를 읽거나 전용 매트에 데이터를 쓸 수 있다. The memory mat controllers 7913A and 7913B may each be configured to access a dedicated memory mat, and may read data stored in the dedicated memory mat or write data to the dedicated mat.

일부 실시예들에서, 0 값 검출 논리부(7914A 또는 7914B)는 메모리 뱅크(7911) 내에 구현될 수 있다. 하나 이상의 0 값 검출 논리부(7914A, 7914B)가 메모리 뱅크, 메모리 서브뱅크, 메모리 매트, 및 하나 이상의 메모리 셀의 세트와 연관될 수 있다. 0 값 검출 논리부(7914A 또는 7914B)는 요청된 특정 주소(예: 메모리 매트(7912A 또는 7912B))가 0 값을 저장하고 있음을 검출할 수 있다. 검출은 여러 방법으로 수행될 수 있다. In some embodiments, the zero value detection logic 7914A or 7914B may be implemented in the memory bank 7911 . One or more zero value detection logic 7914A, 7914B may be associated with a memory bank, a memory subbank, a memory mat, and a set of one or more memory cells. The zero value detection logic 7914A or 7914B may detect that a specific requested address (eg, the memory mat 7912A or 7912B) stores a zero value. Detection can be performed in several ways.

제1 방법은 0에 대한 디지털 비교기를 활용하는 것을 포함한다. 디지털 비교기는 2개의 숫자를 2진 형태의 입력으로 취하고 제1 숫자(수신된 데이터)가 제2 숫자(0)와 동일한지 여부를 판단하도록 구성될 수 있다. 2개의 숫자가 동일한 것으로 디지털 비교기가 판단하는 경우, 0 값 검출 논리부는 0 값 지시자를 생성할 수 있다. 0 값 지시자는 1비트 신호일 수 있고, 데이터 비트를 다음 레벨(예: 도 78b의 IO 버스(7812))로 전송할 수 있는 증폭기(예: 로컬 센스 증폭기(7915A, 7915B)), 송신기, 및 버퍼를 비활성화할 수 있다. 0 값 지시자는 0 값 지시자 라인(7931A 또는 7931B)을 통해 글로벌 센스 증폭기(7916)로 더 전송될 수 있지만, 일부 경우에서는 글로벌 센스 증폭기를 바이패스 할 수 있다. A first method involves utilizing a digital comparator to zero. The digital comparator may be configured to take two numbers as inputs in binary form and determine whether a first number (received data) is equal to a second number (0). If the digital comparator determines that the two numbers are equal, the zero-value detection logic may generate a zero-value indicator. The zero-value indicator can be a 1-bit signal and includes an amplifier (e.g., local sense amplifiers 7915A, 7915B), a transmitter, and a buffer that can transfer data bits to the next level (e.g., IO bus 7812 in Figure 78B). can be deactivated. The zero-value indicator may be further sent to the global sense amplifier 7916 via the zero-value indicator line 7931A or 7931B, but may bypass the global sense amplifier in some cases.

0 값 검출의 제2 방법은 아날로그 비교기를 활용하는 것을 포함할 수 있다. 아날로그 비교기는 비교를 위해 2개의 아날로그 입력의 전압을 활용하는 것을 제외하고는 디지털 비교기와 유사하게 기능할 수 있다. 예를 들면, 비트가 모두 감지될 수 있고, 비교기는 신호 사이의 논리 OR 함수 역할을 할 수 있다. A second method of zero value detection may include utilizing an analog comparator. An analog comparator can function similarly to a digital comparator except that it utilizes the voltages of the two analog inputs for comparison. For example, all of the bits can be sensed, and the comparator can act as a logical OR function between the signals.

0 값 검출의 제3 방법은 로컬 센스 증폭기(7915A, 7915B)로부터 전송된 신호를 글로벌 센스 증폭기(7916) 내부로 활용하는 것을 포함할 수 있고, 글로벌 센스 증폭기(7916)는 입력 중에서 하이(비제로)인 것이 있는지 여부를 감지하고 이 논리 신호를 활용하여 다음 레벨의 증폭기를 제어하도록 구성된다. 로컬 센스 증폭기(7915A, 7915B)와 글로벌 센스 증폭기(7916)는 복수의 메모리 뱅크로부터 저전력 신호를 감지하고, 복수의 메모리 뱅크에 저장된 데이터가 메모리 매트 컨트롤러(7913A 또는 7913B)와 같은 적어도 하나의 컨트롤러에 의해 해석될 수 있도록 작은 전압 스윙을 높은 전압 레벨로 증폭하도록 구성된 복수의 트랜지스터를 포함할 수 있다. 예를 들어, 메모리 셀은 메모리 뱅크(7911) 상에 행과 열로 레이아웃 될 수 있다. 각 라인은 행에서 각 메모리 셀에 부착될 수 있다. 행을 따라 있는 라인은 워드라인이라고 하며, 선택적으로 전압을 가하여 활성화된다. 열을 따라 있는 라인은 비트라인이라고 하며, 2개의 보완적인 비트라인이 메모리 어레이의 가장자리에서 센스 증폭기에 부착될 수 있다. 센스 증폭기의 수는 메모리 뱅크(7911) 상의 비트라인(열)의 수에 상응할 수 있다. 특정 메모리 셀에서 비트를 읽기 위하여, 셀의 행을 따라 있는 워드라인이 온 되어(turned on) 행에 있는 모든 메모리 셀이 활성화된다. 이어, 각 셀에 저장된 값(0 또는 1)은 특정 셀과 연관된 비트라인에서 사용 가능해진다. 2개의 보완적인 비트라인의 종단에 있는 센스 증폭기는 작은 전압을 정상 논리 레벨로 증폭할 수 있다. 이어, 원하는 셀로부터의 비트는 셀의 센스 증폭기로부터 버퍼로 래치되고(latched) 출력 버스에 놓이게 될 수 있다. A third method of zero value detection may include utilizing signals transmitted from local sense amplifiers 7915A, 7915B into a global sense amplifier 7916, wherein the global sense amplifier 7916 selects a high (non-zero) among the inputs. ) and is configured to use this logic signal to control the next level of the amplifier. The local sense amplifiers 7915A, 7915B and the global sense amplifier 7916 sense low-power signals from a plurality of memory banks, and data stored in the plurality of memory banks is transmitted to at least one controller, such as a memory mat controller 7913A or 7913B. may include a plurality of transistors configured to amplify a small voltage swing to a high voltage level so as to be interpreted by For example, memory cells may be laid out in rows and columns on the memory bank 7911 . Each line may be attached to each memory cell in a row. The lines along the rows are called wordlines and are activated by selectively applying a voltage. The lines along the column are called bitlines, and two complementary bitlines can be attached to the sense amplifiers at the edges of the memory array. The number of sense amplifiers may correspond to the number of bitlines (columns) on the memory bank 7911 . To read a bit from a particular memory cell, the wordline along the row of cells is turned on, activating all memory cells in the row. The value (0 or 1) stored in each cell is then made available to the bitline associated with that particular cell. A sense amplifier at the end of the two complementary bitlines can amplify a small voltage to a normal logic level. The bits from the desired cell may then be latched from the cell's sense amplifier into a buffer and placed on an output bus.

0 값 검출의 제4 방법은 값이 0인 경우에 메모리에 저장되고 쓰기 시간에 저장된 각 워드 당 추가 비트를 활용하는 것을 포함할 수 있고 데이터를 읽어서 데이터가 0인지 아닌지를 알게 되는 경우에 이러한 추가 비트를 활용할 수 있다. 이 방법은 메모리에 모든 0을 쓰는 것을 피할 수 있으므로 에너지를 더욱 절약할 수 있다. A fourth method of zero-value detection may include utilizing an additional bit per each word stored in memory and stored at write time if the value is zero, and reading the data to know if the data is zero or not, this addition You can use bits. This method avoids writing all zeros to memory, which further saves energy.

앞서 설명하고 본 개시 전반에서 설명하는 바와 같이, 일부 실시예들은 복수의 프로세서 서브유닛을 포함하는 메모리 유닛(예: 메모리 유닛(7800))을 포함할 수 있다. 이러한 프로세서 서브유닛은 단일 기판(예: 메모리 유닛(7800)과 같은 메모리 칩의 기판) 상에 공간적으로 분산될 수 있다. 또한, 복수의 프로세서 서브유닛 각각은 메모리 유닛(7800)의 복수의 메모리 뱅크 중의 해당 메모리 뱅크 전용일 수 있다. 또한, 해당 프로세서 서브 유닛 전용의 이러한 메모리 뱅크도 기판 상에 공간적으로 분산될 수 있다. 일부 실시예들에서, 메모리 유닛(7800)은 특정 작업(예: 신경망 실행과 연관된 하나 이상의 연산 등)과 연관될 수 있고, 메모리 유닛(7800)의 프로세서 서브유닛 각각은 이러한 작업의 일부분의 수행을 담당할 수 있다. 예를 들어, 각 프로세서 서브유닛에는 데이터 처리 및 메모리 연산, 산술 및 논리 연산 등을 포함할 수 있는 명령이 구비될 수 있다. 일부 경우에서, 0 값 검출 로직은 0 값 지시자를 메모리 유닛(7800) 상에 공간적으로 분산된 기재된 프로세서 서브유닛의 하나 이상으로 제공하도록 구성될 수 있다. As described above and described throughout this disclosure, some embodiments may include a memory unit (eg, memory unit 7800 ) that includes a plurality of processor subunits. These processor subunits may be spatially distributed on a single substrate (eg, a substrate of a memory chip such as the memory unit 7800 ). Also, each of the plurality of processor subunits may be dedicated to a corresponding one of the plurality of memory banks of the memory unit 7800 . In addition, such a memory bank dedicated to the processor sub-unit may also be spatially distributed on the substrate. In some embodiments, memory unit 7800 may be associated with a particular task (eg, one or more operations associated with executing a neural network, etc.), and each processor subunit of memory unit 7800 is responsible for performing a portion of such task. can be in charge For example, each processor subunit may be provided with instructions that may include data processing and memory operations, arithmetic and logical operations, and the like. In some cases, the zero value detection logic may be configured to provide a zero value indicator to one or more of the described processor subunits spatially distributed on the memory unit 7800 .

도 80은 본 개시의 실시예들에 따른 복수의 메모리 뱅크의 특정 주소에 저장된 0 값을 검출하는 예시적인 방법(8000)의 순서도를 도시한 것이다. 방법(8000)은 메모리 칩(예: 도 78b의 메모리 칩(7810))에 의해 수행될 수 있다. 구체적으로, 메모리 유닛의 컨트롤러(예: 도 79의 컨트롤러(7913A) 및 0 값 검출 논리부(예: 0 값 검출 논리부(7914A))가 방법(8000)을 수행할 수 있다. 80 is a flowchart of an exemplary method 8000 of detecting a zero value stored at a specific address of a plurality of memory banks in accordance with embodiments of the present disclosure. Method 8000 may be performed by a memory chip (eg, memory chip 7810 in FIG. 78B ). Specifically, the controller of the memory unit (eg, the controller 7913A of FIG. 79 and the zero value detection logic unit (eg, the zero value detection logic unit 7914A)) may perform the method 8000 .

단계 8010에서, 임의의 모든 적합한 방식에 의해 읽기 또는 쓰기 동작이 개시될 수 있다. 일부 경우에서, 컨트롤러는 복수의 이산 메모리 뱅크(예: 도 78에 도시된 메모리 뱅크)의 특정 주소에 저장된 데이터를 읽어오라는 요청을 수신할 수 있다. 컨트롤러는 복수의 이산 메모리 뱅크에 대한 읽기/쓰기 동작의 적어도 일 양상을 제어하도록 구성될 수 있다. At step 8010, a read or write operation may be initiated in any suitable manner. In some cases, the controller may receive a request to read data stored at a specific address of a plurality of discrete memory banks (eg, the memory bank shown in FIG. 78 ). The controller may be configured to control at least one aspect of read/write operations for the plurality of discrete memory banks.

단계 8020에서, 하나 이상의 0 값 검출 회로가 사용되어 읽기 또는 쓰기 명령과 연관된 0 값의 존재를 검출할 수 있다. 예를 들면, 0 값 검출 논리부(예: 도 78의 00 값 검출 논리부(7830))가 읽기 또는 쓰기와 연관된 특정 주소와 연관된 0 값을 검출할 수 있다. At 8020, one or more zero-value detection circuits may be used to detect the presence of a zero-value associated with a read or write command. For example, the 0 value detection logic unit (eg, the 0 value detection logic unit 7830 of FIG. 78 ) may detect a 0 value associated with a specific address associated with reading or writing.

단계 8030에서, 컨트롤러는 단계 8020에서 0 값 검출 논리부에 의한 0 값 검출에 응답하여 메모리 유닛 외부의 하나 이상의 회로로 0 값 지시자를 전송할 수 있다. 예를 들어, 0 값 검출 논리부는 요청된 주소가 0 값을 저장하고 있다고 판단할 수 있고 값이 0이라는 지시를 메모리 칩 외부(또는 메모리 뱅크의 어레이 중에 분산된 프로세서 서브유닛을 포함하는 개시된 분산 프로세서 메모리 칩의 경우 등에서는 메모리 칩 내부)의 엔티티(예: 하나 이상의 회로)로 전송할 수 있다. 0 값이 읽기 또는 쓰기 명령과 연관된 것으로 검출되지 않는 경우, 컨트롤러는 0 값 지시자 대신에 데이터 값을 전송할 수 있다. 일부 실시예들에서, 0 값 지시자가 전송되는 하나 이상의 회로는 메모리 유닛 내부에 있을 수 있다. In step 8030 , the controller may transmit a zero-value indicator to one or more circuits external to the memory unit in response to detecting a zero value by the zero-value detection logic in step 8020 . For example, the zero value detection logic may determine that the requested address stores a zero value and sends an indication that the value is zero outside the memory chip (or in the disclosed distributed processor including processor subunits distributed in an array of memory banks). In the case of a memory chip, etc., it may be transmitted to an entity (eg, one or more circuits) inside the memory chip. If a zero value is not detected as being associated with a read or write command, the controller may send a data value instead of a zero value indicator. In some embodiments, one or more circuitry through which a zero-value indicator is transmitted may be internal to the memory unit.

개시된 실시예들은 0 값 검출에 대하여 기재되었지만, 동일한 원리와 방식이 다른 메모리 값(예: 1 등)의 검출에 적용될 수 있을 것이다. 일부 경우에서, 0 값 지시자 외에도, 검출 로직은 읽기 또는 쓰기 명령과 연관된 다른 값(예: 1 등)의 하나 이상의 지시자를 전송할 수 있고, 이러한 지시자는 값 지시자에 상응하는 임의의 모든 값이 검출되는 경우에 전송될 수 있다. 일부 경우에서, 값은 사용자에 의해 조정(예: 하나 이상의 레지스터를 업데이트)될 수 있다. 이러한 업데이트는 데이터 세트에 관한 특성이 알려져 있을 수 있는 경우에 특히 유용할 수 있고, 특정 값이 다른 값들보다 데이터 내에 더 많이 있을 수 있다고 이해된다(예: 사용자의 측에서 이해). 이러한 경우, 하나, 둘, 셋, 또는 그 이상의 지시자가 데이터 세트와 연관된 가장 많은 데이터 값과 연관될 수 있다. Although the disclosed embodiments have been described with respect to detecting a zero value, the same principles and methods may be applied to detecting other memory values (eg, 1, etc.). In some cases, in addition to the zero-value indicator, the detection logic may send one or more indicators of other values (eg, 1, etc.) associated with a read or write command, such indicators that any and all values corresponding to the value indicators are detected. may be transmitted in some cases. In some cases, values may be adjusted by the user (eg, updating one or more registers). Such updates can be particularly useful when characteristics about the data set may be known, and it is understood that certain values may be more present in the data than others (eg understood on the user's side). In this case, one, two, three, or more indicators may be associated with the largest number of data values associated with the data set.

DRAM 활성화 페널티에 대한 보상Compensation for DRAM activation penalty

특정 유형의 메모리(예: DRAM)에서, 메모리 셀은 메모리 뱅크 내의 어레이에 배치될 수 있고, 메모리 셀에 포함된 값들은 한 번에 어레이 내 메모리 셀의 한 라인씩 접근되고 검색(읽기)될 수 있다. 이러한 읽기 프로세스는 우선 메모리 셀의 라인(또는 열)을 오픈(활성화)하여 메모리 셀에 의해 저장된 데이터 값이 사용 가능하게 만드는 단계를 포함할 수 있다. 다음으로, 오픈된 라인의 메모리 셀 값들이 동시에 감지될 수 있고, 메모리 셀 값들을 읽기 위하여 열 주소가 사용되어 개별 메모리 셀 값 또는 메모리 셀 값들의 그룹(즉, 워드)을 사이클하고 각 메모리 셀 값을 외부 데이터 버스로 연결할 수 있다. 이러한 프로세스는 시간이 걸린다. 일부 경우에서, 읽기 위한 메모리 라인을 오픈하는 데에 32 사이클의 연산 시간이 필요하고, 오픈된 라인에서 갑들을 읽는 데에 다시 32 사이클이 필요하다. 현재 오픈된 라인의 읽기 동작을 완료한 후에야 읽을 다음 라인을 오픈한다면 심각한 지연이 발생할 수 있다. 본 예에서, 다음 라인을 오픈하는 데에 필요한 32사이클 동안에, 아무 데이터도 읽히지 않으며, 각 라인을 읽는 데에 라인 데이터를 반복하는 데에 걸리는 32사이클이 아니라 결과적으로 총 64사이클이 필요하게 된다. 기존의 메모리 시스템에서는 제1 라인에 읽기 또는 쓰기가 수행되는 동안에 동일 뱅크의 제2 라인을 오픈할 수 없다. 따라서, 지연을 줄이기 위하여, 오픈할 다음 라인은 하기에 더욱 상세히 설명하는 바와 같이 다른 뱅크에 있거나 듀얼 라인 접근을 위한 특별한 뱅크에 있을 수 있다. 현재 라인은 다음 라인을 오픈하기 전에 모두 플립플롭 또는 래치로 샘플링 될 수 있고, 다음 라인이 오픈될 수 있는 반면에 모든 프로세싱이 플립플롭/래치 상에서 수행된다. 다음 예상 라인이 동일 뱅크에 있는(및 상기 내용이 하나도 존재하지 않는 경우) 경우, 지연은 회피할 수 없을 수 있고, 시스템은 기다릴 필요가 있을 수 있다. 이러한 메커니즘은 표준 메모리 및 특히 메모리 프로세싱 장치 모두에 해당한다. In certain types of memory (such as DRAM), memory cells may be placed in an array within a memory bank, and the values contained in the memory cells may be accessed and retrieved (read) one line at a time of the memory cells in the array at a time. have. This read process may include first opening (activating) lines (or columns) of memory cells to make available data values stored by the memory cells. Next, the memory cell values of the open line can be sensed simultaneously, and the column address is used to read the memory cell values to cycle an individual memory cell value or group of memory cell values (i.e., a word) and each memory cell value can be connected to an external data bus. This process takes time. In some cases, it takes 32 cycles of operation time to open a memory line for reading, and 32 cycles again to read data from an open line. If the next line to be read is opened only after the read operation of the currently open line is completed, a serious delay may occur. In this example, during the 32 cycles required to open the next line, no data is read, and as a result, a total of 64 cycles is required to read each line, rather than the 32 cycles it takes to repeat the line data. In the conventional memory system, the second line of the same bank cannot be opened while the first line is read or written. Thus, to reduce delay, the next line to open may be in a different bank or in a special bank for dual line access, as described in more detail below. The current line can all be sampled with the flip-flop or latch before opening the next line, and the next line can be opened while all processing is done on the flip-flop/latch. If the next expected line is in the same bank (and none of the above), the delay may be unavoidable and the system may need to wait. This mechanism corresponds to both standard memories and in particular memory processing devices.

여기에 개시된 실시예들은 오픈될 다음 메모리 라인을 현재 오픈되어 있는 메모리 라인의 읽기 동작이 완료되기 전에 예측하는 등을 통하여 이러한 지연을 줄일 수 있다. 즉, 오픈될 다음 라인이 예측될 수 있다면, 다음 라인을 오픈하는 프로세스는 현재 라인의 읽기 동작이 완료되기 전에 시작할 수 있다. 프로세스의 어느 시점에 다음 라인 예측이 이루어지는가에 따라, 다음 라인의 오픈과 연관된 지연은 32 사이클(앞서 설명한 특정 예)에서 32 사이클 미만으로 감소될 수 있다. 일 특정 예에서, 다음 라인 오픈이 20 사이클 먼저 예측된다면, 추가적인 지연은 12 사이클에 불과하게 된다. 다른 예에서, 다음 라인 오픈이 32 사이클 먼저 예측된다면, 지연은 아예 없게 된다. 그 결과, 각 행을 직렬적으로 오픈하고 읽기 위하여 총 64 사이클이 필요한 대신에, 현재 행을 읽는 중에 다음 행을 오픈함으로써, 각 행을 읽는 시간이 결과적으로 줄어들 수 있다. The embodiments disclosed herein can reduce such a delay by predicting the next memory line to be opened before the read operation of the currently open memory line is completed. That is, if the next line to be opened can be predicted, the process of opening the next line may start before the read operation of the current line is completed. Depending on at what point in the process the next line prediction is made, the delay associated with the opening of the next line can be reduced from 32 cycles (the specific example described above) to less than 32 cycles. In one particular example, if the next line open is predicted 20 cycles earlier, the additional delay would be only 12 cycles. In another example, if the next line open is predicted 32 cycles earlier, there is no delay at all. As a result, instead of requiring a total of 64 cycles to open and read each row serially, by opening the next row while the current row is being read, the time to read each row can consequently be reduced.

하기의 메커니즘은 현재 라인과 예측 라인이 동일 뱅크에 있어야 하지만, 라인 상에서 활성화와 작동을 동시에 지원할 수 있는 뱅크가 있다면 그러한 뱅크가 활용될 수도 있다. The mechanism below requires that the current line and the predicted line be in the same bank, but if there is a bank that can support both activation and operation on the line at the same time, such a bank may be utilized.

예측 주소 생성기는 접근된 행을 관찰하고, 접근과 연관된 하나 이상의 패턴(예: 순차적 라인 접근, 모든 짝수 라인으로의 접근, 모든 3의 배수 라인으로의 접근 등)을 식별하고, 관찰된 패턴에 의거하여 접근될 다음 행을 추정하는 패턴 학습 모델을 포함할 수 있다. 다른 예에서, 예측 주소 생성기는 접근될 다음 행을 예측하기 위한 공식/알고리즘을 적용하는 유닛을 포함할 수 있다. 또 다른 실시예들에서, 예측 주소 생성기는 접근되는 현재 주소/행, 접근된 이전 2, 3, 4, 또는 그 이상의 주소/행 등과 같은 입력에 의거하여 접근될 예측 다음 행(예측 행과 연관된 하나 이상의 주소 포함)을 출력하는 학습 신경망을 포함할 수 있다. 임의의 모든 상기 예측 주소 생성기를 활용하여 접근될 다음 메모리 라인을 예측하면 메모리 접근과 연관된 지연을 상당히 감소시킬 수 있다. 기재된 예측 주소/행 생성기는 데이터를 검색하기 위한 메모리로의 접근이 이루어지는 임의의 모든 시스템에 유용할 수 있다. 일부 경우에서, 기재된 예측 주소/행 생성기 및 다음 메모리 라인 접근을 예측하기 위한 연관 방법들은 인공지능 모델을 실행하는 시스템에 특히 적합할 수 있는데, 이는 AI 모델들은 다음 행 예측을 용이하게 하는 반복적 메모리 접근 패턴과 연관될 수 있기 때문이다. A predictive address generator observes the rows accessed, identifies one or more patterns associated with the accesses (eg sequential line accesses, accesses to all even lines, accesses to all multiples of 3 lines, etc.), and based on the observed patterns, to include a pattern learning model that estimates the next row to be accessed. In another example, the predictive address generator may include a unit that applies a formula/algorithm to predict the next row to be accessed. In still other embodiments, the predictive address generator may generate a predictive next row (one associated with the predictive row) to be accessed based on input such as the current address/row being accessed, the previous 2, 3, 4, or more addresses/rows accessed, etc. It may include a learning neural network that outputs the above addresses). Using any and all of the above predictive address generators to predict the next memory line to be accessed can significantly reduce the latency associated with memory accesses. The predicted address/row generators described may be useful in any and all systems where access to memory to retrieve data is made. In some cases, the predictive address/row generators described and associated methods for predicting the next memory line access may be particularly suitable for systems running artificial intelligence models, where the AI models facilitate iterative memory access to predict the next row. Because it can be related to the pattern.

도 81a는 본 개시의 실시예들에 따른 다음 행 예측에 의거하여 메모리 뱅크(8100)와 연관된 다음 행을 활성화하는 시스템을 도시한 것이다. 시스템(8100)은 현재 및 예측 주소 생성기(8192), 뱅크 컨트롤러(8191), 및 메모리 뱅크(8180A, 8180B)를 포함할 수 있다. 주소 생성기는 메모리 뱅크(8180A, 8180B) 내의 접근을 위한 주소를 생성하는 엔티티일 수 있고, 소프트웨어 프로그램을 실행하는 임의의 모든 논리 회로, 컨트롤러, 또는 마이크로프로세서에 기반할 수 있다. 뱅크 컨트롤러(8191)는 메모리 뱅크(8180A)의 현재 행에 접근하도록(예: 주소 생성기(8192)에 의해 생성된 현재 행 식별기를 활용) 구성될 수 있다. 뱅크 컨트롤러(8191)는 또한 주소 생성기(8192)에 의해 생성된 예측 행 식별자에 의거하여 메모리 뱅크(8180B) 내에서 접근될 예측 다음 행을 활성화하도록 구성될 수 있다. 하기의 예는 2개의 뱅크를 설명한다. 다른 예에서는, 더 많은 뱅크가 사용될 수 있다. 일부 실시예들에서, 한 번에 둘 이상의 행에 접근하는 것을 허용하는(하기에 설명) 메모리 뱅크가 있을 수 있으므로, 동일 프로세스가 단일 뱅크 상에서 수행될 수 있다. 앞서 설명한 바와 같이, 접근될 예측 다음 행의 활성화는 접근되는 현재 행에 대해 수행된 읽기 동작의 완료 이전에 시작할 수 있다. 따라서, 일부 경우에서, 주소 생성기(8192)는 접근할 다음 행을 예측할 수 있고 예측된 다음 행의 식별자(예: 하나 이상의 주소)를 현재 행으로의 접근이 완료되기 이전 아무 때나 컨트롤러(8191)로 보낼 수 있다. 이러한 타이밍으로 인해, 뱅크 컨트롤러는 예측된 다음 행의 활성화를 현재 행이 접근되는 동안 및 현재 행의 접근이 완료되기 이전의 임의의 시점에 개시할 수 있다. 일부 경우에서, 뱅크 컨트롤러(8291)는 접근된 현재 행의 활성화가 완료되는 시점 및/또는 현재 행에 대한 읽기 동작이 시작된 시점의 동일 시간(또는 몇 클럭 사이클 이내)에 메모리 뱅크(8180)의 예측된 다음 행의 활성화를 개시할 수 있다. 81A illustrates a system for activating a next row associated with the memory bank 8100 based on next row prediction according to embodiments of the present disclosure. The system 8100 may include a current and predictive address generator 8192 , a bank controller 8191 , and memory banks 8180A, 8180B. An address generator may be an entity that generates addresses for accesses in memory banks 8180A, 8180B, and may be based on any and all logic circuits, controllers, or microprocessors executing software programs. The bank controller 8191 may be configured to access a current row of the memory bank 8180A (eg, utilize a current row identifier generated by the address generator 8192). The bank controller 8191 may also be configured to activate a predicted next row to be accessed within the memory bank 8180B based on the predicted row identifier generated by the address generator 8192 . The example below illustrates two banks. In other examples, more banks may be used. In some embodiments, there may be a memory bank that allows accessing more than one row at a time (described below), so that the same process can be performed on a single bank. As described above, the activation of a row next to the predicted row to be accessed may start before completion of a read operation performed on the current row being accessed. Thus, in some cases, the address generator 8192 may predict the next row to be accessed and send the predicted next row identifiers (eg, one or more addresses) to the controller 8191 at any time before the access to the current row is complete. can send. Due to this timing, the bank controller can initiate the predicted activation of the next row at any time while the current row is accessed and before the current row is accessed. In some cases, the bank controller 8291 predicts the memory bank 8180 at the same time (or within a few clock cycles) of the time when the activation of the accessed current row is completed and/or the time when the read operation on the current row starts. Activation of the next row can be initiated.

일부 실시예들에서, 현재 주소와 연관된 현재 행에 대한 동작은 읽기 또는 쓰기 동작일 수 있다. 일부 실시예들에서, 현재 행과 다음 행은 동일한 메모리 뱅크 내에 있을 수 있다. 일부 실시예들에서, 동일한 메모리 뱅크는 현재 행이 접근되는 동안에 다음 행이 접근되게 할 수 있다. 현재 행과 다음 행은 상이한 메모리 뱅크에 있을 수 있다. 일부 실시예들에서, 메모리 유닛은 현재 주소와 예측 주소를 생성하도록 구성된 프로세서를 포함할 수 있다. 일부 실시예들에서, 메모리 유닛은 분산 프로세서를 포함할 수 있다. 분산 프로세서는 메모리 어레이의 복수의 이산 메모리 뱅크 중에 공간적으로 분산된 프로세싱 어레이의 복수의 프로세서 서브유닛을 포함할 수 있다. 일부 실시예들에서, 예측 주소는 지연 생성되는 주소를 샘플링 하는 일련의 플립플롭에 의해 생성될 수 있다. 지연은 샘플링 된 주소를 저장하는 플립플롭 간의 선택을 하는 멀티플렉서를 통해 설정될 수 있다. In some embodiments, the operation on the current row associated with the current address may be a read or write operation. In some embodiments, the current row and the next row may be in the same memory bank. In some embodiments, the same memory bank may cause the next row to be accessed while the current row is being accessed. The current row and the next row may be in different memory banks. In some embodiments, the memory unit may include a processor configured to generate a current address and a predicted address. In some embodiments, the memory unit may include a distributed processor. A distributed processor may include a plurality of processor subunits of a processing array spatially distributed among a plurality of discrete memory banks of the memory array. In some embodiments, the predictive address may be generated by a series of flip-flops that samples the delayed generated address. The delay can be set via a multiplexer that selects between flip-flops storing the sampled address.

여기서, 예측된 다음 행이 실행 소프트웨어가 접근 요청을 하는 실제 다음 행이라는 것이 확인되면(예: 현재 행에 대한 읽기 동작의 완료 후) 예측된 다음 행은 접근될 현재 행이 될 수 있다. 개시된 실시예들에서, 예측된 다음 행을 활성화하는 프로세스는 현재 행 읽기 동작의 완료 전에 개시될 수 있기 때문에, 예측된 다음 행이 접근할 다음 행이 맞다는 것이 확인되면, 접근할 다음 행은 이미 완전히 또는 부분적으로 활성화될 수 있다. 이로써, 라인 활성화와 연관된 지연을 상당히 감소시킬 수 있다. 현재 행의 읽기가 끝나기 전에 또는 끝남과 동시에 활성화가 끝나도록 다음 행이 활성화되는 경우에 전력 감소가 이루어질 수 있다. Here, if it is confirmed that the predicted next row is the actual next row for which the executing software requests access (eg, after completion of a read operation on the current row), the predicted next row may be the current row to be accessed. In the disclosed embodiments, since the process of activating the predicted next row can be initiated before the completion of the current row read operation, if it is verified that the predicted next row is the next row to be accessed, the next row to be accessed is already It can be fully or partially activated. This can significantly reduce the delay associated with line activation. Power reduction may occur when the next row is activated so that the activation ends before or at the end of the reading of the current row.

현재 및 예측 주소 생성기(8192)는 메모리 뱅크(8180) 내의 접근될 행을 식별(예: 프로그램 실행에 의거)하고 접근될 다음 행을 예측(예: 행 접근에서 관찰된 패턴에 의거, 소정의 패턴(n+1, n+2)에 의거 등)하도록 구성된 임의의 모든 논리 요소, 연산부, 메모리 유닛, 알고리즘, 학습 모델 등을 포함할 수 있다. 예컨대, 일부 실시예들에서, 현재 및 예측 주소 생성기(8192)는 카운터(8192A), 현재 주소 생성기(8192B), 및 예측 주소 생성기(8192C)를 포함할 수 있다. 현재 주소 생성기(8192B)는 예컨대 카운터(8192A)의 출력에 의거하거나 연산부로부터의 요청에 의거하여 메모리 뱅크(8180) 내에서 접근될 현재 행의 현재 주소를 생성하도록 구성될 수 있다. 접근될 현재 행과 연관된 주소는 뱅크 컨트롤러(8191)로 제공될 수 있다. 예측 주소 생성기(8192C)는 카운터(8192A)의 출력에 의거하거나, 소정의 접근 패턴에 의거하거나(예: 카운터(8192A)와 함께), 학습 신경망 또는 관찰된 라인 접근과 연관된 패턴 등에 의거하여 라인 접근을 관찰하고 접근될 다음 라인을 예측하는 기타 유형의 패턴 예측 알고리즘의 출력에 의거하여, 메모리 뱅크(8180) 내에서 접근될 다음 행의 예측 주소를 판단하도록 구성될 수 있다. 주소 생성기(8192)는 예측 주소 생성기(8192C)로부터의 예측된 다음 행 주소를 뱅크 컨트롤러(8191)로 제공할 수 있다. Current and predictive address generator 8192 identifies a row in memory bank 8180 to be accessed (eg, based on program execution) and predicts the next row to be accessed (eg, based on a pattern observed in row access, a predetermined pattern). (based on n+1, n+2), etc.) may include any and all logical elements, calculation units, memory units, algorithms, learning models, and the like. For example, in some embodiments, current and predictive address generator 8192 may include a counter 8192A, current address generator 8192B, and predictive address generator 8192C. The current address generator 8192B may be configured to generate the current address of the current row to be accessed in the memory bank 8180, for example, based on an output of the counter 8192A or based on a request from an operation unit. The address associated with the current row to be accessed may be provided to the bank controller 8191 . Predictive address generator 8192C may be configured to access a line based on an output of a counter 8192A, based on a predetermined access pattern (eg, with a counter 8192A), a learning neural network, or based on a pattern associated with an observed line access, etc. observing and based on the output of other types of pattern prediction algorithms that predict the next line to be accessed, determine the predicted address of the next row to be accessed within the memory bank 8180 . The address generator 8192 may provide the predicted next row address from the predicted address generator 8192C to the bank controller 8191 .

일부 실시예들에서, 현재 주소 생성기(8192B)와 예측 주소 생성기(8192C)는 시스템(8100)의 내부 또는 외부에 구현될 수 있다. 외부 호스트도 시스템(8100)의 외부에 구현되고 시스템(8100)으로 더 연결될 수 있다. 예를 들어, 현재 주소 생성기(8192B)는 프로그램을 실행하는 외부 호스트에 있는 소프트웨어일 수 있고, 예측 주소 생성기(8192C)는 시스템(8100)의 내부 또는 외부에 구현될 수 있다. In some embodiments, current address generator 8192B and predictive address generator 8192C may be implemented internally or externally to system 8100 . An external host may also be implemented outside the system 8100 and further connected to the system 8100 . For example, current address generator 8192B may be software residing on an external host executing a program, and predictive address generator 8192C may be implemented internally or externally to system 8100 .

앞서 설명한 바와 같이, 예측된 다음 행 주소는 이전에 접근된 하나 이상의 행 주소를 포함할 수 있는 입력(들)에 의거하여 접근할 다음 행을 예측하는 학습 신경망을 활용하여 판단될 수 있다. 학습 신경망 또는 기타 유형의 모델은 예측 주소 생성기(8192C)와 연관된 로직 내에서 작동할 수 있다. 일부 경우에서, 학습 신경망 등은 예측 주소 생성기(8192C)의 외부에 있지만 예측 주소 생성기(8192C)와 통신하는 하나 이상의 연산부에 의해 실행될 수 있다. As described above, the predicted next row address may be determined using a learning neural network that predicts the next row to be accessed based on input(s) that may include one or more previously accessed row addresses. A learning neural network or other type of model may operate within the logic associated with predictive address generator 8192C. In some cases, the learning neural network or the like may be executed by one or more computations external to the predictive address generator 8192C but in communication with the predictive address generator 8192C.

일부 실시예들에서, 예측 주소 생성기(8192C)는 현재 주소 생성기(8192B)의 복제 또는 실질적 복제를 포함할 수 있다. 또한, 현재 주소 생성기(8192B)와 예측 주소 생성기(8192C) 동작 타이밍은 서로에 대해 고정되거나 조정될 수 있다. 예를 들어, 일부 경우에서, 예측 주소 생성기(8192C)는 접근될 다음 행과 연관된 주소 식별자를 현재 주소 생성기(8192B)가 생성하는 시점에 대하여 고정된 시간(예: 고정된 수의 클럭 사이클)에 예측된 다음 행과 연관된 주소 식별자를 출력하도록 구성될 수 있다. 일부 경우에서, 예측된 다음 행 식별자는 접근될 현재 행의 활성화가 시작되기 전 또는 후, 접근될 현재 행과 연관된 읽기 동작이 시작되기 전 또는 후, 또는 접근되는 현재 행과 연관된 읽기 동작이 완료되기 전의 임의의 시간에 생성될 수 있다. 일부 경우에서, 예측된 다음 행 식별자는 접근될 현재 행의 활성화의 시작과 동시에 또는 접근될 현재 행과 연관된 읽기 동작의 시작과 동시에 생성될 수 있다. In some embodiments, predictive address generator 8192C may comprise a duplicate or substantial clone of current address generator 8192B. Also, the current address generator 8192B and the predicted address generator 8192C operation timing may be fixed or adjusted with respect to each other. For example, in some cases, predictive address generator 8192C generates an address identifier associated with the next row to be accessed at a fixed time (eg, a fixed number of clock cycles) relative to when the current address generator 8192B is generating. It may be configured to output an address identifier associated with the predicted next row. In some cases, the predicted next row identifier is determined before or after activation of the current row to be accessed begins, before or after a read operation associated with the current row to be accessed begins, or before the read operation associated with the current row being accessed completes. It can be created at any time before. In some cases, the predicted next row identifier may be generated concurrently with the start of an activation of the current row to be accessed or the start of a read operation associated with the current row to be accessed.

다른 경우에서, 예측된 다음 행 식별자의 생성과 접근될 현재 행의 활성화 또는 현재 행과 연관된 읽기 동작의 개시 사이의 시간은 조정 가능할 수 있다. 예를 들면, 일부 경우에서, 이 시간은 하나 이상의 동작 파라미터와 연관된 값에 의거하여 메모리 유닛(8100)의 동작 동안에 연장 또는 단축될 수 있다. 일부 경우에서, 메모리 유닛 또는 컴퓨팅 시스템의 다른 컴포넌트와 연관된 온도(또는 기타 파라미터 값)는 현재 주소 생성기(8192B)와 예측 주소 생성기(8192C)가 상대적인 동작 시기를 변경하도록 유발할 수 있다. 일부 실시예들에서, 예측 메커니즘은 논리의 일부일 수 있다. In other cases, the time between generation of the predicted next row identifier and activation of the current row to be accessed or the initiation of a read operation associated with the current row may be adjustable. For example, in some cases, this time may be extended or shortened during operation of the memory unit 8100 based on values associated with one or more operating parameters. In some cases, a temperature (or other parameter value) associated with a memory unit or other component of a computing system may cause current address generator 8192B and predictive address generator 8192C to change relative timing of operation. In some embodiments, the prediction mechanism may be part of the logic.

현재 및 예측 주소 생성기(8192)는 접근할 예측된 다음 행 판단과 연관된 신뢰 수준을 생성할 수 있다. 이러한 신뢰 수준(예측 프로세스의 일환으로 예측 주소 생성기(8192C)에 의해 판단될 수 있음)는 현재 행의 읽기 동작 동안에(즉, 현재 행 읽기 동작이 완료되기 이전 및 접근할 다음 행의 식별이 확인되기 이전에) 예측된 다음 행의 활성화를 개시할지 여부 등을 판단하는데 활용될 수 있다. 예를 들어, 일부 경우에서, 접근할 예측된 다음 행과 연관된 신뢰 수준은 임계 수준과 비교될 수 있다. 예컨대, 신뢰 수준이 임계 수준에 미달하는 경우, 메모리 유닛(8100)은 예측된 다음 행의 활성화를 포기할 수 있다. 반면에, 신뢰 수준이 임계 수준을 초과하는 경우, 메모리 유닛(8100)은 메모리 뱅크(8180) 내에서 예측된 다음 행의 활성화를 개시할 수 있다. Current and predicted address generator 8192 may generate a confidence level associated with determining a predicted next row to be accessed. This level of confidence (which may be determined by the predictive address generator 8192C as part of the prediction process) is determined during the read operation of the current row (i.e., before the current row read operation is complete and the identification of the next row to be accessed is verified). It can be used to determine whether or not to start activation of the next row predicted before). For example, in some cases, a confidence level associated with the predicted next row to be accessed may be compared to a threshold level. For example, if the confidence level is less than the threshold level, the memory unit 8100 may give up the predicted next row activation. On the other hand, if the confidence level exceeds the threshold level, the memory unit 8100 may initiate activation of the next predicted row in the memory bank 8180 .

예측된 다음 행의 임계 수준 대비 신뢰 수준의 검사와 그 이후의 예측된 다음 행의 활성화의 개시 또는 비개시는 임의의 모든 적합한 방식으로 이루어질 수 있다. 예를 들어, 일부 경우에서, 예측된 다음 행과 연관된 신뢰 수준이 임계 수준에 미달하는 경우, 예측 주소 생성기(8192C)는 예측된 다음 행 결과를 이후의 논리 요소로 출력하는 것을 포기할 수 있다. 이러한 경우에, 대안적으로, 현재 및 예측 주소 생성기(8192)는 예측된 다음 행 식별자를 뱅크 컨트롤러(8191)로부터 보류할 수 있거나, 뱅크 컨트롤러(또는 다른 논리부)는 읽어지는 현재 행과 연관된 읽기 동작이 완료되기 전에 예측된 다음 행의 활성화를 시작할지 여부를 판단하기 위해 예측된 다음 행에서 신뢰 수준을 활용하도록 구성될 수 있다. The check of the confidence level against the threshold level of the predicted next row and the subsequent initiation or non-initiation of activation of the predicted next row may be in any and any suitable manner. For example, in some cases, if the confidence level associated with the predicted next row falls below a threshold level, the predicted address generator 8192C may abandon outputting the predicted next row result to a later logical element. In this case, alternatively, the current and predicted address generator 8192 may withhold the predicted next row identifier from the bank controller 8191, or the bank controller (or other logic) may read associated with the current row being read. It may be configured to utilize the confidence level in the predicted next row to determine whether to start the activation of the next predicted row before the operation is complete.

예를 들어, 일부 경우에서, 신뢰 수준은 하나 이상의 이전 다음 행 예측이 정확했던 것으로 판명됐는지(예: 과거 성능 지시자) 여부에 달려있을 수 있다. 신뢰 수준은 또한 알고리즘/모델로의 입력의 하나 이상의 특성에 의거할 수 있다. 예컨대, 패턴을 따르는 실제 행 접근을 포함하는 입력은 패턴을 덜 따르는 실제 행 접근보다 신뢰 수준이 높을 수 있다. 또한, 최근 행 접근을 포함하는 입력의 스트림에 대해 무작위성이 검출되는 등의 일부 경우에서, 생성된 신뢰 수준은 낮을 수 있다. 또한, 무작위성이 검출되는 경우에서, 다음 행 예측 프로세스는 일괄 중단되거나, 다음 행 예측이 메모리 유닛(8100)의 하나 이상의 컴포넌트에 의해 무시되거나, 예측된 다음 행의 활성화를 포기하기 위한 임의의 모든 다른 동작이 취해질 수 있다. For example, in some cases, the confidence level may depend on whether one or more previous next row predictions have been found to be correct (eg, historical performance indicators). The confidence level may also be based on one or more characteristics of the input to the algorithm/model. For example, an input comprising an actual row access that follows a pattern may have a higher confidence level than an actual row access that follows less of the pattern. Also, in some cases, such as when randomness is detected for a stream of input containing recent row accesses, the confidence level generated may be low. Further, in the case where randomness is detected, the next row prediction process is batch aborted, the next row prediction is ignored by one or more components of the memory unit 8100 , or any other for abandoning activation of the predicted next row. An action may be taken.

일부 경우에서, 메모리(8100)의 동작에 대하여 피드백 메커니즘이 포함될 수 있다. 예를 들어, 주기적으로 또는 각 다음 행 예측 이후에도, 예측 주소 생성기(8192C)가 접근될 실제 다음 행을 예측하는 정확성이 판단될 수 있다. 일부 경우에서, 접근할 다음 행의 예측에 오류가 있는 경우(또는 소정의 수의 오류 이후에), 예측 주소 생성기(8192C)의 다음 행 예측 동작은 정지될 수 있다. 다른 경우에서, 예측 주소 생성기(8192C)는 예측 동작의 하나 이상의 양상이 접근할 다음 행 예측의 정확성에 관해 수신된 피드백에 의거하여 조정될 수 있도록 학습 요소를 포함할 수 있다. 이러한 능력은 예측 주소 생성기(8192C)의 동작을 개선하여 예측 주소 생성기(8192C)가 변화하는 접근 패턴 등에 적응할 수 있도록 할 수 있다. In some cases, a feedback mechanism may be included with respect to the operation of memory 8100 . For example, periodically or even after each next row prediction, the accuracy of predicting the actual next row to be accessed by the predicted address generator 8192C may be determined. In some cases, if there is an error in the prediction of the next row to be accessed (or after a predetermined number of errors), the next row prediction operation of the prediction address generator 8192C may be stopped. In other cases, the prediction address generator 8192C may include a learning element such that one or more aspects of the prediction operation may be adjusted based on the received feedback regarding the accuracy of predicting the next row to approach. This capability may improve the operation of the predictive address generator 8192C to allow the predictive address generator 8192C to adapt to changing access patterns and the like.

일부 실시예들에서, 예측된 다음 행의 생성 및/또는 예측된 다음 행의 활성화의 시기는 메모리 유닛(8100)의 전반적인 동작에 달려있을 수 있다. 예를 들면, 전원을 켠 후 또는 메모리 유닛(8100)을 리셋한 후에, 접근할 다음 행의 예측(또는 예측된 다음 행을 뱅크 컨트롤러(8191)로 송부)은 정지될 수 있다(예: 소정의 시간 또는 클럭 사이클 동안, 소정의 수의 행 접근/읽기가 완료될 때까지, 예측된 다음 행의 신뢰 수준이 소정의 임계 수준을 초과할 때까지, 또는 임의의 모든 다른 적절한 조건에 의거하여). In some embodiments, the timing of generation of the predicted next row and/or activation of the predicted next row may depend on the overall operation of the memory unit 8100 . For example, after turning on the power or after resetting the memory unit 8100, prediction of the next row to be accessed (or sending the predicted next row to the bank controller 8191) may be stopped (eg, a predetermined for a period of time or clock cycle, until a predetermined number of row accesses/reads are complete, until the predicted confidence level of the next row exceeds a predetermined threshold level, or any other suitable condition).

도 81b는 개시된 실시예에 따른 메모리 유닛(8100)의 다른 구성을 도시한 것이다. 도 81b의 시스템(8100B)에서, 캐시(8193)는 뱅크 컨트롤러(8191)와 연관될 수 있다. 캐시(8193)는, 예컨대, 하나 이상의 행의 데이터가 접근된 후에 저장되고 다시 활성화할 필요가 없게 하도록 구성될 수 있다. 따라서, 캐시(8193)는 뱅크 컨트롤러(8191)가 메모리 뱅크(8180)에 접근하는 대신에 캐시(8193)로부터 행 데이터에 접근하게 할 수 있다. 예를 들면, 캐시(8193)는 마지막 X 행 데이터(또는 임의의 모든 다른 캐시 저장 전략)을 저장할 수 있고, 뱅크 컨트롤러(8191)는 예측된 행에 따라 캐시(8193)를 채울 수 있다. 또한, 예측된 행이 이미 캐시(8193)에 있는 경우, 예측된 행은 다시 열릴 필요가 없고, 뱅크 컨트롤러(또는 캐시(8193) 내에 구현된 캐시 컨트롤러)는 예측된 행이 스와핑 되는 것으로부터 보호할 수 있다. 캐시(8193)는 여러 이점을 제공할 수 있다. 첫째, 캐시(8193)는 행을 캐시(8193)에 로딩하고 뱅크 컨트롤러는 캐시(8193)에 접근하여 행 데이터를 가져올 수 있으므로, 다음 행 예측을 위한 특별한 뱅크나 하나 이상의 뱅크가 필요하지 않다. 둘째, 뱅크 컨트롤러(8191)로부터 캐시(8193)까지의 물리적 거리는 뱅크 컨트롤러(8191)로부터 메모리 뱅크(8180)까지의 물리적 거리보다 짧으므로 캐시(8193)로부터의 읽기와 쓰기는 에너지를 절약할 수 있다. 셋째, 캐시(8193)는 더 작고 컨트롤러(8191)에 더 가까이 있으므로 캐시(8193)에 의해 야기된 지연은 메모리 뱅크(8180)의 지연보다 낮다. 일부 경우에서, 예컨대, 예측 주소 생성기에 의해 생성된 예측된 다음 행의 식별자는 예측된 다음 행이 뱅크 컨트롤러(8191)에 의해 메모리 뱅크(8180) 내에서 활성화됨에 따라 캐시(8193)에 저장될 수 있다. 프로그램 실행 등에 의거하여, 현재 주소 생성기(8192B)는 메모리 뱅크(8180) 내의 접근할 실제 다음 행을 식별할 수 있다. 접근할 실제 다음 행과 연관된 식별자는 캐시(8193)에 저장된 예측된 다음 행의 식별자와 비교될 수 있다. 접근할 실제 다음 행과 접근할 예측된 다음 행이 동일한 경우, 뱅크 컨트롤러(8191)는 접근할 다음 행의 활성화가 완료된 후에 이 행에 대한 읽기 동작을 개시할 수 있다(다음 행 예측 프로세스의 결과로 완전히 또는 부분적으로 활성화될 수 있음). 반면에, 접근할 실제 다음 행(현재 주소 생성기(8192B)에 의해 판단)이 캐시(8193)에 저장된 예측된 다음 행 식별자와 일치하지 않는 경우, 완전히 또는 부분적으로 활성화된 예측된 다음 행에 대하여 읽기 동작이 개시되지 않게 되고, 시스템은 접근될 실제 다음 행의 활성화를 시작하게 된다. 81B shows another configuration of the memory unit 8100 according to the disclosed embodiment. In system 8100B of FIG. 81B , cache 8193 may be associated with bank controller 8191 . Cache 8193 may be configured such that, for example, one or more rows of data are stored after being accessed and do not need to be activated again. Accordingly, the cache 8193 may allow the bank controller 8191 to access row data from the cache 8193 instead of accessing the memory bank 8180 . For example, cache 8193 may store the last X rows of data (or any other cache storage strategy), and bank controller 8191 may fill cache 8193 according to the predicted rows. Also, if the predicted row is already in cache 8193, the predicted row does not need to be reopened, and the bank controller (or cache controller implemented within cache 8193) can protect the predicted row from being swapped. can The cache 8193 may provide several advantages. First, the cache 8193 loads a row into the cache 8193 and the bank controller can access the cache 8193 to fetch row data, so there is no need for a special bank or more than one bank for next row prediction. Second, since the physical distance from the bank controller 8191 to the cache 8193 is shorter than the physical distance from the bank controller 8191 to the memory bank 8180, reading and writing from the cache 8193 can save energy. . Third, the cache 8193 is smaller and closer to the controller 8191 so the delay caused by the cache 8193 is lower than that of the memory bank 8180 . In some cases, for example, the identifier of the predicted next row generated by the predictive address generator may be stored in the cache 8193 as the predicted next row is activated in the memory bank 8180 by the bank controller 8191 . have. Based on program execution, etc., the current address generator 8192B may identify the actual next row in the memory bank 8180 to be accessed. The identifier associated with the actual next row to be accessed may be compared with the identifier of the predicted next row stored in the cache 8193 . When the actual next row to be accessed and the predicted next row to be accessed are the same, the bank controller 8191 may initiate a read operation on this row after activation of the next row to be accessed is completed (as a result of the next row prediction process) may be fully or partially activated). On the other hand, if the actual next row to be accessed (as determined by the current address generator 8192B) does not match the predicted next row identifier stored in the cache 8193, read for the fully or partially activated predicted next row. No action is initiated, and the system begins activating the actual next row to be accessed.

듀얼 활성화 뱅크Dual Activation Bank

앞서 설명한 바와 같이, 다른 행들이 처리되는 동안에 한 행을 활성화할 능력이 있는 뱅크를 구축하도록 하는 여러 메커니즘을 설명할 가치가 있다. 다른 행이 접근되는 동안에 추가적이 행을 활성화시키는 뱅크에 대해 여러 실시예들이 제공될 수 있다. 실시예들에서 2개 행만 활성화되는 것으로 설명하고 있지만, 더 많은 행에도 적용될 수 있음은 당연하다 할 것이다. 제안된 제1 실시예에서, 메모리 뱅크는 메모리 서브뱅크로 분할될 수 있고, 기재된 실시예들은 한 서브뱅크에서 라인에 대한 읽기 동작을 수행하는 동안에 다른 서브뱅크에서 예측되거나 필요한 다음 행의 활성화에 활용될 수 있다. 예를 들어, 도 81c에 도시된 바와 같이, 메모리 뱅크(8180)는 다중 메모리 서브뱅크(8181)를 포함하도록 구성될 수 있다. 메모리 뱅크(8180)와 연관된 뱅크 컨트롤러(8191)는 복수의 서브뱅크 컨트롤러를 상응하는 서브뱅크와 연관되어 포함할 수 있다. 복수의 서브뱅크 컨트롤러의 제1 서브뱅크 컨트롤러는 복수의 서브뱅크 컨트롤러의 제2 서브뱅크 컨트롤러가 복수의 서브뱅크의 제2 서브뱅크의 다음 행을 활성화할 수 있는 동안에 복수의 서브뱅크의 제1 서브뱅크의 현재 행에 포함된 데이터로의 접근을 가능하게 하도록 구성될 수 있다. 한 번에 한 서브뱅크의 워드만에 접근하는 경우에 하나의 컬럼 디코더만이 사용될 수 있다. 2개의 뱅크가 동일한 출력 버스에 묶여 단일 뱅크로 보일 수 있다. 단일 뱅크 입력도 단일 주소 및 새로운 행을 열기 위한 추가 행 주소일 수 있다. As discussed earlier, it is worth explaining the different mechanisms that allow building a bank with the ability to activate one row while other rows are being processed. Various embodiments may be provided for a bank that activates additional rows while other rows are being accessed. Although it is described that only two rows are activated in the embodiments, it will be natural that it can be applied to more rows. In the first proposed embodiment, the memory bank may be divided into memory subbanks, and the described embodiments are utilized for the activation of the next row predicted or required in another subbank while performing a read operation on a line in one subbank. can be For example, as shown in FIG. 81C , the memory bank 8180 may be configured to include multiple memory subbanks 8181 . The bank controller 8191 associated with the memory bank 8180 may include a plurality of sub-bank controllers in association with the corresponding sub-banks. The first subbank controller of the plurality of subbank controllers is configured to activate the first subbank of the plurality of subbanks while the second subbank controller of the plurality of subbank controllers can activate the next row of the second subbank of the plurality of subbanks. may be configured to enable access to data contained in the current row of the bank. Only one column decoder can be used when accessing only the words of one subbank at a time. Two banks can be tied to the same output bus and appear as a single bank. A single bank input can also be a single address and an additional row address to open a new row.

도 81c는 각 메모리 서브뱅크(8181)별 제1 및 제2 서브뱅크 행 컨트롤러(8183A, 8183B)를 도시한 것이다. 도 81c에 도시된 바와 같이, 메모리 뱅크(8180)는 복수의 서브뱅크(8181)를 포함할 수 있다. 또한, 뱅크 컨트롤러(8191)는 상응하는 서브뱅크(8181)와 각각 연관된 복수의 서브뱅크 컨트롤러(8183A-B)를 포함할 수 있다. 복수의 서브뱅크 컨트롤러의 제1 서브뱅크 컨트롤러(8183A)는 제2 서브뱅크 컨트롤러(8183B)가 서브뱅크(8181)의 제2 부분의 다음 행을 활성화할 수 있는 동안에 서브뱅크(8181)의 제1 부분의 현재 행에 포함된 데이터로의 접근을 가능하게 하도록 구성될 수 있다. FIG. 81C illustrates the first and second subbank row controllers 8183A and 8183B for each memory subbank 8181 . As shown in FIG. 81C , the memory bank 8180 may include a plurality of subbanks 8181 . In addition, the bank controller 8191 may include a plurality of sub-bank controllers 8183A-B respectively associated with the corresponding sub-bank 8181 . The first sub-bank controller 8183A of the plurality of sub-bank controllers is configured to activate the first sub-bank controller 8183B of the sub-bank 8181 while the second sub-bank controller 8183B can activate the next row of the second portion of the sub-bank 8181 . It may be configured to enable access to data contained in the current row of the part.

접근되고 있는 행에 바로 인접한 행의 활성화는 접근된 행을 왜곡하고/왜곡하거나 접근된 행으로부터 읽히는 데이터에 오류를 일으킬 수 있기 때문에, 개시된 실시예들은 활성화될 것으로 예측되는 다음 행이 데이터가 접근되고 있는 제1 서브뱅크의 현재 행으로부터 예를 들어 적어도 2행만큼 이격되게 하도록 구성될 수 있다. 일부 실시예들에서, 활성화될 행들은 적어도 한 매트만큼 이격될 수 있어서 활성화가 서로 상이한 매트들에서 실행될 수 있다. 제2 서브뱅크 컨트롤러는 제1 서브뱅크 컨트롤러가 제1 서브뱅크의 다음 행을 활성화하는 동안에 제2 서브뱅크의 현재 행에 포함된 데이터로의 접근을 유발하도록 구성될 수 있다. 제1 서브뱅크의 활성화된 다음 행은 데이터가 접근되고 있는 제2 서브뱅크의 현재 행으로부터 적어도 2행만큼 이격될 수 있다. Because activation of a row immediately adjacent to the row being accessed may distort the row accessed and/or cause errors in data read from the row being accessed, the disclosed embodiments ensure that the next row predicted to be active is the data accessed and may be configured to be spaced apart, for example, by at least two rows from the current row of the first subbank in which it is located. In some embodiments, the rows to be activated can be spaced apart by at least one mat so that activation can be performed on different mats. The second subbank controller may be configured to cause access to data included in the current row of the second subbank while the first subbank controller activates the next row of the first subbank. The next activated row of the first subbank may be spaced apart by at least two rows from the current row of the second subbank from which data is being accessed.

읽기/접근되고 있는 행들과 활성화되고 있는 행들 사이의 소정의 거리는 하드웨어에 의해 결정될 수 있다. 예를 들어, 메모리 뱅크의 상이한 부분들을 상이한 로우 디코더들에 결합하고, 소프트웨어가 데이터를 손상시키지 않기 위하여 유지할 수 있다. 현재 행 사이의 간격은 2 이상(예: 3, 4, 5, 또는 그 이상)일 수 있다. 거리는 시간이 경과하면서, 예를 들어, 저장된 데이터에 일어난 왜곡에 대한 평가에 근거하여, 변경될 수 있다. 왜곡은 다양한 방식으로 평가될 수 있다. 예를 들어, 신호 대 잡음비, 오류율, 왜곡을 보수하는데 필요한 오류 코드 등을 산출하여 평가될 수 있다. 두 행은 충분히 멀리 떨어져 있고 동일 뱅크 상에서 2개의 뱅크 컨트롤러가 구현되는 경우에 실제로 활성화될 수 있다. 새로운 아키텍처(동일 뱅크 상에 2개의 컨트롤러를 구현)는 동일 매트 내에서 라인이 개방되는 것을 방지할 수 있다. A predetermined distance between rows being read/accessed and rows being activated may be determined by hardware. For example, different portions of a memory bank may be coupled to different row decoders, and software may keep the data intact. The spacing between the current rows can be 2 or more (eg 3, 4, 5, or more). The distance may change over time, for example, based on an assessment of the distortion that has occurred to the stored data. Distortion can be assessed in a variety of ways. For example, it can be evaluated by calculating a signal-to-noise ratio, an error rate, an error code required to correct distortion, and the like. The two rows are far enough apart and can actually be active if two bank controllers are implemented on the same bank. The new architecture (implementing two controllers on the same bank) prevents the lines from being opened within the same mat.

도 81d는 본 개시의 실시예들에 따른 다음 행 예측을 위한 일 실시예를 도시한 것이다. 본 실시예는 플립플롭의 추가적인 파이프라인(주소 레지스터 A 내지 C)을 포함할 수 있다. 주소 생성기 이후의 활성화 및 지연된 주소와 예측을 활용하기 위해 전체 실행을 지연하는데 필요한 지연은 생성된 새로운 주소(파이프의 시작, 주소 레지스터 C 아래)가 될 수 있고 현재 주소는 파이프라인의 끝이 됨에 따라, 파이프라인은 임의의 모든 수의 플립플롭(스테이지)으로 구현될 수 있다. 본 실시예에서, 복제 주소 생성기는 필요하지 않다. 주소 레지스터가 지연을 제공하는 동안에 지연을 구성하도록 셀렉터(도 81d의 멀티플렉서)가 추가될 수 있다. 81D illustrates an embodiment for next row prediction according to embodiments of the present disclosure. This embodiment may include an additional pipeline of flip-flops (address registers A to C). The delay needed to delay the entire execution to take advantage of activations and deferred addresses and predictions after the address generator can be the new address generated (beginning of the pipe, below address register C), and as the current address becomes the end of the pipeline , the pipeline can be implemented with any and all number of flip-flops (stages). In this embodiment, a duplicate address generator is not required. A selector (multiplexer in Fig. 81D) may be added to configure the delay while the address register provides the delay.

도 81e는 본 개시의 실시예들에 따른 메모리 뱅크에 대한 일 실시예를 도시한 것이다. 메모리 뱅크는 새로 활성화된 라인이 현재 라인으로부터 충분히 멀리 떨어져 있는 경우에 새로운 라인의 활성화가 현재 라인에 손상을 입히지 않도록 구현될 수 있다. 도 82e에 도시된 바와 같이, 메모리 뱅크는 2 라인의 매트 사이마다 추가 메모리 매트(검정색)를 포함할 수 있다. 따라서, 제어부(예: 로우 디코더)는 매트에 의해 이격된 라인들을 활성화할 수 있다. 81E illustrates an embodiment of a memory bank according to embodiments of the present disclosure. The memory bank may be implemented such that activation of a new line does not damage the current line if the newly activated line is far enough away from the current line. 82E, the memory bank may include an additional memory mat (black) between every two lines of mats. Accordingly, the controller (eg, the row decoder) may activate lines spaced apart by the mat.

일부 실시예들에서, 메모리 유닛은 프로세싱을 위한 제1 주소와 소정의 시간에 활성화 및 접근을 하기 위한 제2 주소를 수신하도록 구성될 수 있다. In some embodiments, the memory unit may be configured to receive a first address for processing and a second address for activation and access at a predetermined time.

도 81f는 본 개시의 실시예들에 따른 메모리 뱅크에 대한 다른 실시예를 도시한 것이다. 메모리 뱅크는 새로 활성화된 라인이 현재 라인으로부터 충분히 멀리 떨어져 있는 경우에 새로운 라인의 활성화가 현재 라인에 손상을 입히지 않도록 구현될 수 있다. 도 81f에 도시된 실시예는 모든 짝수 라인은 메모리 뱅크의 상반부에 구현되고 모든 홀수 라인은 메모리 뱅크의 하반부에 구현되도록 함으로써 로우 디코더가 라인 n과 라인 n+1을 열 수 있도록 할 수 있다. 이러한 구현으로 항상 충분히 멀리 떨어진 연속적인 라인에 접근하는 것이 가능할 수 있다. 81F illustrates another embodiment of a memory bank according to embodiments of the present disclosure. The memory bank may be implemented such that activation of a new line does not damage the current line if the newly activated line is far enough away from the current line. The embodiment shown in FIG. 81F may allow the row decoder to open line n and line n+1 by having all even lines implemented in the upper half of the memory bank and all odd lines being implemented in the lower half of the memory bank. With this implementation it may always be possible to access a continuous line that is far enough away.

개시된 실시예들에 따른 듀얼 컨트롤 메모리 뱅크는, 듀얼 컨트롤 메모리 뱅크가 한 번에 하나의 데이터 단위만을 출력하도록 구성된 경우에도, 단일 메모리 뱅크의 상이한 부분들의 접근 및 활성화를 가능하게 할 수 있다. 예를 들면, 설명한 바와 같이, 듀얼 컨트롤은 메모리 뱅크가 제2 행(예: 예측된 다음 행 또는 접근할 소정의 다음 행)을 활성화하는 동안에 제1 행에 접근하는 것이 가능하게 할 수 있다. A dual control memory bank in accordance with disclosed embodiments may enable access and activation of different portions of a single memory bank, even when the dual control memory bank is configured to output only one data unit at a time. For example, as described, dual control may enable a memory bank to access a first row while activating a second row (eg, a predicted next row or a predetermined next row to be accessed).

도 82는 본 개시의 실시예들에 따른 메모리 행 활성화 페널티(예: 지연)를 감소시키기 위한 듀얼 컨트롤 메모리 뱅크(8280)를 도시한 것이다. 듀얼 컨트롤 메모리 뱅크(8280)는 데이터 입력(DIN)(8290), 행 주소(ROW)(8291), 열 주소(COLUMN)(8292), 제1 명령 입력(COMMAND_1)(8293), 및 제2 명령 입력(COMMAND_2)(8294)을 포함하는 입력들을 포함할 수 있다. 메모리 뱅크(8280)는 데이터 출력(Dout)(8295)을 포함할 수 있다. 82 illustrates a dual control memory bank 8280 for reducing a memory row activation penalty (eg, delay) in accordance with embodiments of the present disclosure. The dual control memory bank 8280 includes a data input (DIN) 8290, a row address (ROW) 8291, a column address (COLUMN) 8292, a first command input (COMMAND_1) 8293, and a second command may include inputs including input (COMMAND_2) 8294 . The memory bank 8280 may include a data output (Dout) 8295 .

주소에는 행 주소와 열 주소가 있고 2개의 로우 디코더가 있는 것으로 추정한다. 주소의 다른 구성도 가능하고, 로우 디코더의 수는 2개보다 많을 수 있고, 단일 컬럼 디코더보다 많은 컬럼 디코더가 있을 수 있다. It is assumed that the address has a row address and a column address, and there are two row decoders. Other configurations of addresses are possible, the number of row decoders may be more than two, and there may be more column decoders than single column decoders.

행 주소(ROW)(8291)는 활성 명령과 같은 명령과 연관된 행을 식별할 수 있다. 행 활성화 후에는 행의 읽기 또는 행에 쓰기를 수반하므로, 행이 열려 있는 동안에(활성화 이후) 쓰기를 하거나 열린 행으로부터 읽기를 할 행 주소를 전송할 필요가 없을 수 있다. A row address (ROW) 8291 may identify a row associated with an instruction, such as an active instruction. Since row activation entails reading or writing to the row, it may not be necessary to write while the row is open (after activation) or send the row address to read from.

제1 명령 입력(COMMAND_1)(8293)은 제1 로우 디코더에 의해 접근된 행으로 명령(활성 명령 등)을 전송하는 데에 사용될 수 있다. 제2 명령 입력(COMMAND_2)(8294)은 제2 로우 디코더에 의해 접근된 행으로 명령(활성 명령 등)을 전송하는 데에 사용될 수 있다.The first command input (COMMAND_1) 8293 may be used to transmit a command (such as an active command) to a row accessed by the first row decoder. The second command input (COMMAND_2) 8294 may be used to send a command (such as an active command) to a row accessed by the second row decoder.

데이터 입력(DIN)(8290)은 쓰기 동작을 실행하는 경우에 데이터를 공급하는 데에 사용될 수 있다. A data input (DIN) 8290 may be used to supply data when performing a write operation.

행 전체가 한꺼번에 읽히지 않을 수 있으므로, 단일 행 세그먼트들은 순차적으로 읽힐 수 있고, 열 주소(COLUMN)(8292)는 행의 읽힐 세그먼트(즉, 어느 열)를 나타낼 수 있다. 설명의 편의상, 2Q의 세그먼트가 있고, 열 입력에는 Q 비트가 있고, Q는 1을 초과하는 양의 정수인 것으로 추정한다. Since the entire row may not be read at once, single row segments may be read sequentially, and the column address (COLUMN) 8292 may indicate the segment to be read (ie, which column) of the row. For convenience of explanation, it is assumed that there are segments of 2Q, there are Q bits in the column input, and that Q is a positive integer greater than 1.

듀얼 컨트롤 메모리 뱅크(8280)는 앞서 도 81a 내지 도 81b를 참조하여 설명한 주소 예측과 함께 또는 별도로 작동할 수 있다. 물론, 동작의 지연을 줄이기 위하여, 듀얼 컨트롤 메모리 뱅크는 본 개시에 따른 주소 예측과 함께 동작할 수 있다. The dual control memory bank 8280 may operate in conjunction with or separately from the address prediction described above with reference to FIGS. 81A to 81B . Of course, in order to reduce the delay of the operation, the dual control memory bank may operate together with the address prediction according to the present disclosure.

도 83a, 도 83b, 및 도 83c는 메모리 뱅크(8180)의 행에 접근 및 활성화하는 예시들을 도시한 것이다. 일례에서, 앞서 언급한 바와 같이, 한 행의 읽기와 활성화에는 32 사이클(세그먼트)이 필요한 것으로 추정한다. 또한, 활성화 페널티(델타로 표시한 길이)를 줄이기 위해, 다음 행이 열려야 하는지를 미리(다음 행에 접근할 필요가 있기 적어도 델타 전에) 아는 것이 유리할 수 있다. 일부 경우에서, 델타는 4 사이클일 수 있다. 도 83a, 도 83b, 및 도 83c에 도시된 각 메모리 뱅크는 일부 실시예에서 한 번에 하나의 행만이 열릴 수 있는 둘 이상의 서브뱅크를 포함할 수 있다. 일부 경우에서, 짝수의 행들은 제1 서브뱅크와 연관될 수 있고, 홀수의 행들은 제2 서브뱅크와 연관될 수 있다. 이러한 예에서, 개시된 예측 주소 실시예를 활용하면 다른 메모리 서브뱅크의 행에 대한 읽기 동작의 끝에 도달하기 전에(델타 주기 이전에) 특정 메모리 서브뱅크의 한 행의 활성화 개시가 가능할 수 있다. 이로써, 순차적 메모리 접근(예: 행 1, 2, 3, 4, 5, 6, 7, 8 ...의 읽기가 수행되어야 하고, 행 1, 3, 5 등은 제1 메모리 서브뱅크와 연관되고, 행 2, 4, 6 등은 이와 상이한 제2 메모리 서브뱅크와 연관되는 소정의 메모리 접근 시퀀스)이 매우 효율적인 방식으로 수행될 수 있다. 83A, 83B, and 83C illustrate examples of accessing and activating a row of the memory bank 8180. In one example, as mentioned above, it is estimated that 32 cycles (segments) are required to read and activate one row. Also, to reduce the activation penalty (length expressed in delta), it may be advantageous to know in advance (at least before delta before the next row needs to be accessed) whether the next row should be opened. In some cases, the delta may be 4 cycles. Each memory bank shown in FIGS. 83A, 83B, and 83C may include two or more subbanks in which only one row may be open at a time in some embodiments. In some cases, even numbered rows may be associated with a first subbank and odd numbered rows may be associated with a second subbank. In this example, utilizing the disclosed predictive address embodiments may allow the initiation of activation of one row of a particular memory subbank before the end of a read operation on the row of another memory subbank is reached (prior to the delta period). Thereby, sequential memory accesses (eg reads of rows 1, 2, 3, 4, 5, 6, 7, 8 ... have to be performed, rows 1, 3, 5, etc. are associated with the first memory subbank and , rows 2, 4, 6, etc. are associated with a second memory subbank different therefrom) can be performed in a very efficient manner.

도 83a는 상이한 메모리 서브뱅크에 포함된 메모리 행에 접근하기 위한 상태를 도시하는 것일 수 있다. 도 83a에 도시된 상태에서:83A may be a diagram illustrating a state for accessing a memory row included in a different memory subbank. In the state shown in Fig. 83A:

a.행 A는 제1 로우 디코더에 의해 접근 가능할 수 있다. 제1 세그먼트(가장 좌측의 음영 처리된 세그먼트)는 제1 로우 디코더가 행 A를 활성화한 이후에 접근될 수 있다. a. Row A may be accessible by the first row decoder. The first segment (the leftmost shaded segment) can be accessed after the first row decoder activates row A.

b.행 B는 제2 로우 디코더에 의해 접근 가능할 수 있다. 도 83a에 도시된 상태에서, 행 B는 닫혀 있고 아직 활성화되지 않았다. b. Row B may be accessible by the second row decoder. In the state shown in Fig. 83A, row B is closed and has not yet been activated.

도 83a에 도시된 상태에 앞서 활성화 명령과 행 A의 주소가 제1 로우 디코더로 전송될 수 있다. Prior to the state illustrated in FIG. 83A , the activation command and the address of row A may be transmitted to the first row decoder.

도 83b는 행 A에 접근한 이후에 행 B에 접근하기 위한 상태를 도시한 것이다. 본 예에 따르면, 행 A는 제1 로우 디코더에 의해 접근 가능할 수 있다. 도 83b에 도시된 상태에서, 제1 로우 디코더는 행 A를 활성화하였고 가장 우측의 4개의 세그먼트들(음영 처리되지 않은 4개의 세그먼트들)만을 제외하고 모든 세그먼트들에 접근하였다. 델타(행 A에 있는 4개의 흰색 세그먼트들)는 4 사이클이므로, 뱅크 컨트롤러는 제2 로우 디코더가 행 A의 가장 우측의 세그먼트들에 접근하기 전에 행 B를 활성화하도록 할 수 있다. 일부 경우에서, 소정의 접근 패턴(예: 홀수 행은 제1 서브뱅크에 지정되고 짝수 행은 제2 서브뱅크에 지정되는 순차적 행 접근)에 대응하여 행 B가 활성화될 수 있다. 다른 경우에서, 앞서 설명한 임의의 모든 행 예측 방법에 대응하여 행 B가 활성화될 수 있다. 행 B가 접근되는 경우에 행 B를 열기 위해 행 B가 활성화될 때까지 기다리는 대신에 행 B가 이미 활성화되어(열려) 있도록, 뱅크 컨트롤러는 제2 로우 디코더가 행 B를 활성화하도록 할 수 있다. 83B shows a state for accessing the row B after the row A is accessed. According to this example, row A may be accessible by the first row decoder. 83B, the first row decoder activated row A and accessed all segments except for the rightmost 4 segments (unshaded 4 segments). Since the delta (4 white segments in row A) is 4 cycles, the bank controller can force the second row decoder to activate row B before accessing the rightmost segments of row A. In some cases, row B may be activated in response to a predetermined access pattern (eg, sequential row access in which odd rows are assigned to a first subbank and even rows are assigned to a second subbank). In other cases, row B may be activated corresponding to any and all of the row prediction methods described above. When row B is accessed, the bank controller may cause the second row decoder to activate row B so that row B is already active (open) instead of waiting for row B to be activated to open row B when it is accessed.

도 83b에 도시된 상태에 앞서 다음과 같은 동작이 일어날 수 있다. The following operations may occur prior to the state shown in FIG. 83B .

a. 제1 로우 디코더에 활성화 명령과 행 A의 주소를 전송a. Send the activation command and the address of row A to the first row decoder

b. 행 A의 첫 28 세그먼트를 읽기 또는 쓰기b. read or write the first 28 segments of row A

c. 행의 28 세그먼트의 읽기 또는 쓰기 동작 후에, 행 B의 주소에 대한 활성화 명령을 제2 로우 디코더로 전송c. After a read or write operation of the 28 segment of the row, an enable command for the address of row B is sent to the second row decoder

일부 실시예들에서, 짝수 행들은 하나 이상의 메모리 뱅크의 절반에 위치한다. 일부 실시예들에서, 홀수 행들은 하나 이상의 메모리 뱅크의 절반에 위치한다.In some embodiments, even rows are located in half of one or more memory banks. In some embodiments, odd rows are located in half of one or more memory banks.

일부 실시예들에서, 한 라인의 가외 중복 매트가 각각의 두 매트 라인 사이에 배치되어 활성화를 허용하기 위한 거리를 형성할 수 있다. 일부 실시예들에서, 서로 근접한 라인들은 동시에 활성화되지 않을 수 있다. In some embodiments, a line of extra overlapping mats may be placed between each two mat lines to form a distance to allow activation. In some embodiments, lines adjacent to each other may not be active at the same time.

도 83c는 행 A에 접근한 후에 행 C(예: 제1 서브뱅크에 포함된 다음 홀수 행)에 접근하기 위한 상태를 도시한 것일 수 있다. 도 83c에 도시된 바와 같이, 행 B는 제2 로우 디코더에 의해 접근 가능할 수 있다. 도시된 바와 같이, 제2 로우 디코더는 행 B를 활성화하였고 가장 우측의 4개의 세그먼트들(음영 처리되지 않은 4개의 세그먼트들)만을 제외하고 모든 세그먼트들에 접근하였다. 본 예에서, 델타는 4 사이클이므로, 뱅크 컨트롤러는 제1 로우 디코더가 행 B의 가장 우측의 세그먼트들에 접근하기 전에 행 C를 활성화하도록 할 수 있다. 행 C가 접근되는 경우에 행 C를 열기 위해 행 C가 활성화될 때까지 기다리는 대신에 행 C가 이미 활성화되어(열려) 있도록, 뱅크 컨트롤러는 제1 로우 디코더가 행 C를 활성화하도록 할 수 있다. 이 방식으로 동작함으로써, 메모리 읽기 동작과 연관된 지연을 감소시키거나 완전히 제거할 수 있다. 83C may be a diagram illustrating a state for accessing a row C (eg, the next odd-numbered row included in the first subbank) after accessing the row A. 83C , row B may be accessible by the second row decoder. As shown, the second row decoder activated row B and accessed all segments except for the rightmost 4 segments (the 4 unshaded segments). In this example, the delta is 4 cycles, so the bank controller can cause the first row decoder to activate row C before accessing the rightmost segments of row B. When row C is accessed, the bank controller may cause the first row decoder to activate row C, such that row C is already activated (opened) instead of waiting for row C to be activated to open row C. By operating in this manner, delays associated with memory read operations can be reduced or eliminated entirely.

레지스터 파일로서의 메모리 매트Memory mat as register file

컴퓨터 아키텍처에서, 프로세서 레지스터는 컴퓨터의 프로세서(예: CPU)로 빠르게 접근 가능하게 하는 저장소 위치를 구성한다. 레지스터는 일반적으로 프로세서 코어(L0)에 가장 가까운 메모리 유닛을 포함한다. 레지스터는 특정 유형의 데이터에 접근하는 가장 빠른 방법을 제공할 수 있다. 컴퓨터는 여러 유형의 레지스터를 포함할 수 있고, 각 유형의 레지스터는 저장하는 정보의 유형에 따라 또는 특정 유형의 레지스터의 정보 상에서 운영되는 명령의 유형에 의거하여 구분될 수 있다. 예를 들면, 컴퓨터는 수치 정보, 피연산자, 중간 결과, 및 설정을 보유하는 데이터 레지스터, 주 메모리에 접근하기 위해 명령에 의해 사용되는 주소 정보를 저장하는 주소 레지스터, 데이터와 주소 정보를 모두 저장하는 범용 레지스터, 상태 레지스터 등을 포함할 수 있다. 레지스터 파일은 컴퓨터 처리부에 의해 사용되기 위한 레지스터의 논리 그룹을 포함한다. In computer architectures, processor registers constitute storage locations that are quickly accessible to the computer's processor (eg CPU). The register generally contains the memory unit closest to the processor core (L0). Registers can provide the fastest way to access certain types of data. A computer may include several types of registers, and each type of register may be distinguished according to the type of information it stores or the type of instruction operating on the information in a particular type of register. For example, a computer may have a data register that holds numeric information, operands, intermediate results, and settings, an address register that stores address information used by instructions to access main memory, and a general purpose register that stores both data and address information. It may include registers, status registers, and the like. A register file contains logical groups of registers for use by computer processing units.

많은 경우에서, 컴퓨터의 레지스터 파일은 처리부(예: CPU) 내에 위치하고 논리 트랜지스터에 의해 구현된다. 그러나 개시된 실시예들에서, 연산 처리부는 기존의 CPU에 있지 않을 수 있다. 대신, 이러한 처리 요소들(예: 프로세서 서브유닛)은 메모리 칩 내에 프로세싱 어레이로 공간적으로 분산되어 있을 수 있다(상기 설명 참조). 각 프로세서 서브유닛은 하나 이상의 상응하는 전용 메모리 유닛(예: 메모리 뱅크)와 연관될 수 있다. 이러한 아키텍처를 통해, 각 프로세서 서브유닛은 특정 프로세서 서브유닛이 운영할 데이터를 저장하는 하나 이상의 메모리 요소에 공간적으로 근접하여 위치할 수 있다. 본 개시에서 설명하는 바와 같이, 이러한 아키텍처는 전형적인 CPU 및 외부 메모리 아키텍처가 겪는 메모리 접근 병목을 제거하는 등을 통해 특정 메모리 집약적 동작의 속도를 상당히 증가시킬 수 있다. In many cases, a computer's register file is located within a processing unit (eg, CPU) and implemented by logic transistors. However, in the disclosed embodiments, the arithmetic processing unit may not be in the existing CPU. Instead, these processing elements (eg, processor subunits) may be spatially distributed as a processing array within a memory chip (see above). Each processor subunit may be associated with one or more corresponding dedicated memory units (eg, memory banks). This architecture allows each processor subunit to be located in spatial proximity to one or more memory elements that store data to be operated by that particular processor subunit. As described in this disclosure, such an architecture can significantly increase the speed of certain memory intensive operations, such as by eliminating memory access bottlenecks experienced by typical CPU and external memory architectures.

그러나 여기에 기재된 분산 프로세서 메모리 칩 아키텍처는 상응하는 프로세서 서브유닛 전용의 메모리 요소로부터 데이터를 운영하는 다양한 유형의 레지스터를 포함하는 레지스터 파일의 이점을 여전히 활용할 수 있다. 그러나 프로세서 서브유닛은 메모리 칩의 메모리 요소들 중에서 분산될 수 있으므로, 주 메모리 저장소 역할을 하기보다는 상응하는 프로세서 서브유닛에 대한 레지스터 파일 또는 캐시로 기능하기 위하여, 상응하는 프로세서 서브유닛에 하나 이상의 메모리 요소를 추가하는 것이 가능할 수 있다(이는 특정 제조 프로세스의 논리 요소에 비하여 특정 프로세스로부터의 이점이 될 수 있음). However, the distributed processor memory chip architecture described herein may still take advantage of register files containing various types of registers that operate data from memory elements dedicated to the corresponding processor subunit. However, since a processor subunit may be distributed among the memory elements of a memory chip, one or more memory elements may be assigned to a corresponding processor subunit to serve as a register file or cache for the corresponding processor subunit rather than serving as main memory storage. It may be possible to add a (which can be an advantage from a particular process over the logic element of that particular manufacturing process).

이러한 아키텍처는 몇 가지 이점을 제공할 수 있다. 예컨대, 레지스터 파일은 상응하는 프로세서 서브유닛의 일부이므로, 프로세서 서브유닛은 관련 레지스터 파일에 공간적으로 가까이 위치할 수 있다. 이러한 배치는 동작 효율을 상당히 증가시킬 수 있다. 기존의 레지스터 파일은 논리 트랜지스터에 의해 구현된다. 예를 들어, 종래 레지스터 파일의 각 비트는 약 12개의 논리 트랜지스터로 구성되므로, 16비트의 레지스터 파일은 192개의 논리 트랜지스터로 구성된다. 이러한 레지스터 파일은 논리 트랜지스터에 접근하려면 많은 수의 논리 컴포넌트를 필요로 할 수 있고, 따라서 공간을 많이 차지할 수 있다. 논리 트랜지스터에 의해 구현된 레지스터 파일에 비해, 본 개시의 실시예들의 레지스터 파일은 공간을 상당히 덜 필요로 한다. 이러한 사이즈 감소는 논리 구조의 제조보다는 메모리 구조의 제조에 최적화된 프로세스에 의해 제조된 메모리 셀을 포함하는 메모리 매트를 활용한 본 개시의 실시예들의 레지스터 파일을 구현하여 실현될 수 있다. Such an architecture can provide several advantages. For example, since a register file is part of a corresponding processor subunit, a processor subunit may be located spatially close to the associated register file. This arrangement can significantly increase operating efficiency. Conventional register files are implemented by logic transistors. For example, since each bit of a conventional register file consists of about 12 logic transistors, a register file of 16 bits consists of 192 logic transistors. These register files can require a large number of logic components to access the logic transistors and therefore can take up a lot of space. Compared to a register file implemented by a logic transistor, the register file of embodiments of the present disclosure requires significantly less space. This size reduction may be realized by implementing the register file of embodiments of the present disclosure that utilizes a memory mat comprising memory cells fabricated by a process optimized for fabrication of a memory structure rather than fabrication of a logical structure.

일부 실시예들에서, 분산 프로세서 메모리 칩이 제공될 수 있다. 분산 프로세서 메모리 칩은 기판, 기판 상에 배치되고 복수의 이산 메모리 뱅크를 포함하는 메모리 어레이, 및 기판 상에 배치되고 복수의 프로세서 서브유닛을 포함하는 프로세싱 어레이를 포함할 수 있다. 각 프로세서 서브유닛은 복수의 이산 메모리 뱅크 중에서 상응하는 전용 이산 메모리 뱅크와 연관될 수 있다. 분산 프로세서 메모리 칩은 또한 제1 복수의 버스와 제2 복수의 버스를 포함할 수 있다. 제1 복수의 버스의 각 버스는 복수의 프로세서 서브유닛의 일 프로세서 서브유닛을 그에 상응하는 전용 메모리 뱅크에 연결시킬 수 있다. 제2 복수의 버스의 각 버스는 복수의 프로세서 서브유닛의 일 프로세서 서브유닛을 복수의 프로세서 서브유닛의 다른 프로세서 서브유닛에 연결시킬 수 있다. 일부 경우에서, 제2 복수의 버스는 복수의 프로세서 서브유닛의 하나 이상의 프로세서 서브유닛을 복수의 프로세서 서브유닛 중의 둘 이상의 다른 프로세서 서브유닛으로 연결시킬 수 있다. 복수의 프로세서 서브유닛의 하나 이상의 프로세서 서브유닛은 또한 기판 상에 배치된 적어도 하나의 메모리 매트를 포함할 수 있다. 적어도 하나의 메모리 매트는 복수의 프로세서 서브유닛의 하나 이상의 프로세서 서브유닛에 대한 레지스터 파일의 적어도 하나의 레지스터 역할을 하도록 구성될 수 있다. In some embodiments, a distributed processor memory chip may be provided. A distributed processor memory chip may include a substrate, a memory array disposed on the substrate and including a plurality of discrete memory banks, and a processing array disposed on the substrate and including a plurality of processor subunits. Each processor subunit may be associated with a corresponding dedicated discrete memory bank among the plurality of discrete memory banks. The distributed processor memory chip may also include a first plurality of buses and a second plurality of buses. Each bus of the first plurality of buses may couple one processor subunit of the plurality of processor subunits to a corresponding dedicated memory bank. Each bus of the second plurality of buses may couple one processor subunit of the plurality of processor subunits to another processor subunit of the plurality of processor subunits. In some cases, the second plurality of buses may couple one or more processor subunits of the plurality of processor subunits to two or more other processor subunits of the plurality of processor subunits. One or more processor subunits of the plurality of processor subunits may also include at least one memory mat disposed on the substrate. The at least one memory mat may be configured to serve as at least one register in a register file for one or more processor subunits of the plurality of processor subunits.

일부 경우에서, 레지스터 파일은 하나 이상의 논리 컴포넌트와 연관되어 메모리 매트가 레지스터 파일의 하나 이상의 레지스터 기능을 하게 하도록 할 수 있다. 예를 들면, 이러한 논리 컴포넌트에는 스위치, 증폭기, 인버터, 센스 증폭기 등이 포함될 수 있다. 레지스터 파일이 DRAM 매트에 의해 구현되는 예에서, 리프레시 동작을 수행하여 저장된 데이터의 손실을 방지하도록 하기 위하여 논리 컴포넌트가 포함될 수 있다. 이러한 논리 컴포넌트는 로우 및 컬럼 멀티플렉서("mux")를 포함할 수 있다. 또한, DRAM 매트에 의해 구현된 레지스터 파일은 수율 저하에 대응하기 위하여 중복 메커니즘을 포함할 수 있다. In some cases, a register file may be associated with one or more logical components to cause the memory mat to function as one or more registers of the register file. For example, such logic components may include switches, amplifiers, inverters, sense amplifiers, and the like. In an example where the register file is implemented by a DRAM mat, a logic component may be included to perform a refresh operation to prevent loss of stored data. Such logical components may include row and column multiplexers (“mux”). Also, the register file implemented by the DRAM mat may include a redundancy mechanism to counter the yield degradation.

도 84는 CPU(8402) 및 외장 메모리(8406)를 포함하는 종래의 컴퓨터 아키텍처(8400)를 도시한 것이다. 동작 동안에, 메모리(8406)로부터의 값들은 CPU(8402)에 포함된 레지스터 파일(8504)와 연관된 레지스터 내부로 로딩될 수 있다. 84 shows a conventional computer architecture 8400 that includes a CPU 8402 and an external memory 8406 . During operation, values from memory 8406 may be loaded into registers associated with register file 8504 included in CPU 8402 .

도 85a는 개시된 실시예들에 따른 예시적인 분산 프로세서 메모리 칩(8500a)을 도시한 것이다. 도 84의 아키텍처와 달리, 분산 프로세서 메모리 칩(8500a)은 동일 기판 상에 배치된 메모리 요소와 프로세서 요소를 포함한다. 즉, 칩(8500a)은 메모리 어레이 및 메모리 어레이에 포함된 하나 이상의 전용 메모리 뱅크와 각각 연관된 복수의 프로세서 서브유닛을 포함하는 프로세싱 어레이를 포함할 수 있다. 도 85의 아키텍처에서, 프로세서 서브유닛에 의해 사용되는 레지스터는 메모리 어레이와 프로세싱 어레이가 형성된 동일 기판 상에 배치된 하나 이상의 메모리 매트에 의해 제공된다. 85A illustrates an exemplary distributed processor memory chip 8500a in accordance with disclosed embodiments. Unlike the architecture of FIG. 84 , the distributed processor memory chip 8500a includes a memory element and a processor element disposed on the same substrate. That is, chip 8500a may include a processing array including a memory array and a plurality of processor subunits each associated with one or more dedicated memory banks included in the memory array. In the architecture of FIG. 85, the registers used by the processor subunits are provided by one or more memory mats disposed on the same substrate on which the memory array and processing array are formed.

도 85a에 도시된 바와 같이, 분산 프로세서 메모리 칩(8500a)은 기판(8502) 상에 배치된 복수의 프로세싱 그룹(8510a, 8510b, 8510c)에 의해 형성될 수 있다. 더욱 구체적으로, 분산 프로세서 메모리 칩(8500a)은 기판(8502) 상에 배치된 메모리 어레이(8520)와 프로세싱 어레이(8530)를 포함할 수 있다. 메모리 어레이(8520)는 메모리 뱅크(8520a, 8520b, 8520c)와 같은 복수의 메모리 뱅크를 포함할 수 있다. 프로세싱 어레이(8530)는 프로세서 서브유닛(8530a, 8530b, 8530c)과 같은 복수의 프로세서 서브유닛을 포함할 수 있다. As shown in FIG. 85A , a distributed processor memory chip 8500a may be formed by a plurality of processing groups 8510a , 8510b , 8510c disposed on a substrate 8502 . More specifically, the distributed processor memory chip 8500a may include a memory array 8520 and a processing array 8530 disposed on a substrate 8502 . The memory array 8520 may include a plurality of memory banks, such as memory banks 8520a, 8520b, and 8520c. The processing array 8530 may include a plurality of processor subunits, such as processor subunits 8530a, 8530b, and 8530c.

또한, 각각의 프로세싱 그룹(8510a, 8510b, 8510c)은 프로세서 서브유닛과 이 프로세서 서브유닛 전용의 하나 이상의 상응하는 메모리 뱅크를 포함할 수 있다. 도 85a에 도시된 실시예에서, 각각의 프로세서 서브유닛(8530a, 8530b, 8530c)은 상응하는 전용 메모리 뱅크(8520a, 8520b, 8532c)와 연관될 수 있다. 즉, 프로세서 서브유닛(8530a)은 메모리 뱅크(8520a)와 연관될 수 있고, 프로세서 서브유닛(8530b)은 메모리 뱅크(8520b)와 연관될 수 있고, 프로세서 서브유닛(8530c)은 메모리 뱅크(8520c)와 연관될 수 있다. Additionally, each processing group 8510a, 8510b, 8510c may include a processor subunit and one or more corresponding memory banks dedicated to the processor subunit. 85A, each processor subunit 8530a, 8530b, 8530c may be associated with a corresponding dedicated memory bank 8520a, 8520b, 8532c. That is, processor subunit 8530a may be associated with memory bank 8520a, processor subunit 8530b may be associated with memory bank 8520b, and processor subunit 8530c may be associated with memory bank 8520c. can be related to

각 프로세서 서브유닛이 그에 상응하는 전용 메모리 뱅크(들)와 통신할 수 있도록 하기 위하여, 분산 프로세서 메모리 칩(8500a)은 프로세서 서브유닛들 중의 하나를 그에 상응하는 전용 메모리 뱅크(들)와 연결시키는 제1 복수의 버스(8540a, 8540b, 8540c)를 포함할 수 있다. 도 85a에 도시된 실시예에서, 버스(8540a)는 프로세서 서브유닛(8530a)을 메모리 뱅크(8520a)에 연결시킬 수 있고, 버스(8540b)는 프로세서 서브유닛(8530b)을 메모리 뱅크(8520b)에 연결시킬 수 있고, 버스(8540c)는 프로세서 서브유닛(8530c)을 메모리 뱅크(8520c)에 연결시킬 수 있다. To enable each processor subunit to communicate with its corresponding dedicated memory bank(s), the distributed processor memory chip 8500a connects one of the processor subunits with its corresponding dedicated memory bank(s). 1 It may include a plurality of buses 8540a, 8540b, and 8540c. 85A, bus 8540a may couple processor subunit 8530a to memory bank 8520a, and bus 8540b may couple processor subunit 8530b to memory bank 8520b. bus 8540c may couple processor subunit 8530c to memory bank 8520c.

또한, 각 프로세서 서브유닛이 다른 프로세서 서브유닛들과 통신할 수 있도록 하기 위하여, 분산 프로세서 메모리 칩(8500a)은 프로세서 서브유닛들 중의 하나를 적어도 하나의 다른 프로세서 서브유닛과 연결시키는 제2 복수의 버스(8550a, 8550b)를 포함할 수 있다. 도 85a에 도시된 실시예에서, 버스(8550a)는 프로세서 서브유닛(8530a)을 프로세서 서브유닛(8530b)과 연결시킬 수 있고, 버스(8550b)는 프로세서 서브유닛(8530b)을 프로세서 서브유닛(8530c)과 연결시킬 수 있다. Also, to enable each processor subunit to communicate with other processor subunits, the distributed processor memory chip 8500a may include a second plurality of buses connecting one of the processor subunits with at least one other processor subunit. (8550a, 8550b) may be included. 85A, bus 8550a may couple processor subunit 8530a with processor subunit 8530b, and bus 8550b may connect processor subunit 8530b with processor subunit 8530c ) can be associated with

각각의 이산 메모리 뱅크(8520a, 8520b, 8520c)는 복수의 메모리 매트를 포함할 수 있다. 도 85a에 도시된 실시예에서, 메모리 뱅크(8520a)는 메모리 매트(8522a, 8524a, 8526a)를 포함할 수 있고, 메모리 뱅크(8520b)는 메모리 매트(8522b, 8524b, 8526b)를 포함할 수 있고, 메모리 뱅크(8520c)는 메모리 매트(8522c, 8524c, 8526c)를 포함할 수 있다. 앞서 도 10을 참조하여 설명한 바와 같이, 메모리 매트는 복수의 메모리 셀을 포함할 수 있고, 각 셀은 커패시터, 트랜지스터, 또는 적어도 1 비트의 데이터를 저장하는 다른 회로를 포함할 수 있다. 종래의 메모리 매트는 예컨대 512 비트 X 512 비트를 포함할 수 있지만, 여기에 개시된 실시예들은 이에 한정되지 않는다. Each discrete memory bank 8520a, 8520b, 8520c may include a plurality of memory mats. 85A, memory bank 8520a may include memory mats 8522a, 8524a, 8526a, memory bank 8520b may include memory mats 8522b, 8524b, 8526b and , the memory bank 8520c may include memory mats 8522c, 8524c, and 8526c. As described above with reference to FIG. 10 , the memory mat may include a plurality of memory cells, and each cell may include a capacitor, a transistor, or other circuit for storing at least one bit of data. A conventional memory mat may include, for example, 512 bits by 512 bits, but embodiments disclosed herein are not limited thereto.

프로세서 서브유닛(8530a, 8530b, 8530c)의 적어도 하나는 상응하는 프로세서 서브유닛(8530a, 8530b, 8530c)에 대한 레지스터 파일 역할을 하도록 구성된 메모리 매트(8532a, 8532b, 8532c)와 같은 적어도 하나의 메모리 매트를 포함할 수 있다. 즉, 적어도 하나의 메모리 매트(8532a, 8532b, 8532c)는 하나 이상의 프로세서 서브유닛(8530a, 8530b, 8530c)에 의해 사용되는 레지스터 파일의 적어도 하나의 레지스터를 제공한다. 레지스터 파일은 하나 이사의 레지스터를 포함할 수 있다. 도 85a에 도시된 실시예에서, 프로세서 서브유닛(8530a) 내의 메모리 매트(8532a)는 프로세서 서브유닛(8530a)(및/또는 분산 프로세서 메모리 칩(8500a)에 포함된 임의의 모든 다른 프로세서 서브유닛)에 대한 레지스터 파일 역할을 할 수 있고('레지스터 파일(8532a)'로도 지칭), 프로세서 서브유닛(8530b) 내의 메모리 매트(8532b)는 프로세서 서브유닛(8530b)에 대한 레지스터 파일 역할을 할 수 있고, 프로세서 서브유닛(8530c) 내의 메모리 매트(8532c)는 프로세서 서브유닛(8530c)에 대한 레지스터 파일 역할을 할 수 있다. At least one of the processor subunits 8530a, 8530b, 8530c is at least one memory mat, such as a memory mat 8532a, 8532b, 8532c, configured to serve as a register file for the corresponding processor subunit 8530a, 8530b, 8530c may include That is, at least one memory mat 8532a, 8532b, 8532c provides at least one register of a register file used by one or more processor subunits 8530a, 8530b, 8530c. A register file may contain one or more registers. 85A, the memory mat 8532a in the processor subunit 8530a is the processor subunit 8530a (and/or any and all other processor subunits included in the distributed processor memory chip 8500a). may serve as a register file for (also referred to as 'register file 8532a') for Memory mat 8532c within processor subunit 8530c may serve as a register file for processor subunit 8530c.

프로세서 서브유닛(8530a, 8530b, 8530c)의 적어도 하나는 또한 논리 컴포넌트(8534a, 8534b, 8534c)와 같은 적어도 하나의 논리 컴포넌트를 포함할 수 있다. 각 논리 컴포넌트(8534a, 8534b, 또는 8534c)는 상응하는 메모리 매트(8532a, 8532b, 또는 8532c)가 상응하는 프로세서 서브유닛(8530a, 8530b, 또는 8530c)에 대한 레지스터 파일 역할을 할 수 있게 하도록 구성될 수 있다. At least one of the processor subunits 8530a, 8530b, 8530c may also include at least one logical component, such as a logical component 8534a, 8534b, 8534c. Each logical component 8534a, 8534b, or 8534c may be configured to enable a corresponding memory mat 8532a, 8532b, or 8532c to serve as a register file for the corresponding processor subunit 8530a, 8530b, or 8530c. can

일부 실시예들에서, 적어도 하나의 메모리 매트는 기판 상에 배치될 수 있고, 이러한 적어도 하나의 메모리 매트는 복수의 프로세서 서브유닛의 하나 이상의 프로세서 서브유닛에 대한 적어도 하나의 중복 레지스터를 제공하도록 구성된 적어도 하나의 중복 메모리 비트를 포함할 수 있다. 일부 실시예들에서, 적어도 하나의 프로세서 서브유닛은 현재 작업을 중단하고 특정 시간에 메모리 리프레시 동작을 촉발하여 메모리 매트를 리프레시하는 메커니즘을 포함할 수 있다. In some embodiments, at least one memory mat may be disposed on a substrate, wherein the at least one memory mat is configured to provide at least one redundant register for one or more processor subunits of the plurality of processor subunits. It may contain one redundant memory bit. In some embodiments, the at least one processor subunit may include a mechanism to refresh the memory mat by suspending the current task and triggering a memory refresh operation at a specific time.

도 85b는 개시된 실시예들에 따른 예시적인 분산 프로세서 메모리 칩(8500b)을 도시한 것이다. 도 85b에 도시된 메모리 칩(8500b)은 메모리 매트(8532a, 8532b, 8532c)가 상응하는 프로세서 서브유닛(8530a, 8530b, 8530c)에 포함되지 않는다는 점을 제외하고는 도 85a에 도시된 메모리 칩(8500a)과 사실상 동일하다. 대신에, 도 85b의 메모리 매트(8532a, 8532b, 8532c)는 상응하는 프로세서 서브유닛(8530a, 8530b, 8530c)의 외부에 그러나 공간적으로 인근에 배치된다. 이 방식으로, 메모리 매트(8532a, 8532b, 8532c)는 여전히 상응하는 프로세서 서브유닛(8530a, 8530b, 8530c)에 대한 레지스터 파일의 역할을 할 수 있다. 85B illustrates an exemplary distributed processor memory chip 8500b in accordance with disclosed embodiments. The memory chip 8500b shown in FIG. 85B is the memory chip 8500b shown in FIG. 85A except that the memory mats 8532a, 8532b, 8532c are not included in the corresponding processor subunits 8530a, 8530b, 8530c. 8500a) is practically the same. Instead, the memory mats 8532a , 8532b , 8532c of FIG. 85B are disposed outside but spatially adjacent to the corresponding processor subunits 8530a , 8530b , 8530c . In this way, the memory mats 8532a, 8532b, 8532c can still serve as register files for the corresponding processor subunits 8530a, 8530b, 8530c.

도 85c는 개시된 실시예에 따른 장치(8500c)를 도시한 것이다. 장치(8500c)는 기판(8560), 제1 메모리 뱅크(8570), 제2 메모리 뱅크(8572), 및 프로세싱 유닛(8580)을 포함한다. 제1 메모리 뱅크(8570), 제2 메모리 뱅크(8572), 및 프로세싱 유닛(8580)은 기판(8560) 상에 배치된다. 프로세싱 유닛(8580)은 프로세서(8584) 및 메모리 매트에 의해 구현된 레지스터 파일(8582)을 포함한다. 프로세싱 유닛(8580)의 동작 동안에, 프로세서(8584)는 레지스터 파일(8582)에 접근하여 데이터의 읽기 또는 쓰기를 할 수 있다. 85C illustrates an apparatus 8500c according to a disclosed embodiment. The apparatus 8500c includes a substrate 8560 , a first memory bank 8570 , a second memory bank 8572 , and a processing unit 8580 . A first memory bank 8570 , a second memory bank 8572 , and a processing unit 8580 are disposed on a substrate 8560 . The processing unit 8580 includes a register file 8582 implemented by a processor 8584 and a memory mat. During operation of processing unit 8580 , processor 8584 may access register file 8582 to read or write data.

분산 프로세서 메모리 칩(8500a, 8500b) 또는 장치(8500c)는 메모리 매트에 의해 제공된 레지스터로의 프로세서 서브유닛의 접근에 의거하여 다양한 기능을 제공할 수 있다. 예를 들면, 일부 실시예들에서, 분산 프로세서 메모리 칩(8500a 또는 8500b)은 메모리에 결합되어 더 큰 메모리 대역폭을 활용할 수 있도록 하는 가속기로 기능하는 프로세서 서브유닛을 포함할 수 있다. 도 85a에 도시된 실시예에서, 프로세서 서브유닛(8530a)이 가속기('가속기(8530a)'로도 지칭함)로 기능할 수 있다. 가속기(8530a)는 가속기(8530a) 내에 배치된 메모리 매트(8532a)를 활용하여 레지스터 파일의 하나 이상의 레지스터를 제공할 수 있다. 대안적으로, 도 85b에 도시된 실시예에서, 가속기(8530a)는 가속기(8530a) 외부에 배치된 메모리 매트(8532a)를 레지스터 파일로 활용할 수 있다. 또한, 가속기(8530a)는 메모리 뱅크(8520b) 내의 메모리 매트(8522b, 8524b, 8526b) 중의 임의의 하나 또는 메모리 뱅크(8520c) 내의 메모리 매트(8522c, 8524c, 8526c) 중의 임의의 하나를 활용하여 하나 이상의 레지스터를 제공할 수 있다. A distributed processor memory chip 8500a, 8500b or device 8500c may provide various functions based on the processor subunit's access to registers provided by the memory mat. For example, in some embodiments, the distributed processor memory chip 8500a or 8500b may include a processor subunit coupled to the memory to function as an accelerator to enable greater memory bandwidth utilization. In the embodiment shown in FIG. 85A , processor subunit 8530a may function as an accelerator (also referred to as 'accelerator 8530a '). Accelerator 8530a may utilize memory mat 8532a disposed within accelerator 8530a to provide one or more registers in a register file. Alternatively, in the embodiment shown in FIG. 85B , the accelerator 8530a may utilize the memory mat 8532a disposed outside the accelerator 8530a as a register file. In addition, accelerator 8530a may utilize any one of memory mats 8522b, 8524b, 8526b in memory bank 8520b or any one of memory mats 8522c, 8524c, 8526c in memory bank 8520c. More registers can be provided.

개시된 실시예들은 특정 유형의 이미지 처리, 신경망, 데이터베이스 분석, 압축 및 압축해제 등에 특히 유용할 수 있다. 예를 들어, 도 85a 또는 도 85b의 실시예에서, 메모리 매트는 메모리 매트로서 동일한 칩 상에 포함된 하나 이상의 프로세서 서브유닛에 대한 레지스터 파일의 하나 이상의 레지스터를 제공할 수 있다. 하나 이상의 레지스터는 프로세서 서브유닛(들)에 의해 자주 접근되는 데이터의 저장에 활용될 수 있다. 예를 들어, 컨볼루션 이미지 처리 동안에, 컨볼루션 가속기는 메모리 내에 있는 전체 이미지에 대해 동일한 계수들을 여러 번 반복적으로 사용할 수 있다. 이러한 컨볼루션 가속기에 대한 제안된 구현은 이러한 계수들을 모두 '인근'의 레지스터 파일에, 즉, 레지스터 파일 메모리 매트로서 동일 칩 상에 위치한 하나 이상의 프로세서 서브유닛 전용의 메모리 매트 내에 포함된 하나 이상의 레지스터 내에 보관하는 것일 수 있다. 이러한 아키텍처는 레지스터(및 저장된 계수값)를 계수값을 운용하는 프로세서 서브유닛에 가까이 둘 수 있다. 메모리 매트에 의해 구현된 레지스터 파일은 공간적으로 가까운 효율적인 캐시 기능을 할 수 있기 때문에, 데이터 전송 시의 손실 및 접근 시의 지연을 상당히 감소시킬 수 있다. The disclosed embodiments may be particularly useful for certain types of image processing, neural networks, database analysis, compression and decompression, and the like. For example, in the embodiment of FIG. 85A or 85B, the memory mat may provide one or more registers in a register file for one or more processor subunits included on the same chip as the memory mat. One or more registers may be utilized for storage of frequently accessed data by the processor subunit(s). For example, during convolutional image processing, the convolution accelerator may repeatedly use the same coefficients multiple times for the entire image in memory. The proposed implementation for such a convolution accelerator puts all these coefficients in a 'neighbor' register file, ie in one or more registers contained within a memory mat dedicated to one or more processor subunits located on the same chip as the register file memory mat. may be storage. This architecture may place registers (and stored count values) close to the processor subunit that operates the count values. Since the register file implemented by the memory mat can function as an efficient cache that is spatially close, it is possible to significantly reduce loss in data transfer and delay in access.

다른 예에서, 개시된 실시예들은 메모리 매트에 의해 제공된 레지스터들 안으로 워드들을 입력할 수 있는 가속기를 포함할 수 있다. 가속기는 레지스터들을 순환 버퍼(cyclic buffer)로 취급하여 단일 사이클에서 벡터를 배가할 수 있다. 예를 들어, 도 85c에 도시된 장치(8500c)에서, 프로세싱 유닛(8580) 내의 프로세서(8584)는 메모리 매트에 의해 구현된 레지스터 파일(8582)을 순환 버퍼로 사용하여 데이터(A1, A2, A3,　.　.　.　.)를 저장하는 가속기로 기능한다. 제1 메모리 뱅크(8570)는 데이터(A1, A2, A3,　.　.　.　.)로 곱해질 데이터(B1, B2, B3,　.　.　.　)를 저장한다. 제2 메모리 뱅크(8572)는 곱셈 결과(C1, C2, C3,　.　.　.　.)를 저장한다. 즉, Ci = Ai Х Bi이다. 프로세싱 유닛(8580)에 레지스터 파일이 없다면, 프로세서(8584)가 메모리 뱅크(8570 또는 8572)와 같은 외부 메모리 뱅크로부터 데이터(A1, A2, A3,　.　.　.　.)와 데이터(B1, B2, B3,　.　.　.　)를 모두 읽으려면 메모리 대역폭과 사이클이 더 필요할 것이고, 이는 상당한 지연을 유발할 수 있다. 반면에, 본 실시예에서는, 데이터(A1, A2, A3,　.　.　.　.)가 프로세싱 유닛(8580) 내에 형성된 레지스터 파일(8582)에 저장된다. 따라서, 프로세서(8584)는 외부 메모리 뱅크(8570)로부터 데이터(B1, B2, B3,　.　.　.　)만 읽어오면 된다. 따라서, 메모리 대역폭은 상당히 줄어들 수 있다. In another example, the disclosed embodiments may include an accelerator capable of inputting words into registers provided by the memory mat. The accelerator can doubling the vector in a single cycle by treating the registers as a cyclic buffer. For example, in the device 8500c shown in FIG. 85C , the processor 8584 in the processing unit 8580 uses the register file 8582 implemented by the memory mat as a circular buffer to the data A1 , A2 , A3 ,　.　.　.　.) functions as an accelerator. The first memory bank 8570 stores data B1, B2, B3, 　. The second memory bank 8572 stores the multiplication result (C1, C2, C3, 　.　.　.　.). That is, Ci = Ai Х Bi. If there is no register file in the processing unit 8580, the processor 8584 can use the data (A1, A2, A3, 　. ,　.　.　.　) will require more memory bandwidth and more cycles to read, which can cause significant delay. On the other hand, in the present embodiment, data A1, A2, A3, 　. . . . . . . are stored in a register file 8582 formed in the processing unit 8580. Accordingly, the processor 8584 only needs to read data B1, B2, B3, 　. 　. 　. 　 from the external memory bank 8570. Accordingly, the memory bandwidth can be significantly reduced.

메모리 프로세스에서, 메모리 매트는 일방향 접근(즉, 단일 접근)만을 허용하는 것이 보통이다. 일방향 접근에서, 메모리로 하나의 포트가 있다. 그 결과, 특정 시간에 특정 주소로부터의 하나의 접근 동작, 예컨대, 읽기 또는 쓰기만이 수행될 수 있다. 그러나 메모리 매트 자체가 양방향 접근을 허용한다면, 양방향 접근이 당연한 선택이 될 수 있다. 양방향 접근에서, 2개의 상이한 주소가 특정 시간에 접근될 수 있다. 메모리 매트에 접근하는 방법은 영역 및 요건에 의거하여 판단될 수 있다. 일부 경우에서, 메모리 매트에 의해 구현된 레지스터 파일은 2개의 소스를 읽어야 하고 하나의 목적지 레지스터(destination register)가 있는 프로세서에 연결된 경우에 4방향 접근을 허용할 수 있다. 일부 경우에서, 레지스터 파일이 DRAM 매트에 의해 구현되어 설정 또는 캐시 데이터를 저장하는 경우에, 레지스터 파일은 일방향 접근만을 허용할 수 있다. 표준 CPU는 다방향 접근 매트를 포함할 수 있는 반면에, DRAM이 적용되는 경우에는 일방향 접근 매트가 더 바람직할 수 있다. In memory processes, memory mats usually only allow one-way access (ie, single access). In one-way access, there is one port into memory. As a result, only one access operation, eg, read or write, from a specific address can be performed at a specific time. However, if the memory mat itself allows bidirectional access, bidirectional access can be a natural choice. In bidirectional access, two different addresses may be accessed at a specific time. A method of accessing the memory mat may be determined based on an area and a requirement. In some cases, the register file implemented by the memory mat may allow four-way access when connected to a processor that has to read two sources and has one destination register. In some cases, where the register file is implemented by the DRAM mat to store settings or cache data, the register file may only allow one-way access. A standard CPU may include a multi-way access mat, whereas a one-way access mat may be more desirable when DRAM is applied.

컨트롤러 또는 가속기가 레지스터로의 단일 접근만을 필요로 하는 방식으로 설계되는 경우에, 메모리 매트로 구현된 레지스터가 종래의 레지스터 파일 대신에 사용될 수 있다. 단일 접근에서, 한 번에 하나의 워드만이 접근될 수 있다. 예를 들어, 프로세싱 유닛은 특정 시간에 2개의 레지스터 파일로부터 2개의 워드에 접근할 수 있다. 2개의 레지스터 파일의 각 레지스터 파일은 단일 접근만을 허용하는 메모리 매트(예: DRAM 매트)에 의해 구현될 수 있다. In cases where the controller or accelerator is designed in such a way that only a single access to the register is required, a register implemented as a memory mat can be used instead of a conventional register file. In single access, only one word can be accessed at a time. For example, a processing unit may access two words from two register files at a particular time. Each register file of the two register files can be implemented by a memory mat (eg DRAM mat) that only allows single access.

대부분의 기술에서, 제조사로부터 확보된 닫혀 있는 블록(IP)인 메모리 매트 IP는 행 및 열 접근을 위해 워드라인과 로우라인과 같은 배선이 구비되어 있다. 그러나 메모리 매트 IP는 주변에 논리 컴포넌트를 포함하지 않는다. 따라서, 본 실시예들에 개시된 메모리 매트에 의해 구현된 레지스터 파일은 논리 컴포넌트를 포함할 수 있다. 메모리 매트의 사이즈는 레지스터 파일의 요구 사이즈에 의거하여 선택될 수 있다. In most technologies, the memory mat IP, which is a closed block (IP) secured from the manufacturer, is equipped with wiring such as word lines and row lines for row and column access. However, the memory mat IP does not contain any logical components around it. Accordingly, the register file implemented by the memory mat disclosed in the present embodiments may include a logical component. The size of the memory mat may be selected based on the requested size of the register file.

메모리 매트를 활용하여 레지스터 파일의 레지스터를 제공하는 경우에 특정 문제가 발생할 수 있고, 이러한 문제는 해당 메모리 매트를 형성하는데 활용된 특정 메모리 기술에 따라 다를 수 있다. 예를 들어, 메모리 생산에서, 제조된 모든 메모리가 생산 후에 반드시 제대로 작동하는 것은 아니다. 이는 알려진 문제이고, 칩 상에 고밀도의 SRAM이나 DRAM이 있는 경우에는 특히 그렇다. 메모리 기술에서는 이러한 문제의 대응으로, 수율을 적당한 수준으로 유지하기 위하여 하나 이상의 중복 메커니즘이 활용될 수 있다. 개시된 실시예에서, 레지스터 파일의 레지스터를 제공하는 데에 사용된 메모리 인스턴스(예: 메모리 뱅크)의 수가 상당히 적기 때문에, 중복 메커니즘은 일반적인 메모리 응용만큼 중요하지 않을 수 있다. 반면에, 메모리 기능성에 영향을 주는 동일한 생산 문제는 특정 메모리 매트가 하나 이상의 레지스터를 제공하는 데에 적절히 기능하는지 여부에도 영향을 미칠 수 있다. 그 결과, 개시된 실시예들에는 중복 요소들이 포함될 수 있다. 예를 들어, 적어도 하나의 중복 메모리 매트가 분산 프로세서 메모리 칩의 기판 상에 배치될 수 있다. 적어도 하나의 중복 메모리 매트는 복수의 프로세서 서브유닛의 하나 이상의 프로세서 서브유닛에 대해 적어도 하나의 중복 레지스터를 제공하도록 구성될 수 있다. 다른 예에서, 매트는 요구되는 사이즈보다 클 수 있고(예: 512x512가 아닌 620x620), 중복 메커니즘은 512x512 영역 또는 이와 균등한 영역 외부의 메모리 매트의 영역 내부로 구축될 수 있다. Certain issues may arise when utilizing a memory mat to provide registers from a register file, and these issues may depend on the particular memory technology utilized to form that memory mat. For example, in memory production, not all manufactured memories will necessarily function properly after production. This is a known issue, especially when there is a high density of SRAM or DRAM on the chip. In memory technology, in response to this problem, one or more redundancy mechanisms may be utilized to keep the yield at an appropriate level. In the disclosed embodiment, since the number of memory instances (eg, memory banks) used to provide registers in the register file is quite small, the redundancy mechanism may not be as important as in typical memory applications. On the other hand, the same production issues that affect memory functionality can also affect whether a particular memory mat functions properly to provide one or more registers. As a result, overlapping elements may be included in the disclosed embodiments. For example, at least one redundant memory mat may be disposed on a substrate of a distributed processor memory chip. The at least one redundant memory mat may be configured to provide at least one redundant register for one or more processor subunits of the plurality of processor subunits. In another example, the mat may be larger than the required size (eg 620x620 rather than 512x512), and the redundancy mechanism may be built into a region of the memory mat outside the 512x512 region or equivalent.

다른 문제는 타이밍과 관련된 것일 수 있다. 일반적으로, 워드라인 및 비트라인의 로딩 타이밍은 메모리의 사이즈에 의해 결정된다. 레지스터 파일이 상당히 작은 단일 메모리 매트(예: 512 x 512 비트)에 의해 구현될 수 있으므로, 메모리 매트로부터 워드를 로딩하는 데에 적은 시간이 필요할 수 있고, 로직에 비해 상당히 빠르게 실행하기에 타이밍이 충분할 수 있다. Another issue could be timing related. In general, the loading timing of the word line and the bit line is determined by the size of the memory. Since the register file can be implemented by a single fairly small memory mat (e.g. 512 x 512 bits), it may take less time to load a word from the memory mat, and the timing may be sufficient to execute significantly faster than the logic. can

리프레시 - DRAM과 같은 일부 메모리 유형은 주기적인 리프레시를 필요로 한다. 리프레시는 프로세서 도는 가속기를 멈출 때에 수행될 수 있다. 작은 메모리 매트에 대해, 리프레시 하는 시간은 작은 퍼센티지의 시간일 수 있다. 따라서, 시스템이 짧은 시간 동안 정지하더라도, 메모리 매트를 레지스터로 사용하여 얻는 이득은 전체 성능을 감안할 때에 다운타임의 가치가 있을 수 있다. 일 실시예에서, 프로세싱 유닛은 수정의 숫자로부터 거꾸로 세는 카운터를 포함할 수 있다. 카운터가 '0'에 도달하는 경우, 프로세싱 유닛은 프로세서(예: 가속기)에 의해 수행되는 현재 작업을 중단하고, 메모리 매트가 라인별로 리프레시 되는 리프레시 동작을 촉발할 수 있다. 리프레시 동작이 종료되면, 프로세서는 작업을 재개할 수 있고, 카운터는 소정의 숫자로부터 거꾸로 세도록 재설정될 수 있다. Refresh - Some types of memory, such as DRAM, require periodic refreshes. A refresh may be performed when the processor or accelerator is stopped. For a small memory mat, the time to refresh may be a small percentage of the time. Thus, even if the system is down for a short period of time, the gains from using the memory mat as a register may be worth the downtime given the overall performance. In one embodiment, the processing unit may include a counter that counts backwards from the number of corrections. When the counter reaches '0', the processing unit may abort the current work being performed by the processor (eg accelerator) and trigger a refresh operation in which the memory mat is refreshed line by line. When the refresh operation is finished, the processor can resume work, and the counter can be reset to count backwards from a predetermined number.

도 86은 개시된 실시예들에 따른 분산 프로세서 메모리 칩에서 적어도 하나의 명령을 실행하는 예시적인 방법의 순서도(8600)를 도시한 것이다. 예컨대, 단계 8602에서, 분산 프로세서 메모리 칩의 기판 상의 메모리 어레이로부터 적어도 하나의 데이터 값이 검색될 수 있다. 단계 8604에서, 검색된 데이터 값은 분산 프로세서 메모리 칩의 기판 상의 메모리 어레이의 메모리 매트에 의해 제공되는 레지스터에 저장될 수 있다. 단계 8606에서, 분산 프로세서 메모리 칩 상의 분산 프로세서 서브유닛의 하나 이상과 같은 프로세서 요소가 메모리 매트 레지스터로부터의 저장된 데이터 값을 운용할 수 있다. 86 illustrates a flowchart 8600 of an exemplary method of executing at least one instruction in a distributed processor memory chip in accordance with disclosed embodiments. For example, at step 8602 , at least one data value may be retrieved from a memory array on a substrate of a distributed processor memory chip. In step 8604, the retrieved data value may be stored in a register provided by a memory mat of a memory array on a substrate of a distributed processor memory chip. At step 8606, a processor element, such as one or more of the distributed processor subunits on the distributed processor memory chip, may operate the stored data values from the memory mat registers.

본 명세서 전반에 걸쳐, 레지스터 파일은 최하위 레벨 캐시일 수 있으므로, 레지스터 파일에 관한 모든 언급은 캐시에도 동일하게 적용될 수 있음은 당연하다 할 것이다. Throughout this specification, since a register file may be a lowest level cache, it will be understood that all references to a register file apply equally to a cache.

프로세싱 병목processing bottleneck

'제1', '제2', '제3' 등의 용어는 서로 상이한 용어들 사이의 차이를 나타내는 용도로만 사용된다. 이러한 용어는 요소의 순서 및/또는 시기 및/또는 중요도를 나타내는 것이 아니다. 예를 들어, 제1 프로세스에 앞서 제2 프로세스가 있는 등이 가능할 수 있다. Terms such as 'first', 'second', and 'third' are used only to indicate a difference between different terms. These terms do not indicate the order and/or timing and/or importance of the elements. For example, it may be possible that the second process precedes the first process, and so on.

'결합'이라는 용어는 직접 및/또는 간접으로 연결됨을 의미할 수 있다. The term 'coupled' may mean directly and/or indirectly connected.

'메모리/프로세싱', '메모리 및 프로세싱', 및 '메모리 프로세싱'은 상호 교환적인 방식으로 사용된다. 'Memory/processing', 'memory and processing', and 'memory processing' are used in an interchangeable manner.

메모리/프로세싱 유닛일 수 있는 다중 방법, 컴퓨터 판독가능 매체, 메모리/프로세싱 유닛, 및/또는 시스템이 있을 수 있다. There may be multiple methods, computer readable media, memory/processing units, and/or systems that may be memory/processing units.

메모리/프로세싱 유닛은 메모리와 프로세싱 능력이 있는 하드웨어 유닛이다. A memory/processing unit is a hardware unit with memory and processing capabilities.

메모리/프로세싱 유닛은 메모리 프로세싱 집적회로일 수 있거나, 메모리 프로세싱 집적회로에 포함될 수 있거나, 하나 이상의 메모리 프로세싱 집적회로를 포함할 수 있다. The memory/processing unit may be a memory processing integrated circuit, may be included in a memory processing integrated circuit, or may include one or more memory processing integrated circuits.

메모리/프로세싱 유닛은 PCT 특허 출원 공개공보 WO2019025892에 도시된 바와 같은 분산 프로세서일 수 있다. The memory/processing unit may be a distributed processor as shown in PCT Patent Application Publication No. WO2019025892.

메모리/프로세싱 유닛은 PCT 특허 출원 공개공보 WO2019025892에 도시된 바와 같은 분산 프로세서를 포함할 수 있다.The memory/processing unit may include a distributed processor as shown in PCT Patent Application Publication No. WO2019025892.

메모리/프로세싱 유닛은 PCT 특허 출원 공개공보 WO2019025892에 도시된 바와 같은 분산 프로세서에 속할 수 있다.The memory/processing unit may belong to a distributed processor as shown in PCT Patent Application Publication No. WO2019025892.

메모리/프로세싱 유닛은 PCT 특허 출원 공개공보 WO2019025892에 도시된 바와 같은 메모리 칩일 수 있다.The memory/processing unit may be a memory chip as shown in PCT Patent Application Publication No. WO2019025892.

메모리/프로세싱 유닛은 PCT 특허 출원 공개공보 WO2019025892에 도시된 바와 같은 메모리 칩을 포함할 수 있다.The memory/processing unit may include a memory chip as shown in PCT Patent Application Publication No. WO2019025892.

메모리/프로세싱 유닛은 PCT 특허 출원 공개공보 WO2019025892에 도시된 바와 같은 메모리 칩에 속할 수 있다.The memory/processing unit may belong to a memory chip as shown in PCT Patent Application Publication No. WO2019025892.

메모리/프로세싱 유닛은 PCT 특허 출원 번호 PCT/IB2019/001005에 도시된 바와 같은 분산 프로세서일 수 있다.The memory/processing unit may be a distributed processor as shown in PCT Patent Application No. PCT/IB2019/001005.

메모리/프로세싱 유닛은 PCT 특허 출원 번호 PCT/IB2019/001005에 도시된 바와 같은 분산 프로세서를 포함할 수 있다.The memory/processing unit may include a distributed processor as shown in PCT Patent Application No. PCT/IB2019/001005.

메모리/프로세싱 유닛은 PCT 특허 출원 번호 PCT/IB2019/001005에 도시된 바와 같은 분산 프로세서에 속할 수 있다.The memory/processing unit may belong to a distributed processor as shown in PCT Patent Application No. PCT/IB2019/001005.

메모리/프로세싱 유닛은 PCT 특허 출원 번호 PCT/IB2019/001005에 도시된 바와 같은 메모리 칩일 수 있다.The memory/processing unit may be a memory chip as shown in PCT Patent Application No. PCT/IB2019/001005.

메모리/프로세싱 유닛은 PCT 특허 출원 번호 PCT/IB2019/001005에 도시된 바와 같은 메모리 칩을 포함할 수 있다.The memory/processing unit may include a memory chip as shown in PCT Patent Application No. PCT/IB2019/001005.

메모리/프로세싱 유닛은 PCT 특허 출원 번호 PCT/IB2019/001005에 도시된 바와 같은 메모리 칩에 속할 수 있다.The memory/processing unit may belong to a memory chip as shown in PCT Patent Application No. PCT/IB2019/001005.

메모리/프로세싱 유닛은 웨이퍼 대 웨이퍼 접합(wafer to wafer bond) 및 다중 도체(multiple conductors)를 활용하여 서로 연결된 집적회로일 수 있다. The memory/processing unit may be an integrated circuit interconnected utilizing wafer to wafer bonds and multiple conductors.

분산 프로세서 메모리 칩, 분산 메모리 프로세싱 집적회로, 메모리 칩, 분산 프로세서에 관한 임의의 모든 내용은 웨이퍼 대 웨이퍼 접합 및 다중 도체를 활용하여 서로 연결된 한 쌍의 집적회로로서 구현될 수 있다. Distributed Processors Memory chips, distributed memory processing integrated circuits, memory chips, and anything related to distributed processors may be implemented as a pair of integrated circuits interconnected utilizing wafer-to-wafer junctions and multiple conductors.

메모리/프로세싱 유닛은 로직 셀보다 메모리 셀에 더 적합한 제1 제조 프로세스에 의해 제조될 수 있다. 따라서, 제1 제조 프로세스는 메모리 가미(memory flavored) 제조 프로세스로 여겨질 수 있다. 메모리 셀은 하나 이상의 트랜지스터를 포함할 수 있다. 로직 셀은 하나 이상의 트랜지스터를 포함할 수 있다. 제1 제조 프로세스는 메모리 뱅크의 제조에 적용될 수 있다. 로직 셀은 함께 논리 기능을 구현하는 하나 이상의 트랜지스터를 포함할 수 있고 더 큰 논리 회로의 블록의 기본 구성 블록으로 사용될 수 있다. 메모리 셀은 함께 메모리 기능을 구현하는 하나 이상의 트랜지스터를 포함할 수 있고 더 큰 논리 회로의 블록의 기본 구성 블록으로 사용될 수 있다. 상응하는 로직 셀들은 동일한 논리 기능을 구현할 수 있다. The memory/processing unit may be manufactured by a first manufacturing process that is more suitable for memory cells than logic cells. Accordingly, the first manufacturing process may be considered a memory flavored manufacturing process. A memory cell may include one or more transistors. A logic cell may include one or more transistors. A first manufacturing process may be applied to manufacturing the memory bank. A logic cell may include one or more transistors that together implement logic functions and may be used as the basic building blocks of blocks of larger logic circuits. A memory cell may include one or more transistors that together implement a memory function and may be used as the basic building block of a larger block of logic circuitry. Corresponding logic cells may implement the same logic function.

메모리/프로세싱 유닛은 메모리 셀보다 로직 셀에 더 적합한 제2 제조 프로세스에 의해 제조된 프로세서, 프로세싱 집적회로, 및/또는 프로세싱 유닛과 다를 수 있다. 따라서, 제2 제조 프로세스는 로직 가미(logic flavored) 제조 프로세스로 여겨질 수 있다. 제2 제조 프로세스는 CPU, GPU 등의 제조에 사용될 수 있다. The memory/processing unit may be different from a processor, processing integrated circuit, and/or processing unit manufactured by a second manufacturing process that is more suitable for logic cells than memory cells. Accordingly, the second manufacturing process may be considered a logic flavored manufacturing process. The second manufacturing process may be used for manufacturing a CPU, GPU, or the like.

메모리/프로세싱 유닛은 프로세서, 프로세싱 집적회로, 및/또는 프로세싱 유닛보다 덜 연산 집약적인 동작의 수행에 더 적합할 수 있다. A memory/processing unit may be more suitable for performing less computationally intensive operations than a processor, processing integrated circuit, and/or processing unit.

예를 들어, 제1 제조 프로세스에 의해 제조된 메모리 셀은 제1 제조 프로세스에 의해 제조된 논리 회로의 임계 치수를 초과하는, 심지어는 아주 많이 초과하는(예: 2, 3, 4, 5, 9, 7, 8, 9, 10 등을 초과하는 인수만큼) 임계 치수를 보일 수 있다. For example, a memory cell manufactured by a first manufacturing process may exceed, or even significantly exceed (eg, 2, 3, 4, 5, 9) a critical dimension of a logic circuit manufactured by the first manufacturing process. , by a factor exceeding 7, 8, 9, 10, etc.).

제1 제조 프로세스는 아날로그 제조 프로세스, DRAM 제조 프로세스 등일 수 있다. The first manufacturing process may be an analog manufacturing process, a DRAM manufacturing process, or the like.

제1 제조 프로세스에 의해 제조된 로직 셀의 사이즈는 제2 제조 프로세스에 의해 제조된 상응하는 로직 셀의 사이즈를 적어도 2만큼 초과할 수 있다. 상응하는 로직 셀은 제1 제조 프로세스에 의해 제조된 로직 셀과 동일한 가능성을 가질 수 있다. The size of a logic cell manufactured by the first manufacturing process may exceed the size of a corresponding logic cell manufactured by the second manufacturing process by at least two. A corresponding logic cell may have the same potential as a logic cell manufactured by the first manufacturing process.

제2 제조 프로세스는 디지털 제조 프로세스일 수 있다. The second manufacturing process may be a digital manufacturing process.

제2 제조 프로세스는 CMOS(complementary metal-oxide-semiconductor), 바이폴라(bipolar), BiCOMS(bipolar-CMOS), DMOS(double-diffused metal-oxide-semiconductor), 산화물 상 실리콘 제조 프로세스 등 중의 임의의 하나일 수 있다. The second fabrication process may be any one of complementary metal-oxide-semiconductor (CMOS), bipolar, bipolar-CMOS (BiCOMS), double-diffused metal-oxide-semiconductor (DMOS), silicon on oxide fabrication process, and the like. can

메모리/프로세싱 유닛은 다중 프로세서 서브유닛을 포함할 수 있다. A memory/processing unit may include multiple processor subunits.

하나 이상의 메모리/프로세싱 유닛의 프로세서 서브유닛들은 서로 개별적으로 및/또는 서로 협조하여 동작 및/또는 분산 프로세싱을 수행할 수 있다. 분산 프로세싱은 다양한 방식으로, 예컨대, 플랫(flat) 방식 또는 계층 방식으로 실행될 수 있다. The processor subunits of one or more memory/processing units may perform operations and/or distributed processing individually and/or cooperatively with each other. Distributed processing may be performed in a variety of ways, for example, in a flat manner or in a hierarchical manner.

플랫 방식은 프로세서 서브유닛들이 동일한 동작들을 수행하도록(및 그 사이 프로세싱의 결과를 출력하거나 출력하지 않도록) 한다. The flat approach allows the processor subunits to perform the same operations (and output or not output the result of the processing in between).

계층 방식은 상이한 계층의 프로세싱 동작의 시퀀스의 실행을 포함할 수 있다. 여기서, 특정 층의 프로세싱 동작은 또 다른 계층의 프로세싱 동작을 뒤따른다. 프로세서 서브유닛들은 상이한 층에 할당(동적 또는 정적)되고 계층적 프로세싱에 가담할 수 있다. A hierarchical approach may include execution of a sequence of processing operations in different layers. Here, the processing operation of a specific layer follows the processing operation of another layer. Processor subunits can be assigned (dynamic or static) to different layers and participate in hierarchical processing.

분산 프로세싱은 또한 다른 유닛들, 예컨대 메모리/프로세싱 유닛의 컨트롤로 및/또는 메모리/프로세싱 유닛에 속하지 않는 유닛들을 포함할 수 있다. Distributed processing may also include other units, such as units that do not belong to and/or to the controller of a memory/processing unit.

로직/논리 및 프로세서 서브유닛이라는 용어는 서로 상호교환적으로 사용될 수 있다. The terms logic/logic and processor subunit may be used interchangeably.

본 출원에서 언급된 임의의 모든 프로세싱/처리는 임의의 모든 방식으로(예: 분산 방식, 분산되지 않는 방식 등) 실행될 수 있다. Any and all processing/processing mentioned in this application may be executed in any and all manner (eg, distributed manner, non-distributed manner, etc.).

이하, 다양한 참조 및/또는 참조에 의한 내용은 PCT 특허 출원 공개공보 WO2019025892 및 PCT 특허 출원 번호 PCT/IB2019/001005(2019년 9월 9일)에 기초한다. PCT 특허 출원 공개공보 WO2019025892 및/또는 PCT 특허 출원 번호 PCT/IB2019/001005는 다양한 방법, 시스템, 프로세서, 메모리 칩 등의 예시들을 제공하지만 이에 한정되지 않으며, 다른 방법, 시스템, 프로세서 등이 제공될 수도 있다. Hereinafter, various references and/or contents by reference are based on PCT Patent Application Publication No. WO2019025892 and PCT Patent Application No. PCT/IB2019/001005 (September 9, 2019). PCT Patent Application Publication No. WO2019025892 and/or PCT Patent Application No. PCT/IB2019/001005 provides examples of various methods, systems, processors, memory chips, etc., but are not limited thereto, and other methods, systems, processors, etc. may be provided have.

프로세서 전에 하나 이상의 메모리/프로세싱 유닛이 있는 프로세싱 시스템이 제공될 수 있고, 각 메모리 및 프로세싱 유닛(메모리/프로세싱 유닛)에는 프로세싱 리소스 및 스토리지 리소스가 있다. A processing system may be provided with one or more memory/processing units prior to the processor, each memory and processing unit (memory/processing unit) having a processing resource and a storage resource.

프로세서는 하나 이상의 메모리/프로세싱 유닛이 다양한 프로세싱 작업을 수행하도록 요청 또는 지시할 수 있다. 다양한 프로세싱 작업의 실행은 프로세서의 로드를 덜고, 지연을 감소시키고, 일부 경우에서는 하나 이상의 메모리/프로세싱 유닛과 프로세서 사이의 정보의 총 대역폭을 감소시킨다. The processor may request or direct one or more memory/processing units to perform various processing tasks. Execution of various processing tasks offloads the processor, reduces latency, and in some cases reduces the total bandwidth of information between one or more memory/processing units and the processor.

프로세서는 상이한 입도의 지시 및/또는 요청을 제공할 수 있다. 예를 들어, 프로세서는 특정 프로세싱 리소스를 겨냥한 지시를 보낼 수 있거나 임의의 프로세싱 리소스를 지정하지 않고 메모리/프로세싱 유닛을 겨냥한 더 높은 계층 지시를 보낼 수 있다. The processor may provide different granularity of indications and/or requests. For example, a processor may send an indication directed at a particular processing resource or may send a higher layer indication directed at a memory/processing unit without specifying any processing resource.

메모리/프로세싱 유닛은 임의의 모든 방식(예: 동적, 정적, 분산, 중앙집중, 오프라인, 온라인 등)으로 그 프로세싱 및/또는 메모리 리소스를 관리할 수 있다. 리소스의 관리는 프로세서에 의한 설정 등의 후에 프로세서의 제어 하에 자율적으로 실행될 수 있다. A memory/processing unit may manage its processing and/or memory resources in any and all manner (eg, dynamic, static, distributed, centralized, offline, online, etc.). Management of resources may be autonomously executed under the control of the processor after setting by the processor or the like.

예를 들어, 작업은 하나 이상의 메모리/프로세싱 유닛의 하나 이상의 프로세싱 리소스 및/또는 메모리 리소스에 의한 실행 또는 하나 이상의 지시를 필요로 할 수 있는 서브작업들(sub-tasks)로 분할될 수 있다. 각 프로세싱 리소스는 적어도 하나의 지시를 실행(예: 독립적 또는 비독립적으로)하도록 구성될 수 있다. 예를 들면, PCT 특허 출원 공개공보 WO2019025892의 프로세서 서브유닛과 같은 프로세싱 리소스에 의해 서브시리즈(sub-series)의 지시가 실행된다. For example, a task may be divided into sub-tasks that may require execution or one or more instructions by one or more processing resources and/or memory resources of one or more memory/processing units. Each processing resource may be configured to execute (eg, independently or non-independently) at least one instruction. For example, an instruction of a sub-series is executed by a processing resource such as a processor subunit of PCT Patent Application Publication No. WO2019025892.

적어도 메모리 리소스의 할당은 또한 하나 이상의 메모리/프로세싱 유닛이 아닌 엔티티, 예를 들면, 하나 이상의 메모리/프로세싱 유닛에 결합될 수 있는 DMA(direct access memory) 유닛으로 제공될 수 있다.Allocation of at least memory resources may also be provided to an entity other than one or more memory/processing units, eg, a direct access memory (DMA) unit, which may be coupled to one or more memory/processing units.

컴파일러는 메모리/프로세싱 유닛에 의해 실행된 작업의 유형별로 설정 파일을 준비할 수 있다. 설정 파일은 메모리 할당 및 작업 유형과 연관된 프로세싱 리소스 할당을 포함할 수 있다. 설정 파일은 상이한 프로세싱 리소스에 의해 실행될 수 있는/있거나 메모리 할당을 정의할 수 있는 지시를 포함할 수 있다. Compilers may prepare configuration files for each type of operation executed by the memory/processing unit. The configuration file may contain memory allocations and processing resource allocations associated with job types. A configuration file may contain instructions that may be executed by different processing resources and/or may define memory allocations.

예를 들어, 행렬 곱셈 작업(행렬 A를 행렬 B로 곱하는 작업, 즉 A*B = C)에 관한 설정 파일은 행렬 A의 요소들을 어디에 저장할지, 행렬 B의 요소들을 어디에 저장할지, 행렬 C의 요소들을 어디에 저장할지, 행렬 곱셈 중에 생성된 중간 결과를 어디에 저장할지를 지시할 수 있고, 행렬 곱셈에 관한 임의의 모든 수학적 연산의 수행을 위한 프로세싱 리소스를 겨냥한 지시를 포함할 수 있다. 설정 파일은 데이터 구조의 일례이고, 다른 데이터 구조들이 제공될 수 있다. For example, a configuration file for a matrix multiplication operation (multiplying matrix A by matrix B, i.e. A*B = C) tells where to store the elements of matrix A, where to store the elements of matrix B, and where to store the elements of matrix C. It may indicate where to store the elements, where to store intermediate results generated during matrix multiplication, and may include instructions directed to processing resources for performing any and all mathematical operations related to matrix multiplication. The configuration file is an example of a data structure, and other data structures may be provided.

행렬 곱셈은 하나 이상의 메모리/프로세싱 유닛에 의해 임의 모든 방식으로 실행될 수 있다. Matrix multiplication may be performed in any and all manner by one or more memory/processing units.

하나 이상의 메모리/프로세싱 유닛은 행렬 A를 벡터 V로 곱할 수 있다. 이는 임의의 모든 방식으로 수행될 수 있다. 예를 들어, 프로세싱 리소스별로 행렬의 행 또는 열을 유지(상이한 프로세싱 리소스별 열의 상이한 행), 행렬의 행 또는 열과 벡터의 곱셈의 결과를 순환(상이한 프로세싱 리소스 사이에서), 및 이전 곱셈(두 번째부터 가장 최근의 반복까지)의 결과를 순환하는 것을 포함할 수 있다. One or more memory/processing units may multiply matrix A by vector V. This can be done in any and every way. For example, maintaining a row or column of a matrix per processing resource (different rows of a column per different processing resource), cycling the result of a vector multiplication by a row or column of a matrix (between different processing resources), and a previous multiplication (second to the most recent iteration).

행렬 A가 4x4 행렬이고, 벡터 V가 1x4 벡터이고, 4개의 프로세싱 리소스가 있다고 가정 하에, 행렬 A의 제1 행은 제1 프로세서 서브유닛에 저장되고, 행렬 A의 제2 행은 제2 프로세서 서브유닛에 저장되고, 행렬 A의 제3 행은 제3 프로세서 서브유닛에 저장되고, 행렬 A의 제4 행은 제4 프로세서 서브유닛에 저장된다. 곱셈은 벡터 V의 제1 내지 제4 요소를 제1 내지 제4 프로세싱 리소스로 보내고 벡터 V의 제1 내지 제4 요소를 A의 상이한 벡터들로 곱하여 개시되어 제1 중간 결과를 제공한다. 곱셈은 제1 중간 결과를 순환함으로써, 즉, 각 프로세싱 리소스가 제1 프로세싱 리소스에 의해 계산된 제1 중간 결과를 이웃하는 프로세싱 리소스로 보냄으로써 계속된다. 각 프로세싱 리소스는 제1 중간 결과를 벡터로 곱하여 제2 곱셈 결과를 제공한다. 이는 행렬 A와 벡터 V의 곱셈이 끝날 때까지 여러 번 반복된다. Assuming that matrix A is a 4x4 matrix, vector V is a 1x4 vector, and there are 4 processing resources, a first row of matrix A is stored in a first processor subunit, and a second row of matrix A is stored in a second processor subunit unit, wherein the third row of matrix A is stored in the third processor subunit, and the fourth row of matrix A is stored in the fourth processor subunit. The multiplication is initiated by sending the first through fourth elements of vector V to the first through fourth processing resources and multiplying the first through fourth elements of vector V by different vectors of A to provide a first intermediate result. Multiplication continues by cycling the first intermediate result, ie, each processing resource sends the first intermediate result computed by the first processing resource to a neighboring processing resource. Each processing resource multiplies the first intermediate result by a vector to provide a second multiplication result. This is repeated several times until the multiplication of matrix A by vector V is finished.

도 90a는 하나 이상의 메모리/프로세싱 유닛(일괄적으로 10910으로 표시) 및 프로세서(19020)를 포함하는 시스템(10900)의 일례이다. 앞서 도시된 바와 같이, 프로세서(10920)는 하나 이상의 메모리/프로세싱 유닛(10910)으로 요청 또는 지시를 (링크(10931)를 통해) 보낼 수 있고, 하나 이상의 메모리/프로세싱 유닛(10910)은 이어서 이러한 요청 및/또는 지시를 수행(또는 선택적으로 수행)하고 결과를 프로세서(10920)로 (링크(10932)를 통해) 보낼 수 있다. 프로세서(10920)는 결과를 더 처리하여 하나 이상의 출력을 (링크(10933)를 통해) 제공할 수 있다. FIG. 90A is an example of a system 10900 including one or more memory/processing units (collectively designated 10910 ) and a processor 19020 . As shown above, the processor 10920 may send a request or indication (via link 10931 ) to one or more memory/processing units 10910 , which in turn may send such requests to one or more memory/processing units 10910 . and/or perform (or optionally perform) the instruction and send the result to the processor 10920 (via link 10932 ). Processor 10920 may further process the results to provide one or more outputs (via link 10933 ).

하나 이상의 메모리/프로세싱 유닛은 J개의 메모리 리소스(10912(1,1) - 10912(1,J))를 포함할 수 있고(J는 양의 정수), K개의 프로세싱 리소스(10911(1,1) - 10911(1,K))를 포함할 수 있다(K는 양의 정수). The one or more memory/processing units may include J memory resources 10912(1,1) - 10912(1,J), where J is a positive integer, and K processing resources 10911(1,1) - 10911(1,K)) (where K is a positive integer).

J는 K와 동일하거나 상이할 수 있다. J may be the same as or different from K.

프로세싱 리소스(10911(1,1) - 10911(1,K))는 PCT 특허 출원 공개공보 WO2019025892에 도시된 바와 같은 프로세싱 그룹 또는 프로세서 서브그룹 등일 수 있다. The processing resources 10911(1,1) - 10911(1,K) may be a processing group or processor subgroup, etc. as shown in PCT Patent Application Publication No. WO2019025892.

메모리 리소스(10912(1,1) - 10912(1,J))는 PCT 특허 출원 공개공보 WO2019025892에 도시된 바와 같은 메모리 인스턴스, 메모리 매트, 메모리 뱅크 등일 수 있다.Memory resources 10912(1,1) - 10912(1,J) may be memory instances, memory mats, memory banks, etc. as shown in PCT Patent Application Publication No. WO2019025892.

하나 이상의 메모리/프로세싱 유닛의 임의의 모든 리소스(메모리 리소스 또는 프로세싱 리소스) 사이에는 임의의 모든 연결성 및/또는 기능성 관계가 있을 수 있다. There may be any and all connectivity and/or functional relationships between any and all resources (memory resources or processing resources) of one or more memory/processing units.

도 90b는 메모리/프로세싱 유닛(10910(1))의 일례이다. 90B is an example of the memory/processing unit 10910(1).

도 90b에서, K개(K는 양의 정수)의 프로세싱 리소스(10911(1,1) - 10911(1,K))는 서로 직렬로 연결됨(링크(10915) 참조)에 따라 루프를 형성한다. 각 프로세싱 리소스는 또한, 상응하는 한 쌍의 전용 메모리 리소스에 결합된다. 예컨대, 프로세싱 리소스(10911(1))는 메모리 리소스(10912(1)과 10912(2))에 결합되고, 프로세싱 리소스(10911(K))는 메모리 리소스(10912(J-1)과10912(J))에 결합된다. 프로세싱 리소스들은 임의의 모든 다른 방식으로 서로 연결될 수 있다. 각 프로세싱 리소스별로 할당된 메모리 리소스의 수는 2개가 아닐 수도 있다. 상이한 리소스들 사이의 연결성의 예시들은 PCT 특허 출원 공개공보 WO2019025892에 도시되어 있다.In FIG. 90B , K processing resources 10911(1,1) - 10911(1,K)) (K being a positive integer) are connected in series with each other (see link 10915 ) to form a loop. Each processing resource is also coupled to a corresponding pair of dedicated memory resources. For example, processing resource 10911(1) is coupled to memory resources 10912(1) and 10912(2), and processing resource 10911(K) includes memory resources 10912(J-1) and 10912(J). )) is bound to The processing resources may be interconnected in any and all other ways. The number of memory resources allocated for each processing resource may not be two. Examples of connectivity between different resources are shown in PCT Patent Application Publication No. WO2019025892.

도 90c는 N개(N은 양의 정수)의 메모리/프로세싱 유닛(10910(1) - 10910(N))과 프로세서(10920)를 포함하는 시스템(10901)의 일례이다. 앞서 도시된 바와 같이, 프로세서(10920)는 메모리/프로세싱 유닛(10920(1)- 10910(N))으로 요청 또는 지시를 (링크(10931(1) - 10931(N))를 통해) 보낼 수 있고, 메모리/프로세싱 유닛(10920(1)- 10910(N))은 이어서 이러한 요청 및/또는 지시를 수행하고 결과를 프로세서(10920)로 (링크(10932(1)-3232(N))를 통해) 보낼 수 있다. 프로세서(10920)는 결과를 더 처리하여 하나 이상의 출력을 (링크(10933)를 통해) 제공할 수 있다.FIG. 90C is an example of a system 10901 including N (N is a positive integer) memory/processing units 10910(1) - 10910(N) and a processor 10920 . As shown above, processor 10920 may send a request or instruction (via link 10931(1) - 10931(N)) to memory/processing units 10920(1) - 10910(N) and , memory/processing units 10920(1) - 10910(N) then perform these requests and/or instructions and return the results to processor 10920 (via link 10932(1)-3232(N)) can send. Processor 10920 may further process the results to provide one or more outputs (via link 10933 ).

도 90d는 N개(N은 양의 정수)의 메모리/프로세싱 유닛(10910(1) - 10910(N))과 프로세서(10920)를 포함하는 시스템(10902)의 일례이다. 도 90d는 메모리/프로세싱 유닛(10910(1) - 10910(N))의 앞에 프리프로세서(preprocessor, 10909)를 도시하고 있다. 프리프로세서는 프레임 추출, 헤더 검출 등과 같은 다양한 프리프로세싱 동작을 수행할 수 있다. FIG. 90D is an example of a system 10902 including N (N is a positive integer) memory/processing units 10910(1) - 10910(N) and a processor 10920 . Figure 90D shows a preprocessor 10909 in front of memory/processing units 10910(1) - 10910(N). The preprocessor may perform various preprocessing operations such as frame extraction, header detection, and the like.

도 90e는 하나 이상의 메모리/프로세싱 유닛(10910) 및 프로세서(10920)를 포함하는 시스템(10903)의 일례이다. 도 90e는 하나 이상의 메모리/프로세싱 유닛(10910) 앞에 있는 프리프로세서(10909) 및 DMA 컨트롤러(10908)를 도시하고 있다. 90E is an example of a system 10903 that includes one or more memory/processing units 10910 and a processor 10920 . FIG. 90E shows a preprocessor 10909 and a DMA controller 10908 in front of one or more memory/processing units 10910 .

도 90f는 적어도 하나의 정보 스트림의 분산 프로세싱 방법(10800)을 도시한 것이다. 90F illustrates a method 10800 for distributed processing of at least one information stream.

방법(10800)은 하나 이상의 메모리 프로세싱 집적회로가 적어도 하나의 정보 스트림을 제1 통신 채널을 통해 수신하는 단계 10810으로 시작할 수 있고, 여기서 각 메모리 프로세싱 집적회로는 컨트롤러, 다중 프로세서 서브유닛, 및 다중 메모리 유닛을 포함한다. Method 10800 may begin with step 10810 in which one or more memory processing integrated circuits receive at least one information stream via a first communication channel, wherein each memory processing integrated circuit includes a controller, multiple processor subunits, and multiple memories. includes units.

단계 10810 이후에 단계 10820과 단계 10830이 수행될 수 있다. After step 10810, steps 10820 and 10830 may be performed.

단계 10820은 하나 이상의 메모리 프로세싱 집적회로가 정보 스트림을 버퍼링하는 단계를 포함할 수 있다. Step 10820 may include one or more memory processing integrated circuits buffering the information stream.

단계 10830은 하나 이상의 메모리 프로세싱 집적회로가 적어도 하나의 정보 스트림에 대한 제1 프로세싱 동작을 수행하여 제1 프로세싱 결과를 제공하는 단계를 포함할 수 있다. Step 10830 may include the one or more memory processing integrated circuits performing a first processing operation on the at least one information stream to provide a first processing result.

단계 10830은 압축 또는 압축해제를 포함할 수 있다. Step 10830 may include compression or decompression.

이에 따라, 정보 스트림의 총 사이즈는 제1 프로세싱 결과의 총 사이즈를 초과할 수 있다. 정보 스트림의 총 사이즈는 특정 기간의 주기 동안에 수신된 정보의 양을 반영하는 것일 수 있다. 제1 프로세싱 결과의 총 사이즈는 동일한 특정 기간의 임의의 주기 동안에 출력된 제1 프로세싱 결과의 양을 반영하는 것일 수 있다. Accordingly, the total size of the information stream may exceed the total size of the first processing result. The total size of the information stream may reflect the amount of information received during a period of a particular period. The total size of the first processing result may reflect the amount of the first processing result output during any period of the same specific period.

대안적으로, 정보 스트림(또는 본 명세서에 언급된 임의의 다른 정보 엔티티)의 총 사이즈는 제1 프로세싱 결과의 총 사이즈보다 작을 수 있다. 이 경우, 압축이 확보된다. Alternatively, the total size of the information stream (or any other information entity mentioned herein) may be less than the total size of the first processing result. In this case, compression is ensured.

단계 10830 이후에, 제1 프로세싱 결과를 하나 이상의 프로세싱 집적회로로 전송하는 단계 10840이 수행될 수 있다. After step 10830, step 10840 of transmitting the first processing result to one or more processing integrated circuits may be performed.

하나 이상의 메모리 프로세싱 집적회로는 메모리 가미 제조 프로세스에 의해 제조될 수 있다. One or more memory processing integrated circuits may be fabricated by a memory additive manufacturing process.

하나 이상의 메모리 프로세싱 집적회로는 로직 가미 제조 프로세스에 의해 제조될 수 있다. One or more memory processing integrated circuits may be fabricated by a logic additive manufacturing process.

메모리 프로세싱 집적회로에서, 메모리 유닛들은 각각 프로세서 서브유닛에 결합될 수 있다. In a memory processing integrated circuit, each of the memory units may be coupled to a processor subunit.

단계 10840 이후에, 하나 이상의 프로세싱 집적회로가 제1 프로세싱 결과에 대한 제2 프로세싱 동작을 수행하여 제2 프로세싱 결과를 제공하는 단계 10850이 수행될 수 있다. After operation 10840 , operation 10850 in which one or more processing integrated circuits perform a second processing operation on the first processing result to provide a second processing result may be performed.

단계 10820 및/또는 단계 10830은 하나 이상의 프로세싱 집적회로에 의해 지시되거나, 하나 이상의 프로세싱 집적회로에 의해 요청되거나, 하나 이상의 프로세싱 집적회로의 설정 이후에 하나 이상의 프로세싱 집적회로에 의해 실행되거나, 하나 이상의 프로세싱 집적회로의 개입 없이 독립적으로 실행될 수 있다. Step 10820 and/or step 10830 are directed by one or more processing integrated circuits, requested by one or more processing integrated circuits, executed by one or more processing integrated circuits after configuration of the one or more processing integrated circuits, or performed one or more processing It can be executed independently without the intervention of an integrated circuit.

제1 프로세싱 동작은 제2 프로세싱 동작보다 덜 연산 집약적일 수 있다. The first processing operation may be less computationally intensive than the second processing operation.

단계 10830 및/또는 단계 10850은 (a) 이동통신망 프로세싱 동작, (b) 다른 망 관련 프로세싱 동작(이동통신망과 다른 네트워크의 프로세싱), (c) 데이터베이스 프로세싱 동작, (d) 데이터베이스 분석 프로세싱 동작, (e) 인공지능 프로세싱 동작, 및 임의의 다른 프로세싱 동작 중의 적어도 하나일 수 있다. Step 10830 and/or step 10850 includes (a) a mobile network processing operation, (b) another network-related processing operation (processing of a mobile network and another network), (c) a database processing operation, (d) a database analysis processing operation, ( e) an artificial intelligence processing operation, and any other processing operation.

분리된 시스템 메모리/프로세싱 유닛 및 분산 프로세싱 방법Separate system memory/processing unit and distributed processing method

분리된(disaggregated) 시스템, 분산 프로세싱 방법, 프로세싱/메모리 유닛, 분리된 시스템의 운용 방법, 프로세싱/메모리 유닛의 운용 방법, 및 비일시적이고 임의의 상기 방법을 실행하기 위한 지시를 저장하는 컴퓨터 판독가능 매체가 제공될 수 있다. 분리된 시스템은 상이한 서브시스템(subsystems)이 상이한 기능을 수행하도록 할당한다. 예를 들어, 저장은 하나 이상의 스토리지 서브시스템에 주로 구현될 수 있는 반면에 연산은 하나 이상의 스토리지 서브시스템에 주로 구현될 수 있다. Disaggregated systems, distributed processing methods, processing/memory units, methods of operating discrete systems, methods of operating processing/memory units, and non-transitory computer-readable storage instructions for executing any of the above methods A medium may be provided. A separate system allocates different subsystems to perform different functions. For example, storage may be implemented primarily in one or more storage subsystems, while operations may be implemented primarily in one or more storage subsystems.

분리된 시스템은 하나의 분리된 서버, 하나 이상의 분리된 서버, 하나 이상의 서버와 다른 서버일 수 있다. A separate system may be one separate server, one or more separate servers, one or more servers and another server.

분리된 시스템은 하나 이상의 스위칭 서브시스템, 하나 이상의 컴퓨팅 서브시스템, 하나 이상의 스토리지 서브시스템, 및 하나 이상의 프로세싱/메모리 서브시스템을 포함할 수 있다. A separate system may include one or more switching subsystems, one or more computing subsystems, one or more storage subsystems, and one or more processing/memory subsystems.

하나 이상의 프로세싱/메모리 서브시스템, 하나 이상의 컴퓨팅 서브시스템, 및 하나 이상의 스토리지 서브시스템은 하나 이상의 스위칭 서브시스템을 통해 서로 결합될 수 있다. One or more processing/memory subsystems, one or more computing subsystems, and one or more storage subsystems may be coupled to each other via one or more switching subsystems.

하나 이상의 프로세싱/메모리 서브시스템은 분리된 시스템의 하나 이상의 서브시스템에 포함될 수 있다. One or more processing/memory subsystems may be included in one or more subsystems in separate systems.

도 87a는 분리된 시스템의 다양한 예를 도시한 것이다. 87A shows various examples of a separate system.

임의의 모든 수의 임의의 모든 유형의 서브시스템이 있을 수 있다. 분리된 시스템은 도 87a에 포함되지 않은 유형의 하나 이상의 추가 서브시스템, 더 적은 수의 유형의 서브시스템 등을 포함할 수 있다. There can be any and any number of any and any type of subsystem. A separate system may include one or more additional subsystems of a type not included in FIG. 87A, fewer types of subsystems, and the like.

분리된 시스템(7101)은 2개의 스토리지 서브시스템(7130), 컴퓨팅 서브시스템(7120), 스위칭 서브시스템(7140), 및 프로세싱/메모리 서브시스템(7110)을 포함한다. The separate system 7101 includes two storage subsystems 7130 , a computing subsystem 7120 , a switching subsystem 7140 , and a processing/memory subsystem 7110 .

분리된 시스템(7102)은 2개의 스토리지 서브시스템(7130), 컴퓨팅 서브시스템(7120), 스위칭 서브시스템(7140), 프로세싱/메모리 서브시스템(7110), 및 가속기 서브시스템(7150)을 포함한다.The separate system 7102 includes two storage subsystems 7130 , a computing subsystem 7120 , a switching subsystem 7140 , a processing/memory subsystem 7110 , and an accelerator subsystem 7150 .

분리된 시스템(7103)은 2개의 스토리지 서브시스템(7130), 컴퓨팅 서브시스템(7120), 프로세싱/메모리 서브시스템(7110)을 포함하는 스위칭 서브시스템(7140)을 포함한다.The separate system 7103 includes a switching subsystem 7140 that includes two storage subsystems 7130 , a computing subsystem 7120 , and a processing/memory subsystem 7110 .

분리된 시스템(7104)은 2개의 스토리지 서브시스템(7130), 컴퓨팅 서브시스템(7120), 프로세싱/메모리 서브시스템(7110)을 포함하는 스위칭 서브시스템(7140), 및 가속기 서브시스템(7150)을 포함한다The separate system 7104 includes two storage subsystems 7130 , a computing subsystem 7120 , a switching subsystem 7140 including a processing/memory subsystem 7110 , and an accelerator subsystem 7150 . do

프로세싱/메모리 서브시스템(7110)을 스위칭 서브시스템(7140) 내에 포함시킴으로써, 분리된 시스템(7101, 7102) 이내의 트래픽을 감소시키고, 스위칭의 지연을 감소시키는 등이 가능할 수 있다. By including the processing/memory subsystem 7110 within the switching subsystem 7140 , it may be possible to reduce traffic within the separate systems 7101 , 7102 , reduce delay in switching, and the like.

분리된 시스템의 상이한 서브시스템들은 다양한 통신 프로토콜을 활용하여 서로 통신할 수 있다. 이더넷을 활용하거나 심지어 이더넷 통신 프로토콜 상의 RDMA를 활용하면 스루풋을 증가시킬 수 있고 심지어 분리된 시스템의 요소들 사이의 정보 단위의 교환에 관한 다양한 제어 및/또는 저장 동작의 복잡성을 감소시킬 수 있는 것이 발견되었다. Different subsystems of a separate system may communicate with each other utilizing various communication protocols. It has been found that utilizing Ethernet, or even utilizing RDMA over Ethernet communication protocols, can increase throughput and even reduce the complexity of various control and/or storage operations relating to the exchange of units of information between elements of separate systems. became

분리된 시스템은 프로세싱/메모리 서브시스템이 계산에 가담하도록 함으로써, 특히 메모리 집약적 계산을 실행함으로써 분산 프로세싱을 수행할 수 있다. A separate system may perform distributed processing by having the processing/memory subsystem participate in the computation, in particular by executing memory intensive computations.

예를 들어, N개의 컴퓨팅 단위들이 N개의 컴퓨팅 단위들 간에 정보 단위들을 공유해야 하는(올투올(all to all) 공유) 것으로 가정하면, (a) N개의 정보 단위들이 하나 이상의 프로세싱/메모리 서브시스템의 하나 이상의 프로세싱/메모리 유닛으로 전송될 수 있고, (b) 하나 이상의 프로세싱/메모리 유닛은 올투올 공유를 필요로 한 계산을 수행할 수 있고, (c) N개의 업데이트된 정보 단위들이 N개의 컴퓨팅 단위들로 전송될 수 있다. 이는 약 N개의 전송 동작이 필요하게 된다. For example, assuming that N computing units must share information units among the N computing units (all-to-all sharing), then (a) the N information units are connected to one or more processing/memory subsystems. may be sent to one or more processing/memory units of may be transmitted in units. This would require about N transmission operations.

예를 들어, 도 87b는 신경망의 모델(신경망의 노드들로 할당된 가중치를 포함하는 모델)을 업데이트한 분산 프로세싱을 도시한 것이다. For example, FIG. 87B shows distributed processing of updating a model of a neural network (a model including weights assigned to nodes of the neural network).

N개의 컴퓨팅 단위들 PU(1) 내지 PU(N)(7120(1) - 7120(N))의 각 컴퓨팅 단위는 분리된 시스템들(7101, 7102, 7103, 7104) 중의 임의의 분리된 시스템의 컴퓨팅 서브시스템(7120)에 속할 수 있다. Each computing unit of the N computing units PU( 1 ) through PU(N) 7120( 1 ) - 7120(N) is one of the discrete systems 7101 , 7102 , 7103 , 7104 . may belong to computing subsystem 7120 .

N개의 컴퓨팅 단위들은 N개의 부분 모델 업데이트(7121(1) 내지 7121(N))(업데이트된 N개의 상이한 부분)를 계산하고 (스위칭 서브시스템(7140)을 통해) 프로세싱/메모리 서브시스템(7110)으로 전송한다. The N computing units compute N partial model updates 7121( 1 ) through 7121(N) (N different parts updated) and process/memory subsystem 7110 (via switching subsystem 7140 ). send to

프로세싱/메모리 서브시스템(7110)은 업데이트된 모델(7122)을 계산하고 (스위칭 서브시스템(7140)을 통해) N개의 컴퓨팅 단위들 PU(1) 내지 PU(N)(7120(1) 내지 7120(N))으로 전송한다. The processing/memory subsystem 7110 computes the updated model 7122 and (via the switching subsystem 7140 ) N computing units PU( 1 ) through PU(N) 7120( 1 ) through 7120 ( N)).

도 87c, 도 87d, 및 도 87e는 각각 메모리/프로세싱 유닛(7011, 7012, 7013)의 예시를 도시한 것이고, 도 87f와 도 87g는 메모리/프로세싱 유닛(9010)과 이더넷 모듈과 이더넷 모듈 상의 RDMA(9022)와 같은 하나 이상의 통신 모듈을 포함하는 집적회로(7014, 7015)를 도시한 것이다. 87c, 87d, and 87e show examples of memory/processing units 7011, 7012, and 7013, respectively, and FIGS. 87f and 87g show memory/processing unit 9010 and Ethernet module and RDMA on Ethernet module An integrated circuit 7014, 7015 including one or more communication modules, such as 9022, is shown.

메모리/프로세싱 유닛은 컨트롤러(9020), 내부 버스(9021), 다중 쌍의 로직(9030), 및 메모리 뱅크(9040)를 포함한다. 컨트롤러는 통신 모듈로 동작하도록 구성되거나 통신 모듈에 결합될 수 있다. The memory/processing unit includes a controller 9020 , an internal bus 9021 , multiple pairs of logic 9030 , and a memory bank 9040 . The controller may be configured to operate as a communication module or may be coupled to the communication module.

컨트롤러(9020)와 다중 쌍의 로직(9030) 및 메모리 뱅크(9040) 사이의 연결성은 다른 방식으로 구현될 수 있다. 메모리 뱅크와 로직은 다른 방식(쌍이 아닌)으로 배치될 수 있다. The connectivity between the controller 9020 and multiple pairs of logic 9030 and memory bank 9040 may be implemented in other ways. The memory banks and logic may be arranged in different ways (not in pairs).

프로세싱/메모리 서브시스템(7110)의 하나 이상의 메모리/프로세싱 유닛(9010)은 모델 업데이트를 병렬로 처리할 수 있고(상이한 로직들을 사용하고 상이한 메모리 뱅크들로부터 모델의 상이한 부분들을 병렬로 검색), 대량의 메모리 리소스와 메모리 뱅크와 로직 사이의 연결의 매우 높은 대역폭 덕분에 이러한 계산을 매우 효율적인 방식으로 수행할 수 있다. One or more memory/processing units 9010 of processing/memory subsystem 7110 may process model updates in parallel (using different logics and retrieving different parts of the model from different memory banks in parallel), and bulk The very high bandwidth of the memory resources of the memory bank and the connection between the memory bank and the logic allows these calculations to be performed in a very efficient manner.

도 87c 내지 도 87e의 메모리/프로세싱 유닛들(7011, 7012, 7013)과 도 87c 내지 도 87e의 집적회로들(7014, 7015)은 이더넷 모듈(도 87c 내지 도 87g) 및 이더넷 모듈 상의 RDMA(7022)(도 87e, 도 87g)와 같은 하나 이상의 통신 모듈을 포함한다. The memory/processing units 7011 , 7012 , 7013 of FIGS. 87C-87E and the integrated circuits 7014 , 7015 of FIGS. 87C-87E include an Ethernet module ( FIGS. 87C-87G ) and an RDMA 7022 on the Ethernet module. ) (FIGS. 87e, 87g), including one or more communication modules.

이러한 RDMA 및/또는 이더넷 모듈(메모리/프로세싱 유닛 이내 또는 메모리/프로세싱 유닛과 동일한 집적회로 이내)을 구비하면, 분리된 시스템의 상이한 요소들 사이의 통신이 매우 빨라지고, RDMA의 경우에는 분리된 시스템의 상이한 요소들 사이의 통신을 상당히 단순하게 한다. Having such an RDMA and/or Ethernet module (either within the memory/processing unit or within the same integrated circuit as the memory/processing unit), communication between the different elements of a separate system is very fast, and in the case of RDMA, the It greatly simplifies communication between different elements.

여기서, RDMA 및/또는 이더넷 모듈을 포함하는 메모리/프로세싱 유닛은 다른 환경에서도, 예컨대 메모리/프로세싱 유닛이 분리된 시스템에 포함되지 않는 경우에도 이점이 있을 수 있다. Here, a memory/processing unit comprising RDMA and/or Ethernet modules may be advantageous in other circumstances, for example, even if the memory/processing unit is not included in a separate system.

또한, RDMA 및/또는 이더넷 모듈은 메모리/프로세싱 유닛의 그룹별로, 예컨대 비용 절감의 이유로, 할당될 수 있다. Additionally, RDMA and/or Ethernet modules may be allocated per group of memory/processing units, eg for cost savings reasons.

메모리/프로세싱 유닛, 메모리/프로세싱 유닛의 그룹, 및 프로세싱/메모리 서브시스템은 다른 통신 포트, 예컨대, PCIe 통신 포트를 포함할 수 있다. The memory/processing units, groups of memory/processing units, and processing/memory subsystems may include other communication ports, such as PCIe communication ports.

RDMA 및/또는 이더넷 모듈을 활용하면 이더넷 포트가 있을 수 있는 네트워크 집적회로(NIC)에 연결되는 브리지로 메모리/프로세싱 유닛을 연결할 필요가 없기 때문에 비용 효율적일 수 있다. Utilizing RDMA and/or Ethernet modules can be cost-effective as there is no need to connect memory/processing units with bridges that connect to network integrated circuits (NICs) that may have Ethernet ports.

RDMA 및/또는 이더넷 모듈을 활용하면 이더넷(또는 이더넷 상의 RDMA)이 메모리/프로세싱 유닛 내에 네이티브(native) 하게 할 수 있다. Utilizing RDMA and/or Ethernet modules allows Ethernet (or RDMA over Ethernet) to be native within the memory/processing unit.

여기서, 이더넷은 LAN(local area network) 프로토콜의 일례에 불과하고, PCIe는 이더넷보다 큰 거리에 사용될 수 있는 다른 통신 프로토콜의 일례에 불과하다. Here, Ethernet is only an example of a local area network (LAN) protocol, and PCIe is only an example of another communication protocol that can be used over a larger distance than Ethernet.

도 87h는 분산 프로세싱을 위한 방법(7000)을 도시한 것이다. 87H illustrates a method 7000 for distributed processing.

방법(7000)은 하나 이상의 프로세싱 반복을 포함할 수 있다. Method 7000 may include one or more processing iterations.

프로세싱 반복은 분리된 시스템의 하나 이상의 메모리 프로세싱 집적회로에 의해 실행될 수 있다. The processing iterations may be executed by one or more memory processing integrated circuits in separate systems.

프로세싱 반복은 분리된 시스템의 하나 이상의 프로세싱 집적회로에 의해 실행될 수 있다.The processing iterations may be performed by one or more processing integrated circuits in separate systems.

하나 이상의 프로세싱 집적회로에 의해 실행되는 프로세싱 반복 이후에 하나 이상의 프로세싱 집적회로에 의해 실행되는 프로세싱 반복이 뒤따를 수 있다. A processing iteration executed by one or more processing integrated circuits may be followed by a processing iteration executed by one or more processing integrated circuits.

하나 이상의 프로세싱 집적회로에 의해 실행되는 프로세싱 반복 이전에 하나 이상의 프로세싱 집적회로에 의해 실행되는 프로세싱 반복이 선행될 수 있다. The processing iteration executed by the one or more processing integrated circuits may be preceded by the processing iteration executed by the one or more processing integrated circuits.

또 다른 프로세싱 반복이 분리된 시스템의 다른 회로들에 의해 실행될 수 있다. 예컨대, 하나 이상의 프리프로세싱 회로가 하나 이상의 메모리 프로세싱 집적회로에 의한 프로세싱 반복을 위한 정보 단위의 준비 등을 포함하는 임의의 모든 유형의 프리프로세싱을 수행할 수 있다. Another processing iteration may be performed by other circuits in a separate system. For example, one or more preprocessing circuits may perform any and all types of preprocessing, including preparing information units for processing iterations by one or more memory processing integrated circuits, and the like.

방법(7000)은 분리된 시스템의 하나 이상의 메모리 프로세싱 집적회로가 정보 단위를 수신하는 단계 7020을 포함할 수 있다. Method 7000 may include step 7020 in which one or more memory processing integrated circuits in the separate system receive the unit of information.

각 메모리 프로세싱 집적회로는 컨트롤러, 다중 프로세서 서브유닛, 및 다중 메모리 유닛을 포함할 수 있다. Each memory processing integrated circuit may include a controller, multiple processor subunits, and multiple memory units.

정보 단위는 신경망 모델의 일부를 전달할 수 있다. The information unit may carry part of the neural network model.

정보 단위는 적어도 하나의 데이터베이스 쿼리의 부분 결과를 전달할 수 있다. The information unit may carry a partial result of at least one database query.

정보 단위는 적어도 하나의 총 데이터베이스 쿼리의 부분 결과를 전달할 수 있다. The information unit may carry partial results of at least one aggregate database query.

단계 7020은 분리된 시스템의 하나 이상의 스토리지 서브시스템으로부터 정보 단위를 수신하는 단계를 포함할 수 있다. Step 7020 may include receiving information units from one or more storage subsystems of the separate system.

단계 7020은 분리된 시스템의 하나 이상의 컴퓨팅 서브시스템으로부터 정보 단위를 수신하는 단계를 포함할 수 있고, 하나 이상의 컴퓨팅 서브시스템은 로직 가미 제조 프로세스에 의해 제조된 다중 프로세싱 집적회로를 포함할 수 있다. Step 7020 may include receiving units of information from one or more computing subsystems of the separate system, wherein the one or more computing subsystems may include multiple processing integrated circuits fabricated by a logic additive manufacturing process.

단계 7020 이후의 단계 7030에서, 하나 이상의 메모리 프로세싱 집적회로는 정보 단위에 프로세싱 동작을 수행하여 프로세싱 결과를 제공할 수 있다. In step 7030 after step 7020, the one or more memory processing integrated circuits may perform a processing operation on the information unit to provide a processing result.

정보 단위의 총 사이즈는 프로세싱 결과의 총 사이즈에 비해 크거나, 동일하거나, 작을 수 있다. The total size of the information unit may be greater than, equal to, or smaller than the total size of the processing result.

단계 7030 이후의 단계 7040에서, 하나 이상의 메모리 프로세싱 집적회로가 프로세싱 결과를 출력한다. In step 7040 after step 7030, the one or more memory processing integrated circuits output processing results.

단계 7040은 분리된 시스템의 하나 이상의 컴퓨팅 서브시스템으로 프로세싱 결과를 출력하는 단계를 포함할 수 있고, 하나 이상의 컴퓨팅 서브시스템은 로직 가미 제조 프로세스에 의해 제조된 다중 프로세싱 집적회로를 포함할 수 있다. Step 7040 may include outputting the processing results to one or more computing subsystems of the separate system, wherein the one or more computing subsystems may include multiple processing integrated circuits manufactured by a logic additive manufacturing process.

단계 7040은 분리된 시스템의 하나 이상의 스토리지 서브시스템으로 프로세싱 결과를 출력하는 단계를 포함할 수 있다. Step 7040 may include outputting the processing result to one or more storage subsystems of the separate system.

정보 단위들은 다중 프로세싱 집적회로의 상이한 그룹의 프로세싱 유닛으로부터 전송될 수 있고 다중 프로세싱 집적회로에 의해 분산 방식으로 실행된 프로세스의 중간 결과의 상이한 부분들일 수 있다. 한 그룹의 프로세싱 유닛은 적어도 하나의 프로세싱 집적회로를 포함할 수 있다. The information units may be transmitted from different groups of processing units of the multiple processing integrated circuit and may be different parts of an intermediate result of a process executed in a distributed manner by the multiple processing integrated circuit. A group of processing units may include at least one processing integrated circuit.

단계 7030은 정보 단위들을 처리하여 전체 프로세스의 결과를 제공하는 단계를 포함할 수 있다. Step 7030 may include processing the information units to provide a result of the overall process.

단계 7040은 전체 프로세스의 결과를 다중 프로세싱 집적회로의 각각으로 전송하는 단계를 포함할 수 있다. Step 7040 may include transmitting the result of the overall process to each of the multiple processing integrated circuits.

중간 결과의 상이한 부분들은 업데이트된 신경망 모델의 상이한 부분들일 수 있고, 여기서, 전체 프로세스의 결과는 업데이트된 신경망 모델이다. Different parts of the intermediate result may be different parts of the updated neural network model, where the result of the overall process is the updated neural network model.

단계 7040은 업데이트된 신경망 모델을 다중 프로세싱 집적회로의 각 프로세싱 집적회로로 전송하는 단계를 포함할 수 있다. Step 7040 may include transmitting the updated neural network model to each processing integrated circuit of the multiprocessing integrated circuit.

단계 7040 이후의 단계 7050에서, 다중 프로세싱 집적회로로의 프로세싱 결과에 적어도 부분적으로 의거하여 다중 프로세싱 집적회로가 다른 프로세싱을 수행할 수 있다. In step 7050 after step 7040, the multi-processing integrated circuit may perform other processing based at least in part on a result of processing with the multi-processing integrated circuit.

단계 7040은 분리된 시스템의 스위칭 서브유닛을 활용하여 프로세싱 결과를 출력하는 단계를 포함할 수 있다. Step 7040 may include outputting a processing result by utilizing a switching subunit of the separated system.

단계 7020은 프리프로세싱된 정보 단위인 정보 단위를 수신하는 단계를 포함할 수 있다. Operation 7020 may include receiving an information unit that is a preprocessed information unit.

도 87i는 분산 프로세싱을 위한 방법(7001)을 도시하고 있다. 87I shows a method 7001 for distributed processing.

방법(7001)은 다중 프로세싱 집적회로가 정보를 프리프로세싱하여 프리프로세싱된 정보 단위를 제공하는 단계 7010을 포함한다는 점에서 방법(7000)과 차이가 있다. Method 7001 differs from method 7000 in that it includes a step 7010 in which the multiprocessing integrated circuit preprocesses information to provide preprocessed information units.

단계 7010 이후에 단계 7020, 단계 7030, 및 단계 7040이 수행될 수 있다. After step 7010, steps 7020, 7030, and 7040 may be performed.

데이터베이스 분석 가속Accelerate database analytics

메모리 유닛과 동일한 집적회로에 속하는 필터링 유닛에 의해 적어도 필터링을 수행하기 위한 장치, 방법 및 지시를 저장하는 컴퓨터 판독가능 매체가 제공된다. 여기서, 필터는 어느 엔트리가 특정 데이터베이스 쿼리에 관련이 있는지를 나타낸다. 아비터(arbitrator) 또는 임의의 다른 흐름 제어 매니저(flow control manager)는 관련 엔트리를 프로세서로 전송하거나 관련 없는 엔트리를 프로세서로 전송하지 않음으로써 프로세서로 및 프로세서로부터의 대부분의 트래픽을 사실상 감소시킬 수 있다. A computer-readable medium is provided that stores apparatus, methods and instructions for performing at least filtering by a filtering unit belonging to the same integrated circuit as the memory unit. Here, the filter indicates which entries are relevant to a particular database query. An arbitrator or any other flow control manager may effectively reduce most traffic to and from the processor by sending relevant entries to the processor or not sending irrelevant entries to the processor.

예를 들어, 도 91a에는 프로세서(CPU 9240) 및 메모리 및 필터링 시스템(9220)을 포함하는 집적회로가 도시되어 있다. 메모리 및 필터링 시스템(9220)은 메모리 유닛 엔트리(9222) 및 관련 엔트리를 프로세서로 전송하는 아비터(9229)와 같은 하나 이상의 아비터에 결합된 필터링 유닛(9224)을 포함할 수 있다. 임의의 모든 아비트레이션(arbitration) 프로세스가 적용될 수 있다. 엔트리의 수, 필터링 유닛의 수, 및 아비터의 수 사이에는 임의의 모든 관계가 있을 수 있다. For example, FIG. 91A shows an integrated circuit including a processor (CPU 9240) and a memory and filtering system (9220). The memory and filtering system 9220 may include a filtering unit 9224 coupled to one or more arbiters, such as an arbiter 9229 that transmits a memory unit entry 9222 and associated entries to the processor. Any and all arbitration processes may be applied. There may be any and all relationships between the number of entries, the number of filtering units, and the number of arbiters.

아비터는 통신 인터페이스, 플로우 컨트롤러 등과 같이 정보의 흐름을 제어할 수 있는 임의의 모든 유닛으로 대체될 수 있다. The arbiter may be replaced by any and all units capable of controlling the flow of information, such as a communication interface, a flow controller, and the like.

필터링은 하나 이상의 관련성/필터링 기준에 의거한다. Filtering is based on one or more relevance/filtering criteria.

관련성은 데이터베이스 쿼리별로 설정될 수 있고 임의의 모든 방식으로 나타내어질 수 있다. 예컨대, 어느 엔트리가 관련이 있는지를 나타내는 관련성 플래그(9224')를 메모리 유닛이 저장할 수 있다. 또한, K개의 데이터베이스 세그먼트(9220(k))를 저장하는 스토리지 장치(9210)가 있고, 여기서 k의 범위는 1 내지 K이다. 여기서, 전체 데이터베이스가 메모리 유닛에 저장되고 스토리지 장치에 저장되지 않을 수 있다(이러한 솔루션은 휘발성 메모리 저장 데이터베이스라고도 지칭함). Relevance can be established per database query and can be expressed in any and all manner. For example, the memory unit may store an association flag 9224' indicating which entry is relevant. Also, there is a storage device 9210 that stores K database segments 9220(k), where k ranges from 1 to K. Here, the entire database is stored in a memory unit and may not be stored in a storage device (this solution is also referred to as a volatile memory storage database).

메모리 유닛 엔트리는 전체 데이터베이스를 저장하기에 너무 작을 수 있으므로 한 번에 한 세그먼트를 수신할 수 있다. A memory unit entry may be too small to store the entire database, so it can receive one segment at a time.

필터링 유닛은 필드의 값을 임계값에 비교하는 동작, 필드의 값을 소정의 값에 비교하는 동작, 필드의 값이 소정의 범위 이내인지를 판단하는 동작 등과 같은 필터링 동작을 수행할 수 있다. The filtering unit may perform a filtering operation such as comparing the value of the field to a threshold value, comparing the value of the field to a predetermined value, determining whether the value of the field is within a predetermined range, and the like.

따라서, 필터링 유닛은 공지의 데이터베이스 필터링 동작을 수행할 수 있고 소형의 저렴한 회로일 수 있다. Thus, the filtering unit can perform known database filtering operations and can be a small and inexpensive circuit.

필터링 동작의 출력(9101)(예: 관련 데이터베이스 엔트리의 내용)은 처리를 위해 CPU(9420)로 전송된다. The output 9101 of the filtering operation (eg, the contents of an associated database entry) is sent to the CPU 9420 for processing.

메모리 및 필터링 시스템(9220)은 도 91b에 도시된 바와 같이 메모리 및 프로세싱 시스템으로 대체될 수 있다. The memory and filtering system 9220 may be replaced with a memory and processing system as shown in FIG. 91B .

메모리 및 프로세싱 시스템(9229)은 메모리 유닛 엔트리(9222)에 결합된 프로세싱 유닛(9225)을 포함한다. 프로세싱 유닛(9225)은 필터링 동작을 수행할 수 있고 관련 기록에 대한 하나 이상의 추가 동작의 실행에 적어도 부분적으로 가담할 수 있다. The memory and processing system 9229 includes a processing unit 9225 coupled to a memory unit entry 9222 . The processing unit 9225 may perform a filtering operation and may at least partially participate in the execution of one or more additional operations on the associated record.

프로세싱 유닛은 특정 동작을 수행하도록 맞추어지고/지거나 다중 동작을 수행하도록 구성된 프로그램 가능 유닛일 수 있다. 예를 들면, 프로세싱 유닛은 파이프라인 방식 프로세싱 유닛일 수 있거나, ALU를 포함하거나, 다중 ALU를 포함하는 등일 수 있다. A processing unit may be a programmable unit tailored to perform a particular operation and/or configured to perform multiple operations. For example, the processing unit may be a pipelined processing unit, may include an ALU, may include multiple ALUs, and the like.

프로세싱 유닛(9225)은 하나 이상의 추가 동작을 모두 수행할 수 있다. The processing unit 9225 may all perform one or more additional operations.

대안적으로, 하나 이상의 추가 동작의 일부가 프로세싱 유닛에 의해 실행되고, 프로세서(CPU 9240)는 하나 이상의 추가 동작의 다른 부분을 실행할 수 있다. Alternatively, a portion of the one or more additional operations may be executed by the processing unit, and the processor (CPU 9240) may execute other portions of the one or more additional operations.

프로세싱 동작의 출력(예: 데이터베이스 쿼리에 대한 부분 응답(9102) 또는 전체 응답(9103))은 CPU(9420)로 전송된다. The output of the processing operation (eg, partial response 9102 or full response 9103 to database query) is sent to CPU 9420 .

부분 응답은 추가적인 처리를 필요로 한다. Partial responses require additional processing.

도 92a는 필터링과 추가 프로세싱을 수행하도록 구성된 메모리/프로세싱 유닛(9227)을 포함하는 메모리/프로세싱 시스템(9228))을 도시한 것이다. 92A illustrates a memory/processing system 9228 including a memory/processing unit 9227 configured to perform filtering and further processing.

메모리/프로세싱 시스템(9228)은 메모리/프로세싱 유닛(9227)에 의해 도 91의 프로세싱 유닛과 메모리 유닛을 구현한다. Memory/processing system 9228 implements the processing unit and memory unit of FIG. 91 by memory/processing unit 9227 .

프로세서의 역할은 프로세싱 유닛의 제어, 하나 이상의 추가 동작의 적어도 일부의 실행 등을 포함할 수 있다. The role of the processor may include controlling the processing unit, executing at least a portion of one or more additional operations, and the like.

메모리 엔트리와 프로세싱 유닛의 조합은 적어도 부분적으로는 하나 이상의 메모리/프로세싱 유닛에 의해 구현될 수 있다. A combination of a memory entry and a processing unit may be implemented, at least in part, by one or more memory/processing units.

도 92b는 메모리/프로세싱 유닛(9010)의 일례를 도시한 것이다. 92B shows an example of a memory/processing unit 9010 .

메모리/프로세싱 유닛(9010)은 컨트롤러(9020), 내부 버스(9021), 및 다중 쌍의 로직(9030)과 메모리 뱅크(9040)를 포함한다. 컨트롤러는 통신 모듈로서 동작하도록 구성되거나 통신 모듈에 결합될 수 있다. Memory/processing unit 9010 includes a controller 9020 , an internal bus 9021 , and multiple pairs of logic 9030 and memory bank 9040 . The controller may be configured to operate as a communication module or may be coupled to the communication module.

컨트롤로(9020)와 다중 쌍의 로직(9030)과 메모리 뱅크(9040) 사이의 연결성은 다른 방식으로 구현될 수 있다. 메모리 뱅크와 로직은 다른 방식으로(쌍이 아닌) 배치될 수 있다. 다중 메모리 뱅크는 단일 로직에 결합 및/또는 단일 로직에 의해 관리될 수 있다. The connectivity between the controller 9020 and multiple pairs of logic 9030 and memory bank 9040 may be implemented in other ways. The memory banks and logic may be arranged in different ways (not in pairs). Multiple memory banks may be coupled to and/or managed by a single logic.

데이터베이스 쿼리(9100)는 인터페이스(9211)를 통해 메모리/프로세싱 시스템에 의해 수신된다. 인터페이스(9211)는 버스, 포트, 입력/출력 인터페이스 등일 수 있다. Database query 9100 is received by the memory/processing system via interface 9211 . Interface 9211 may be a bus, port, input/output interface, or the like.

여기서, 데이터베이스에 대한 응답은 하나 이상의 메모리/프로세싱 시스템, 하나 이상의 메모리 및 프로세싱 시스템, 하나 이상의 메모리 및 필터링 시스템, 및 이러한 시스템의 외부에 위치한 하나 이상의 프로세서 등 중의 적어도 하나(또는 이들의 조합)에 의해 생성될 수 있다. wherein the response to the database is provided by at least one (or a combination thereof) of one or more memory/processing systems, one or more memory and processing systems, one or more memory and filtering systems, and one or more processors located external to such systems, and the like. can be created

여기서, 데이터베이스에 대한 응답은 하나 이상의 필터링 유닛, 하나 이상의 메모리/프로세싱 유닛, 하나 이상의 프로세싱 유닛, 및 하나 이상의 다른 프로세서(예: 하나 이상의 다른 CPU) 등 중의 적어도 하나(또는 이들의 조합)에 의해 생성될 수 있다.wherein the response to the database is generated by at least one (or a combination thereof) of one or more filtering units, one or more memory/processing units, one or more processing units, and one or more other processors (eg, one or more other CPUs), etc. can be

임의의 모든 프로세스는 관련 데이터베이스 엔트리를 찾는 단계 및 그 후에 관련 데이터베이스 엔트리를 처리하는 단계를 포함할 수 있다. 처리는 하나 이상의 프로세싱 엔티티에 의해 실행될 수 있다. Any and all processes may include finding the relevant database entry and thereafter processing the relevant database entry. The processing may be executed by one or more processing entities.

프로세싱 엔티티는 메모리 및 프로세싱 시스템의 프로세싱 유닛(예: 메모리 및 프로세싱 시스템(9229)의 프로세싱 유닛(9225)), 메모리/프로세싱 유닛의 프로세서 서브유닛(또는 로직)(예: 도 91a, 도 91b, 및 도 74의 CPU(9240)) 등 중의 적어도 하나일 수 있다. A processing entity is a memory and processing unit of a processing system (eg, processing unit 9225 of memory and processing system 9229 ), a processor subunit (or logic) of a memory/processing unit (eg, FIGS. 91A , 91B , and It may be at least one of the CPU 9240 of FIG. 74 ).

데이터베이스 쿼리에 대한 응답의 생성에 포함된 처리는 다음 중의 임의의 하나 또는 그 조합에 의해 생성될 수 있다. The processing involved in generating a response to a database query may be generated by any one or combination of the following.

a. 메모리 및 프로세싱 시스템(9229)의 프로세싱 유닛(9225).a. A processing unit 9225 of the memory and processing system 9229 .

b. 상이한 메모리 및 프로세싱 시스템들(9229)의 프로세싱 유닛들(9225). b. Processing units 9225 in different memory and processing systems 9229 .

c. 메모리/프로세싱 시스템(9228)의 하나 이상의 메모리/프로세싱 유닛(9227)의 프로세서 서브유닛(또는 로직(9030)). c. A processor subunit (or logic 9030 ) of one or more memory/processing units 9227 of memory/processing system 9228 .

d. 상이한 메모리/프로세싱 시스템들(9228)의 메모리/프로세싱 유닛들(9227)의 프로세서 서브유닛들(또는 로직(9030)).d. Processor subunits (or logic 9030 ) of memory/processing units 9227 of different memory/processing systems 9228 .

e. 메모리/프로세싱 시스템(9228)의 하나 이상의 메모리/프로세싱 유닛들(9227)의 컨트롤러들. e. Controllers of one or more memory/processing units 9227 of memory/processing system 9228 .

f. 상이한 메모리/프로세싱 시스템들(9228)의 하나 이상의 메모리/프로세싱 유닛들(9227)의 컨트롤러들. f. Controllers of one or more memory/processing units 9227 of different memory/processing systems 9228 .

따라서, 데이터베이스 쿼리에 대한 응답에 관여된 프로세싱은 (a) 하나 이상의 메모리/프로세싱 유닛들의 하나 이상의 컨트롤러들, (b) 메모리/프로세싱 시스템들의 하나 이상의 프로세싱 유닛들, (c) 하나 이상의 메모리/프로세싱 유닛들의 하나 이상의 프로세서 서브유닛들, 및 (d) 하나 이상의 다른 프로세서들 등의 임의의 조합 또는 하부 조합에 의해 실행될 수 있다. Accordingly, the processing involved in responding to a database query may include (a) one or more controllers of one or more memory/processing units, (b) one or more processing units of memory/processing systems, (c) one or more memory/processing units, and (c) one or more memory/processing units. may be executed by any combination or sub-combination of one or more processor subunits, and (d) one or more other processors, etc.

하나보다 많은 프로세싱 엔티티에 의해 실행되는 프로세싱은 분산 프로세싱으로 지칭할 수 있다. Processing performed by more than one processing entity may be referred to as distributed processing.

여기서, 필터링은 하나 이상의 필터링 유닛 및/또는 하나 이상의 프로세싱 유닛 및/또는 하나 이상의 프로세싱 서브유닛 중의 필터링 엔티티에 의해 실행될 수 있다. 이러한 차원에서, 필터링 동작을 수행하는 프로세싱 유닛 및/또는 프로세싱 서브유닛은 필터링 유닛으로 지칭될 수 있다. Here, the filtering may be performed by a filtering entity in one or more filtering units and/or one or more processing units and/or one or more processing subunits. In this dimension, a processing unit and/or a processing subunit that performs a filtering operation may be referred to as a filtering unit.

프로세싱 엔티티는 필터링 엔티티일 수 있거나 필터링 유닛과 다를 수 있다. The processing entity may be a filtering entity or may be different from a filtering unit.

프로세싱 엔티티는 다른 필터링 엔티티에 의해 관련 있는 것으로 여겨진 데이터베이스 엔트리의 프로세싱 동작을 수행할 수 있다. A processing entity may perform processing operations on database entries deemed relevant by other filtering entities.

프로세싱 엔티티는 필터링 동작도 수행할 수 있다. The processing entity may also perform filtering operations.

데이터베이스 쿼리에 대한 응답은 하나 이상의 필터링 엔티티 및 하나 이상의 프로세싱 엔티티를 활용할 수 있다. A response to a database query may utilize one or more filtering entities and one or more processing entities.

하나 이상의 필터링 엔티티 및 하나 이상의 프로세싱 엔티티는 동일 시스템(예: 메모리/프로세싱 시스템(9228), 메모리 및 프로세싱 시스템(9229), 메모리 및 필터링 시스템(9220))에 속하거나 상이한 시스템에 속할 수 있다. The one or more filtering entities and the one or more processing entities may belong to the same system (eg, memory/processing system 9228 , memory and processing system 9229 , memory and filtering system 9220 ) or may belong to different systems.

메모리/프로세싱 유닛은 다중 프로세서 서브유닛을 포함할 수 있다. 프로세서 서브유닛들은 서로로부터 독립적으로 동작할 수 있고, 서로 부분적으로 협력할 수 있고, 분산 프로세싱에 가담하는 등이 가능할 수 있다. A memory/processing unit may include multiple processor subunits. Processor subunits may operate independently of each other, may partially cooperate with each other, may engage in distributed processing, and the like.

도 92c는 다중 메모리 및 필터링 시스템(9220), 다른 다중 프로세서(예: CPU(9240)), 및 스토리지 장치(9210)를 도시한 것이다. 92C illustrates multiple memory and filtering system 9220 , another multiple processor (eg, CPU 9240 ), and storage device 9210 .

다중 메모리 및 필터링 시스템(9220)은 하나 이상의 데이터베이스 쿼리 내의 하나 이상의 필터링 기준에 의거하여 하나 이상의 데이터베이스 엔트리의 필터링에 가담(동시 또는 비동시)할 수 있다. Multiple memory and filtering system 9220 may engage (concurrently or asynchronously) in the filtering of one or more database entries based on one or more filtering criteria in one or more database queries.

도 92d는 다중 메모리 및 프로세싱 시스템(9229), 다른 다중 프로세서(예: CPU(9240)), 및 스토리지 장치(9210)를 도시한 것이다. 92D illustrates multiple memory and processing system 9229 , another multiple processor (eg, CPU 9240 ), and storage device 9210 .

다중 메모리 및 프로세싱 시스템(9229)은 하나 이상의 데이터베이스 쿼리의 응답에 관여된 필터링과 적어도 부분적으로는 프로세싱에 가담(동시 또는 비동시)할 수 있다. Multiple memory and processing system 9229 may engage (concurrently or asynchronously) processing and at least in part filtering involved in responding to one or more database queries.

도 92e는 다중 메모리/프로세싱 시스템(9228), 다른 다중 프로세서(예: CPU(9240)), 및 스토리지 장치(9210)를 도시한 것이다. 92E illustrates multiple memory/processing system 9228 , another multiple processor (eg, CPU 9240 ), and storage device 9210 .

다중 메모리/프로세싱 시스템(9228)은 하나 이상의 데이터베이스 쿼리의 응답에 관여된 필터링과 적어도 부분적으로는 프로세싱에 가담(동시 또는 비동시)할 수 있다. Multiple memory/processing system 9228 may engage (simultaneously or asynchronously) processing and at least in part filtering involved in responding to one or more database queries.

도 92f는 데이터베이스 분석 가속 방법(9300)을 도시한 것이다. 92F illustrates a method 9300 for accelerating database analysis.

방법(9300)은 데이터베이스 쿼리와 관련 있는 데이터베이스의 데이터베이스 엔트리를 나타내는 적어도 하나의 관련성 기준을 포함하는 데이터베이스 쿼리를 메모리 프로세싱 집적회로가 수신하는 단계 9310으로 시작할 수 있다. The method 9300 may begin with step 9310 where the memory processing integrated circuit receives a database query comprising at least one relevance criterion representing a database entry in the database that is relevant to the database query.

데이터베이스 쿼리와 관련 있는 데이터베이스의 데이터베이스 엔트리는 데이터베이스의 데이터베이스 엔트리의 전부, 일부, 또는 하나일 수 있거나 하나도 아닐 수 있다. A database entry of a database that is related to a database query may be all, some, one, or none of the database entries of the database.

메모리 프로세싱 집적회로는 컨트롤러, 다중 프로세서 서브유닛, 및 다중 메모리 유닛일 수 있다. A memory processing integrated circuit may be a controller, multiple processor subunits, and multiple memory units.

단계 9310 이후에, 메모리 프로세싱 집적회로에 저장된 한 그룹의 관련 있는 데이터베이스 엔트리를 메모리 프로세싱 집적회로가 적어도 하나의 관련성 기준에 의거하여 판단하는 단계 9320이 수행될 수 있다. After step 9310, step 9320 may be performed in which the memory processing integrated circuit determines a group of relevant database entries stored in the memory processing integrated circuit based on at least one relevance criterion.

단계 9320 이후에, 메모리 프로세싱 집적회로에 저장된 관련 없는 데이터 엔트리를 하나 이상의 프로세싱 엔티티로 실질적으로 전송하지 않고 프로세싱을 계속하기 위해 상기 그룹의 관련 있는 데이터베이스 엔트리를 상기 하나 이상의 프로세싱 엔티티로 전송하는 단계 9330이 수행될 수 있다. After step 9320, step 9330 of transferring the group of relevant database entries to the one or more processing entities to continue processing without substantially transferring the extraneous data entries stored in the memory processing integrated circuit to the one or more processing entities can be performed.

'실질적으로 전송하지 않고'라는 문구는 전혀 전송하지 않거나(데이터베이스 쿼리에 대한 응답 과정에서) 미미한 수의 관련 없는 엔트리를 전송하는 것을 의미한다. 미미하다는 것은 최대 1, 2, 3, 4, 5, 6, 7, 8, 9, 10%를 의미할 수 있거나 대역폭에 중대한 영향이 없는 양을 전송하는 것을 의미할 수 있다. The phrase 'substantially not sending' means either not sending at all (in response to a database query) or sending a negligible number of irrelevant entries. Insignificant can mean up to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10%, or it can mean transmitting an amount that has no significant impact on bandwidth.

단계 9330 이후에, 상기 그룹의 관련 있는 데이터베이스 엔트리를 프로세싱 하여 데이터베이스 쿼리에 대한 응답을 제공하는 단계 9340이 수행될 수 있다. After step 9330, step 9340 may be performed to process the relevant database entries of the group to provide a response to the database query.

[002] 도 92g는 데이터베이스 분석 가속 방법(9301)을 도시한 것이다. [002] 92G illustrates a database analysis acceleration method 9301 .

필터링 및 데이터베이스 쿼리에 대한 응답에 필요한 프로세싱 전체가 메모리 프로세싱 집적회로에 의해 실행되는 것으로 가정한다. It is assumed that all of the processing necessary for filtering and responding to database queries is executed by the memory processing integrated circuit.

방법(9301)은 데이터베이스 쿼리와 관련 있는 데이터베이스의 데이터베이스 엔트리를 나타내는 적어도 하나의 관련성 기준을 포함하는 데이터베이스 쿼리를 메모리 프로세싱 집적회로가 수신하는 단계 9310으로 시작할 수 있다.The method 9301 may begin with step 9310 where the memory processing integrated circuit receives a database query comprising at least one relevance criterion representing a database entry in the database that is relevant to the database query.

단계 9310 이후에, 메모리 프로세싱 집적회로에 저장된 한 그룹의 관련 있는 데이터베이스 엔트리를 메모리 프로세싱 집적회로가 적어도 하나의 관련성 기준에 의거하여 판단하는 단계 9320이 수행될 수 있다.After step 9310, step 9320 may be performed in which the memory processing integrated circuit determines a group of relevant database entries stored in the memory processing integrated circuit based on at least one relevance criterion.

단계 9320 이후에, 메모리 프로세싱 집적회로에 저장된 관련 없는 데이터 엔트리를 하나 이상의 프로세싱 엔티티로 실질적으로 전송하지 않고 완전한 프로세싱을 위해 상기 그룹의 관련 있는 데이터베이스 엔트리를 상기 하나 이상의 프로세싱 엔티티로 전송하는 단계 9331이 수행될 수 있다.After step 9320, step 9331 is performed of transferring the relevant database entries of the group to the one or more processing entities for complete processing without substantially transferring the extraneous data entries stored in the memory processing integrated circuit to the one or more processing entities. can be

단계 9331 이후에, 상기 그룹의 관련 있는 데이터베이스 엔트리를 완전히 프로세싱 하여 데이터베이스 쿼리에 대한 응답을 제공하는 단계 9341이 수행될 수 있다. After step 9331 , step 9341 may be performed to completely process the relevant database entries of the group to provide a response to the database query.

단계 9341 이후에, 데이터베이스 쿼리에 대한 응답을 메모리 프로세싱 집적회로로부터 출력하는 단계 9351이 수행될 수 있다. After step 9341, step 9351 of outputting a response to the database query from the memory processing integrated circuit may be performed.

도 92h는 데이터베이스 분석 가속 방법(9302)을 도시한 것이다. 92H illustrates a method 9302 for accelerating database analysis.

필터링 및 데이터베이스 쿼리에 대한 응답에 필요한 프로세싱의 일부만이 메모리 프로세싱 집적회로에 의해 실행되는 것으로 가정한다. 메모리 프로세싱 집적회로는 메모리 프로세싱 집적회로의 외부에 위치한 하나 이상의 다른 프로세싱 엔티티에 의해 처리될 부분 결과를 출력하게 된다. It is assumed that only a portion of the processing necessary for filtering and responding to database queries is executed by the memory processing integrated circuit. The memory processing integrated circuit may output partial results to be processed by one or more other processing entities located external to the memory processing integrated circuit.

단계 9320 이후에, 메모리 프로세싱 집적회로에 저장된 관련 없는 데이터 엔트리를 하나 이상의 프로세싱 엔티티로 실질적으로 전송하지 않고 일부 프로세싱을 위해 상기 그룹의 관련 있는 데이터베이스 엔트리를 상기 하나 이상의 프로세싱 엔티티로 전송하는 단계 9332가 수행될 수 있다.After step 9320, a step 9332 of transferring the relevant database entries of the group to the one or more processing entities for some processing is performed without substantially transferring the extraneous data entries stored in the memory processing integrated circuit to the one or more processing entities. can be

단계 9332 이후에, 상기 그룹의 관련 있는 데이터베이스 엔트리를 일부 프로세싱하여 데이터베이스 쿼리에 대한 중간 응답을 제공하는 단계 9342가 수행될 수 있다.After step 9332, step 9342 may be performed to partially process the relevant database entries of the group to provide an intermediate response to the database query.

단계 9342 이후에, 데이터베이스 쿼리에 대한 중간 응답을 메모리 프로세싱 집적회로로부터 출력하는 단계 9352가 수행될 수 있다.After step 9342, step 9352 of outputting an intermediate response to the database query from the memory processing integrated circuit may be performed.

단계 9352 이후에, 상기 중간 응답을 계속 프로세싱하여 데이터베이스 쿼리에 대한 응답을 제공하는 단계 9390이 수행될 수 있다.After step 9352, step 9390 may be performed to continue processing the intermediate response to provide a response to the database query.

도 92i는 데이터베이스 분석 가속 방법(9303)을 도시한 것이다. 92I illustrates a method 9303 for accelerating database analysis.

필터링은 메모리 프로세싱 집적회로에 의해 실행되지만 관련 있는 데이터베이스 엔트리의 처리는 메모리 프로세싱 집적회로에 의해 실행되지 않는 것으로 가정한다. 메모리 프로세싱 집적회로는 메모리 프로세싱 집적회로의 외부에 위치한 하나 이상의 다른 프로세싱 엔티티에 의해 완전히 처리될 관련 있는 데이터베이스 엔트리 그룹을 출력하게 된다.It is assumed that the filtering is performed by the memory processing integrated circuit but the processing of the relevant database entries is not performed by the memory processing integrated circuit. The memory processing integrated circuit may output a group of related database entries to be fully processed by one or more other processing entities located external to the memory processing integrated circuit.

단계 9320 이후에, 메모리 프로세싱 집적회로에 저장된 관련 없는 데이터 엔트리를 하나 이상의 프로세싱 엔티티로 실질적으로 전송하지 않고 상기 그룹의 관련 있는 데이터베이스 엔트리를 메모리 프로세싱 집적회로의 외부에 위치한 하나 이상의 프로세싱 엔티티로 전송하는 단계 9333이 수행될 수 있다.After step 9320, transferring the group of relevant database entries to one or more processing entities located external to the memory processing integrated circuit without substantially transferring the extraneous data entries stored in the memory processing integrated circuit to the one or more processing entities. 9333 may be performed.

단계 9333 이후에, 상기 중간 응답을 완전히 프로세싱 하여 데이터베이스 쿼리에 대한 응답을 제공하는 단계 9391이 수행될 수 있다. After step 9333, step 9391 of completely processing the intermediate response to provide a response to the database query may be performed.

도 92j는 데이터베이스 분석 가속 방법(9304)을 도시한 것이다.92J illustrates a method 9304 for accelerating database analysis.

방법(9303)은 데이터베이스 쿼리와 관련이 있는 데이터베이스의 데이터베이스 엔트리를 나타내는 적어도 하나의 관련성 기준을 포함하는 데이터베이스 쿼리를 집적회로가 수신하는 단계 9315로 시작할 수 있고, 집적회로는 컨트롤러, 필터링 유닛, 및 다중 메모리 유닛을 포함한다. Method 9303 may begin with step 9315, in which the integrated circuit receives a database query comprising at least one relevance criterion indicative of a database entry in the database relevant to the database query, wherein the integrated circuit includes a controller, a filtering unit, and multiple It includes a memory unit.

단계 9315 이후에, 집적회로에 저장된 한 그룹의 관련 있는 데이터베이스 엔트리를 필터링 유닛이 적어도 하나의 관련성 기준에 의거하여 판단하는 단계 9325가 수행될 수 있다.After step 9315, step 9325 in which the filtering unit determines a group of relevant database entries stored in the integrated circuit based on at least one relevance criterion may be performed.

단계 9325 이후에, 집적회로에 저장된 관련 없는 데이터 엔트리를 하나 이상의 프로세싱 엔티티로 실질적으로 전송하지 않고 프로세싱을 계속하기 위해 상기 그룹의 관련 있는 데이터베이스 엔트리를 집적회로의 외부에 위치한 하나 이상의 프로세싱 엔티티로 전송하는 단계 9335가 수행될 수 있다.After step 9325, transferring the relevant database entries of the group to one or more processing entities located external to the integrated circuit to continue processing without substantially transferring the extraneous data entries stored in the integrated circuit to the one or more processing entities. Step 9335 may be performed.

단계 9335 이후에 단계 9391이 수행될 수 있다. Step 9391 may be performed after step 9335.

도 92k는 데이터베이스 분석 가속 방법(9305)을 도시한 것이다.92K illustrates a method 9305 for accelerating database analysis.

방법(9305)은 데이터베이스 쿼리와 관련이 있는 데이터베이스의 데이터베이스 엔트리를 나타내는 적어도 하나의 관련성 기준을 포함하는 데이터베이스 쿼리를 집적회로가 수신하는 단계 9314로 시작할 수 있고, 집적회로는 컨트롤러, 프로세싱 유닛, 및 다중 메모리 유닛을 포함한다.The method 9305 may begin with step 9314, in which the integrated circuit receives a database query comprising at least one relevance criterion representing a database entry in the database relevant to the database query, wherein the integrated circuit includes a controller, a processing unit, and multiple It includes a memory unit.

단계 9314 이후에, 집적회로에 저장된 한 그룹의 관련 있는 데이터베이스 엔트리를 프로세싱 유닛이 적어도 하나의 관련성 기준에 의거하여 판단하는 단계 9324가 수행될 수 있다.After step 9314, step 9324 may be performed in which the processing unit determines a group of relevant database entries stored in the integrated circuit based on at least one relevance criterion.

단계 9324 이후에, 집적회로에 저장된 관련 없는 데이터 엔트리를 프로세싱 유닛이 처리하지 않고 상기 그룹의 관련 있는 데이터베이스 엔트리를 프로세싱 유닛이 처리하여 프로세싱 결과를 제공하는 단계 9334가 수행될 수 있다. After step 9324, step 9334 may be performed in which the processing unit processes the relevant database entries of the group to provide processing results without the processing unit processing the irrelevant data entries stored in the integrated circuit.

단계 9334 이후에, 집적회로로부터 프로세싱 결과를 출력하는 단계 9344가 수행될 수 있다. After step 9334, step 9344 of outputting a processing result from the integrated circuit may be performed.

방법들(9300, 9301, 9302, 9304, 9305) 중의 임의의 방법에서, 메모리 프로세싱 집적회로는 출력을 생성한다. 출력은 관련 있는 데이터베이스 엔트리의 그룹, 하나 이상의 중간 결과, 또는 하나 이상의 (완전) 결과일 수 있다. In any of methods 9300 , 9301 , 9302 , 9304 , 9305 , the memory processing integrated circuit generates an output. The output can be a group of related database entries, one or more intermediate results, or one or more (complete) results.

출력 이전에 메모리 프로세싱 집적회로의 필터링 엔티티 및/또는 프로세싱 엔티티로부터 하나 이상의 관련 있는 데이터베이스 엔트리 및/또는 하나 이상의 결과(전체 결과 또는 중간 결과)의 검색이 수행될 수 있다. A retrieval of one or more relevant database entries and/or one or more results (full results or intermediate results) may be performed from a filtering entity and/or processing entity of the memory processing integrated circuit prior to output.

검색은 하나 이상의 방식으로 제어될 수 있고 메모리 프로세싱 집적회로의 하나 이상의 컨트롤러 및/또는 아비터에 의해 제어될 수 있다. The retrieval may be controlled in one or more ways and may be controlled by one or more controllers and/or arbiters of the memory processing integrated circuit.

출력 및/또는 검색은 검색 및/또는 출력의 하나 이상의 파라미터의 제어를 포함할 수 있다. 파라미터는 검색 타이밍, 검색 속도, 검색 소스, 대역폭, 검색 순서, 출력 타이밍, 출력 속도, 출력 소스, 대역폭, 출력 순서, 검색 방법의 유형, 아비트레이션 방법의 유형 등을 포함할 수 있다. Output and/or retrieval may include controlling one or more parameters of the retrieval and/or output. The parameters may include search timing, search rate, search source, bandwidth, search order, output timing, output rate, output source, bandwidth, output order, type of search method, type of arbitration method, and the like.

출력 및/또는 검색은 흐름 제어 프로세스를 수행할 수 있다. Output and/or retrieval may perform a flow control process.

출력 및/또는 검색(예: 흐름 제어 프로세스의 적용)은 하나 이상의 프로세싱 엔티티로부터 출력되고 그룹의 데이터베이스 엔트리의 처리의 완료에 관한 지시자에 대응하는 것일 수 있다. 지시자는 프로세싱 엔티티로부터 중간 결과가 검색될 준비가 되었는지 여부를 나타내는 것일 수 있다. Outputs and/or retrievals (eg, application of flow control processes) may be output from one or more processing entities and correspond to indicators regarding completion of processing of a group's database entries. The indicator may indicate whether an intermediate result is ready to be retrieved from the processing entity.

출력은, 메모리 프로세싱 집적회로를 리퀘스터 유닛(requester unit)에 결합시키는 링크를 통해, 출력 과정에서 사용된 대역폭을 최대 허용 대역폭에 일치시키려는 시도를 포함할 수 있다. 이 링크는 메모리 프로세싱 집적회로의 출력의 수신자로의 링크일 수 있다. 최대 허용 대역폭은 링크의 능력 및/또는 유용성, 출력된 컨텐츠의 수신자의 능력 및/또는 유용성에 따를 수 있다. The output may include an attempt to match the bandwidth used in the output process to a maximum allowed bandwidth, via a link coupling the memory processing integrated circuit to a requester unit. This link may be a link to the receiver of the output of the memory processing integrated circuit. The maximum allowed bandwidth may depend on the capabilities and/or availability of the link, the capabilities and/or availability of the recipient of the outputted content.

출력은 출력된 컨텐츠를 최선의 또는 차선의 방식으로 출력하려는 시도를 포함할 수 있다. Outputting may include an attempt to output the outputted content in a best or suboptimal manner.

출력된 컨텐츠의 출력은 출력 트래픽 속도의 변동을 임계값 이하로 유지하려는 시도를 포함할 수 있다. The output of the output content may include an attempt to keep fluctuations in the output traffic rate below a threshold value.

방법들(9300, 9301, 9302, 9305) 중의 임의의 방법은 하나 이상의 프로세싱 엔티티가 상기 그룹의 관련 있는 데이터베이스 엔트리의 추가적인 프로세싱의 진행을 나타내는 것일 수 있는 프로세싱 상태 지시자를 생성하는 단계를 포함할 수 있다. Any of methods 9300 , 9301 , 9302 , 9305 may include one or more processing entities generating a processing status indicator, which may indicate progress of further processing of the group's relevant database entry. .

앞서 설명한 임의의 모든 방법에 포함된 프로세싱은 단일 프로세싱 엔티티 이상에 의해 실행될 수 있다. 이 경우, 프로세싱은 분산 방식으로 실행되므로 분산 프로세싱으로 여겨질 수 있다. The processing included in any and all methods described above may be performed by more than a single processing entity. In this case, since the processing is executed in a distributed manner, it can be regarded as distributed processing.

앞서 나타낸 바와 같이, 프로세싱은 계층 방식 또는 플랫 방식으로 실행될 수 있다. As indicated above, processing may be performed in a hierarchical manner or in a flat manner.

방법들(9300 내지 9305) 중의 임의의 방법은 다중 데이터베이스 쿼리에 대해 동시에 또는 순차적으로 응답할 수 있는 다중 시스템에 의해 실행될 수 있다.Any of methods 9300 - 9305 may be executed by multiple systems capable of responding to multiple database queries simultaneously or sequentially.

워드 임베딩word embedding

앞서 설명한 바와 같이, 워드 임베딩은 어휘에서 단어 또는 구절이 요소의 벡터로 매핑되는 자연어 처리(NLP)의 언어 모델링 및 특징 학습 방식의 모음에 대한 포괄적인 명칭이다. 개념적으로, 단어당 많은 차원이 있는 공간으로부터 훨씬 적은 차원이 있는 연속적 벡터 공간으로의 수학적 매핑이 개입된다. As previously discussed, word embeddings are a generic name for a collection of language modeling and feature learning approaches in natural language processing (NLP) in which words or phrases in a vocabulary are mapped to vectors of elements. Conceptually, a mathematical mapping is involved from a space with many dimensions per word to a continuous vector space with many fewer dimensions.

벡터들은 수학적으로 처리될 수 있다. 예를 들면, 행렬에 속하는 벡터들은 합산되어 합산 벡터를 제공할 수 있다. Vectors can be processed mathematically. For example, vectors belonging to a matrix may be summed to provide a summation vector.

다른 예를 들면, (문장의) 행렬의 공분산(covariance)이 계산될 수 있다. 여기에는 행렬과 그 전치행렬(transposed matrix)의 곱셈이 포함될 수 있다. As another example, the covariance of a matrix (of a sentence) can be calculated. This may include multiplication of a matrix and its transposed matrix.

메모리/프로세싱 유닛은 어휘를 저장할 수 있다. 특히, 어휘의 부분들은 메모리/프로세싱 유닛의 다중 메모리 뱅크들에 저장될 수 있다. The memory/processing unit may store the vocabulary. In particular, portions of a vocabulary may be stored in multiple memory banks of a memory/processing unit.

따라서, 메모리/프로세싱 유닛은 문장의 단어 또는 구절을 나타내게 되는 접근 정보(예: 검색 키)로 접근될 수 있다. 따라서, 문장의 단어 또는 구절을 나타내는 벡터는 메모리/프로세싱 유닛의 메모리 뱅크들의 적어도 일부로부터 검색되게 된다. Thus, the memory/processing unit can be accessed with access information (eg, a search key) that represents a word or phrase in a sentence. Accordingly, a vector representing a word or phrase of a sentence is retrieved from at least some of the memory banks of the memory/processing unit.

메모리/프로세싱 유닛의 상이한 메모리 뱅크들은 어휘의 상이한 부분들을 저장할 수 있고 (문장의 색인 분포에 따라) 병렬로 접근될 수 있다. 메모리 뱅크의 단일 라인 이상이 순차적으로 접근될 필요가 있는 경우에도 예측은 페널티를 줄일 수 있다. Different memory banks of the memory/processing unit can store different parts of the vocabulary and can be accessed in parallel (according to the index distribution of the sentence). Prediction can also reduce the penalty when more than a single line of a memory bank needs to be accessed sequentially.

메모리/프로세싱 유닛의 상이한 메모리 뱅크들 사이에 어휘의 단어들을 할당하면 문장별로 메모리/프로세싱 유닛의 상이한 메모리 뱅크들로 병렬 접근이 될 가능성을 향상시킨다는 차원에서 최선이거나 매우 바람직할 수 있다. 이러한 할당은 사용자별로, 일반 대중별로, 또는 사람들의 그룹별로 학습될 수 있다. Allocating words of the vocabulary between different memory banks of the memory/processing unit may be best or highly desirable in terms of improving the likelihood of parallel access to different memory banks of the memory/processing unit per sentence. These assignments may be learned per user, per general public, or per group of people.

또한, 메모리/프로세싱 유닛은 프로세싱 동작의 적어도 일부를 수행(로직에 의해)하는 데에도 활용될 수 있어서, 메모리/프로세싱 유닛 외부의 버스로부터 요구되는 대역폭을 줄일 수 있고 다중 연산을 효율적인 방식으로(병렬로도 가능) 계산(메모리/프로세싱 유닛의 다중 프로세서를 병렬로 활용)할 수 있다. In addition, the memory/processing unit may also be utilized to perform (by logic) at least a portion of the processing operations, thereby reducing the bandwidth required from the bus external to the memory/processing unit and performing multiple operations in an efficient manner (parallel). ) can be computed (utilizing multiple processors in a memory/processing unit in parallel).

메모리 뱅크는 로직과 연관될 수 있다. A memory bank may be associated with logic.

프로세싱 동작의 적어도 일부는 하나 이상의 추가적인 프로세서(예: 벡터 합산기(vector adder) 등을 포함하는 벡터 프로세서)에 의해 실행될 수 있다. At least some of the processing operations may be executed by one or more additional processors (eg, a vector processor including a vector adder, etc.).

메모리/프로세싱 유닛 메모리 뱅크의 일부 또는 전부로 할당될 수 있는 하나 이상의 추가 프로세서를 포함할 수 있다(로직 쌍). A memory/processing unit may include one or more additional processors (logic pairs) that may be allocated as part or all of a memory bank.

따라서, 단일 추가 프로세서는 메모리 뱅크의 전부 또는 일부로 할당될 수 있다(로직 쌍). 다른 예를 들면, 추가 프로세서는 일부 레벨의 추가 프로세서가 그보다 낮은 레벨의 추가 프로세서의 출력을 처리할 수 있도록 계층 방식으로 배치될 수 있다. Thus, a single additional processor may be allocated as all or part of a memory bank (logic pair). As another example, the additional processors may be arranged in a hierarchical manner such that some level of additional processors may process the output of lower level additional processors.

여기서, 프로세싱 동작은 추가 프로세서를 사용하지 않고 메모리/프로세싱 유닛의 로직에 의해 실행될 수 있다.Here, the processing operation may be executed by the logic of the memory/processing unit without using an additional processor.

도 89a, 도 89b, 도 89c, 도 89d, 도 89e, 도 89f, 및 도 89g는 각각 메모리/프로세싱 유닛(9010, 9011, 9012, 9013, 9014, 9015, 9019)의 예를 도시한 것이다. 메모리/프로세싱 유닛(9010)은 컨트롤러(9020), 내부 버스(9021), 및 로직(9030)과 메모리 뱅크(9040)의 다중 쌍을 포함한다. 89A, 89B, 89C, 89D, 89E, 89F, and 89G illustrate examples of memory/processing units 9010, 9011, 9012, 9013, 9014, 9015, and 9019, respectively. Memory/processing unit 9010 includes a controller 9020 , an internal bus 9021 , and multiple pairs of logic 9030 and memory bank 9040 .

여기서, 로직(9030)과 메모리 뱅크(9040)는 다른 방식으로 컨트롤러에 및/또는 서로 결합될 수 있다. 예컨대, 다중 버스가 컨트롤러와 로직 사이에 제공되거나, 로직이 다중 층으로 배치되거나, 단일 로직이 다중 메모리 뱅크에 의해 공유되거나(도 89e의 예 참조) 등이 가능할 수 있다. Here, logic 9030 and memory bank 9040 may be coupled to the controller and/or to each other in other ways. For example, it may be possible for multiple buses to be provided between the controller and the logic, for the logic to be placed in multiple layers, for a single logic to be shared by multiple memory banks (see the example of FIG. 89E ), and the like.

메모리/프로세싱 유닛(9010) 내의 각 메모리 뱅크의 페이지 길이는 임의의 모든 방식으로 정의될 수 있다. 예를 들면, 길이는 충분히 작을 수 있고, 메모리 뱅크의 수는 관련 없는 정보에 많은 비트를 낭비하지 않고 많은 수의 벡터의 출력이 병렬로 가능하게 할 수 있도록 충분히 클 수 있다. The page length of each memory bank within the memory/processing unit 9010 may be defined in any and all manner. For example, the length may be small enough, and the number of memory banks may be large enough to enable the output of a large number of vectors in parallel without wasting many bits on unrelated information.

로직(9020)은 완전 ALU, 부분 ALU, 메모리 컨트롤러, 부분 메모리 컨트롤러 등을 포함할 수 있다. 부분 ALU(메모리 컨트롤러) 유닛은 완전 ALU(메모리 컨트롤러)에 의해 실행 가능한 기능의 일부만을 실행할 능력이 있다. 본 출원에 도시된 임의의 모든 로직 또는 서브프로세서는 완전 ALU, 부분 ALU, 메모리 컨트롤러, 부분 메모리 컨트롤러 등을 포함할 수 있다.Logic 9020 may include a full ALU, a partial ALU, a memory controller, a partial memory controller, and the like. A partial ALU (memory controller) unit is capable of executing only a fraction of the functions executable by a full ALU (memory controller). Any and all logic or subprocessors shown in this application may include full ALUs, partial ALUs, memory controllers, partial memory controllers, and the like.

컨트롤러(9020)와 다중 쌍의 로직(9030)과 메모리 뱅크(9040) 사이의 연결성은 다른 방식으로 구현될 수 있다. 메모리 뱅크와 로직은 다른 방식(예: 쌍이 아닌 방식)으로 배치될 수 있다. The connectivity between the controller 9020 and multiple pairs of logic 9030 and memory bank 9040 may be implemented in other ways. The memory banks and logic may be arranged in different ways (eg, not in pairs).

메모리/프로세싱 유닛(9010)에는 추가 벡터가 없을 수 있고, (메모리 뱅크로부터의) 벡터의 처리는 로직(9030)에 의해 수행된다. There may be no additional vectors in the memory/processing unit 9010 , and processing of the vectors (from the memory bank) is performed by the logic 9030 .

도 89b는 내부 버스(9021)에 결합된 벡터 프로세서(9050)와 같은 추가 프로세서를 도시한 것이다. 89B shows an additional processor, such as vector processor 9050, coupled to internal bus 9021.

도 89c는 내부 버스(9021)에 결합된 벡터 프로세서(9050)와 같은 추가 프로세서를 도시한 것이다. 하나 이상의 추가 프로세서가 프로세싱 동작을 실행(단독으로 또는 로직과 함께)한다. 89C shows an additional processor, such as vector processor 9050, coupled to internal bus 9021. One or more additional processors execute (alone or in conjunction with logic) the processing operations.

도 89d는 버스(9022)를 통해 메모리/프로세싱 유닛(9010)에 결합된 호스트(9018)를 도시한 것이다. 89D shows host 9018 coupled to memory/processing unit 9010 via bus 9022 .

도 89d는 또한 단어/구절(9072)을 벡터(9073)에 매핑하는 어휘(9070)를 도시하고 있다. 메모리/프로세싱 유닛은 이전에 인지된 단어 또는 구절을 각각 나타내는 검색 키(9071)를 활용하여 접근된다. 호스트(9018)는 문장을 나타내는 다중 검색 키(9071)를 메모리/프로세싱 유닛으로 전송하고, 메모리/프로세싱 유닛은 벡터(9070) 또는 문장과 관련된 벡터에 의해 적용된 프로세싱 동작의 결과를 출력할 수 있다. 단어/구절은 메모리/프로세싱 유닛(9010)에 저장되지 않는 것이 보통이다. 89D also shows vocabulary 9070 mapping word/phrase 9072 to vector 9073 . The memory/processing unit is accessed utilizing a search key 9071 representing each previously recognized word or phrase. The host 9018 may send the multiple search key 9071 representing the sentence to the memory/processing unit, which may output the result of the processing operation applied by the vector 9070 or the vector associated with the sentence. Words/phrases are not normally stored in memory/processing unit 9010 .

메모리 뱅크를 제어하기 위한 메모리 컨트롤러 기능은 로직에 포함(완전히 또는 부분적으로), 컨트롤러(9020)에 포함(완전히 또는 부분적으로), 및/또는 메모리/프로세싱 유닛(9010) 내의 하나 이상의 메모리 컨트롤러(미도시)에 포함(완전히 또는 부분적으로)될 수 있다. Memory controller functions for controlling the memory bank may be included in logic (completely or partially), included in controller 9020 (completely or partially), and/or one or more memory controllers (not shown) within memory/processing unit 9010 . city) can be included (completely or partially).

메모리/프로세싱 유닛은 호스트(9018)로 전송되는 벡터/결과의 스루풋을 최대화하도록 구성될 수 있거나 내부 메모리/프로세싱 유닛 트래픽을 제어 및/또는 메모리/프로세싱 유닛과 호스트 컴퓨터(또는 메모리/프로세싱 유닛 외부의 임의의 다른 엔티티) 사이의 트래픽을 제어하기 위한 임의의 모든 프로세스를 적용할 수 있다. The memory/processing unit may be configured to maximize the throughput of vectors/results sent to the host 9018 or control the internal memory/processing unit traffic and/or control the memory/processing unit and the host computer (or external to the memory/processing unit). Any and all processes for controlling traffic between any other entity) may be applied.

상이한 로직(9030)이 메모리/프로세싱 유닛의 메모리 뱅크(9040)에 결합되고 벡터에 수학적 연산을 수행(바람직하게 병렬)하여 처리된 벡터를 생성할 수 있다. 하나의 로직(9030)은 다른 로직으로 벡터를 전송할 수 있고(도 89g의 예시 라인(38) 참조), 다른 로직은 수신된 벡터 및 스스로 계산한 벡터에 수학적 연산을 적용할 수 있다. 로직은 계층으로 배치될 수 있고, 특정 레벨의 로직은 이전 레벨의 로직으로부터의 벡터 또는 중간 결과(수학적 연산을 적용을 생성)를 처리할 수 있다. Different logic 9030 may be coupled to the memory bank 9040 of the memory/processing unit and perform mathematical operations on the vectors (preferably in parallel) to produce the processed vectors. One logic 9030 may send the vector to another logic (see example line 38 in FIG. 89G ), and the other logic may apply mathematical operations to the received vector and the vector it computes itself. Logic can be arranged in hierarchies, and logic at a particular level can process vectors or intermediate results (creating applying mathematical operations) from previous levels of logic.

처리된 벡터의 총 사이즈가 결과의 총 사이즈를 초과하는 경우, 출력 대역폭(메모리/프로세싱 유닛으로부터의 출력 대역폭)의 감소가 확보된다. 예를 들어, 메모리/프로세싱 유닛에 의해 K개의 벡터가 합산되어 단일 출력 벡터를 제공하는 경우, 대역폭의 K:1 감소가 확보된다. When the total size of the processed vector exceeds the total size of the result, a reduction in the output bandwidth (output bandwidth from the memory/processing unit) is ensured. For example, if K vectors are summed by the memory/processing unit to provide a single output vector, then a K:1 reduction in bandwidth is ensured.

컨트롤러(9020)는 접근될 상이한 벡터들의 주소를 방송함으로써 병렬로 다중 메모리 뱅크를 개방하도록 구성될 수 있다. The controller 9020 may be configured to open multiple memory banks in parallel by broadcasting the addresses of different vectors to be accessed.

컨트롤러는 문장의 단어 또는 구절의 순서에 적어도 부분적으로 의거하여 다중 메모리 뱅크로부터(또는 상이한 벡터를 저장하는 임의의 중간 버퍼 또는 스토리지 회로; 도 89d의 버퍼(9033) 참조) 상이한 벡터를 검색하는 순서를 제어하도록 구성될 수 있다. The controller determines the order in which different vectors are retrieved from multiple memory banks (or any intermediate buffer or storage circuitry that stores different vectors; see buffer 9033 in Figure 89D) based at least in part on the order of words or phrases in the sentence. can be configured to control.

컨트롤러(9020)는 벡터의 메모리/프로세싱 유닛(9010) 외부 출력에 관한 하나 이상의 파라미터에 의거하여 상이한 벡터의 검색을 관리하도록 구성될 수 있다. 예를 들어, 메모리 뱅크로부터의 상이한 벡터의 검색 속도는 메모리/프로세싱 유닛(9010)으로부터 상이한 벡터를 출력하는 허용 속도와 실질적으로 동일하도록 설정될 수 있다. The controller 9020 may be configured to manage the retrieval of different vectors based on one or more parameters relating to an output external to the memory/processing unit 9010 of the vector. For example, the retrieval rate of different vectors from the memory bank may be set to be substantially equal to the allowable rate of outputting the different vectors from the memory/processing unit 9010 .

컨트롤러는 임의의 트래픽 형성 프로세스를 적용하여 메모리/프로세싱 유닛(9010) 외부로 상이한 벡터를 출력할 수 있다. 예를 들어, 컨트롤러(9020)는 호스트 컴퓨터 또는 메모리/프로세싱 유닛(9010)을 호스트 컴퓨터로 결합시키는 링크에 의해 허용 가능한 최대 속도와 최대한 근접한 속도로 상이한 벡터를 출력하려고 겨냥할 수 있다. 다른 예를 들면, 컨트롤러는 시간에 따른 트래픽 속도 변동을 최소화 또는 적어도 실질적으로 감소시키면서 상이한 벡터를 출력할 수 있다. The controller may apply any traffic shaping process to output different vectors out of the memory/processing unit 9010 . For example, the controller 9020 may aim to output different vectors at a speed as close as possible to the maximum speed allowed by the host computer or the link coupling the memory/processing unit 9010 to the host computer. As another example, the controller may output different vectors while minimizing or at least substantially reducing traffic rate fluctuations over time.

컨트롤러(9020)는 메모리 뱅크(9040) 및 로직(9030)과 동일한 집적회로에 속할 수 있고, 따라서 상이한 벡터의 검색 상태(예: 벡터가 준비되었는지 여부, 벡터가 준비되었지만 동일한 메모리 뱅크로부터 다른 벡터가 검색되고 있거나 검색되기 직전인지 여부 등)에 관한 피드백을 상이한 로직/메모리 뱅크로부터 수월하게 수신할 수 있다. 피드백은 임의의 모든 방식으로, 예컨대, 전용 컨트롤 라인, 공유 컨트롤 라인 등을 통해, 상태 비트 등을 활용하여, 제공될 수 있다(도 89f의 상태 라인(9039) 참조). The controller 9020 may belong to the same integrated circuit as the memory bank 9040 and the logic 9030, so that the retrieval states of different vectors (e.g., whether the vector is ready, whether the vector is ready but another vector from the same memory bank) Feedback on whether it is being retrieved or is about to be retrieved, etc.) can be readily received from different logic/memory banks. Feedback may be provided in any and all manner, eg, via dedicated control lines, shared control lines, etc., utilizing status bits, etc. (see status line 9039 in FIG. 89F ).

컨트롤러(9020)는 상이한 벡터의 검색과 출력을 독립적으로 제어할 수 있고, 따라서 호스트 컴퓨터의 개입을 줄일 수 있다. 대안적으로, 호스트 컴퓨터는 컨트롤러의 관리 능력을 인식하지 못할 수 있고 계속해서 상세 지시를 전송할 수 있다. 이 경우, 메모리/프로세싱 유닛(9010)은 상세 지시를 무시하고 컨트롤러의 관리 능력을 숨기는 등을 할 수 있다. 앞서 설명한 솔루션은 호스트 컴퓨터에 의해 관리 가능한 프로토콜에 의거하여 사용될 수 있다. The controller 9020 can independently control the search and output of different vectors, thus reducing the intervention of the host computer. Alternatively, the host computer may not be aware of the controller's management capabilities and may continue to send detailed instructions. In this case, the memory/processing unit 9010 may ignore detailed instructions, hide management capabilities of the controller, and the like. The solution described above can be used based on a protocol manageable by the host computer.

메모리/프로세싱 유닛에서 프로세싱 동작을 수행하는 것은 이러한 동작이 호스트 내의 프로세싱 동작보다 전력 소비가 많은 경우에도, 심지어 이러한 동작이 호스트와 메모리/프로세싱 유닛 사이의 전달 동작보다 전력 소비가 많은 경우에도, 매우 이점이 있는(에너지 측면에서) 것이 발견되었다. 예를 들어, 벡터가 충분히 크다는 가정 하에, 예컨대 데이터 단위를 전송하는 에너지 소비는 4pJ이고 데이터 단위의 (호스트에 의한) 처리 동작의 에너지 소비는 0.1pJ이라고 하면, 메모리/프로세싱 유닛에 의한 데이터 단위의 처리는 메모리/프로세싱 유닛에 의한 데이터 단위의 처리의 에너지 소비가 5pJ 미만이었을 때 더 효과적이었다. It is very advantageous to perform processing operations in the memory/processing unit, even when such operations consume more power than processing operations within the host, even when these operations consume more power than transfer operations between the host and the memory/processing unit. It was found that there is (in terms of energy). For example, assuming that the vector is large enough, e.g. the energy consumption of transferring a data unit is 4pJ and the energy consumption of a processing operation (by the host) of the data unit is 0.1pJ, then the energy consumption of the data unit by the memory/processing unit is 0.1pJ. The processing was more effective when the energy consumption of processing of data units by the memory/processing unit was less than 5 pJ.

각 벡터(문장을 나타내는 행렬의 벡터)는 단어의 시퀀스(또는 다른 다중 비트 세그먼트)로 나타내어질 수 있다. 설명의 편의상, 다중 비트 세그먼트는 단어인 것으로 가정한다. Each vector (a vector of matrices representing sentences) can be represented as a sequence of words (or other multi-bit segments). For convenience of description, it is assumed that a multi-bit segment is a word.

벡터가 0 값 단어를 포함하는 경우에 추가적인 전력 감소가 확보될 수 있다. 0 값 단어 전체를 출력하는 대신에, 단어보다 짧은(예: 1비트) 0 값 플래그(전용 컨트롤 라인으로 전달되기도 함)가 출력될 수 있다. 플래그는 다른 값(예: 1값을 가진 단어)에 할당될 수 있다. Further power reduction can be ensured if the vector contains zero value words. Instead of printing the entire zero-valued word, a zero-valued flag (sometimes passed to a dedicated control line) that is shorter than the word (eg 1 bit) may be output. Flags can be assigned different values (eg words with a value of 1).

도 88a는 임베딩을 위한 방법(9400)을 도시한 것으로서, 특징 벡터 관련 정보를 검색하는 방법일 수도 있다. 이러한 특징 벡터 관련 정보는 특징 벡터 및/또는 이러한 특징 벡터의 처리 결과를 포함할 수 있다. 88A shows a method 9400 for embedding, which may be a method for searching feature vector related information. The feature vector-related information may include a feature vector and/or a processing result of the feature vector.

방법(9400)은 다중 문장 세그먼트에 매핑될 수 있는 다중 요청된 특징 벡터의 검색을 위한 검색 정보를 메모리 프로세싱 집적회로가 수신하는 단계 9410으로 시작할 수 있다. The method 9400 may begin with step 9410 where the memory processing integrated circuit receives search information for a search of multiple requested feature vectors that may map to multiple sentence segments.

메모리 프로세싱 유닛은 컨트롤러, 다중 프로세서 서브유닛, 및 다중 메모리 유닛을 포함할 수 있다. 각각의 메모리 유닛은 프로세서 서브유닛에 결합될 수 있다. A memory processing unit may include a controller, multiple processor subunits, and multiple memory units. Each memory unit may be coupled to a processor subunit.

단계 9410 이후에, 다중 메모리 유닛의 적어도 일부로부터 다중 요청된 특징 벡터를 검색하는 단계 9420이 수행될 수 있다. After operation 9410 , operation 9420 of retrieving multiple requested feature vectors from at least some of the multiple memory units may be performed.

검색은 둘 이상의 메모리 유닛으로부터 둘 이상의 메모리 유닛에 저장된 요청된 특징 벡터의 요청을 포함할 수 있다. The retrieval may include a request for a requested feature vector stored in the two or more memory units from the two or more memory units.

요청은 문장 세그먼트와 문장 세그먼트에 매핑된 특징 벡터의 위치 사이의 알려진 매핑에 의거하여 실행될 수 있다. The request may be executed based on a known mapping between the sentence segment and the location of the feature vector mapped to the sentence segment.

매핑은 메모리 프로세싱 집적회로의 부팅 과정 중에 업로드될 수 있다. The mapping may be uploaded during the boot process of the memory processing integrated circuit.

한 번에 최대한 많은 요청된 특징 벡터를 검색하는 것이 유리할 수 있지만, 이는 요청된 특징 벡터가 저장된 위치와 상이한 메모리 유닛의 수에 달려있다. It may be advantageous to retrieve as many requested feature vectors as possible at once, but this depends on the number of memory units different from where the requested feature vectors are stored.

하나보다 많은 요청된 특징 벡터가 동일 메모리 뱅크에 저장되어 있는 경우, 예측 검색이 적용되어 메모리 뱅크로부터의 정보 검색과 연관된 페널티를 줄일 수 있다. 페널티 감소를 위한 다양한 방법이 본 출원의 다양한 부분에 도시되어 있다. If more than one requested feature vector is stored in the same memory bank, a predictive search may be applied to reduce the penalty associated with retrieving information from the memory bank. Various methods for penalty reduction are shown in various parts of this application.

검색은 단일 메모리 유닛에 저장된 한 세트의 요청된 특징 벡터의 적어도 일부 요청된 특징 벡터의 예측 검색을 적용하는 것이 포함될 수 있다. The search may include applying a predictive search of at least some requested feature vectors of a set of requested feature vectors stored in a single memory unit.

요청된 특징 벡터는 최적의 방식으로 메모리 유닛 사이에 분산될 수 있다. The requested feature vectors can be distributed among the memory units in an optimal way.

요청된 특징 벡터는 예상되는 검색 패턴에 의거하여 메모리 유닛 사이에 분산될 수 있다. The requested feature vectors may be distributed among memory units based on expected search patterns.

다중 요청된 특징 벡터의 검색은 특정 순서에 따라, 예를 들면, 하나 이상의 문장의 문장 세그먼트의 순서에 따라 실행될 수 있다. Retrieval of multiple requested feature vectors may be performed according to a particular order, eg, according to the order of sentence segments of one or more sentences.

다중 요청된 특징 벡터의 검색은 적어도 부분적으로는 순서 없이 실행될 수 있고, 여기서 검색은 다중 요청된 특징 벡터의 순서를 재배열하는 것을 더 포함할 수 있다. The retrieval of the multiple requested feature vectors may be performed, at least in part, out of order, wherein the retrieval may further include rearranging the order of the multiple requested feature vectors.

다중 요청된 특징 벡터의 검색은 다중 요청된 특징 벡터가 컨트롤러에 의해 읽히기 전에 다중 요청된 특징 벡터의 버퍼링을 포함할 수 있다. Retrieving the multiple requested feature vector may include buffering the multiple requested feature vector before the multiple requested feature vector is read by the controller.

다중 요청된 특징 벡터의 검색은 다중 메모리 유닛과 연관된 하나 이상의 버퍼가 하나 이상의 요청된 특징 벡터를 저장하는 경우를 지시하는 버퍼 상태 지시자의 생성을 포함할 수 있다. Retrieving the multiple requested feature vectors may include generating a buffer status indicator that indicates when one or more buffers associated with the multiple memory units store the one or more requested feature vectors.

방법은 전용 컨트롤 라인을 통해 버퍼 상태 지시자를 전달하는 단계를 포함할 수 있다. The method may include passing a buffer status indicator through a dedicated control line.

전용 컨트롤 라인은 메모리 유닛별로 할당될 수 있다. A dedicated control line may be allocated for each memory unit.

버퍼 상태 지시자는 하나 이상의 버퍼에 저장된 상태 비트일 수 있다. The buffer status indicator may be a status bit stored in one or more buffers.

방법은 하나 이상의 공유 컨트롤 라인을 통해 버퍼 상태 지시자를 전달하는 단계를 포함할 수 있다. The method may include passing the buffer status indicator over one or more shared control lines.

단계 9420 이후에, 다중 요청된 특징 벡터를 처리하여 프로세싱 결과를 제공하는 단계 9430이 수행될 수 있다. After step 9420 , step 9430 of processing the multi-requested feature vector to provide a processing result may be performed.

추가적으로 또는 대안적으로, 단계 9420 이후에, (a) 요청된 특징 벡터 및 (b) 요청된 특징 벡터의 처리 결과 중의 적어도 하나를 포함할 수 있는 출력을 메모리 프로세싱 집적회로로부터 출력하는 단계 9440이 수행될 수 있다. (a) 요청된 특징 벡터 및 (b) 요청된 특징 벡터의 처리 결과 중의 적어도 하나는 특징 벡터 관련 정보로도 지칭된다. Additionally or alternatively, after step 9420, step 9440 of outputting from the memory processing integrated circuit an output that may include at least one of (a) the requested feature vector and (b) a result of processing the requested feature vector is performed can be At least one of (a) the requested feature vector and (b) the result of processing the requested feature vector is also referred to as feature vector related information.

단계 9430이 실행되는 경우, 단계 9440은 (적어도) 요청된 특징 벡터의 처리 결과를 출력하는 단계를 포함할 수 있다. When step 9430 is executed, step 9440 may include outputting (at least) a result of processing the requested feature vector.

단계 9430을 건너뛰는 경우, 단계 9440은 요청된 특징 벡터를 출력하는 단계를 포함하고 요청된 특징 벡터의 처리 결과를 출력하는 단계를 포함하지 않을 수 있다. If step 9430 is skipped, step 9440 may include outputting the requested feature vector and may not include outputting the processing result of the requested feature vector.

도 88b는 임베딩을 위한 방법(9401)을 도시한 것이다. 88B illustrates a method 9401 for embedding.

출력은 요청된 특징 벡터를 포함하지만 요청된 특징 벡터를 처리한 결과는 포함하지 않는 것으로 가정한다. It is assumed that the output contains the requested feature vector but not the result of processing the requested feature vector.

방법(9401)은 다중 문장 세그먼트에 매핑될 수 있는 다중 요청된 특징 벡터의 검색을 위한 검색 정보를 메모리 프로세싱 집적회로가 수신하는 단계 9410으로 시작할 수 있다. The method 9401 may begin with step 9410 where the memory processing integrated circuit receives search information for a search of multiple requested feature vectors that may be mapped to multiple sentence segments.

단계 9420 이후에, 요청된 특징 벡터를 포함하지만 요청된 특징 벡터를 처리한 결과를 포함하지 않는 출력을 메모리 프로세싱 집적회로로부터 출력하는 단계 9431이 수행될 수 있다. After step 9420 , step 9431 of outputting an output including the requested feature vector but not including a result of processing the requested feature vector from the memory processing integrated circuit may be performed.

도 88c는 임베딩을 위한 방법(9402)을 도시한 것이다.88C illustrates a method 9402 for embedding.

출력은 요청된 특징 벡터를 처리한 결과를 포함하는 것으로 가정한다. It is assumed that the output contains the result of processing the requested feature vector.

방법(9402)은 다중 문장 세그먼트에 매핑될 수 있는 다중 요청된 특징 벡터의 검색을 위한 검색 정보를 메모리 프로세싱 집적회로가 수신하는 단계 9410으로 시작할 수 있다.The method 9402 may begin with step 9410 where the memory processing integrated circuit receives search information for a search of multiple requested feature vectors that may be mapped to multiple sentence segments.

단계 9410 이후에, 다중 메모리 유닛의 적어도 일부로부터 다중 요청된 특징 벡터를 검색하는 단계 9420이 수행될 수 있다.After operation 9410 , operation 9420 of retrieving multiple requested feature vectors from at least some of the multiple memory units may be performed.

단계 9420 이후에, 다중 요청된 특징 벡터를 처리하여 프로세싱 결과를 제공하는 단계 9430이 수행될 수 있다.After step 9420, step 9430 of processing the multi-requested feature vector to provide a processing result may be performed.

단계 9430 이후에, 요청된 특징 벡터를 처리한 결과를 포함할 수 있는 출력을 메모리 프로세싱 집적회로로부터 출력하는 단계 9442가 수행될 수 있다.After step 9430 , step 9442 of outputting an output that may include a result of processing the requested feature vector from the memory processing integrated circuit may be performed.

출력을 출력하는 단계는 출력에 트래픽 성형(traffic shaping)을 적용하는 단계를 포함할 수 있다. Outputting the output may include applying traffic shaping to the output.

출력을 출력하는 단계는 메모리 프로세싱 집적회로를 리퀘스터 유닛에 결합시키는 링크를 통해, 출력 과정에서 사용된 대역폭을 최대 허용 대역폭에 일치시키려고 시도하는 단계를 포함할 수 있다. Outputting the output may include, via a link coupling the memory processing integrated circuit to the requester unit, attempting to match the bandwidth used in the outputting process to a maximum allowed bandwidth.

출력을 출력하는 단계는 출력 트래픽 속도의 변동을 임계값 이하로 유지하려고 시도하는 단계를 포함할 수 있다.Outputting the output may include attempting to keep fluctuations in the output traffic rate below a threshold.

검색하는 단계 및 출력하는 단계 중의 임의의 단계는 호스트의 제어 하에 및/또는 컨트롤러에 의해 독립적으로 또는 일부 독립적으로 실행될 수 있다. Any of the steps of retrieving and outputting may be executed independently or in part independently by a controller and/or under the control of the host.

호스트는 다중 메모리 유닛 내의 요청된 특징 벡터의 위치와 무관한 일반 검색 정보의 전송부터 다중 메모리 유닛 내의 요청된 특징 벡터의 위치와 의거한 상세 검색 정보의 전송에 이르는 상이한 입도의 검색 명령을 전송할 수 있다. The host may send search commands of different granularity, ranging from the transmission of general search information independent of the location of the requested feature vector within multiple memory units to the transmission of detailed search information based on the location of the requested feature vector within multiple memory units. .

호스트는 메모리 프로세싱 집적회로 이내의 상이한 검색 동작의 타이밍을 제어(또는 제어 시도)할 수 있지만 이러한 타이밍에 대해 개의치 않을 수도 있다. The host may control (or attempt to control) the timing of different seek operations within the memory processing integrated circuit, but may not care about such timing.

컨트롤러는 호스트에 의해 다양한 레벨에서 제어될 수 있고, 심지어 호스트의 상세 명령을 무시하고 적어도 검색 및/또는 출력을 독립적으로 제어할 수도 있다. The controller may be controlled at various levels by the host, and may even override the host's detailed commands and at least independently control the retrieval and/or output.

요청된 특징 벡터의 처리는 하나 이상의 메모리/프로세싱 유닛 및 하나 이상의 메모리/프로세싱 유닛 외부에 위치한 하나 이상의 프로세서 등 중의 적어도 하나(또는 이들의 조합)에 의해 실행될 수 있다. Processing of the requested feature vector may be executed by at least one (or a combination thereof) of one or more memory/processing units and one or more processors located external to the one or more memory/processing units, and the like.

여기서, 요청된 특징 벡터의 처리는 하나 이상의 프로세서 서브유닛, 컨트롤러, 하나 이상의 벡터 프로세서, 및 하나 이상의 메모리/프로세싱 유닛 외부에 위치한 하나 이상의 메모리/프로세싱 유닛 등 중의 적어도 하나(또는 이들의 조합)에 의해 실행될 수 있다. wherein the processing of the requested feature vector is performed by at least one (or a combination thereof) of one or more processor subunits, a controller, one or more vector processors, and one or more memory/processing units located external to the one or more memory/processing units. can be executed

요청된 특징 벡터의 처리는 다음 중의 임의의 하나 또는 그 조합에 의해 실행되고 생성될 수 있다. The processing of the requested feature vector may be executed and generated by any one or a combination of the following.

a. 메모리/프로세싱 유닛의 프로세서 서브유닛(또는 로직(9030))a. Processor subunit (or logic 9030) of memory/processing unit

b. 다중 메모리/프로세싱 유닛의 프로세서 서브유닛(또는 로직(9030))b. Processor subunit (or logic 9030) of multiple memory/processing units

c. 메모리/프로세싱 유닛의 컨트롤러c. Controller of memory/processing unit

d. 다중 메모리/프로세싱 유닛의 컨트롤러d. Controller of multiple memory/processing units

e. 메모리/프로세싱 유닛의 하나 이상의 벡터 프로세서e. One or more vector processors in the memory/processing unit

f. 다중 메모리/프로세싱 유닛의 하나 이상의 벡터 프로세서f. One or more vector processors in multiple memory/processing units

따라서, 요청된 특징 벡터의 처리는 (a) 하나 이상의 메모리/프로세싱 유닛들의 하나 이상의 컨트롤러들, (b) 하나 이상의 메모리/프로세싱 유닛들의 하나 이상의 프로세서 서브유닛들, (c) 하나 이상의 메모리/프로세싱 유닛들의 하나 이상의 벡터 프로세서들, 및 (d) 하나 이상의 메모리/프로세싱 유닛들의 외부에 위치한 하나 이상의 다른 프로세서들 등의 임의의 조합 또는 하부 조합에 의해 실행될 수 있다.Accordingly, processing of the requested feature vector may result in (a) one or more controllers of one or more memory/processing units, (b) one or more processor subunits of one or more memory/processing units, (c) one or more memory/processing units, and (c) one or more memory/processing units. may be executed by any combination or sub-combination of one or more vector processors of

둘 이상의 프로세싱 엔티티들에 의해 실행되는 처리는 분산 프로세싱으로 지칭될 수 있다. Processing performed by two or more processing entities may be referred to as distributed processing.

메모리/프로세싱 유닛은 다중 프로세서 서브유닛을 포함할 수 있다. 프로세서 서브유닛들은 서로 독립적으로 동작, 서로 부분적으로 협력, 분산 프로세싱에 가담 등이 가능하다. A memory/processing unit may include multiple processor subunits. The processor subunits can operate independently of each other, partially cooperate with each other, participate in distributed processing, and the like.

프로세싱은 모든 프로세서 서브유닛이 동일한 동작을 수행하는(및 동작들 사이에 프로세싱의 결과를 출력하거나 출력하지 않을 수 있는) 플랫 방식으로 실행될 수 있다. The processing may be executed in a flat manner in which all processor subunits perform the same operation (and may or may not output the result of the processing between operations).

프로세싱은 상이한 레벨의 프로세싱 동작의 시퀀스가 포함되는 계층 방식으로 실행될 수 있다. 여기서, 특정 층의 프로세싱 동작은 다른 레벨의 프로세싱 동작을 뒤따른다. 프로세서 서브유닛들은 상이한 층들에 할당(동적 또는 정적)되고 계층 프로세싱에 가담할 수 있다. The processing may be performed in a hierarchical fashion in which a sequence of processing operations at different levels is included. Here, a processing operation of a particular layer follows a processing operation of another level. Processor subunits can be assigned (dynamic or static) to different layers and participate in layer processing.

요청된 특징 벡터의 프로세싱은 둘 이상의 프로세싱 엔티티(프로세서 서브유닛, 컨트롤러, 벡터 프로세서, 기타 프로세서)에 의해 실행될 수 있고, 임의의 모든 방식(플랫, 계층, 또는 기타 방식)으로 분산 프로세싱 될 수 있다. 예를 들어, 프로세서 서브유닛들이 그 처리 결과를 컨트롤러로 출력하고, 컨트롤러는 그 결과를 더 처리할 수 있다. 하나 이상의 메모리/프로세싱 유닛의 외부에 위치한 하나 이상의 다른 프로세서는 메모리 프로세싱 집적회로의 출력을 더 처리할 수 있다. The processing of the requested feature vector may be executed by two or more processing entities (processor subunits, controllers, vector processors, other processors), and may be distributed processing in any and all manner (flat, hierarchical, or otherwise). For example, the processor subunits may output the processing result to the controller, and the controller may further process the result. One or more other processors located external to the one or more memory/processing units may further process the output of the memory processing integrated circuit.

여기서, 검색 정보는 문장 세그먼트에 매핑되지 않은 요청된 특징 벡터의 검색을 위한 정보도 포함할 수 있다. 이러한 특징 벡터는 문장 세그먼트에 관련될 수 있는 하나 이상의 사람, 장치, 또는 임의의 다른 엔티티에 매핑될 수 있다. 예를 들면, 문장 세그먼트를 감지한 장치의 사용자, 문장 세그먼트를 감지한 장치, 문장 세그먼트의 소스로 식별된 사용자, 문장 생성 당시에 접속된 웹 사이트, 문장이 캡처된 장소 등이 여기에 포함될 수 있다. Here, the search information may also include information for searching for a requested feature vector that is not mapped to a sentence segment. These feature vectors may map to one or more people, devices, or any other entity that may be associated with a sentence segment. For example, this may include the user of the device that detected the sentence segment, the device that detected the sentence segment, the user identified as the source of the sentence segment, the website accessed at the time the sentence was generated, the location where the sentence was captured, and the like.

방법(9400, 9401, 9402)은 문장 세그먼트에 매핑되지 않은 요청된 검색 벡터 및/또는 프로세싱에 준용하여 적용 가능하다. Methods 9400 , 9401 , 9402 are applicable mutatis mutandis to requested search vectors and/or processing that are not mapped to sentence segments.

특징 벡터의 프로세싱 예에는 합산, 가중 합산, 평균, 감산, 또는 임의의 모든 다른 수학적 함수의 적용이 포함될 수 있으며 이에 한정되지 않는다. Examples of processing of feature vectors may include, but are not limited to, summing, weighted summing, averaging, subtracting, or application of any and any other mathematical function.

하이브리드 장치hybrid device

프로세서 속도와 메모리 사이즈가 모두 지속적으로 증가함에 따라, 효과적인 처리 속도에 대한 중대한 한계는 폰노이만 병목현상이다. 폰노이만 병목현상은 기존의 컴퓨터 아키텍처의 스루풋 한계에서 기인한다. 특히, 프로세서에 의해 수행되는 실제 계산에 비해 메모리(즉, 외부 DRAM 메모리와 같은 로직 다이 외부의 메모리)로부터 프로세서로의 데이터 전송에 병목이 생기는 경우가 많다. 이에 따라, 메모리 집약적 처리에서는 메모리에 읽기와 쓰기를 위한 클럭 사이클의 수가 상당히 증가한다. 클럭 사이클이 메모리에 읽기와 쓰기에 소비되고 데이터에 대한 연산을 수행하는데 활용될 수 없으므로, 그 결과 효과적인 처리 속도에 손실이 발생한다. 또한, 프로세서의 계산 대역폭은 일반적으로 프로세서가 메모리에 접근하기 위해 사용하는 버스의 대역폭보다 크다. As both processor speed and memory size continue to increase, a significant limit to effective processing speed is the von Neumann bottleneck. The von Neumann bottleneck is caused by the throughput limitations of conventional computer architectures. In particular, the transfer of data from memory (ie, memory external to the logic die, such as external DRAM memory) to the processor is often a bottleneck compared to the actual computation performed by the processor. Accordingly, in memory-intensive processing, the number of clock cycles for reading and writing to the memory significantly increases. As clock cycles are spent reading and writing to memory and cannot be utilized to perform operations on data, the result is a loss in effective processing speed. Also, the computational bandwidth of a processor is generally greater than the bandwidth of the bus that the processor uses to access memory.

이러한 병목현상은 신경망 및 기타 머신러닝 알고리즘과 같은 메모리 집약적 프로세스, 데이터베이스 구성, 색인 검색, 쿼리 작업, 및 데이터 처리 연산보다 더 많은 읽기와 쓰기 동작을 포함하는 기타 작업에서 특히 확연하다. This bottleneck is particularly evident in memory-intensive processes such as neural networks and other machine learning algorithms, database construction, index searches, query operations, and other tasks that involve more read and write operations than data processing operations.

본 개시는 앞서 종래 기술의 다른 문제들 중에서 앞서 나열한 하나 이상의 문제를 완화 또는 극복하기 위한 해법을 기재한다. This disclosure describes solutions for alleviating or overcoming one or more of the problems listed above, among other problems in the prior art.

메모리 집약적 프로세싱을 위한 하이브리드 장치가 제공될 수 있고, 하이브리드 장치는 베이스 다이(base die), 다중 프로세서, 적어도 하나의 다른 다이의 제1 메모리 리소스, 및 적어도 하나의 또 다른 다이의 제2 메모리 리소스를 포함할 수 있다. A hybrid device may be provided for memory intensive processing, wherein the hybrid device uses a base die, multiple processors, a first memory resource of at least one other die, and a second memory resource of at least one other die. may include

베이스 다이와 적어도 하나의 다른 다이는 웨이퍼 온 웨이퍼 접합(wafer on wafer bonding)에 의해 서로 연결된다. The base die and at least one other die are connected to each other by wafer on wafer bonding.

다중 프로세서는 프로세싱 동작을 수행하고 제1 메모리 리소스에 저장된 검색된 정보를 검색하도록 구성된다. The multiple processors are configured to perform processing operations and retrieve retrieved information stored in the first memory resource.

제2 메모리 리소스는 제2 메모리 리소스에서 제1 메모리 리소스로 추가 정보를 전송하도록 구성된다. The second memory resource is configured to transfer the additional information from the second memory resource to the first memory resource.

베이스 다이와 적어도 하나의 다른 다이 사이의 제1 경로의 전반적인 대역폭은 적어도 하나의 다른 다이와 적어도 하나의 또 다른 다이 사이의 제2 경로의 대역폭보다 크고, 제1 메모리 리소스의 저장 용량은 제2 메모리 리소스의 저장 용량보다 몇 배 작다. the overall bandwidth of the first path between the base die and the at least one other die is greater than the bandwidth of the second path between the at least one other die and the at least one other die, and the storage capacity of the first memory resource is greater than that of the second memory resource. It is several times smaller than the storage capacity.

제2 메모리 리소스는 고대역폭 메모리(HBM) 리소스이다. The second memory resource is a high bandwidth memory (HBM) resource.

적어도 하나의 또 다른 다이는 고대역폭 메모리(HBM) 칩의 적층이다. At least one other die is a stack of high-bandwidth memory (HBM) chips.

제2 메모리 리소스의 적어도 일부는 웨이퍼 투 웨이퍼 접합과 다른 연결성에 의해 베이스 다이로 연결되는 또 다른 다이에 속할 수 있다. At least a portion of the second memory resource may belong to another die connected to the base die by a wafer-to-wafer junction and other connectivity.

제2 메모리 리소스의 적어도 일부는 웨이퍼 투 웨이퍼 접합과 다른 연결성에 의해 다른 다이로 연결되는 또 다른 다이에 속할 수 있다.At least a portion of the second memory resource may belong to another die connected to another die by a wafer-to-wafer junction and other connectivity.

제1 메모리 리소스와 제2 메모리 리소스는 서로 상이한 레벨의 캐시 메모리이다. The first memory resource and the second memory resource are cache memories of different levels.

제1 메모리 리소스는 베이스 다이와 제2 메모리 리소스 사이에 위치한다. The first memory resource is located between the base die and the second memory resource.

제1 메모리 리소스는 제2 메모리 리소스의 옆에 위치한다. The first memory resource is located next to the second memory resource.

다른 다이는 추가적인 프로세싱을 수행하도록 구성되고, 다른 다이는 복수의 프로세서 서브유닛과 제1 메모리 리소스를 포함한다. The other die is configured to perform additional processing, and the other die includes a plurality of processor subunits and a first memory resource.

각 프로세서 서브유닛은 프로세서 서브유닛에 할당된 제1 메모리 리소스의 고유 부분에 결합된다. Each processor subunit is coupled to a unique portion of a first memory resource allocated to the processor subunit.

제1 메모리 리소스의 고유 부분은 적어도 하나의 메모리 뱅크이다. The unique portion of the first memory resource is at least one memory bank.

다중 프로세서는 제1 메모리 리소스도 포함하는 메모리 프로세싱 칩에 포함된 복수의 프로세서 서브유닛이다. A multiprocessor is a plurality of processor subunits included in a memory processing chip that also includes a first memory resource.

베이스 다이는 다중 프로세서를 포함하고, 다중 프로세서는 웨이퍼 투 웨이퍼 접합에 형성된 컨덕터를 통해 결합된 복수의 프로세서 서브유닛이다. The base die includes multiple processors, the multiple processors being a plurality of processor subunits coupled via conductors formed in a wafer-to-wafer junction.

하나 이상의 또 다른 다이에 포함되고 웨이퍼 온 웨이퍼(WOW) 연결성과 다른 연결성을 활용하여 연결되는 제2 메모리 리소스에 베이스 다이의 적어도 일 부분을 WOW 연결성을 활용하여 결합시킬 수 있는 하이브리드 집적회로가 제공될 수 있다. 제2 메모리 리소스의 일례는 고대역폭 메모리(HBM) 리소스일 수 있다. 다양한 도면에서, 제2 메모리 리소스는 실리콘관통전극(through silicon via 또는 TSV) 연결성을 활용하여 컨트롤러에 결합될 수 있는 HBM 메모리 유닛의 적층에 포함된다. 컨트롤러는 베이스 다이에 포함되거나 베이스 다이의 적어도 일부에 결합된다(예: 마이크로 범프를 통해). A hybrid integrated circuit capable of coupling at least a portion of a base die to a second memory resource contained in one or more other dies and coupled utilizing wafer-on-wafer (WOW) connectivity and other connectivity utilizing WOW connectivity. can An example of the second memory resource may be a high bandwidth memory (HBM) resource. In the various figures, a second memory resource is included in a stack of HBM memory units that may be coupled to a controller utilizing through silicon via (TSV) connectivity. The controller is included in the base die or coupled to at least a portion of the base die (eg, via micro bumps).

베이스 다이는 로직 다이일 수 있지만 메모리/프로세싱 유닛일 수도 있다. The base die may be a logic die but may also be a memory/processing unit.

WOW 연결성은 베이스 다이의 하나 이상의 부분을 메모리 다이 또는 메모리/프로세싱 유닛일 수 있는 다른 다이(WOW 연결된 다이)의 하나 이상의 부분으로 결합시키는 데에 활용될 수 있다. WOW 연결성은 매우 높은 스루풋 연결성이다. WOW connectivity may be utilized to couple one or more portions of a base die to one or more portions of another die (WOW connected die), which may be a memory die or a memory/processing unit. WOW connectivity is a very high-throughput connectivity.

고대역폭 메모리(HBM) 칩의 적층은 베이스 다이에 결합될 수 있고(직접 또는 WOW 연결된 다이를 통해) 고 스루풋 연결과 매우 확장된 메모리 리소스를 제공할 수 있다. Stacks of high-bandwidth memory (HBM) chips can be coupled to a base die (either directly or via a WOW-connected die) and can provide high-throughput connectivity and highly expanded memory resources.

WOW 연결된 다이는 HBM 칩의 적층과 베이스 다이 사이에 결합되어 TSV 연결성이 있고 하부에 WOW 연결된 다이가 있는 HBM 메모리 칩 적층을 형성할 수 있다. WOW connected dies may be coupled between a stack of HBM chips and a base die to form an HBM memory chip stack with TSV connectivity and WOW connected dies underneath.

TSV 연결성이 있고 하부에 WOW 연결된 다이가 있는 HBM 칩 적층은 WOW 연결된 다이가 베이스 다이로 접근될 수 있는 하위 레벨 메모리(예: 레벨 3 캐시)로 사용될 수 있는 다층 메모리 계층을 제공할 수 있고, 여기서 고위 레벨 HBM 메모리 적층으로부터의 페치(fetch) 및/또는 프리페치(pre-fetch) 동작이 WOW 연결된 다이를 채운다. A stack of HBM chips with TSV connectivity and with WOW connected dies underneath can provide a multi-layered memory layer that can be used as a lower level memory (eg level 3 cache) where the WOW connected dies can be accessed as a base die, where Fetch and/or pre-fetch operations from the higher level HBM memory stack fill the WOW connected die.

HBM 메모리 칩은 HBM DRAM 칩일 수 있지만, 임의의 모든 다른 메모리 기술이 사용될 수 있다. The HBM memory chip may be an HBM DRAM chip, although any and all other memory technologies may be used.

WOW 연결성과 HBM 칩을 조합하여 활용하면 대역폭과 메모리 밀도 사이의 균형을 제공할 수 있는 다중 메모리 층을 포함할 수 있는 다중 레벨 메모리 구조를 제공할 수 있다. The combination of WOW connectivity and HBM chips can provide a multi-level memory architecture that can include multiple memory layers that can provide a balance between bandwidth and memory density.

제시된 솔루션은 종래의 DRAM 메모리/HBM과 로직 다이의 내부 캐시 사이의 추가적이고 완전히 새로운 메모리 계층의 역할을 할 수 있다. The presented solution can serve as an additional and entirely new memory layer between the conventional DRAM memory/HBM and the internal cache of the logic die.

이는 빠른 방식으로 메모리 읽기를 더 잘 관리하는 DRAM 상의 새로운 메모리 계층을 제공할 수 있다. This could provide a new memory layer on DRAM that better manages memory reads in a faster way.

도 93a 내지 도 93i는 각각 하이브리드 집적회로(11011'-11019')를 도시한 것이다. 93A to 93I show hybrid integrated circuits 11011'-11019', respectively.

도 93a는 TSV(11039)를 활용하여 서로 연결되고 베이스 다이의 제1 메모리 컨트롤러(11031)에 연결되는 HDM DRAM 메모리 칩(11032)의 적층을 포함하는, TSV 연결성이 있고 최하위 레벨에 마이크로 범프가 있는 HBM DRAM 적층(집합적으로 11030으로 표시)을 도시하고 있다. 93A shows TSV connectivity and micro-bumps at the lowest level, comprising stacks of HDM DRAM memory chips 11032 connected to each other utilizing TSVs 11039 and coupled to a first memory controller 11031 on a base die. HBM DRAM stacking (collectively designated 11030) is shown.

도 93a는 또한 하나 이상의 WOW 중간층(11023)을 통해 DRAM 웨이퍼(11021)에 결합된 베이스 다이(11019)의 제2 메모리 컨트롤러(11022)를 포함하는, 적어도 메모리 리소스가 있고 WOW 기술을 활용하여 결합되는 웨이퍼(집합적으로 11040으로 표시)를 도시하고 있다. 하나 이상의 WOW 중간층은 상이한 물질로 구성될 수 있지만 패드 연결성 및/또는 TSV 연결성과 다를 수 있다. 93A also shows a second memory controller 11022 of a base die 11019 coupled to a DRAM wafer 11021 via one or more WOW interlayers 11023 having at least memory resources and coupled utilizing WOW technology. A wafer (collectively designated 11040) is shown. The one or more WOW interlayers may be composed of different materials but may have different pad connectivity and/or TSV connectivity.

컨덕터(11022')는 하나 이상의 WOW 중간층을 통과하고 DRAM 다이를 베이스 다이의 컴포넌트에 전기적으로 결합시킨다. Conductors 11022' pass through one or more WOW interlayers and electrically couple the DRAM die to the components of the base die.

베이스 다이(11019)는 인터포저(11018)에 결합되고, 이어서 인터포저(11018)는 마이크로 범프를 활용하여 패키지 기판(11017)에 결합된다. 패키지 기판의 하면에는 마이크로 범프의 어레이가 있다. The base die 11019 is coupled to an interposer 11018 , which in turn is coupled to the package substrate 11017 utilizing micro bumps. An array of micro bumps is provided on the lower surface of the package substrate.

마이크로 범프는 다른 연결성으로 대체될 수 있다. 인터포저(11018)와 패키지 기판(11017)은 다른 층으로 대체될 수 있다. Micro bumps can be replaced with other connectivity. The interposer 11018 and the package substrate 11017 may be replaced with other layers.

제1 메모리 컨트롤러(11031) 및/또는 제2 메모리 컨트롤러(11032)는 베이스 다이(11019)의 외부에 (적어도 부분적으로), 예컨대 DRAM 웨이퍼 내, DRAM 웨이퍼와 베이스 다이 사이, HBM 메모리 유닛의 적층과 베이스 다이 사이 등에 위치할 수 있다. The first memory controller 11031 and/or the second memory controller 11032 are external (at least partially) to the outside of the base die 11019, such as in a DRAM wafer, between the DRAM wafer and the base die, and the stacking of HBM memory units; It may be located between the base dies, etc.

제1 메모리 컨트롤러(11031) 및 제2 메모리 컨트롤러(11032)는 동일 컨트롤러에 속하거나 상이한 컨트롤러에 속할 수 있다. The first memory controller 11031 and the second memory controller 11032 may belong to the same controller or different controllers.

HBM 메모리 유닛의 하나 이상은 로직은 물론 메모리도 포함할 수 있고, 메모리/프로세싱 유닛일 수 있거나 메모리/프로세싱 유닛을 포함할 수 있다. One or more of the HBM memory units may include logic as well as memory, and may be or include a memory/processing unit.

제1 및 제2 메모리 컨트롤러는 제1 메모리 리소스와 제1 메모리 리소스 사이에 정보를 전달하기 위하여 다중 버스(11016)에 의해 서로 결합될 수 있다. 도 93a는 또한, 제2 메모리 컨트롤러로부터 베이스 다이의 컴포넌트(예: 다중 프로세서)로의 버스(11014)를 도시하고 있다. 도 93a에는 또한 제1 메모리 컨트롤러로부터 베이스 다이의 컴포넌트(예: 도 93c에 도시된 바와 같은 다중 프로세서)로의 버스(11015)가 도시되어 있다.The first and second memory controllers may be coupled to each other by multiple buses 11016 to transfer information between the first memory resource and the first memory resource. 93A also shows a bus 11014 from the second memory controller to the components of the base die (eg, multiple processors). Also shown in FIG. 93A is a bus 11015 from the first memory controller to the components of the base die (eg, multiple processors as shown in FIG. 93C ).

도 93b는 DRAM 다이(11021) 대신에 메모리/프로세싱 유닛(11021')을 구비함으로써 도 93a의 하이브리드 집적회로(11011)와 다른 하이브리드 집적회로(11012)를 도시한 것이다. 93B shows a hybrid integrated circuit 11012 different from the hybrid integrated circuit 11011 of FIG. 93A by having a memory/processing unit 11021' instead of a DRAM die 11021.

도 93c는 HBM 메모리 유닛의 적층과 베이스 다이(11018) 사이에 DRAM 다이(11021)를 포함하는, TSV 연결성이 있고 하부에 WOW 연결된 다이가 있는 HBM 메모리 칩 적층(집합적으로 11040으로 표시)이 있는 점에서 도 93a의 하이브리드 집적회로(11011)와 차이가 있는 하이브리드 집적회로(11013)를 도시한 것이다. 93C shows an HBM memory chip stack (collectively designated 11040) with TSV connectivity and a WOW connected die underneath, including a DRAM die 11021 between a stack of HBM memory units and a base die 11018. A hybrid integrated circuit 11013 different from the hybrid integrated circuit 11011 of FIG. 93A is shown in a point.

DRAM 다이(11021)는 WOW 기술을 활용하여(WOW 중간층(11023) 참조) 베이스 다이(11019)의 제1 메모리 컨트롤러(11031)에 결합된다. HBM 메모리 다이(11032)의 하나 이상은 로직과 메모리를 모두 포함할 수 있고 메모리/프로세싱 유닛이거나 메모리/프로세싱 유닛을 포함할 수 있다. DRAM die 11021 is coupled to first memory controller 11031 of base die 11019 utilizing WOW technology (see WOW intermediate layer 11023 ). One or more of the HBM memory dies 11032 may include both logic and memory and may be or include a memory/processing unit.

최하단의 DRAM 다이(도 93c에 DRAM 다이(11021)로 도시)는 HBM 메모리 다이이거나 HBM 메모리 다이와 다를 수 있다. 도 93d의 하이브리드 집적회로(11014)에 도시된 바와 같이, 최하단의 DRAM 다이(DRAM 다이(11021))는 메모리/프로세싱 유닛(11021')으로 대체될 수 있다. The bottommost DRAM die (shown as DRAM die 11021 in FIG. 93C ) may be an HBM memory die or different from the HBM memory die. As shown in the hybrid integrated circuit 11014 of FIG. 93D , the lowest DRAM die (DRAM die 11021) may be replaced with a memory/processing unit 11021'.

도 93e 내지 도 93g는 각각 하이브리드 집적회로(11015, 11016, 11016')를 도시한 것으로서, 베이스 다이(11019)가 TSV 연결성과 최하위 레벨에 마이크로 범프가 있는 HBM DRAM 적층(11030)과 적어도 메모리 리소스가 있고 WOW 기술을 활용하여 결합된 웨이퍼(11020)의 다중 인스턴스 및/또는 TSV 연결성과 하부에 WOW 연결된 다이가 있는 HBM 메모리 칩 적층(11040)의 다중 인스턴스에 결합된다. 93e to 93g show hybrid integrated circuits 11015, 11016, and 11016', respectively, in which the base die 11019 has TSV connectivity and an HBM DRAM stack 11030 with micro-bumps at the lowest level and at least a memory resource. and multiple instances of wafer 11020 bonded utilizing WOW technology and/or multiple instances of HBM memory chip stack 11040 with TSV connectivity and WOW connected dies underneath.

도 93h는 메모리 유닛(53), 레벨 2 캐시 메모리(L2 캐시(11052)) 및 다중 프로세서(11051)를 도시함으로써 도 93d의 하이브리드 집적회로(11014)와 차이가 있는 하이브리드 집적회로(11014')를 도시한 것이다. 다중 프로세서(11051)는 L2 캐시(11052)에 결합되고 메모리 유닛(11053)과 L2 캐시(11052)에 저장된 계수 및/또는 데이터에 의해 공급될 수 있다. 93H illustrates a hybrid integrated circuit 11014' that differs from the hybrid integrated circuit 11014 of FIG. 93D by showing a memory unit 53, a level 2 cache memory (L2 cache 11052) and multiple processors 11051. it will be shown Multiple processors 11051 may be coupled to L2 cache 11052 and fed by coefficients and/or data stored in memory unit 11053 and L2 cache 11052 .

상기 하이브리드 집적회로의 임의의 하이브리드 집적회로는 대역폭 집약적인 인공지능(AI) 프로세싱에 활용될 수 있다. Any hybrid integrated circuit of the hybrid integrated circuit may be utilized for bandwidth intensive artificial intelligence (AI) processing.

도 93d와 도 93h의 메모리/프로세싱 유닛(11021')은 WOW 기술로 메모리 컨트롤러에 결합되는 경우에 AI 계산을 수행할 수 있고 HBM DRAM 적층 및/또는 WOW 연결된 다이로부터 매우 빠른 속도로 데이터와 계수를 모두 수신할 수 있다. The memory/processing unit 11021' of Figures 93D and 93H, when coupled to a memory controller with WOW technology, can perform AI calculations and retrieve data and counts from HBM DRAM stacks and/or WOW connected dies at very high rates. all can be received.

임의의 모든 메모리/프로세싱 유닛은 분산 메모리 어레이 및 프로세서 어레이를 포함할 수 있다. 분산 메모리 어레이 및 프로세서 어레이는 다중 메모리 뱅크와 다중 프로세서를 포함할 수 있다. 다중 프로세서는 프로세싱 어레이를 형성할 수 있다. Any and all memory/processing units may include distributed memory arrays and processor arrays. Distributed memory arrays and processor arrays may include multiple memory banks and multiple processors. Multiple processors may form a processing array.

도 93c, 도 93d, 및 도 93h를 참조하고, 하이브리드 집적회로(11013, 11014, 또는 11014')가 행렬을 벡터로 곱하는 계산을 포함하는 GEMV(general matrix-vector multiplications)를 실행하도록 요구된다고 가정하면, 이런 유형의 계산은 검색된 행렬 데이터를 재사용하지 않기 때문에 대역폭 집약적이다. 따라서, 전체 행렬이 검색되어야 하고 한 번만 사용된다. 93c, 93d, and 93h, assuming that a hybrid integrated circuit 11013, 11014, or 11014' is required to perform general matrix-vector multiplications (GEMVs) that include a matrix-by-vector calculation , this type of computation is bandwidth intensive because it does not reuse the retrieved matrix data. Therefore, the entire matrix has to be searched and used only once.

GEMV는 (i) 제1 행렬(A)을 제1 벡터(V1)로 곱하여 제1 중간 벡터를 제공하고, 제1 중간 벡터에 제1 비선형 연산(NLO1)을 적용하여 제1 중간 결과를 제공하고, (ii) 제2 행렬(B)을 제1 중간 결과로 곱하여 제2 중간 벡터를 제공하고, 제2 중간 벡터에 제2 비선형 연산(NLO2)을 적용하여 제2 중간 결과를 제공하는 등(N은 2보다 큰, 제N 중간 결과를 수신할 때까지)을 포함하는 수학적 연산 시퀀스의 일부일 수 있다. GEMV is (i) multiplying a first matrix (A) by a first vector (V1) to provide a first intermediate vector, applying a first nonlinear operation (NLO1) to the first intermediate vector to provide a first intermediate result, , (ii) multiplying the second matrix (B) by the first intermediate result to give a second intermediate vector, applying a second nonlinear operation (NLO2) to the second intermediate vector to give a second intermediate result, etc. (N may be part of a sequence of mathematical operations including (until receiving an Nth intermediate result greater than 2).

각 행렬이 크다고(예: 1 기가비트) 가정하면, 계산에는 1 테라비트의 연산 파워가 필요하고 1 테라비트의 대역폭/스루풋이 필요하다. 연산과 계산은 병렬로 실행될 수 있다. Assuming each matrix is large (eg 1 gigabit), the computation requires 1 terabit of computational power and 1 terabit of bandwidth/throughput. Operations and computations can be executed in parallel.

GEMV 계산이 N=4를 나타내고 다음의 형태를 가진다고 가정한다: 결과 = NLO4(D*(NLO3(C*(NLO2(B*(NLO1(A*V1))))))).Assume that the GEMV calculation represents N=4 and has the form: Result = NLO4(D*(NLO3(C*(NLO2(B*(NLO1(A*V1)))))))).

또한, DRAM 다이(11021)(또는 메모리/프로세싱 유닛(11021')은 A, B, C, D를 동시에 저장하기에 메모리가 충분하지 않은 것으로 가정하면, 이러한 행렬들의 적어도 일부는 HDM DRAM 다이(11032)에 저장되게 된다. Also, assuming that DRAM die 11021 (or memory/processing unit 11021') does not have enough memory to store A, B, C, and D simultaneously, at least some of these matrices are the HDM DRAM die 11032 ) is stored in

베이스 다이는 프로세서, ALU 등과 같은 계산 유닛을 포함하는 로직 다이인 것으로 가정한다. It is assumed that the base die is a logic die that includes computation units such as processors, ALUs, and the like.

제1 다이가 A*V1를 계산하는 동안, 제1 메모리 컨트롤러(11031)는 다음 계산을 위해 다른 행렬의 빠진 부분들을 하나 이상의 HBM DRAM 다이(11032)로부터 검색한다. While the first die calculates A*V1 , the first memory controller 11031 retrieves the missing portions of another matrix from the one or more HBM DRAM die 11032 for the next calculation.

도 93h를 참조하고 (a) DRAM 다이(11021)는 대역폭이 2 TB이고 용량이 512 Mb이고, (b) HBM DRAM 다이(11032)는 대역폭이 0.2 TB이고 용량이 8 Gb이고, (c) L2 캐시(11052)는 대역폭이 6 TB이고 용량이 10 Mb인 SRAM인 것으로 가정한다. 93h, (a) DRAM die 11021 has a bandwidth of 2 TB and a capacity of 512 Mb, (b) HBM DRAM die 11032 has a bandwidth of 0.2 TB and a capacity of 8 Gb, and (c) L2 Assume that the cache 11052 is SRAM with a bandwidth of 6 TB and a capacity of 10 Mb.

행렬의 곱셈은 데이터의 재사용, 즉, 큰 행렬을 세그먼트 (예: 이중 버퍼 구성에서 사용될 수 있는 L2 캐시에 들어가도록 5 Mb 세그먼트)로 나누고 제1 행렬 세그먼트를 제2 행렬의 세그먼트들(다른 행렬 세그먼트 뒤에 제2 행렬 세그먼트)로 곱하는 것을 포함할 수 있다. Multiplication of matrices results in data reuse, i.e. dividing a large matrix into segments (eg 5 Mb segments to fit into an L2 cache that can be used in a double buffer configuration) and dividing the first matrix segment into segments of the second matrix (another matrix segment). followed by a second matrix segment).

제1 행렬 세그먼트를 제2 행렬 세그먼트로 곱하는 동안, 다른 제2 행렬 세그먼트가 (메모리 프로세싱 유닛(11021')의) DRAM 다이(11021)로부터 L2 캐시로 페치된다. While multiplying the first matrix segment by the second matrix segment, another second matrix segment is fetched from the DRAM die 11021 (of the memory processing unit 11021 ′) into the L2 cache.

행렬이 각각 1 Gb인 것으로 가정하면, 페치 동작과 계산이 실행되는 동안에 DRAM 다이(11021) 또는 메모리/프로세싱 유닛(11021')이 HBM DRAM 다이(11032)로부터 행렬 세그먼트에 의해 공급된다. Assuming that the matrices are 1 Gb each, a DRAM die 11021 or memory/processing unit 11021' is supplied by matrix segments from the HBM DRAM die 11032 while fetch operations and calculations are being executed.

DRAM 다이(11021) 또는 메모리/프로세싱 유닛(11021')은 행렬 세그먼트를 합하고, 행렬 세그먼트들은 WOW 중간층(11023)을 통해 베이스 다이(11019)로 공급된다. The DRAM die 11021 or memory/processing unit 11021 ′ sums the matrix segments, and the matrix segments are supplied to the base die 11019 through the WOW intermediate layer 11023 .

메모리/프로세싱 유닛(11021')은 결과를 제공하도록 계산되는 중간값을 전송하는 대신에 계산을 수행하고 결과를 전송하여 WOW 중간층(11023)을 통해 베이스 다이(11019)로 전송되는 정보의 양을 줄일 수 있다. 다중(Q) 중간값이 처리되어 결과가 제공되는 경우, 압축비는 Q:1이 될 수 있다. The memory/processing unit 11021 ′ performs the calculation instead of sending the calculated intermediate value to provide the result and transmits the result to reduce the amount of information transmitted to the base die 11019 through the WOW intermediate layer 11023 . can If multiple (Q) median values are processed to provide a result, the compression ratio may be Q:1.

도 93i는 WOW 기술로 구현되는 메모리 프로세싱 유닛(11019')의 일례를 도시한 것이다. 로직 유닛(9030)(프로세서 서브유닛일 수 있음), 컨트롤러(9020), 및 버스(9021)가 제1 칩(11061) 내에 위치하고 있고, 상이한 로직 유닛에 할당된 메모리 뱅크(9040)가 제2 칩(11062) 내에 위치하고 있다. 여기서, 제1 칩과 제2 칩은 하나 이상의 WOW 중간층을 포함할 수 있는 WOW 접합(11061)을 통과하는 컨덕터(11012')를 활용하여 서로 연결된다. 93I shows an example of a memory processing unit 11019' implemented with WOW technology. A logic unit 9030 (which may be a processor subunit), a controller 9020 , and a bus 9021 are located in a first chip 11061 , and a memory bank 9040 assigned to a different logic unit is located in a second chip It is located in (11062). Here, the first chip and the second chip are connected to each other utilizing a conductor 11012 ′ passing through a WOW junction 11061 which may include one or more WOW interlayers.

도 93j는 메모리 집약적 프로세싱을 위한 방법(11100)의 일례이다. 메모리 집약적이란 프로세싱이 높은 대역폭 메모리 소비를 필요로 하거나 높은 대역폭 메모리 소비와 연관됨을 의미한다. 93J is an example of a method 11100 for memory intensive processing. Memory intensive means that processing requires or is associated with high bandwidth memory consumption.

방법(11100)은 단계 11110, 단계 11120, 및 단계 11130으로 시작할 수 있다. Method 11100 may begin with steps 11110 , 11120 , and 11130 .

단계 11110은 베이스 다이, 적어도 하나의 다른 다이의 제1 메모리 리소스, 및 적어도 하나의 또 다른 다이의 제2 메모리 리소스를 포함하는 하이브리드 장치의 다중 프로세서가 프로세싱 동작을 수행하는 단계를 포함하고, 여기서 베이스 다이와 적어도 하나의 다른 다이는 웨이퍼 온 웨이퍼 접합에 의해 서로 연결된다. Step 11110 includes multiple processors of the hybrid device comprising a base die, a first memory resource of at least one other die, and a second memory resource of at least one other die, performing a processing operation, wherein the base The die and at least one other die are connected to each other by wafer-on-wafer bonding.

단계 11120은 다중 프로세서가 제1 메모리 리소스에 저장된 검색된 정보를 검색하는 단계를 포함한다. Step 11120 includes the multi-processor retrieving the retrieved information stored in the first memory resource.

단계 11130은 제2 메모리 리소스에서 제1 메모리 리소스로 추가 정보를 전송하는 단계를 포함할 수 있고, 여기서 베이스 다이와 적어도 하나의 다른 다이 사이의 제1 경로의 전반적인 대역폭은 적어도 하나의 다이와 적어도 하나의 또 다른 다이 사이의 제2 경로의 전반적인 대역폭보다 크고, 제1 메모리 리소스의 저장 용량은 제2 메모리 리소스의 저장 용량보다 몇 배 작다.Step 11130 may include transmitting additional information from the second memory resource to the first memory resource, wherein the overall bandwidth of the first path between the base die and the at least one other die is between the at least one die and the at least one other die. greater than the overall bandwidth of the second path between different dies, and the storage capacity of the first memory resource is several times smaller than the storage capacity of the second memory resource.

방법 11100은 복수의 프로세서 서브유닛과 제1 메모리 리소스를 포함하는 다른 다이가 추가적인 프로세싱을 수행하는 단계 11140을 더 포함할 수 있다. Method 11100 may further include step 11140 in which another die including the plurality of processor subunits and the first memory resource performs additional processing.

각 프로세서 서브유닛은 해당 프로세서 서브유닛에 할당된 제1 메모리 리소스의 고유 부분에 결합될 수 있다.Each processor subunit may be coupled to a unique portion of a first memory resource allocated to that processor subunit.

제1 메모리 리소스의 고유 부분은 적어도 하나의 메모리 뱅크이다.The unique portion of the first memory resource is at least one memory bank.

단계 11110, 단계 11120, 단계 11130, 및 단계 11140은 동시, 부분적으로 중첩되는 방식 등으로 실행될 수 있다. Step 11110, step 11120, step 11130, and step 11140 may be executed concurrently, in a partially overlapping manner, or the like.

제2 메모리 리소스는 고대역폭 메모리(HBM) 리소스이거나 HBM 메모리 리소스와 다를 수 있다. The second memory resource may be a high bandwidth memory (HBM) resource or may be different from the HBM memory resource.

적어도 하나의 또 다른 다이는 HBM 메모리 칩의 적층이다. At least one other die is a stack of HBM memory chips.

통신 칩communication chip

데이터베이스는 여러 필드를 포함하는 많은 엔트리를 포함한다. 데이터베이스 프로세싱은 하나 이상의 필터링 파라미터(예: 하나 이상의 관련 있는 필드의 식별자 또는 하나 이상의 관련 있는 필드의 값)를 포함하고 또한 실행될 동작의 유형, 동작을 적용할 때에 사용될 변수 또는 상수 등을 판단할 수 있는 하나 이상의 동작 파라미터를 포함하는 하나 이상의 쿼리의 실행을 포함하는 것이 일반적이다. 데이터 프로세싱은 데이터베이스 분석 또는 기타 데이터베이스 프로세스를 포함할 수 있다. A database contains many entries containing several fields. Database processing may include one or more filtering parameters (eg, identifiers of one or more relevant fields or values of one or more relevant fields) and may also determine the type of operation to be executed, a variable or constant to be used when applying the operation, etc. It typically involves the execution of one or more queries that include one or more operational parameters. Data processing may include database analysis or other database processes.

예를 들면, 데이터베이스 쿼리는 특정 필드가 소정의 범위 이내의 값을 가진(필터링 파라미터) 데이터베이스의 모든 기록에 통계적 연산을 수행(연산 파라미터)하도록 요청할 수 있다. 다른 예를 들면, 데이터베이스 쿼리는 임계값보다 작은(필터링 파라미터) 특정 필드를 가진 기록을 삭제(연산 파라미터)하도록 요청할 수 있다. For example, a database query may request that a statistical operation be performed (operation parameter) on all records in the database for which a particular field has a value within a predetermined range (a filtering parameter). As another example, a database query may request that records with certain fields less than a threshold (a filtering parameter) be deleted (a computational parameter).

대량의 데이터베이스는 일반적으로 스토리지 장치에 저장된다. 쿼리에 응답하기 위하여, 데이터베이스는, 하나의 데이터베이스 세그먼트씩, 메모리 유닛으로 전송된다. Bulk databases are typically stored on storage devices. To respond to a query, the database is transferred to a memory unit, one database segment at a time.

데이터베이스 세그먼트의 엔트리는 메모리 유닛으로부터 메모리 유닛과 동일한 집적회로에 속하지 않는 프로세서로 전송된다. 이후에 엔트리는 프로세서에 의해 처리된다. Entries in the database segment are transferred from the memory unit to a processor that does not belong to the same integrated circuit as the memory unit. The entry is then processed by the processor.

메모리 유닛에 저장된 데이터베이스의 각 데이터베이스 세그먼트에 대해, 프로세싱은 다음과 같은 단계를 포함한다: (i) 데이터베이스 세그먼트의 기록을 선택하는 단계; (ii) 기록을 메모리 유닛에서 프로세서로 전송하는 단계; (iii) 프로세서가 기록을 필터링 하여 관련이 있는 기록인지 여부를 판단하는 단계; 및 (iv) 관련 있는 기록에 하나 이상의 추가 연산(합산, 임의의 기타 수학적 및/또는 통계적 연산의 적용)을 수행하는 단계. For each database segment of the database stored in the memory unit, processing includes the following steps: (i) selecting a record of the database segment; (ii) transferring the write from the memory unit to the processor; (iii) the processor filtering the records to determine whether they are relevant records; and (iv) performing one or more additional operations (summing, application of any other mathematical and/or statistical operations) on the relevant records.

모든 기록이 프로세서로 전송되고 프로세서가 기록이 관련이 있는 것으로 판단하면 필터링 프로세스가 종료된다. When all records are sent to the processor and the processor determines that the records are relevant, the filtering process ends.

데이터베이스의 관련 있는 엔트리가 프로세서에 저장되어 있지 않은 경우, 필터링 단계 이후에 계속 처리되게 하기 위하여(프로세싱 후속의 연산을 적용) 이러한 관련 있는 기록을 프로세서로 전송할 필요가 있다. If the relevant entries in the database are not stored in the processor, it is necessary to send these relevant records to the processor in order to continue processing after the filtering step (applying operations following processing).

다중 프로세싱 연산이 단일 필터링에 후속하는 경우, 각 연산의 결과는 메모리 유닛으로 전송될 수 있고, 이후에 다시 프로세서로 전송될 수 있다. When multiple processing operations follow a single filtering, the result of each operation may be sent to a memory unit, and then sent back to the processor.

이러한 프로세스는 대역폭과 시간이 많이 든다. This process is bandwidth and time consuming.

데이터베이스 프로세싱을 수행하는 효율적인 방식을 제공할 필요가 증가하고 있다. There is a growing need to provide an efficient way to perform database processing.

데이터베이스 가속 집적회로를 포함할 수 있는 장치가 제공될 수 있다. An apparatus may be provided that may include a database acceleration integrated circuit.

하나 이상의 데이터베이스 가속 집적회로 그룹의 데이터베이스 가속 집적회로들 사이에 정보 및/또는 가속 결과(데이터베이스 가속 집적회로에 의해 수행된 프로세싱의 결과)를 교환하도록 구성될 수 있는 하나 이상의 데이터베이스 가속 집적회로 그룹을 포함하는 장치가 제공될 수 있다. one or more database acceleration integrated circuit groups comprising one or more database acceleration integrated circuit groups that can be configured to exchange information and/or acceleration results (results of processing performed by the database acceleration integrated circuits) between the database acceleration integrated circuits of the one or more database acceleration integrated circuit groups A device may be provided.

데이터베이스 가속 집적회로 그룹의 데이터베이스 가속 집적회로들은 동일한 인쇄회로기판에 연결될 수 있다. The database acceleration integrated circuits of the database acceleration integrated circuit group may be connected to the same printed circuit board.

데이터베이스 가속 집적회로 그룹의 데이터베이스 가속 집적회로들은 전산 시스템의 모듈형 유닛에 속할 수 있다. The database acceleration integrated circuits of the database acceleration integrated circuit group may belong to a modular unit of the computing system.

상이한 그룹의 데이터베이스 가속 집적회로들은 상이한 인쇄회로기판에 연결될 수 있다. Different groups of database acceleration integrated circuits may be connected to different printed circuit boards.

상이한 그룹의 데이터베이스 가속 집적회로들은 전산 시스템의 상이한 모듈형 유닛에 속할 수 있다. Different groups of database acceleration integrated circuits may belong to different modular units of a computing system.

장치는 하나 이상의 그룹의 데이터베이스 가속 집적회로들에 의해 분산 프로세스를 실행하도록 구성될 수 있다. The apparatus may be configured to execute a distributed process by one or more groups of database acceleration integrated circuits.

장치는 하나 이상의 그룹의 상이한 그룹의 데이터베이스 가속 집적회로들 사이에 (a) 정보 및 (b) 데이터베이스 가속 결과 중의 적어도 하나를 교환하기 위해 적어도 하나의 스위치를 사용하도록 구성될 수 있다. The apparatus may be configured to use the at least one switch to exchange at least one of (a) information and (b) a database acceleration result between one or more groups of different groups of database acceleration integrated circuits.

장치는 하나 이상의 그룹의 일부 그룹의 데이터베이스 가속 집적회로들의 일부에 의해 분산 프로세스를 실행하도록 구성될 수 있다. The apparatus may be configured to execute a distributed process by a portion of a group of database acceleration integrated circuits in one or more groups.

장치는 제1 데이터 구조 및 제2 데이터 구조의 분산 프로세스를 수행하도록 구성될 수 있고, 여기서 제1 데이터 구조와 제2 데이터 구조의 총 사이즈는 다중 메모리 프로세싱 집적회로의 저장 능력보다 크다. The apparatus may be configured to perform a process of distributing the first data structure and the second data structure, wherein a total size of the first data structure and the second data structure is greater than a storage capacity of the multiple memory processing integrated circuit.

장치는 (a) 상이한 쌍의 제1 데이터 구조 부분과 제2 데이터 구조 부분을 상이한 데이터베이스 가속 집적회로에 새롭게 할당 및 (b) 상기 상이한 쌍의 처리를 여러 번 반복 수행하여 분산 프로세스를 수행하도록 구성될 수 있다. The apparatus may be configured to: (a) newly allocate different pairs of the first data structure part and the second data structure part to different database acceleration integrated circuits and (b) perform the distributed process by repeating the different pairs of processing multiple times. can

도 94a와 도 94b는 스토리지 시스템(11560), 컴퓨터 시스템(11510), 및 데이터베이스 가속을 위한 하나 이상의 장치(11520)의 예시를 도시한 것이다. 데이터베이스 가속을 위한 하나 이상의 장치(11520)는 다양한 방식으로, 예컨대 스토리지 시스템(11560)과 컴퓨터 시스템(11510) 사이에서 탐지하거나 위치함으로써, 스토리지 시스템(11560)과 컴퓨터 시스템(11510) 사이의 통신을 모니터 할 수 있다.94A and 94B show examples of a storage system 11560 , a computer system 11510 , and one or more devices 11520 for database acceleration. The one or more devices 11520 for database acceleration monitor communication between the storage system 11560 and the computer system 11510 in various ways, such as by detecting or locating between the storage system 11560 and the computer system 11510 . can do.

스토리지 시스템(11560)은 많은(예: 20개 이상, 50개 이상, 100개 이상 등) 스토리지 유닛(예: 디스크)을 포함할 수 있고 예컨대 100 TB 이상의 정보를 저장할 수 있다. 컴퓨터 시스템(11510)은 방대한 컴퓨터 시스템일 수 있고 수십, 수백, 수천 개의 프로세싱 유닛을 포함할 수 있다. The storage system 11560 may include many (eg, 20 or more, 50 or more, 100 or more, etc.) storage units (eg, disks) and may store, for example, 100 TB or more of information. The computer system 11510 may be a massive computer system and may include tens, hundreds, or thousands of processing units.

컴퓨터 시스템(11510)은 매니저(11511)에 의해 제어되는 다중 컴퓨트 노드(compute nodes, 11512)를 포함할 수 있다. The computer system 11510 may include multiple compute nodes 11512 controlled by a manager 11511 .

컴퓨트 노드는 데이터베이스 가속을 위한 하나 이상의 장치(11520)를 제어하거나 데이터베이스 가속을 위한 하나 이상의 장치(11520)와 상호 작용할 수 있다. The compute node may control one or more devices 11520 for database acceleration or may interact with one or more devices 11520 for database acceleration.

데이터베이스 가속을 위한 하나 이상의 장치(11520)는 하나 이상의 데이터베이스 가속 집적회로(예: 도 94a와 도 94b의 데이터베이스 가속 집적회로(11530)) 및 메모리 리소스(11550)를 포함할 수 있다. 메모리 리소스는 메모리 전용의 하나 이상의 칩에 속할 수 있지만 메모리/프로세싱 유닛에 속할 수도 있다. One or more devices 11520 for database acceleration may include one or more database acceleration integrated circuits (eg, database acceleration integrated circuits 11530 of FIGS. 94A and 94B ) and memory resources 11550 . A memory resource may belong to one or more chips dedicated to memory, but may also belong to a memory/processing unit.

도 94c와 도 94d는 컴퓨터 시스템(11510)과 데이터베이스 가속을 위한 하나 이상의 장치(11520)의 예를 도시한 것이다. 94C and 94D illustrate examples of a computer system 11510 and one or more devices 11520 for database acceleration.

데이터베이스 가속을 위한 하나 이상의 장치(11520)의 하나 이상의 데이터베이스 가속 집적회로는 컴퓨터 시스템 내에 위치하거나(도 94c 참조) 데이터베이스 가속을 위한 하나 이상의 장치(11520) 내에 위치한(도 94d 참조) 관리 유닛(11513)에 의해 제어될 수 있다. The one or more database acceleration integrated circuits of the one or more devices 11520 for database acceleration are located in a computer system (see FIG. 94c) or a management unit 11513 located within the one or more devices 11520 for database acceleration (see FIG. 94d). can be controlled by

도 94e는 데이터베이스 가속 집적회로(11530)와 다중 메모리 프로세싱 집적회로(11551)를 포함하는 데이터베이스 가속을 위한 장치(11520)를 도시한 것이다. 각 메모리 프로세싱 집적회로는 컨트롤러, 다중 프로세서 서브유닛, 및 다중 메모리 유닛을 포함할 수 있다. 94E shows an apparatus 11520 for database acceleration that includes a database acceleration integrated circuit 11530 and a multiple memory processing integrated circuit 11551 . Each memory processing integrated circuit may include a controller, multiple processor subunits, and multiple memory units.

데이터베이스 가속 집적회로(11530)는 네트워크 통신 인터페이스(11531), 제1 프로세싱 유닛(11532), 메모리 컨트롤러(11533), 데이터베이스 가속 유닛(11535), 인터커넥트(11536), 및 관리 유닛(11513)을 포함하는 것으로 도시되어 있다. The database acceleration integrated circuit 11530 includes a network communication interface 11531 , a first processing unit 11532 , a memory controller 11533 , a database acceleration unit 11535 , an interconnect 11536 , and a management unit 11513 . is shown as

네트워크 통신 네트워크(11531)는 많은 수의 스토리지 유닛으로부터 방대한 양의 정보를 수신(예: 네트워크 통신 인터페이스의 제1 포트(11531(1))를 통해)하도록 구성될 수 있다. 각 스토리지 유닛은 초당 수십 및 수백 메가바이트의 속도로 정보를 출력할 수 있고, 데이터 전송 속도는 앞으로 계속 증가할 것으로(예: 2-3년마다 2배씩) 예상된다. 스토리지 유닛의 수는 10개, 50개, 100개, 200개, 또는 그 이상일 수 있다. 방대한 양의 정보는 초당 수십 및 수백 기가바이트 이상일 수 있고, 심지어 초당 테라바이트, 페타바이트 범위일 수 있다. Network The communication network 11531 may be configured to receive a large amount of information from a large number of storage units (eg, via a first port 11531( 1 ) of a network communication interface). Each storage unit can output information at rates of tens and hundreds of megabytes per second, and data transfer rates are expected to continue to increase in the future (eg, doubling every 2-3 years). The number of storage units may be 10, 50, 100, 200, or more. Vast amounts of information can be tens and hundreds of gigabytes per second or more, and can even range from terabytes to petabytes per second.

제1 프로세싱 유닛(11532)은 방대한 양의 정보를 1차 처리(프리프로세스)하여 제1 처리 정보를 제공하도록 구성될 수 있다. The first processing unit 11532 may be configured to provide first processing information by primary processing (pre-processing) a vast amount of information.

메모리 컨트롤러(11533)는 방대 스루풋 인터페이스(11534)를 통해 제1 처리 정보를 다중 메모리 프로세싱 집적회로로 전송하도록 구성될 수 있다. The memory controller 11533 may be configured to transmit the first processing information to the multi-memory processing integrated circuit via the massive throughput interface 11534 .

다중 메모리 프로세싱 집적회로(11551)는 다중 메모리 프로세싱 집적회로가 제1 처리 정보의 적어도 일부를 2차 처리(프로세스)하여 제2 처리 정보를 제공하도록 구성될 수 있다. The multiple memory processing integrated circuit 11551 may be configured such that the multiple memory processing integrated circuit secondary processes (processes) at least a portion of the first processing information to provide the second processing information.

메모리 컨트롤러(11533)는 다중 메모리 프로세싱 집적회로로부터 검색된 정보를 검색하도록 구성될 수 있다. 검색된 정보는 (a) 제1 처리 정보의 적어도 일부 및 (b) 제2 처리 정보의 적어도 일부 중의 적어도 하나를 포함할 수 있다. Memory controller 11533 may be configured to retrieve information retrieved from multiple memory processing integrated circuits. The retrieved information may include at least one of (a) at least a portion of the first processing information and (b) at least a portion of the second processing information.

데이터베이스 가속 유닛(11535)은 검색된 정보에 데이터프로세스 동작을 수행하여 데이터베이스 가속 결과를 제공하도록 구성될 수 있다. The database acceleration unit 11535 may be configured to perform a data processing operation on the retrieved information to provide a database acceleration result.

데이터베이스 가속 집적회로는 예컨대 네트워크 통신 인터페이스의 하나 이상의 제2 포트(11531(2))를 통하여 데이터베이스 가속 결과를 출력하도록 구성될 수 있다. The database acceleration integrated circuit may be configured to output the database acceleration results, for example, via the one or more second ports 11531( 2 ) of the network communication interface.

도 94e는 또한 검색된 정보의 검색, 1차 처리(프리프로세스), 2차 처리(프로세스), 및 3차 처리(데이터베이스 프로세싱) 중의 적어도 하나를 관리하도록 구성된 관리 유닛(11513)을 도시하고 있다. 관리 유닛(11513)은 데이터베이스 가속 집적회로의 외부에 위치할 수 있다. Fig. 94E also shows a management unit 11513 configured to manage at least one of retrieval of the retrieved information, primary processing (pre-process), secondary processing (process), and tertiary processing (database processing). The management unit 11513 may be located external to the database acceleration integrated circuit.

관리 유닛은 실행 계획에 의거하여 상기 관리를 수행하도록 구성될 수 있다. 실행 계획은 관리 유닛에 의해 생성되거나 데이터베이스 가속 집적회로 외부에 위치한 엔티티에 의해 생성될 수 있다. 실행 계획은 (a) 데이터베이스 가속 집적회로의 다양한 컴포넌트에 의해 실행될 지시, (b) 실행 계획의 이행을 위해 필요한 데이터 및/또는 계수, 및 (c) 지시 및/또는 데이터의 메모리 할당 중의 적어도 하나를 포함할 수 있다. The management unit may be configured to perform the management according to the execution plan. The execution plan may be generated by the management unit or may be generated by an entity located external to the database acceleration integrated circuit. The execution plan includes at least one of (a) instructions to be executed by the various components of the database acceleration integrated circuit, (b) data and/or coefficients required for implementation of the execution plan, and (c) memory allocation of instructions and/or data. may include

관리 유닛은 (a) 네트워크 통신 네트워크 인터페이스 리소스, (b) 압축해제 유닛 리소스, (c) 메모리 컨트롤러 리소스, (d) 다중 메모리 프로세싱 집적회로 리소스, 및 (e) 데이터 가속 유닛 리소스 중의 적어도 일부를 할당하여 관리를 수행하도록 구성될 수 있다. The management unit allocates at least a portion of (a) a network communication network interface resource, (b) a decompression unit resource, (c) a memory controller resource, (d) a multiple memory processing integrated circuit resource, and (e) a data acceleration unit resource. and may be configured to perform management.

도 94e와 도 94g에 도시된 바와 같이, 네트워크 통신 네트워크 인터페이스는 상이한 유형의 네트워크 통신 포트를 포함할 수 있다. 94E and 94G , the network communication network interface may include different types of network communication ports.

상이한 유형의 네트워크 통신 포트는 스토리지 인터페이스 프로토콜 포트(예: SATA 포트, ATA 포트, ISCSI 포트, 네트워크 파일 포트, 파이버 채널 포트) 및 일반 네트워크를 통한 스토리지 인터페이스 프로토콜 포트(예: 이더넷을 통한 ATA, 이더넷을 통한 파이버 채널, NVME, Roce 등)를 포함할 수 있다. Different types of network communication ports include storage interface protocol ports (such as SATA ports, ATA ports, ISCSI ports, network file ports, Fiber Channel ports) and storage interface protocol ports over common networks (such as ATA over Ethernet, Ethernet over through Fiber Channel, NVME, Roce, etc.).

상이한 유형의 네트워크 통신 포트는 스토리지 인터페이스 프로토콜 포트 및 PCIe 포트를 포함할 수 있다. Different types of network communication ports may include storage interface protocol ports and PCIe ports.

도 94f에는 방대한 정보, 제1 처리 정보, 검색된 정보, 및 데이터베이스 가속 결과의 흐름을 나타내는 점선이 도시되어 있다. 도 94f에는 데이터베이스 가속 집적회로(11530)가 다중 메모리 리소스(11550)에 결합되는 것으로 도시되어 있다. 다중 메모리 리소스(11550)는 메모리 프로세싱 집적회로에 속하지 않을 수 있다. 94F shows a dotted line representing the flow of massive information, first processing information, searched information, and database acceleration results. 94F shows the database acceleration integrated circuit 11530 coupled to multiple memory resources 11550 . Multiple memory resources 11550 may not belong to a memory processing integrated circuit.

데이터베이스 가속을 위한 장치(11520)는 데이터베이스 가속 집적회로(11530)가 다중 작업을 동시에 실행하도록 구성될 수 있다. 즉, 네트워크 통신 인터페이스(11531)가 정보의 다중 스트림을 (동시에) 수신할 수 있음에 따라, 제1 프로세싱 유닛(11532)은 다중 정보 단위에 동시에 1차 처리를 수행할 수 있고, 메모리 컨트롤러(11533)는 다중 1차 처리 정보 단위를 동시에 다중 메모리 프로세싱 집적회로(11551)로 전송할 수 있고, 데이터베이스 가속 유닛(11535)은 다중 검색된 정보 단위를 동시에 처리할 수 있다. The apparatus 11520 for database acceleration may be configured such that the database acceleration integrated circuit 11530 executes multiple tasks simultaneously. That is, as the network communication interface 11531 can receive (simultaneously) multiple streams of information, the first processing unit 11532 can perform primary processing on multiple information units simultaneously, and the memory controller 11533 ) may transmit multiple primary processing information units to multiple memory processing integrated circuits 11551 at the same time, and database acceleration unit 11535 may process multiple retrieved information units simultaneously.

데이터베이스 가속을 위한 장치(11520)는 방대한 컴퓨터 시스템의 컴퓨트 노드에 의해 데이터베이스 가속 집적회로로 전송된 실행 계획에 의거하여 검색, 1차 처리, 전송, 및 3차 처리 중의 적어도 하나를 실행하도록 구성될 수 있다. Apparatus for database acceleration 11520 may be configured to execute at least one of retrieval, primary processing, transmission, and tertiary processing based on an execution plan transmitted to the database acceleration integrated circuit by the compute node of the massive computer system. can

데이터베이스 가속을 위한 장치(11520)는 데이터베이스 가속 집적회로의 활용을 실질적으로 최적화하는 방식으로 검색, 1차 처리, 전송, 및 3차 처리 중의 적어도 하나를 관리하도록 구성될 수 있다. 최적화는 지연, 스루풋, 및 임의 기타 타이밍, 스토리지, 또는 프로세싱을 고려하고 모든 컴포넌트가 흐름 경로를 따라 쉬지 않고 병목 현상 없이 유지되도록 시도한다. Apparatus for database acceleration 11520 may be configured to manage at least one of retrieval, primary processing, transmission, and tertiary processing in a manner that substantially optimizes utilization of the database acceleration integrated circuit. Optimization takes into account latency, throughput, and any other timing, storage, or processing and attempts to keep all components non-stop and bottleneck-free along the flow path.

데이터베이스 가속 집적회로는, 예컨대 네트워크 통신 인터페이스의 하나 이상의 제2 포트(11531(2))를 통해, 데이터베이스 가속 결과를 출력하도록 구성될 수 있다. The database acceleration integrated circuit may be configured to output a database acceleration result, eg, via the one or more second ports 11531(2) of the network communication interface.

데이터베이스 가속을 위한 장치(11520)는 네트워크 통신 네트워크 인터페이스에 의해 교환되는 트래픽의 대역폭을 실질적으로 최적화하도록 구성될 수 있다. Apparatus for database acceleration 11520 may be configured to substantially optimize bandwidth of traffic exchanged by a network communication network interface.

데이터베이스 가속을 위한 장치(11520)는 데이터베이스 가속 집적회로의 활용을 실질적으로 최적화하는 방식으로 검색, 1차 처리, 전송, 및 3차 처리 중의 적어도 하나에 병목이 발생하는 것을 실질적으로 방지하도록 구성될 수 있다. Apparatus for database acceleration 11520 may be configured to substantially avoid bottlenecking at least one of retrieval, primary processing, transmission, and tertiary processing in a manner that substantially optimizes utilization of the database acceleration integrated circuit. have.

데이터베이스 가속을 위한 장치(11520)는 시간 I/O 대역폭(temporal I/O bandwidth)에 따라 데이터베이스 가속 집적회로의 리소스를 할당하도록 구성될 수 있다. The apparatus for database acceleration 11520 may be configured to allocate resources of the database acceleration integrated circuit according to a temporal I/O bandwidth.

도 94g는 데이터베이스 가속 집적회로(11530)와 다중 메모리 프로세싱 집적회로(11551)를 포함하는 데이터베이스 가속을 위한 장치(11520)를 도시한 것이다. 도 94g에는 또한 데이터베이스 가속 집적 회로(11530)에 결합된 다양한 유닛들, 예컨대 원격 RAM(11546), 이더넷 메모리 DIMMs(11547), 스토리지 시스템(11560), 로컬 스토리지 유닛(11561), 비휘발성 메모리(NVM)(11563)(비휘발성 메모리는 NVME(NVM express unit)일 수 있음) 등이 도시되어 있다. 94G illustrates an apparatus 11520 for database acceleration that includes a database acceleration integrated circuit 11530 and a multiple memory processing integrated circuit 11551 . 94G also shows various units coupled to database acceleration integrated circuit 11530 , such as remote RAM 11546 , Ethernet memory DIMMs 11547 , storage system 11560 , local storage unit 11561 , non-volatile memory (NVM). ) 11563 (the non-volatile memory may be an NVM express unit (NVME)) and the like are shown.

데이터베이스 가속 집적회로(11530)는 이더넷 포트(11531(4)), RDMA 유닛(11545), 직렬 스케일업 포트(11531(5)), SATA 컨트롤러(11540), PCIe 포트(11531(9)), 제1 프로세싱 유닛(11532), 메모리 컨트롤러(11533), 데이터베이스 가속 유닛(11535), 인터커넥트(11536), 관리 유닛(11513), 암호화 연산을 실행하기 위한 암호화 엔진(11537), 및 레벨 2 SRAM(L2 SRAM)(11538)을 포함하는 것으로 도시되어 있다. The database acceleration integrated circuit 11530 includes an Ethernet port 11531(4), an RDMA unit 11545, a serial scale-up port 11531(5), a SATA controller 11540, a PCIe port 11531(9), a first 1 processing unit 11532 , memory controller 11533 , database acceleration unit 11535 , interconnect 11536 , management unit 11513 , encryption engine 11537 for executing cryptographic operations, and level 2 SRAM (L2 SRAM) ) 11538 .

데이터베이스 가속 유닛은 DMA 엔진(11549), 레벨 3(L3) 메모리(11548), 및 데이터베이스 가속기 서브유닛(11547)을 포함하는 것으로 도시되어 있다. 데이터베이스 가속기 서브유닛(11547)은 설정 가능한 유닛일 수 있다. The database acceleration unit is shown comprising a DMA engine 11549 , a level 3 (L3) memory 11548 , and a database accelerator subunit 11547 . The database accelerator subunit 11547 may be a configurable unit.

이더넷 포트(11531(4)), RDMA 유닛(11545), 직렬 스케일업 포트(11531(5)), SATA 컨트롤러(11540), PCIe 포트(11531(9))는 각각 네트워크 통신 인터페이스(11531)의 일부인 것으로 간주될 수 있다. Ethernet port 11531(4), RDMA unit 11545, serial scale-up port 11531(5), SATA controller 11540, PCIe port 11531(9) are each part of network communication interface 11531 can be considered as

원격 RAM(11546), 이더넷 메모리 DIMMs(11547), 스토리지 시스템(11560)은 이더넷 포트(11531(4))에 결합되고, 이어서 이더넷 포트(11531(4))는 RDMA 유닛(11545)에 결합된다. Remote RAM 11546 , Ethernet memory DIMMs 11547 , and storage system 11560 are coupled to an Ethernet port 11531 ( 4 ), which in turn is coupled to an RDMA unit 11545 .

로컬 스토리지 유닛(11561)은 SATA 컨트롤러(11540)에 결합된다. The local storage unit 11561 is coupled to the SATA controller 11540 .

PCIe 포트(11531(9))는 NVM(11563)에 결합된다. PCIe 포트는 예컨대 관리 목적의 명령을 교환하는 데에도 활용될 수 있다. PCIe port 11531 ( 9 ) is coupled to NVM 11563 . The PCIe port can also be utilized to exchange commands for management purposes, for example.

도 94h는 데이터베이스 가속 유닛(11535)의 일례이다. 94H is an example of the database acceleration unit 11535 .

데이터베이스 가속 유닛(11535)은 데이터베이스 프로세싱 서브유닛(11573)이 데이터베이스 프로세스 지시를 동시에 수행하도록 구성될 수 있고, 여기서 데이터베이스 가속 유닛은 공유 메모리 유닛(11575)을 공유하는 데이터베이스 가속기 서브유닛의 그룹을 포함할 수 있다. The database acceleration unit 11535 may be configured such that the database processing subunit 11573 concurrently performs database process instructions, wherein the database acceleration unit includes a group of database accelerator subunits sharing a shared memory unit 11575. can

상이한 조합의 데이터베이스 가속 유닛(11535)이 동적으로 서로 연결되어(설정 가능한 링크 또는 인터커넥트(11576)를 통해) 다중 지시를 포함할 수 있는 데이터베이스 프로세스 동작을 실행하는 데에 요구되는 실행 파이프라인을 제공할 수 있다. Different combinations of database acceleration units 11535 may be dynamically linked together (via configurable links or interconnects 11576) to provide the execution pipeline required to execute database process operations that may include multiple instructions. can

각 데이터베이스 프로세서 서브유닛은 특정 유형의 데이터베이스 프로세스 지시(예: 필터, 병합, 누적 등)를 실행하도록 구성될 수 있다. Each database processor subunit may be configured to execute specific types of database process instructions (eg, filter, merge, accumulate, etc.).

도 94h는 또한 캐시(11571)에 결합된 개별적인 데이터베이스 프로세싱 유닛(11572)을 도시하고 있다. 데이터베이스 프로세싱 유닛(11572)과 캐시(11571)는 재구성 가능한 DB 가속기 어레이(11574) 대신에 제공되거나 재구성 가능한 DB 가속기 어레이(11574)에 추가하여 제공될 수 있다. 94H also shows a separate database processing unit 11572 coupled to the cache 11571 . The database processing unit 11572 and the cache 11571 may be provided instead of the reconfigurable DB accelerator array 11574 or may be provided in addition to the reconfigurable DB accelerator array 11574 .

장치는 스케일인(scale-in) 및/또는 스케일아웃(scale-out)을 용이하게 하여 다중 데이터베이스 가속 집적회로(11530)(및 연관된 메모리 리소스(11550) 또는 다중 메모리 프로세싱 집적회로(11551))가, 예컨대 데이터베이스 동작의 분산 프로세싱에 가담하여, 서로 협력하게 할 수 있다. The apparatus facilitates scale-in and/or scale-out so that multiple database acceleration integrated circuits 11530 (and associated memory resources 11550 or multiple memory processing integrated circuits 11551) are , for example, may engage in distributed processing of database operations, allowing them to cooperate with each other.

도 94i는 2개의 데이터베이스 가속 집적회로(1153)(및 그와 연관 메모리 리소스(11550))를 포함하는 블레이드(11580)와 같은 모듈형 유닛을 도시한 것이다. 블레이드는 하나, 둘, 또는 셋 이상의 메모리 프로세싱 집적회로(11551)와 그와 연관된 메모리 리소스(11550)를 포함할 수 있다. 94I illustrates a modular unit, such as blade 11580, including two database acceleration integrated circuits 1153 (and associated memory resources 11550). A blade may include one, two, or three or more memory processing integrated circuits 11551 and associated memory resources 11550 .

블레이드는 또한 하나 이상의 비휘발성 메모리 유닛, 이더넷 스위치, PCIe 스위치, 및 이더넷 스위치를 포함할 수 있다. The blade may also include one or more non-volatile memory units, an Ethernet switch, a PCIe switch, and an Ethernet switch.

다중 블레이드는 임의의 모든 통신 방법, 통신 프로토콜, 및 연결성을 활용하여 서로 통신할 수 있다. Multiple blades may communicate with each other utilizing any and all communication methods, communication protocols, and connectivity.

도 94i에는 서로 완전히 연결된 4개의 데이터베이스 가속 집적회로(11530)(및 그와 연관된 메모리 리소스(11550))이 도시되어 있다. 여기서, 각 데이터베이스 가속 집적회로(11530)는 나머지 3개의 데이터베이스 가속 집적회로(11530)에 모두 연결된다. 이러한 연결성은 예컨대 인터넷을 통한 RDMA 프로토콜 등과 같은 임의의 모든 통신 프로토콜을 활용하여 확보될 수 있다. 94I shows four database acceleration integrated circuits 11530 (and associated memory resources 11550) fully coupled to each other. Here, each database acceleration integrated circuit 11530 is connected to the remaining three database acceleration integrated circuits 11530 . Such connectivity may be ensured utilizing any and all communication protocols, such as, for example, RDMA protocol over the Internet.

도 94i에는 또한 연관된 메모리 리소스(11550) 및 RAM과 이더넷 포트를 포함하는 유닛(11531)에 연결된 데이터베이스 가속 집적회로(11530)가 도시되어 있다. Also shown in FIG. 94i is a database acceleration integrated circuit 11530 coupled to a unit 11531 comprising an associated memory resource 11550 and RAM and Ethernet ports.

도 94j, 도 94k, 도 94l, 및 도 94m은 4개의 데이터베이스 가속 집적회로 그룹(11580)이 도시되어 있고, 각 그룹은 (서로 완전히 연결된) 4개의 데이터베이스 집적회로(11530) 및 그와 연관된 메모리 리소스(11550)가 도시하고 있다. 상이한 그룹들은 스위치(11590)를 통해 서로 연결된다. 94J, 94K, 94L, and 94M show four database acceleration integrated circuit groups 11580, each group comprising four database integrated circuits 11530 (fully coupled to each other) and associated memory resources. (11550) shows. The different groups are connected to each other via a switch 11590 .

그룹의 수는 2, 3, 4, 또는 그 이상일 수 있다. 그룹별 데이터베이스 가속 집적회로의 수는 2, 3, 4, 또는 그 이상일 수 있다. 그룹의 수는 그룹별 데이터베이스 가속 집적회로의 수와 동일(또는 상이)할 수 있다. The number of groups may be 2, 3, 4, or more. The number of database acceleration integrated circuits per group may be 2, 3, 4, or more. The number of groups may be equal to (or different from) the number of database acceleration integrated circuits per group.

도 94k는 동시에 효율적으로 서로 조인하기에 너무 큰(예: 1 TB) 2개의 표 A와 B를 도시하고 있다. 94k shows two tables A and B that are too large (eg, 1 TB) to join together efficiently at the same time.

이 표들은 사실상 패치로 나누어져 있고, 조인 동작은 표 A의 패치와 표 B의 패치를 포함하는 쌍에 적용된다. The tables are effectively broken down into patches, and the join operation is applied to a pair containing the patch in Table A and the patch in Table B.

데이터베이스 가속 집적회로 그룹은 다양한 방식으로 패치를 처리할 수 있다. A group of database-accelerated integrated circuits can process patches in a variety of ways.

예를 들어, 장치는 다음과 같은 동작에 의해 분산 프로세싱을 수행하도록 구성될 수 있다: For example, the apparatus may be configured to perform distributed processing by the following operations:

g. 상이한 제1 데이터 구조 부분들(표 A의 패치들, 예컨대 제1 패치(A0) 내지 제16 패치(A15))을 하나 이상의 그룹의 상이한 데이터베이스 가속 집적회로에 할당. g. Allocating different first data structure portions (patches in Table A, eg first patch A0 to 16th patch A15) to one or more groups of different database acceleration integrated circuits.

h. (i) 하나 이상의 그룹의 상이한 데이터베이스 가속 집적회로에 상이한 제2 데이터 구조 부분들(표 B의 패치들, 예컨대 제1 패치(B0) 내지 제16 패치(B15))의 새로운 할당, 및 (ii) 데이터베이스 가속 집적회로에 의한 제1 데이터 구조 부분들과 제2 데이터 구조 부분의 처리를 여러 번 반복 수행. h. (i) new assignment of different second data structure parts (patches in Table B, such as first patch B0 to 16th patch B15) to one or more groups of different database acceleration integrated circuits, and (ii) Perform multiple iterations of processing of the first data structure portion and the second data structure portion by the database acceleration integrated circuit.

장치는 다음 반복의 새로운 할당을 현재 반복의 처리와 적어도 부분적으로 시간 중첩되는 방식으로 실행하도록 구성될 수 있다. The apparatus may be configured to execute the new assignment of the next iteration in a manner that at least partially overlaps in time with the processing of the current iteration.

장치는 상이한 데이터베이스 가속 집적회로 사이에 제2 데이터 구조 부분을 교환함으로써 새로운 할당을 실행하도록 구성될 수 있다. The apparatus may be configured to effect the new allocation by exchanging the second data structure portion between the different database acceleration integrated circuits.

상기 교환은 적어도 부분적으로는 프로세스와 시간 중첩되는 방식으로 실행될 수 있다. The exchange may be carried out in a manner that overlaps, at least in part, in time with the process.

장치는 한 그룹의 상이한 데이터베이스 가속 집적회로 사이에 제2 데이터 구조 부분을 교환하고 더 이상 교환할 제2 데이터 구조 부분이 없으면 상이한 그룹의 데이터베이스 가속 집적회로 사이의 제2 데이터 구조 부분을 교환함으로써 새로운 할당을 실행하도록 구성될 수 있다. The device exchanges a second data structure portion between a group of different database acceleration integrated circuits and, when there are no more second data structure portions to exchange, a new allocation by exchanging a second data structure portion between a different group of database acceleration integrated circuits. can be configured to run

도 94k에서, 4 사이클의 일부 결합 연산이 도시되어 있다. 예를 들어, 좌측 상단 그룹의 데이터베이스 가속 집적회로(11530)를 참조하면, 4 사이클은 Join(A0, B0), Join(A0, B3), Join(A0, B2), 및 Join(A0, B1)의 계산을 포함한다. 이러한 4 사이클 동안에, A0은 동일한 데이터베이스 가속 집적회로(11530)에 유지되는 반면에, 행렬 B의 패치들(B0, B1, B2, B3)은 동일 그룹의 데이터베이스 가속 집적회로(11530)의 성분들 사이에서 회전된다. In Figure 94k, some join operations of 4 cycles are shown. For example, referring to the database acceleration integrated circuit 11530 in the upper left group, 4 cycles are Join(A0, B0), Join(A0, B3), Join(A0, B2), and Join(A0, B1) includes the calculation of During these four cycles, A0 remains in the same database acceleration integrated circuit 11530 , while patches B0 , B1 , B2 , B3 in matrix B are interspersed between the components of the database acceleration integrated circuit 11530 in the same group. is rotated in

도 94l에서, 제2 행렬의 패치들이 상이한 그룹 사이에서 회전된다. 즉, (a) 패치 B0, B1, B2 및 B3(좌측 상단의 그룹에 의해 이전에 처리된 패치들)은 좌측 상단의 그룹에서 좌측 하단의 그룹으로 보내지고, (b) 패치 B4, B5, B6 및 B7(좌측 하단 그룹에 의해 이전에 처리된 패치들)은 좌측 하단의 그룹에서 우측 상단의 그룹으로 보내지고, (c) 패치 B8, B9, B10 및 B11(우측 상단 그룹에서 이전에 처리된 패치들)은 우측 상단의 그룹에서 우측 하단의 그룹으로 보내지고, (d) 패치 B12, B13, B14 및 B15(우측 하단 그룹에서 이전에 처리된 패치들)는 우측 하단의 그룹에서 좌측 상단의 그룹으로 보내진다. 94L , the patches of the second matrix are rotated between different groups. That is, (a) patches B0, B1, B2 and B3 (the patches previously processed by the group on the top left) are sent from the group on the top left to the group on the bottom left, and (b) patches B4, B5, B6 and B7 (patches previously processed by the lower left group) are sent from the lower left group to the upper right group, (c) patches B8, B9, B10 and B11 (the patches previously processed in the upper right group) ) are sent from the upper right group to the lower right group, and (d) patches B12, B13, B14 and B15 (the patches previously processed in the lower right group) are sent from the lower right group to the upper left group. are sent

도 94n은 다중 블레이드(11580), SATA 컨트롤러(11540), 로컬 스토리지 유닛(11561), NVME(11563), PCIe 스위치(11601), 이더넷 메모리 DIMMs(11547), 및 이더넷 포트(11531(4))를 포함하는 시스템의 일례이다. 94n shows multiple blades 11580, SATA controller 11540, local storage unit 11561, NVME 11563, PCIe switch 11601, Ethernet memory DIMMs 11547, and Ethernet port 11531(4). An example of a system that includes

블레이드(11580)는 PCIe 스위치(11601), 이더넷 포트(11531), 및 SATA 컨트롤러(11540)의 각각에 결합될 수 있다. The blade 11580 may be coupled to each of a PCIe switch 11601 , an Ethernet port 11531 , and a SATA controller 11540 .

도 94o에는 2개의 시스템(11621, 11622)이 도시되어 있다. Two systems 11621 and 11622 are shown in Figure 94o.

시스템(11621)은 데이터베이스 가속을 위한 하나 이상의 장치(11520), 스위칭 시스템(11611), 스토리지 시스템(11612), 및 컴퓨트 시스템(11613)을 포함할 수 있다. 스위칭 시스템(11611)은 데이터베이스 가속을 위한 하나 이상의 장치(11520), 스토리지 시스템(11612), 컴퓨트 시스템(11613) 사이의 연결성을 제공한다. The system 11621 may include one or more devices for database acceleration 11520 , a switching system 11611 , a storage system 11612 , and a compute system 11613 . The switching system 11611 provides connectivity between one or more devices 11520 , the storage system 11612 , and the compute system 11613 for database acceleration.

시스템(11622)은 스토리지 시스템 및 데이터베이스 가속을 위한 하나 이상의 장치(11615), 스위칭 시스템(11611), 및 컴퓨트 시스템(11613)을 포함할 수 있다. 스위칭 시스템(11611)은 스토리지 시스템 및 데이터베이스 가속을 위한 하나 이상의 장치(11615)와 컴퓨트 시스템(11613) 사이의 연결성을 제공한다. System 11622 may include a storage system and one or more devices 11615 for database acceleration, a switching system 11611 , and a compute system 11613 . The switching system 11611 provides connectivity between the compute system 11613 and one or more devices 11615 for storage system and database acceleration.

도 95a는 데이터베이스 가속을 위한 방법(11200)을 도시한 것이다. 95A illustrates a method 11200 for database acceleration.

방법(11200)은 데이터베이스 가속 집적회로의 네트워크 통신 네트워크 인터페이스가 방대한 수의 스토리지 유닛으로부터 방대한 양의 정보를 검색하는 단계 12100으로 시작할 수 있다. The method 11200 may begin with step 12100 in which a network communication network interface of the database acceleration integrated circuit retrieves a vast amount of information from a vast number of storage units.

방대한 수의 스토리지 유닛으로 연결하면(예: 다중의 상이한 버스를 활용) 단일 스토리지 유닛의 스루풋이 한정되어 있는 경우에도 네트워크 통신 네트워크 인터페이스가 방대한 양의 정보를 수신할 수 있게 한다.Connecting to a large number of storage units (eg utilizing multiple different buses) allows the network communication network interface to receive large amounts of information, even when the throughput of a single storage unit is limited.

단계 11210 이후에, 방대한 양의 정보를 1차 처리하여 1차 처리된 정보를 제공할 수 있다. 1차 처리는 버퍼링, 페이로드(payloads)로부터의 정보 추출, 헤더 제거, 압축해제, 압축, 해독, 데이터베이스 쿼리 필터링, 또는 임의의 다른 처리 동작의 수행을 포함할 수 있다. 1차 처리는 또한 버퍼링으로 제한될 수도 있다. After step 11210, a large amount of information may be first processed to provide the primarily processed information. Primary processing may include buffering, extracting information from payloads, removing headers, decompressing, compressing, decrypting, filtering database queries, or performing any other processing operation. Primary processing may also be limited to buffering.

단계 11210 이후에, 데이터베이스 가속 집적회로의 메모리 컨트롤러가 1차 처리된 정보를 방대한 스루풋 인터페이스를 통해 다중 메모리 프로세싱 집적회로로 전송하는 단계 11220이 수행될 수 있고, 여기서 각 메모리 프로세싱 집적회로는 컨트롤러, 다중 프로세서 서브유닛, 및 다중 메모리 유닛을 포함할 수 있다. 메모리 프로세싱 집적회로는 본 출원의 다른 부분에서 도시된 바와 같은 메모리/프로세싱 유닛 또는 분산 프로세서 또는 메모리 칩일 수 있다. After step 11210, step 11220 may be performed in which the memory controller of the database acceleration integrated circuit transmits the primary processed information to the multi-memory processing integrated circuit through the massive throughput interface, wherein each memory processing integrated circuit includes a controller, multiple It may include a processor subunit, and multiple memory units. The memory processing integrated circuit may be a memory/processing unit or a distributed processor or memory chip as shown elsewhere in this application.

단계 11220 이후에, 다중 메모리 프로세싱 집적회로가 1차 처리된 정보의 적어도 일부를 2차 처리하여 2차 처리된 정보를 제공하는 단계 11230이 수행될 수 있다. After operation 11220 , operation 11230 in which the multi-memory processing integrated circuit provides secondary processed information by secondary processing at least a portion of the primary processed information may be performed.

단계 11230에는 데이터베이스 가속 집적회로에 의한 다중 작업의 동시 실행이 포함될 수 있다. Step 11230 may include concurrent execution of multiple tasks by the database acceleration integrated circuit.

단계 11230은 데이터 프로세싱 서브유닛에 의한 데이터베이스 프로세싱 지시의 동시 수행을 포함할 수 있고, 여기서 데이터베이스 가속 유닛은 공유 메모리 유닛을 공유하는 데이터베이스 가속기 서브유닛 그룹을 포함할 수 있다. Step 11230 may include concurrent execution of a database processing instruction by the data processing subunit, wherein the database acceleration unit may include a group of database accelerator subunits sharing a shared memory unit.

단계 11230 이후에, 데이터베이스 가속 집적회로의 메모리 컨트롤러가 다중 메모리 프로세싱 집적회로로부터 검색된 정보를 검색하는 단계 11240이 수행될 수 있고, 여기서 검색된 정보는 (a) 1차 처리된 정보의 적어도 일부분 및 (b) 2차 처리된 정보의 적어도 일부분 중의 적어도 하나를 포함할 수 있다. After step 11230, step 11240 in which the memory controller of the database acceleration integrated circuit retrieves information retrieved from the multi-memory processing integrated circuit may be performed, wherein the retrieved information includes (a) at least a portion of the primary processed information and (b) ) may include at least one of at least a portion of the secondary processed information.

단계 11240 이후에, 데이터베이스 가속 집적회로의 데이터베이스 가속 유닛이 검색된 정보에 데이터베이스 프로세싱 동작을 수행하여 데이터베이스 가속 결과를 제공하는 단계 11250이 수행될 수 있다. After step 11240 , step 11250 in which the database acceleration unit of the database acceleration integrated circuit provides a database acceleration result by performing a database processing operation on the retrieved information may be performed.

단계 11250은 시간 I/O 대역폭에 따라 데이터베이스 가속 집적회로의 리소스를 할당하는 단계를 포함할 수 있다. Step 11250 may include allocating resources of the database acceleration integrated circuit according to the time I/O bandwidth.

단계 11250 이후에, 데이터베이스 가속 결과를 출력하는 단계 11260이 수행될 수 있다. After step 11250, step 11260 of outputting a database acceleration result may be performed.

단계 11260은 데이터베이스 프로세싱 서브유닛들을 동적으로 연결하여 다중 지시를 포함할 수 있는 데이터베이스 프로세싱 동작을 실행하는 데에 필요한 실행 파이프라인을 제공하는 단계를 포함할 수 있다. Step 11260 may include dynamically coupling database processing subunits to provide an execution pipeline necessary to execute database processing operations that may include multiple instructions.

단계 11260은 데이터베이스 가속 결과를 로컬 스토리지로 출력하는 단계와 로컬 스토리지에서 데이터베이스 가속 결과를 검색하는 단계를 포함할 수 있다. Step 11260 may include outputting the database acceleration result to the local storage and retrieving the database acceleration result from the local storage.

여기서, 단계 11210, 단계 11220, 단계 11230, 단계 11240, 단계 11250, 단계 11260 및 방법(11100)의 임의의 다른 단계는 파이프라인 방식으로 실행될 수 있다. 이러한 단계는 동시에 또는 앞서 설명한 순서와 다른 순서로 실행될 수 있다. Here, step 11210, step 11220, step 11230, step 11240, step 11250, step 11260 and any other steps of method 11100 may be executed in a pipelined manner. These steps may be performed concurrently or in an order different from that described above.

예를 들어, 단계 1120 이후에 단계 11250이 수행되어 1차 처리된 정보가 데이터베이스 가속 유닛에 의해 계속 처리되도록 할 수 있다. For example, step 11250 may be performed after step 1120 so that the primary processed information continues to be processed by the database acceleration unit.

다른 예를 들면, 1차 처리된 정보는 다중 메모리 프로세싱 집적회로로 보내지고 이어서 (다중 메모리 프로세싱 집적회로에 의해 처리되지 않고) 데이터베이스 가속 유닛으로 보내질 수 있다. For another example, the primary processed information may be sent to a multiple memory processing integrated circuit and then to a database acceleration unit (without being processed by the multiple memory processing integrated circuit).

또 다른 예를 들면, 1차 처리된 정보 및/또는 2차 처리된 정보는 데이터 가속 유닛에서 데이터베이스 처리되지 않고 데이터베이스 가속 집적회로에서 출력될 수 있다. As another example, the primary processed information and/or the secondary processed information may be output from the database acceleration integrated circuit without database processing in the data acceleration unit.

방법은 방대한 컴퓨트 시스템의 컴퓨트 노드에 의해 데이터베이스 가속 집적회로로 보내진 실행 계획에 의거한 검색, 1차 처리, 전송, 및 3차 처리 중의 적어도 하나의 실행을 포함할 수 있다. The method may include executing at least one of retrieval, primary processing, transmission, and tertiary processing based on an execution plan sent to the database acceleration integrated circuit by the compute node of the massive computing system.

방법은 데이터베이스 가속 집적회로의 활용을 실질적으로 최적화하는 방식으로 검색, 1차 처리, 전송, 및 3차 처리 중의 적어도 하나를 관리하는 단계를 포함할 수 있다. The method may include managing at least one of retrieval, primary processing, transmission, and tertiary processing in a manner that substantially optimizes utilization of the database acceleration integrated circuit.

방법은 네트워크 통신 네트워크 인터페이스에 의해 교환되는 트래픽의 대역폭을 실질적으로 최적화하는 단계를 포함할 수 있다. The method may include substantially optimizing a bandwidth of traffic exchanged by the network communication network interface.

방법은 데이터베이스 가속 집적회로의 활용을 실질적으로 최적화하는 방식으로 검색, 1차 처리, 전송, 및 3차 처리 중의 적어도 하나에 병목이 발생하는 것을 실질적으로 방지하는 단계를 포함할 수 있다. The method may include substantially avoiding bottlenecking at least one of retrieval, primary processing, transmission, and tertiary processing in a manner that substantially optimizes utilization of the database acceleration integrated circuit.

방법(11200)은 또한 다음 단계들의 적어도 하나를 포함할 수 있다. Method 11200 may also include at least one of the following steps.

단계 11270은 데이터베이스 가속 집적회로의 관리 유닛이 검색, 1차 처리, 전송, 및 3차 처리 중의 적어도 하나를 관리하는 단계를 포함할 수 있다. Step 11270 may include, by the management unit of the database acceleration integrated circuit, managing at least one of retrieval, primary processing, transmission, and tertiary processing.

관리는 데이터베이스 가속 집적회로의 관리 유닛에 의해 생성된 실행 계획에 의거하여 실행될 수 있다. The management may be executed according to the execution plan generated by the management unit of the database acceleration integrated circuit.

관리는 데이터베이스 가속 집적회로에 생성되지 않았지만 데이터베이스 가속 집적회로에 의해 수신된 실행 계획에 의거하여 실행될 수 있다. Management may be executed based on an execution plan not created in the database acceleration integrated circuit but received by the database acceleration integrated circuit.

관리는 (a) 네트워크 통신 네트워크 인터페이스 리소스, (b) 압축해제 유닛 리소스, (c) 메모리 컨트롤러 리소스, (d) 다중 메모리 프로세싱 집적회로 리소스, 및 (e) 데이터 가속 유닛 리소스 중의 적어도 일부를 할당하는 단계를 포함할 수 있다. The management comprises allocating at least a portion of (a) a network communication network interface resource, (b) a decompression unit resource, (c) a memory controller resource, (d) a multiple memory processing integrated circuit resource, and (e) a data acceleration unit resource. may include steps.

단계 11271은 방대한 컴퓨트 시스템의 컴퓨트 노드가 검색, 1차 처리, 전송, 및 3차 처리 중의 적어도 하나를 제어하는 단계를 포함할 수 있다. Operation 11271 may include, by the compute node of the massive computing system, controlling at least one of search, primary processing, transmission, and tertiary processing.

단계 11272는 데이터베이스 가속 집적회로의 외부에 위치한 관리 유닛이 검색, 1차 처리, 전송, 및 3차 처리 중의 적어도 하나를 관리하는 단계를 포함할 수 있다. Step 11272 may include, by a management unit located external to the database acceleration integrated circuit, managing at least one of search, primary processing, transmission, and tertiary processing.

도 95b는 데이터베이스 가속 집적회로의 그룹을 운영하는 방법(11300)을 도시한 것이다. 95B illustrates a method 11300 of operating a group of database acceleration integrated circuits.

방법(11300)은 데이터베이스 가속 집적회로가 데이터베이스 가속 동작을 수행하는 단계 11310으로 시작할 수 있다. 단계 11310은 방법(11200)의 하나 이상의 단계를 실행하는 단계를 포함할 수 있다. The method 11300 may begin with step 11310 where the database acceleration integrated circuit performs a database acceleration operation. Step 11310 may include executing one or more steps of method 11200 .

방법(11300)은 또한 하나 이상의 그룹의 데이터베이스 가속 집적회로들의 데이터 가속 집적회로들 사이에 (a) 정보 및 (b) 데이터베이스 가속 결과 중의 적어도 하나를 교환하는 단계 11320을 포함할 수 있다. Method 11300 may also include step 11320 exchanging at least one of (a) information and (b) database acceleration result between data acceleration integrated circuits of one or more groups of database acceleration integrated circuits.

단계 11310과 단계 11320의 조합은 하나 이상의 그룹의 데이터베이스 가속 집적회로에 의한 분산 프로세싱의 실행에 이를 수 있다. The combination of steps 11310 and 11320 may result in execution of distributed processing by one or more groups of database acceleration integrated circuits.

교환은 하나 이상의 그룹의 데이터베이스 가속 집적회로의 네트워크 통신 네트워크 인터페이스를 이용하여 실행될 수 있다. The exchange may be performed using a network communication network interface of one or more groups of database acceleration integrated circuits.

교환은 스타 결선(star-connection)에 의해 서로 연결될 수 있는 다중 그룹에 걸쳐 실행될 수 있다. The exchange can be carried out across multiple groups which can be connected to each other by a star-connection.

단계 11320은 하나 이상의 그룹의 상이한 그룹의 데이터베이스 가속 집적회로들 사이에 (a) 정보 및 (b) 데이터베이스 가속 결과 중의 적어도 하나를 교환하기 위해 적어도 하나의 스위치를 사용하는 단계를 포함할 수 있다. Step 11320 may include using the at least one switch to exchange at least one of (a) information and (b) database acceleration results between one or more groups of different groups of database acceleration integrated circuits.

단계 11310은 하나 이상의 그룹의 일부의 데이터베이스 가속 집적회로의 일부가 분산 프로세싱을 실행하는 단계 11311을 포함할 수 있다. Step 11310 may include step 11311 in which a portion of the database acceleration integrated circuit of the portion of one or more groups executes distributed processing.

단계 11311은 제1 데이터 구조 및 제2 데이터 구조의 분산 프로세스를 수행하는 단계를 포함할 수 있고, 여기서 제1 데이터 구조와 제2 데이터 구조의 총 사이즈는 다중 메모리 프로세싱 집적회로의 저장 능력보다 크다. Step 11311 may include performing a process of distributing the first data structure and the second data structure, wherein a total size of the first data structure and the second data structure is greater than a storage capacity of the multiple memory processing integrated circuit.

분산 프로세싱을 수행하는 단계는 (a) 상이한 쌍의 제1 데이터 구조 부분과 제2 데이터 구조 부분을 상이한 데이터베이스 가속 집적회로에 새롭게 할당 및 (b) 상기 상이한 쌍의 처리를 여러 번 반복 수행하는 단계를 포함할 수 있다. The step of performing distributed processing includes the steps of (a) newly allocating different pairs of the first data structure part and the second data structure part to different database acceleration integrated circuits, and (b) repeating the different pairs of processing several times. may include

분산 프로세싱의 수행은 데이터베이스 조인(join) 동작의 실행을 포함할 수 있다. Performance of distributed processing may include execution of a database join operation.

단계 11310은 (a) 상이한 제1 데이터 구조 부분들을 하나 이상의 그룹의 상이한 데이터베이스 가속 집적회로로 할당하는 단계 11312 및 (b) 상이한 제2 데이터 구조 부분들을 하나 이상의 그룹의 상이한 데이터베이스 가속 집적회로로 새로 할당하는 단계 11314와 데이터베이스 가속 집적회로가 제1 데이터 구조 부분들과 제2 데이터 구조 부분들을 처리하는 단계 11316을 여러 번 반복 수행하는 단계를 포함할 수 있다. Step 11310 includes (a) allocating different first data structure parts to one or more groups of different database acceleration integrated circuits; performing step 11314 and step 11316 of the database acceleration integrated circuit processing the first data structure portions and the second data structure portions several times repeatedly.

단계 11314는 현재 반복의 처리와 적어도 부분적으로 시간 중첩되는 방식으로 실행될 수 있다. Step 11314 may be executed in a manner that at least partially overlaps in time with the processing of the current iteration.

단계 11314는 상이한 데이터베이스 가속 집적회로 사이에 제2 데이터 구조 부분을 교환하는 단계를 포함할 수 있다. Step 11314 may include exchanging the second data structure portion between different database acceleration integrated circuits.

단계 11320은 단계 11310과 적어도 부분적으로 시간 중첩되는 방식으로 실행될 수 있다. Step 11320 may be executed in a manner that at least partially overlaps in time with step 11310 .

단계 11314는 한 그룹의 상이한 데이터베이스 가속 집적회로 사이에 제2 데이터 구조 부분을 교환하고 더 이상 교환할 제2 데이터 구조 부분이 없으면 상이한 그룹의 데이터베이스 가속 집적회로 사이의 제2 데이터 구조 부분을 교환하는 단계를 포함할 수 있다. step 11314 exchanging a second data structure portion between a group of different database acceleration integrated circuits and exchanging a second data structure portion between a different group of database acceleration integrated circuits when there are no more second data structure portions to exchange. may include

도 95c는 데이터베이스 가속을 위한 방법(11350)을 도시한 것이다. 95C illustrates a method 11350 for database acceleration.

방법(11350)은 데이터베이스 가속 집적회로의 네트워크 통신 네트워크 인터페이스가 방대한 수의 스토리지 유닛으로부터 방대한 양의 정보를 검색하는 단계 11352로 시작할 수 있다. The method 11350 may begin with step 11352 in which a network communication network interface of the database acceleration integrated circuit retrieves a vast amount of information from a large number of storage units.

단계 11352 이후에, 방대한 양의 정보를 1차 처리하여 1차 처리된 정보를 제공하는 단계 11354가 수행될 수 있다. After operation 11352, operation 11354 of providing primary processed information by primary processing a large amount of information may be performed.

단계 11352 이후에, 데이터베이스 가속 집적회로의 메모리 컨트롤러가 1차 처리된 정보를 방대한 스루풋 인터페이스를 통해 다중 메모리 리소스로 전송하는 단계 11354가 수행될 수 있다. After operation 11352, operation 11354 in which the memory controller of the database acceleration integrated circuit transmits the primary-processed information to multiple memory resources through a massive throughput interface may be performed.

단계 11354 이후에, 다중 메모리 리소스에서 검색된 정보를 검색하는 단계 11356이 실행될 수 있다. After step 11354, step 11356 of retrieving information retrieved from multiple memory resources may be executed.

단계 11356 이후에, 데이터베이스 가속 집적회로의 데이터베이스 가속 유닛이 검색된 정보에 데이터베이스 프로세싱 동작을 수행하여 데이터베이스 가속 결과를 제공하는 단계 11358이 수행될 수 있다. After step 11356, step 11358 in which the database acceleration unit of the database acceleration integrated circuit performs a database processing operation on the retrieved information to provide a database acceleration result may be performed.

단계 11358 이후에, 데이터베이스 가속 결과를 출력하는 단계 11359가 수행될 수 있다. After step 11358, step 11359 of outputting a database acceleration result may be performed.

방법은 또한 1차 처리된 정보를 2차 처리하여 2차 처리된 정보를 제공하는 단계 11355를 포함할 수 있다. 2차 처리는 다중 메모리 리소스를 더 포함하는 하나 이상의 메모리 프로세싱 집적회로 내에 위치한 다중 프로세서에 의해 실행될 수 있다. 단계 11355 이후에, 단계 11354와 단계 11356이 실행된다. The method may also include a step 11355 of secondary processing the primary processed information to provide secondary processed information. The secondary processing may be executed by multiple processors located within one or more memory processing integrated circuits further comprising multiple memory resources. After step 11355, steps 11354 and 11356 are executed.

2차 처리된 정보의 총 사이즈는 1차 처리된 정보의 총 사이즈보다 작을 수 있다. The total size of the secondary processed information may be smaller than the total size of the primary processed information.

1차 처리된 정보의 총 사이즈는 방대한 양의 정보의 총 사이즈보다 작을 수 있다. The total size of the primary processed information may be smaller than the total size of the vast amount of information.

1차 처리는 데이터베이스 엔트리의 필터링을 포함할 수 있다. 따라서, 쿼리에 관련이 없는 데이터베이스 엔트리를 걸러 냄으로써 추가적인 처리하기 전에 및/또는 관련이 없는 데이터베이스 엔트리를 다중 메모리 리소스에 저장하기 전에 대역폭, 저장 리소스, 및 프로세싱 리소스를 절약할 수 있다. Primary processing may include filtering of database entries. Thus, by filtering out database entries that are not relevant to the query, bandwidth, storage resources, and processing resources can be saved before further processing and/or before storing the irrelevant database entries in multiple memory resources.

2차 처리는 데이터베이스 엔트리의 필터링을 포함할 수 있다. 필터링은 필터링 조건이 복잡하고(예: 다중 조건을 포함) 필터링이 이루어지지 건에 다중 데이터베이스 엔트리 필드를 수신해야 하는 경우에 적용될 수 있다. 예컨대, (a) 특정 나이 이상이고 바나나와 같은 사람 및 (b) 다른 나이 이상이고 사과와 같은 사람을 찾는 경우가 이에 해당할 수 있다. Secondary processing may include filtering of database entries. Filtering can be applied when the filtering condition is complex (eg contains multiple conditions) and multiple database entry fields must be received in case no filtering is performed. For example, (a) a person who is over a certain age and is like a banana, and (b) a person who is over a certain age and is like an apple may be the case.

데이터베이스database

이하의 예들은 데이터베이스에 관한 것이다. 데이터베이스는 데이터 센터일 수 있거나, 데이터 센터의 일부일 수 있거나, 데이터 센터에 속하지 않을 수 있다. The examples below relate to databases. The database may be a data center, may be part of a data center, or may not belong to a data center.

데이터베이스는 하나 이상의 네트워크를 통해 다중 사용자에게 결합될 수 있다. 데이터베이스는 클라우드 데이터베이스일 수 있다. The database may be coupled to multiple users via one or more networks. The database may be a cloud database.

하나 이상의 관리 유닛을 포함하는 데이터베이스 및 하나 이상의 메모리/프로세싱 유닛을 포함하는 다중 데이터베이스 가속기 보드가 제공될 수 있다. Multiple database accelerator boards may be provided that include a database including one or more management units and one or more memory/processing units.

도 96b는 관리 유닛(12021) 및 각각 통신/관리 프로세서(프로세서(12024))와 다중 메모리/프로세싱 유닛(12026)을 포함하는 다중 DB 가속기 보드(12022)를 포함하는 데이터베이스(12020)를 도시한 것이다. 96B shows a database 12020 comprising a management unit 12021 and multiple DB accelerator boards 12022 each including a communication/management processor (processor 12024 ) and multiple memory/processing units 12026 . .

프로세서(12024)는 PCIe, ROCE 같은 프로토콜 등과 같은 다양한 통신 프로토콜을 지원할 수 있다. The processor 12024 may support various communication protocols, such as protocols such as PCIe and ROCE.

데이터베이스 명령은 메모리/프로세싱 유닛(12026)에 의해 실행될 수 있고, 프로세서는 상이한 메모리/프로세싱 유닛들(12026) 사이에, 상이한 DB 가속기 보드들(12022) 사이에, 및 관리 유닛(12021)과 함께 트래픽을 전송할 수 있다. The database instructions may be executed by the memory/processing unit 12026 , which processes the traffic between the different memory/processing units 12026 , between the different DB accelerator boards 12022 , and with the management unit 12021 . can be transmitted.

특히 많은 내부 메모리 뱅크를 포함하는 다중 메모리/프로세싱 유닛(12026)을 활용하면 데이터베이스 명령의 실행을 극적으로 가속시킬 수 있고 통신 병목 현상을 피할 수 있다. In particular, utilizing multiple memory/processing units 12026 with many internal memory banks can dramatically accelerate the execution of database commands and avoid communication bottlenecks.

도 96c는 프로세서(12024) 및 다중 메모리/프로세싱 유닛(12026)을 포함하는 DB 가속기 보드(12022)를 도시한 것이다. 프로세서(12024)는 메모리/프로세싱 유닛(12026)과 통신하기 위한 DDR 컨트롤러(12033)와 같은 다중 통신 전용 컴포넌트, RDMA 엔진(12031), DB 쿼리 데이터베이스 엔진(12034) 등을 포함한다. DDR 컨트롤러는 통신 컨트롤러의 예이고, RDMA 엔진은 임의의 통신 엔진의 예이다. 96C shows a DB accelerator board 12022 including a processor 12024 and multiple memory/processing units 12026 . The processor 12024 includes multiple communication-only components such as a DDR controller 12033 for communicating with the memory/processing unit 12026 , an RDMA engine 12031 , a DB query database engine 12034 , and the like. A DDR controller is an example of a communication controller, and an RDMA engine is an example of an arbitrary communication engine.

도 96b, 도 96c, 및 도 96d의 시스템의 운영(또는 시스템의 임의 부분의 운영) 방법이 제공될 수 있다. A method of operating (or operating any portion of the system) of the system of FIGS. 96B , 96C , and 96D may be provided.

여기서, 데이터베이스 가속 집적회로(11530)는 다중 메모리 프로세싱 집적회로에 포함되지 않는 다중 메모리 리소스와 연관될 수 있거나 프로세싱 유닛과 연관되지 않을 수 있다. 이러한 경우, 프로세싱은 주로 및 오로지 데이터베이스 가속 집적회로에 의해서 실행된다. Here, the database acceleration integrated circuit 11530 may be associated with multiple memory resources not included in the multiple memory processing integrated circuit, or may not be associated with a processing unit. In this case, the processing is performed primarily and exclusively by the database acceleration integrated circuit.

도 94p는 데이터베이스 가속을 위한 방법(11700)을 도시한 것이다. 94P illustrates a method 11700 for database acceleration.

방법(11700)은 데이터베이스 가속 집적회로의 네트워크 통신 인터페이스가 스토리지 유닛으로부터 정보를 검색하는 단계 11710을 포함할 수 있다. Method 11700 may include step 11710 in which the network communication interface of the database acceleration integrated circuit retrieves information from the storage unit.

단계 11710 이후에, 정보의 양을 1차 처리하여 1차 처리된 정보를 제공하는 단계 11720이 수행될 수 있다. After step 11710, step 11720 of providing the primarily processed information by first processing the amount of information may be performed.

단계 11720 이후에, 데이터베이스 가속 집적회로의 메모리 컨트롤러가 1차 처리된 정보를 스루풋 인터페이스를 통해 다중 메모리 리소스로 전송하는 단계 11730이 수행될 수 있다. After operation 11720, operation 11730 in which the memory controller of the database acceleration integrated circuit transmits the primarily processed information to multiple memory resources through the throughput interface may be performed.

단계 11730 이후에, 다중 메모리 리소스로부터 정보를 검색하는 단계 11740이 수행될 수 있다. After step 11730, step 11740 of retrieving information from multiple memory resources may be performed.

단계 11740 이후에, 데이터베이스 가속 집적회로의 데이터베이스 가속 유닛이 검색된 정보에 데이터베이스 프로세싱 동작을 수행하여 데이터베이스 가속 결과를 제공하는 단계 11750이 수행될 수 있다.After step 11740 , step 11750 in which the database acceleration unit of the database acceleration integrated circuit provides a database acceleration result by performing a database processing operation on the retrieved information may be performed.

단계 11750 이후에, 데이터베이스 가속 결과를 출력하는 단계 11760이 수행될 수 있다.After step 11750, step 11760 of outputting a database acceleration result may be performed.

1차 처리 및/또는 2차 처리는 데이터베이스 엔트리의 필터링을 포함하여 계속 처리할 데이터베이스 엔트리를 판단할 수 있다. Primary processing and/or secondary processing may include filtering of database entries to determine which database entries to continue processing.

2차 처리는 데이터베이스 엔트리의 필터링을 포함한다. Secondary processing includes filtering of database entries.

하이브리드 시스템hybrid system

메모리/프로세싱 유닛은 메모리 집약적일 수 있는 계산을 실행하는 경우 및/또는 병목 현상이 검색 동작과 관련되는 경우에 매우 효과적이다. 프로세싱 중심의(및 덜 메모리 중심의) 프로세서 유닛(예: GPU, CPU 등)은 병목 현상이 연산 동작과 관련되는 경우에 더욱 효과적일 수 있다. Memory/processing units are very effective when performing computations that can be memory intensive and/or when bottlenecks are related to retrieval operations. Processing intensive (and less memory intensive) processor units (eg GPUs, CPUs, etc.) can be more effective when the bottleneck is related to computational operations.

하이브리드 시스템은 서로 완전히 또는 부분적으로 연결될 수 있는 하나 이상의 프로세서 유닛과 하나 이상의 메모리/프로세싱 유닛을 모두 포함할 수 있다. A hybrid system may include both one or more processor units and one or more memory/processing units, which may be fully or partially coupled to each other.

메모리/프로세싱 유닛(MPU)은 로직 셀보다 메모리 셀에 더 적합한 제1 제조 프로세스에 의해 제조될 수 있다. 예를 들어, 제1 제조 프로세스에 의해 제조된 메모리 셀의 임계 치수는 제1 제조 프로세스에 의해 제조된 로직 셀의 임계 치수보다 작거나 심지어 매우 작을 수(예: exceeds 2, 3, 4, 5, 6, 7, 8, 9, 10 등을 초과하는 인수만큼) 있다. 예컨대, 제1 제조 프로세스는 아날로그 제조 프로세스, DRAM 제조 프로세스 등일 수 있다. The memory/processing unit (MPU) may be manufactured by a first manufacturing process that is more suitable for memory cells than logic cells. For example, the critical dimension of a memory cell manufactured by the first manufacturing process may be less than or even significantly smaller than the critical dimension of a logic cell manufactured by the first manufacturing process (eg, exceeds 2, 3, 4, 5, as many arguments as 6, 7, 8, 9, 10, etc.). For example, the first manufacturing process may be an analog manufacturing process, a DRAM manufacturing process, or the like.

프로세서는 로직에 더 적합한 제2 제조 프로세스에 의해 제조될 수 있다. 예를 들면, 제2 제조 프로세스에 의해 제조된 로직 회로의 임계 치수는 제1 제조 프로세스에 의해 제조된 로직 회로의 임계 치수보다 작거나 심지어 매우 작을 수 있다. 다른 예를 들면, 제2 제조 프로세스에 의해 제조된 로직 회로의 임계 치수는 제1 제조 프로세스에 의해 제조된 메모리 셀의 임계 치수보다 작거나 심지어 매우 작을 수 있다. 예컨대, 제2 제조 프로세스는 디지털 제조 프로세스, CMOS 제조 프로세스 등일 수 있다. The processor may be manufactured by a second manufacturing process that is more suitable for the logic. For example, a critical dimension of a logic circuit manufactured by a second manufacturing process may be less than or even significantly smaller than a critical dimension of a logic circuit manufactured by a first manufacturing process. As another example, the critical dimension of a logic circuit manufactured by the second manufacturing process may be less than or even significantly smaller than the critical dimension of a memory cell manufactured by the first manufacturing process. For example, the second manufacturing process may be a digital manufacturing process, a CMOS manufacturing process, or the like.

작업은 각 유닛의 이점 및 유닛들 사이의 데이터 전송에 관련된 페널티를 고려하여 정적 또는 동적 방식으로 상이한 유닛들 사이에 할당될 수 있다. Tasks can be assigned between different units in a static or dynamic manner, taking into account the advantages of each unit and the penalties associated with transferring data between the units.

예컨대, 메모리 집약적 프로세스는 메모리/프로세싱 유닛에 할당되고 프로세싱 집약적이고 메모리를 많이 쓰지 않는 프로세스는 프로세싱 유닛에 할당될 수 있다. For example, a memory-intensive process may be allocated to a memory/processing unit and a processing-intensive, non-memory intensive process may be allocated to a processing unit.

프로세서는 하나 이상의 메모리/프로세싱 유닛이 다양한 프로세싱 작업을 수행하도록 요청 또는 지시할 수 있다. 다양한 프로세싱 작업의 실행은 프로세스의 로드를 덜고, 지연을 감소시키고, 일부 경우에서는 하나 이상의 메모리/프로세싱 유닛과 프로세서 사이의 정보의 전반적인 대역폭을 감소시키는 등을 할 수 있다. The processor may request or direct one or more memory/processing units to perform various processing tasks. Execution of various processing tasks may offload the process, reduce latency, in some cases reduce the overall bandwidth of information between one or more memory/processing units and the processor, and the like.

프로세서는 상이한 입도의 지시 및/또는 요청을 제공할 수 있다. 예를 들면, 프로세서는 특정 프로세싱 리소스를 겨냥한 지시를 전송하거나 임의의 프로세싱 리소스를 지정하지 않고 메모리/프로세싱 유닛을 겨냥한 더 높은 레벨의 지시를 전송할 수 있다. The processor may provide different granularity of indications and/or requests. For example, a processor may send an indication directed at a particular processing resource or a higher level indication directed at a memory/processing unit without designating any processing resource.

도 96d는 하나 이상의 메모리/프로세싱 유닛(MPU)(12043)과 프로세서(12042)를 포함하는 하이브리드 시스템(12040)의 일례이다. 도시된 바와 같이, 프로세서(12042)는 하나 이상의 MPU(12043)로 요청 또는 지시를 전송할 수 있고, 이어서 하나 이상의 MPU(12043)는 요청 및/또는 지시를 수행(또는 선택적으로 수행)하고 결과를 프로세서(12042)로 전송할 수 있다. 96D is an example of a hybrid system 12040 including one or more memory/processing units (MPUs) 12043 and a processor 12042 . As shown, the processor 12042 may send a request or instruction to one or more MPUs 12043 , which in turn perform (or optionally perform) the request and/or instruction and send the result to the processor. It can be sent to (12042).

프로세서(12042)는 결과를 더 처리하여 하나 이상의 출력을 제공할 수 있다. Processor 12042 may further process the results to provide one or more outputs.

각 MPU는 메모리 리소스, 프로세싱 리소스(예: 콤팩트 마이크로컨트롤러(12044)), 및 캐시 메모리(12049)를 포함한다. 마이크로컨트롤러는 한정된 연산 능력이 있을 수 있다(예: 곱셈 누적 유닛을 주로 포함할 수 있음). Each MPU includes memory resources, processing resources (eg, compact microcontroller 12044 ), and cache memory 12049 . A microcontroller may have limited computational power (eg it may primarily contain multiplication and accumulation units).

마이크로컨트롤러(12044)는 인메모리 가속(in-memory acceleration) 목적으로 프로세스를 적용할 수 있고, CPU 또는 완전 DB 프로세싱 엔진 또는 그 서브세트일 수 있다. The microcontroller 12044 may apply the process for in-memory acceleration purposes, and may be a CPU or full DB processing engine or a subset thereof.

MPU(12043)는 뱅크 간의 빠른 통신을 위해 매시(mesh), 링(ring), 또는 기타 위상배치(topology)로 연결될 수 있는 패킷 프로세싱 유닛 및 마이크로프로세서를 포함할 수 있다. The MPU 12043 may include a microprocessor and a packet processing unit that may be coupled in a mesh, ring, or other topology for fast communication between banks.

DIMM 간의 빠른 통신을 위해 둘 이상의 DDR 컨트롤러가 있을 수 있다. There can be more than one DDR controller for fast communication between DIMMs.

인메모리 패킷 프로세서의 목표는 BW, 데이터 이동, 전력 소비를 줄이고 성능을 향상시키는 것이다. 이들을 활용하면 표준 솔루션보다 성능/TCO를 극적으로 향상시키게 된다. The goal of the in-memory packet processor is to reduce BW, data movement, power consumption and improve performance. Utilizing them will dramatically improve performance/TCO over standard solutions.

여기서, 관리 유닛은 선택적이다. Here, the management unit is optional.

각 MPU는 인공지능(AI) 계산을 수행하고 그 결과만을 프로세서로 전달하여 트래픽의 양을 줄일 수 있고(특히 MPU가 다중 계산에 사용될 신경망 계수를 수신 및 저장하는 경우) 신경망의 일 부분이 사용되어 새로운 데이터를 처리할 때마다 외부 칩으로부터 계수를 수신하지 않아도 되므로, AI 메모리/프로세싱 유닛으로 동작할 수 있다. Each MPU can perform artificial intelligence (AI) calculations and pass only the results to the processor, reducing the amount of traffic (especially when the MPU receives and stores neural network coefficients to be used for multiple computations) and a portion of the neural network is used It can act as an AI memory/processing unit as it does not need to receive coefficients from an external chip every time it processes new data.

MPU는 계수가 0인 경우를 판단하고 0 값 계수를 포함하는 곱셈을 수행할 필요가 없다고 프로세서에 알릴 수 있다. The MPU may determine if the coefficient is zero and inform the processor that there is no need to perform multiplication involving the zero-valued coefficient.

여기서, 1차 처리와 2차 처리는 데이터베이스 엔트리의 필터링을 포함할 수 있다. Here, the primary processing and secondary processing may include filtering of database entries.

MPU는 본 명세서, PCT 출원 공개공보 WO2019025862 및 PCT 특허 출원 번호 PCT/IB2019/001005에 도시된 임의의 모든 메모리 프로세싱 유닛일 수 있다. The MPU may be any and all memory processing units shown in this specification, PCT Application Publication No. WO2019025862 and PCT Patent Application No. PCT/IB2019/001005.

네트워크 인터페이스 카드가 AI 프로세싱 능력이 있고 다중 AI 가속 서버를 결합시키는 네트워크를 통해 전송될 트래픽의 양을 감소시키기 위해 일부 AI 프로세싱 작업을 수행하도록 구성된 AI 컴퓨팅 시스템(및 이 시스템에 의해 실행 가능한 시스템)이 제공될 수 있다. An AI computing system (and systems executable by the system) whose network interface card has AI processing capabilities and is configured to perform some AI processing tasks to reduce the amount of traffic to be sent over a network that combines multiple AI acceleration servers; can be provided.

예를 들면, 일부 추론(inference) 시스템에서, 입력은 네트워크(예: AI 서버에 연결된 IP 카메라의 다중 스트림)이다. 이러한 경우에서, 프로세싱 및 네트워킹 유닛 상에서 RDMA + AI를 활용하면, CPU와 PCIe 버스의 로드를 감소시킬 수 있고, 프로세싱 및 네트워킹 유닛에 포함되지 않은 GPU에 의해서가 아니라 프로세싱 및 네트워킹 유닛 상에서 프로세싱을 제공할 수 있다. For example, in some inference systems, the input is a network (eg multiple streams of IP cameras connected to an AI server). In this case, leveraging RDMA + AI on the processing and networking unit can reduce the load on the CPU and PCIe bus and provide processing on the processing and networking unit rather than by the GPU that is not included in the processing and networking unit. can

예를 들면, 초기 결과를 계산하고 타깃 AI 가속 서버(하나 이상의 AI 프로세싱 연산을 적용하는 서버)로 전송하는 대신에 프로세싱 및 네트워킹 유닛은 타깃 AI 가속 서버로 보내지는 값의 양을 줄이는 프리프로세싱을 수행할 수 있다. 타깃 AI 컴퓨팅 서버는 다른 AI 가속 서버에 의해 제공된 값에 계산을 수행하도록 할당된 AI 컴퓨팅 서버이다. 이는 AI 가속 서버들 사이에 교환되는 트래픽의 대역폭을 줄이고 또한 타깃 AI 가속 서버의 로드를 줄인다. For example, instead of computing the initial result and sending it to the target AI acceleration server (the server that applies one or more AI processing operations), the processing and networking unit performs preprocessing that reduces the amount of values sent to the target AI acceleration server. can do. The target AI computing server is an AI computing server assigned to perform calculations on values provided by other AI acceleration servers. This reduces the bandwidth of traffic exchanged between AI acceleration servers and also reduces the load on the target AI acceleration server.

타깃 AI 가속 서버는 동적 또는 정적인 방식으로, 로드 밸런싱(load balancing) 또는 기타 할당 알고리즘을 활용하여 할당될 수 있다. 단일 타깃 AI 가속 서버보다 많은 타깃 AI 가속 서버가 있을 수 있다. Target AI acceleration servers may be allocated in a dynamic or static manner, utilizing load balancing or other allocation algorithms. There may be more target AI acceleration servers than a single target AI acceleration server.

예를 들면, 타깃 AI 가속 서버가 다중의 손실을 추가하는 경우, 프로세싱 및 네트워킹 유닛은 AI 가속 서버에 의해 생성된 손실을 추가하고 손실의 합을 타깃 AI 가속 서버로 전송하여 대역폭을 줄일 수 있다. 미분 계산, 종합 등과 같은 프리프로세싱 연산을 수행하는 경우에 동일한 이득이 얻어질 수 있다. For example, if the target AI acceleration server adds multiple losses, the processing and networking unit may add the losses generated by the AI acceleration server and send the sum of the losses to the target AI acceleration server to reduce bandwidth. The same gain can be obtained in the case of performing a preprocessing operation such as a differential calculation, a synthesis, and the like.

도 97b는 서버 마더보드(12064)를 구비한 AI 프로세싱 및 네트워킹 유닛(12063)을 서로 연결하기 위한 스위치(12061)를 각각 포함하는 서브시스템을 포함하는 시스템(12060)을 도시한 것이다. 서버 마더보드는 네트워크 능력이 있고 AI 프로세싱 능력이 있는 하나 이상의 AI 프로세싱 및 네트워킹 유닛(12063)을 포함한다. AI 프로세싱 및 네트워킹 유닛(12063)은 하나 이상의 NIC, 및 프리프로세싱을 수행하기 위한 ALU 또는 기타 계산 회로를 포함할 수 있다. 97B shows a system 12060 including subsystems each including a switch 12061 for interconnecting an AI processing and networking unit 12063 with a server motherboard 12064 . The server motherboard has network capabilities and includes one or more AI processing and networking units 12063 with AI processing capabilities. AI processing and networking unit 12063 may include one or more NICs, and an ALU or other computational circuitry to perform preprocessing.

AI 프로세싱 및 네트워킹 유닛(12063)은 칩일 수 있거나 단일 칩 이상을 포함할 수 있다. 단일 칩인 AI 프로세싱 및 네트워킹 유닛(12063)을 구비하는 것이 유리할 수 있다. The AI processing and networking unit 12063 may be a chip or may include more than a single chip. It may be advantageous to have the AI processing and networking unit 12063 being a single chip.

AI 프로세싱 및 네트워킹 유닛(12063)은 (오로지 혹은 주로) 프로세싱 리소스를 포함할 수 있다. AI 프로세싱 및 네트워킹 유닛(12063)은 인메모리 컴퓨팅 회로를 포함하거나, 인메모리 컴퓨팅 회로를 포함하지 않거나, 중요한 인메모리 컴퓨팅 회로를 포함하지 않을 수 있다. AI processing and networking unit 12063 may include (only or primarily) processing resources. AI processing and networking unit 12063 may include in-memory computing circuitry, may not include in-memory computing circuitry, or may not include significant in-memory computing circuitry.

AI 프로세싱 및 네트워킹 유닛(12063)은 집적회로이거나, 하나 이상의 집적회로를 포함하거나, 집적회로의 일부 등일 수 있다. AI processing and networking unit 12063 may be an integrated circuit, may include one or more integrated circuits, may be part of an integrated circuit, or the like.

AI 프로세싱 및 네트워킹 유닛(12063)은 AI 프로세싱 및 네트워킹 유닛(12063)을 포함하는 AI 가속 서버와 다른 AI 가속 서버 사이에 트래픽을 전달(예: DDR 채널, 네트워크 채널, 및/또는 PCIe 채널을 활용하여) 할 수 있다(도 97c 참조). AI 프로세싱 및 네트워킹 유닛(12063)은 또한 DDR 메모리와 같은 외부 메모리에 결합될 수 있다. 프로세싱 및 네트워킹 유닛은 메모리를 포함 및/또는 메모리/프로세싱 유닛을 포함할 수 있다. AI processing and networking unit 12063 passes traffic between the AI acceleration server including AI processing and networking unit 12063 and other AI acceleration servers (eg, utilizing DDR channels, network channels, and/or PCIe channels). ) can be done (see FIG. 97c). AI processing and networking unit 12063 may also be coupled to external memory, such as DDR memory. The processing and networking unit may include memory and/or may include a memory/processing unit.

도 97c에서, AI 프로세싱 및 네트워킹 유닛(12063)은 로컬 DDR 연결, DDR 채널, AI 가속기, RAM 메모리, 암호화/해독 엔진, PCIe 스위치, PCIe 인터페이스, 다중 코어 프로세싱 어레이, 고속 네트워킹 연결 등을 포함하는 것으로 도시되어 있다. 97C , AI processing and networking unit 12063 is shown to include local DDR connections, DDR channels, AI accelerators, RAM memory, encryption/decryption engines, PCIe switches, PCIe interfaces, multi-core processing arrays, high-speed networking connections, and the like. is shown.

도 97b 및 도 97c의 임의의 도면의 시스템을 운영하기 위한(또는 시스템의 임의의 부분을 운영하는) 방법이 제공될 수 있다. A method for operating (or operating any portion of) the system of any of the figures of FIGS. 97B and 97C may be provided.

본 출원에 언급된 임의의 모든 방법의 임의의 모든 단계의 조합이 제공될 수 있다. Combinations of any and all steps of any and all methods mentioned in this application may be provided.

본 출원에서 언급된 임의의 모든 유닛, 집적회로, 메모리 리소스, 로직, 프로세싱 서브유닛, 컨트롤러, 컴포넌트의 임의의 조합이 제공될 수 있다. Any and all units, integrated circuits, memory resources, logic, processing subunits, controllers, and any combination of components mentioned herein may be provided.

"포함"("including" 및/또는 "comprising")에 관한 일체의 언급은 "포함"("consisting")과 "실질적으로 포함"("substantially consisting")에 준용하여 적용될 수 있다. Any reference to "including" and/or "comprising" may apply mutatis mutandis to "comprising" and "substantially consisting of".

상기의 설명은 예시의 목적으로 제시되었다. 이 설명은 모든 것을 망라한 것이 아니며 개시된 그대로의 형태 또는 실시예로 제한되는 것이 아니다. 수정 및 응용은 본 명세서를 고려하고 개시된 실시예를 실시함으로써 당업자에게 당연할 것이다. 또한, 개시된 실시예의 양상들이 메모리에 저장되는 것으로 설명되었지만, 당업자라면 이러한 양상들이, 예를 들어, 하드 디스크 또는 CD ROM, 또는 다른 유형의 RAM 또는 ROM, USB 매체, DVD, 블루레이, UHD 블루레이, 또는 기타 광드라이브 매체 등의 2차 저장장치와 같은 다른 유형의 컴퓨터 판독가능 매체에 저장될 수도 있음을 이해할 것이다.The above description has been presented for purposes of illustration. This description is not exhaustive and is not limited to the precise form or embodiment disclosed. Modifications and adaptations will become apparent to those skilled in the art upon consideration of the present specification and practice of the disclosed embodiments. Also, although aspects of the disclosed embodiments have been described as being stored in memory, those skilled in the art will recognize that these aspects are, for example, hard disk or CD ROM, or other types of RAM or ROM, USB media, DVD, Blu-ray, UHD Blu-ray. , or other tangible computer readable media such as secondary storage such as optical drive media.

개시된 설명과 방법에 기반한 컴퓨터 프로그램은 당업자에게는 당연한 기술이다. 다양한 프로그램 또는 프로그램 모듈이 당업자에게 공지인 기술을 사용하여 생성되거나 기존의 소프트웨어와 관련되어 설계될 수 있다. 예를 들어, 프로그램 섹션 또는 프로그램 모듈은 .Net Framework, .Net Compact Framework (및 Visual Basic, C 등과 같은 관련 언어), Java, C++, Objective-C, HTML, HTML/AJAX 조합, XML, 또는 자바 애플릿(Java applet)을 포함하는 HTML로 설계될 수 있다.A computer program based on the disclosed description and method is of ordinary skill to those skilled in the art. Various programs or program modules may be created using techniques known to those skilled in the art or designed in conjunction with existing software. For example, a program section or program module can contain .Net Framework, .Net Compact Framework (and related languages such as Visual Basic, C, etc.), Java, C++, Objective-C, HTML, HTML/AJAX combination, XML, or Java applet. It can be designed with HTML containing (Java applet).

또한, 예시된 실시예들을 여기에 설명하였지만, 모든 실시예의 범위는 균등한 구성요소, 수정, 누락, 조합(예, 다양한 실시예에 걸친 양상의 조합), 응용, 및/또는 변경을 가짐은 본 발명의 당업자에게 당연하다. 청구항의 한정은 청구항에 사용된 언어에 근거하여 넓게 해석되어야 하며 본 명세서에서 또는 본 발명의 출원 중에 설명된 예시에 한정되지 않는다. 예시들은 배타적이지 않은 것으로 이해되어야 한다. 나아가, 개시된 방법의 단계들은 단계들의 순서를 재배열 및/또는 단계를 삽입 또는 삭제하는 등의 다양한 방법으로 수정될 수 있다. 따라서, 본 명세서와 예시들은 예시의 목적으로만 고려되고, 진정한 범위와 기술적 사상은 하기의 청구항과 그 균등한 범위에 의해 정의된다.Moreover, although illustrated embodiments have been described herein, the scope of all embodiments is contemplated with equivalent elements, modifications, omissions, combinations (eg, combinations of aspects across various embodiments), applications, and/or variations. It is obvious to those skilled in the art. The limitations of the claims are to be interpreted broadly based on the language used in the claims and are not limited to the examples set forth herein or during the filing of the present invention. It is to be understood that the examples are not exclusive. Furthermore, the steps of the disclosed method may be modified in various ways, such as by rearranging the order of the steps and/or inserting or deleting steps. Accordingly, the specification and examples are to be considered for illustrative purposes only, and the true scope and spirit is defined by the following claims and their equivalents.

Claims

Board;
a memory array disposed on the substrate and comprising a plurality of discrete memory banks;
a processing array disposed on the substrate and comprising a plurality of processor subunits each associated with one or more of the plurality of discrete memory banks; and
An integrated circuit comprising a controller, comprising:
and the controller is configured to implement at least one security measure for the operation.

According to claim 1,
and the controller is configured to take one or more solutions when the at least one security action is triggered.

According to claim 1,
and the controller is configured to implement at least one security measure on the at least one memory location.

3. The method of claim 2,
The data includes weight data of a neural network model.

According to claim 1,
and the controller is configured to implement at least one security measure comprising blocking access to one or more memory portions of the memory array that are not used for input data or output data operations.

According to claim 1,
and the controller is configured to implement at least one security measure comprising blocking only a subset of the memory array.

7. The method of claim 6,
and said subset of said memory array is designated by a particular memory address.

7. The method of claim 6,
and said subset of said memory array is configurable.

According to claim 1,
and the controller is configured to implement at least one security measure comprising controlling traffic to or from the integrated circuit.

According to claim 1,
and the controller is configured to implement at least one security measure comprising uploading of mutable data, code, or fixed data.

According to claim 1,
and the uploading of the mutable data, code, or fixed data occurs during a boot process.

According to claim 1,
wherein the controller is configured to implement at least one security measure comprising uploading during the boot process a configuration file identifying a specific memory address for at least a portion of the memory array to be blocked upon completion of the boot process. Circuit.

According to claim 1,
and the controller is further configured to request a password to unblock access to a memory portion of the memory array associated with one or more memory addresses.

According to claim 1,
and the at least one security measure is triggered when an attempt to access at least one blocked memory address is detected.

According to claim 1,
The controller calculates a checksum, a hash, a cyclic redundancy check (CRC), and a parity calculated for at least a portion of the memory array, the calculated checksum, the hash , CRC, or parity to a predetermined value.

16. The method of claim 15,
The controller is
and determine, as part of the at least one security measure, whether the calculated checksum, hash, CRC, or parity matches the predetermined value.

According to claim 1,
and said at least one security measure comprises copying the program code in at least two different memory portions.

18. The method of claim 17,
and said at least one security measure comprises determining whether output results from execution of said program code of said at least two different memory portions differ from one another.

19. The method of claim 18,
and the output result comprises an intermediate output result or a final output result.

18. The method of claim 17,
and the at least two different memory portions are included within the integrated circuit.

According to claim 1,
wherein said at least one security measure comprises determining whether an operating pattern differs from one or more predetermined operating patterns.

3. The method of claim 2,
and the one or more solutions include stopping execution of an operation.

A method of protecting an integrated circuit from counterfeiting, comprising:
The method includes implementing, using a controller associated with the integrated circuit, at least one security measure for operation of the integrated circuit;
The integrated circuit comprises:
Board;
a memory array disposed on the substrate and comprising a plurality of discrete memory banks;
and a processing array disposed on the substrate and comprising a plurality of processor subunits each associated with one or more of the plurality of discrete memory banks.

24. The method of claim 23,
and taking one or more remedies if the at least one security action is triggered.

Board;
a memory array disposed on the substrate and comprising a plurality of discrete memory banks;
a processing array disposed on the substrate and comprising a plurality of processor subunits each associated with one or more of the plurality of discrete memory banks; and
An integrated circuit comprising a controller, comprising:
and the controller is configured to implement at least one security measure for operation of the integrated circuit, wherein the at least one security measure comprises copying program code in at least two different memory portions.

Board;
a memory array disposed on the substrate and comprising a plurality of discrete memory banks;
a processing array disposed on the substrate and comprising a plurality of processor subunits each associated with one or more of the plurality of discrete memory banks; and
An integrated circuit comprising a controller configured to implement at least one security measure for operation of the integrated circuit.

27. The method of claim 26,
and the controller is further configured to take one or more solutions when the at least one security action is triggered.

Board;
a memory array disposed on the substrate and comprising a plurality of discrete memory banks;
a processing array disposed on the substrate and comprising a plurality of processor subunits each associated with one or more of the plurality of discrete memory banks; and
a first communication port configured to establish a communication connection between the distributed processor memory chip and an external entity other than another distributed processor memory chip; and
and a second communication port configured to establish a communication connection between the distributed processor memory chip and the first additional distributed processor memory chip.

29. The method of claim 28,
and a third communication port configured to establish a communication connection between the distributed processor memory chip and a second additional distributed processing memory chip.

30. The method of claim 29,
and a controller configured to control communication through at least one of the first communication port, the second communication port, and the third communication port.

30. The method of claim 29,
and the first communication port, the second communication port, and the third communication port are each associated with a corresponding bus.

32. The method of claim 31,
and the corresponding bus is a bus common to each of the first communication port, the second communication port, and the third communication port.

32. The method of claim 31,
and the corresponding bus associated with each of the first communication port, the second communication port, and the third communication port are all coupled to the plurality of distributed memory banks.

32. The method of claim 31,
and at least one bus associated with the first communication port, the second communication port, and the third communication port is unidirectional.

32. The method of claim 31,
and at least one bus associated with the first communication port, the second communication port, and the third communication port is bidirectional.

31. The method of claim 30,
wherein the controller schedules a data transfer between the distributed processor memory chip and the first additional distributed processor memory chip, wherein the first additional distributed processor memory chip is received during a time period based on the data transfer and during which the data transfer is received. A distributed processor memory chip, configured to cause a processor subunit to execute program code associated therewith.

31. The method of claim 30,
the controller transmits a clock enable signal to the at least one processor subunit of the plurality of processor subunits of the distributed processor memory chip to enable the at least one processor subunit of the plurality of processor subunits A distributed processor memory chip configured to control one or more aspects of operation.

38. The method of claim 37,
The controller controls the clock enable signal sent to the at least one processor subunit of the plurality of processor subunits to time the one or more communication commands associated with the at least one processor subunit of the plurality of processor subunits. Distributed processor memory chip, characterized in that configured to control the.

31. The method of claim 30,
and the controller is configured to selectively initiate execution of program code by one or more processor subunits of the plurality of processor subunits of the distributed processor memory chip.

31. The method of claim 30,
wherein the controller is configured to utilize a clock enable signal to control the timing of data transmission from one or more processor subunits of the plurality of processor subunits to at least one of the second communication port and the third communication port. Distributed processor memory chip.

29. The method of claim 28,
and a communication rate associated with the first communication port is lower than a communication rate associated with the second communication port.

31. The method of claim 30,
The controller determines whether a first processor subunit of the plurality of processor subunits is ready to transmit data to a second processor subunit included in the first additional distributed processor memory chip, and the first processor subunit and use a clock enable signal to initiate transmission of data from the first processor subunit to the second processor subunit after determining that the data is ready to be transmitted to the second processor subunit. Distributed processor memory chip.

43. The method of claim 42,
The controller uses the clock enable signal to determine whether the second processor subunit is ready to receive the data, and after determining that the second processor subunit is ready to receive the data and initiate the transfer of the data from one processor subunit to the second processor subunit.

43. The method of claim 42,
the controller determines whether the second processor subunit is ready to receive the data, and determines whether the second processor subunit of the first additional distributed processor memory chip is ready to receive the data and wherein the distributed processor memory chip is further configured to buffer the data included in the transmission.

A method for transferring data between a first distributed processor memory chip and a second distributed processor memory chip, the method comprising:
whether a first processor subunit of a plurality of processor subunits disposed on the first distributed processor memory chip is ready to transmit data to a second processor subunit included in the second distributed processor memory chip; determining using a controller associated with at least one of a distributed processor memory chip and the second distributed processor memory chip; and
After determining that the first processor subunit is ready to transmit the data to the second processor subunit, the first processor subunit uses a clock enable signal controlled by the controller to transfer the data to the second processor subunit and initiating transmission of the data to a unit.

46. The method of claim 45,
determining, using the controller, whether the second processor subunit is ready to receive the data; and
initiating the transmission of the data from the first processor subunit to the second processor subunit by utilizing the clock enable signal after the second processor subunit determines that it is ready to receive the data; How to include more.

46. The method of claim 45,
determine whether the second processor subunit is ready to receive the data, and include in the transmission until after determining that the second processor subunit of the first additional distributed processor memory chip is ready to receive the data The method further comprising the step of buffering (buffering) the data.

Board;
a memory array disposed on the substrate and comprising a plurality of discrete memory banks;
a first communication port configured to establish a communication connection between the memory chip and an external entity other than the memory chip; and
and a second communication port configured to establish a communication connection between the memory chip and the first additional memory chip.

49. The method of claim 48,
and the first communication port is connected to at least one of a main bus inside the memory chip and at least one processor subunit included in the memory chip.

49. The method of claim 48,
and the second communication port is connected to at least one of a main bus inside the memory chip and at least one processor subunit included in the memory chip.

a memory array including a plurality of memory banks;
at least one controller configured to control at least one aspect of a read operation for the plurality of memory banks; and
at least one zero value detection logic configured to detect a multi-bit zero value associated with data stored at a particular address of the plurality of memory banks;
and the at least one controller is configured to send a zero-value indicator to the one or more circuits in response to detection of a zero value by the at least one zero-value detection logic unit.

52. The method of claim 51,
and the one or more circuits to which the zero-value indicator is sent are external to the memory unit.

52. The method of claim 51,
and the one or more circuits to which the zero-value indicator is sent are internal to the memory unit.

52. The method of claim 51,
and at least one read deactivation element configured to abort a read command associated with the particular address when the at least one zero value detection logic detects a zero value associated with the particular address.

52. The method of claim 51,
and the at least one controller is configured to transmit the zero-value indicator to the one or more circuits instead of transmitting the zero-valued data stored at the specific address.

52. The method of claim 51,
The size of the zero-value indicator is smaller than the size of the zero-value data.

52. The method of claim 51,
Energy expended by a first process comprising (a) detecting the zero-value, (b) generating the zero-value indicator, and (c) transmitting the zero-value indicator to the one or more circuits is the zero-valued data is less than the energy consumed by transmitting to the one or more circuits.

58. The memory unit of claim 57, wherein the energy consumed by the first process is less than half of the energy consumed by transferring the zero-valued data to the one or more circuits.

52. The method of claim 51,
at least one sense amplifier configured to exclude activation of at least one memory bank of the plurality of memory banks after detection of a zero value by the at least one zero value detection logic.

60. The method of claim 59,
wherein the at least one sense amplifier is configured to sense a low power signal from the plurality of memory banks and amplify a small voltage swing to a high voltage level such that data stored in the plurality of memory banks can be interpreted by the at least one controller. A memory unit comprising a plurality of transistors.

52. The method of claim 51,
Each memory bank of the plurality of memory banks is further organized into subbanks, wherein the at least one controller includes a subbank controller, and the at least one zero value detection logic includes zero value detection logic associated with the subbank. A memory unit, characterized in that.

62. The method of claim 61,
and at least one read disable element comprising a sense amplifier associated with each of said subbanks.

52. The method of claim 51,
a plurality of processor subunits spatially distributed within the memory unit, each processor subunit of the plurality of processor subunits being associated with at least one memory bank dedicated to the plurality of memory banks; wherein each processor subunit of the processor subunit is configured to access and operate data stored in a corresponding memory bank.

64. The method of claim 63,
and the one or more circuits include one or more of the processor subunits.

64. The method of claim 63,
and each processor subunit of the plurality of processor subunits is connected to two or more other processor subunits of the plurality of processor subunits by one or more buses.

52. The method of claim 51,
A memory unit further comprising a plurality of buses.

67. The method of claim 66,
and said plurality of buses are configured to transfer data between said plurality of memory banks.

68. The method of claim 67,
and at least one of the plurality of buses is configured to transmit the zero-value indicator to the one circuit.

receiving a read request of data stored in addresses of a plurality of memory banks from a circuit external to the memory unit;
activating a zero-value detection logic in response to the received request, so that the controller detects a zero value at the received address; and
and sending, by the controller, a zero-value indicator to the circuit in response to the zero-value detection by the zero-value detection logic unit.

70. The method of claim 69,
and if the zero detection logic detects a zero value associated with the requested address, configuring the read disable element to cause the controller to abort a read command associated with the requested address.

70. The method of claim 69,
and configuring a sense amplifier such that when the zero value detection logic detects a zero value, the controller excludes activation of at least one memory bank of the plurality of memory banks.

A non-transitory computer-readable medium storing a set of instructions executable by a controller of the memory unit to cause a memory unit to detect a value of zero at a particular address in a plurality of memory banks, the method comprising:
receiving a read request of data stored in addresses of a plurality of memory banks from a circuit external to the memory unit;
activating a zero-value detection logic in response to the received request, so that the controller detects a zero value at the received address; and
and in response to detection of the zero value by the zero value detection logic, the controller sending a zero value indicator to the circuitry.

73. The method of claim 72,
The method further comprises configuring a read disable element such that when the zero detection logic detects a zero value associated with the requested address, the controller aborts a read command associated with the requested address. A non-transitory computer-readable medium characterized in that

73. The method of claim 72,
The method further comprises configuring the sense amplifier such that when the zero value detection logic detects a zero value, the controller excludes activation of at least one memory bank of the plurality of memory banks. a non-transitory computer-readable medium.

a plurality of memory banks, at least one controller configured to control at least one aspect of a read operation for the plurality of memory banks, and at least one configured to detect a multi-bit zero value associated with data stored at a specific address of the plurality of memory banks a memory unit including one zero-value detection logic; and
a processing unit configured to send a read request to the memory unit to read data from the memory bank;
wherein the at least one controller and the at least one zero detection logic are configured to send a zero-value indicator to the one or more circuits in response to detection of a zero value by the at least one zero-value detection logic.

a memory array including a plurality of memory banks;
at least one controller configured to control at least one aspect of a read operation for the plurality of memory banks; and
at least one detection logic configured to detect a predetermined multi-bit value associated with data stored at a specific address of the plurality of memory banks;
and the at least one controller is configured to send a value indicator to one or more circuits in response to detection of the predetermined multi-bit value by the at least one detection logic unit.

77. The method of claim 76,
The predetermined multi-bit value is selectable by a user.

a memory array including a plurality of memory banks;
at least one controller configured to control at least one aspect of a write operation to the plurality of memory banks; and
at least one detection logic configured to detect a predetermined multi-bit value associated with data to be written to a particular address of the plurality of memory banks;
and the at least one controller is configured to send a value indicator to one or more circuits in response to detection of the predetermined multi-bit value by the at least one detection logic unit.

Board;
a memory array including a plurality of memory banks disposed on the substrate;
a plurality of processor subunits disposed on the substrate;
at least one controller configured to control at least one aspect of a read operation for the plurality of memory banks; and
at least one detection logic configured to detect a predetermined multi-bit value associated with data stored at a specific address of the plurality of memory banks;
and the at least one controller is configured to send a value indicator to one or more of the plurality of processor subunits in response to detection of the predetermined multi-bit value by the at least one detection logic. .

one or more memory banks;
bank controller; and
an address generator;
The address generator is:
providing to the bank controller a current address of a current row to be accessed within an associated memory bank of the one or more memory banks;
determine a predicted address of a next row to be accessed within the associated memory bank;
and provide the predicted address to the bank controller before the operation on the current row associated with the current address is complete.

81. The method of claim 80,
and the operation on the current row associated with the current address is a read operation or a write operation.

81. The method of claim 80,
and the current row and the next row are in the same memory bank.

83. The method of claim 82,
and the same memory bank allows the next row to be accessed while the current row is accessed.

81. The method of claim 80,
and the current row and the next row are in different memory banks.

81. The method of claim 80,
Further comprising a distributed processor,
wherein said distributed processor comprises a plurality of processor subunits of a processing array spatially distributed among a plurality of discrete memory banks of said memory array.

81. The method of claim 80,
and the bank controller is configured to access the current row and activate the next row prior to completion of the operation on the current row.

81. The method of claim 80,
The one or more memory banks each include at least a first sub-bank and a second sub-bank, and the bank controller associated with each of the one or more memory banks includes a first sub-bank controller associated with the first sub-bank and the second A memory unit comprising a second subbank controller associated with the subbank.

88. The method of claim 87,
the first subbank controller is configured to enable access to data included in a current row of the first subbank while the second subbank controller activates a next row in the second subbank unit.

89. The method of claim 88,
and the activated next row of a second subbank is spaced apart by at least two rows from the current row of the first subbank from which data is being accessed.

88. The method of claim 87,
the second subbank controller is configured to cause access to data contained in a current row of the second subbank while the first subbank controller activates a next row of the first subbank unit.

91. The method of claim 90,
and the activated next row of a first subbank is spaced apart by at least two rows from the current row of the second subbank from which data is being accessed.

81. The method of claim 80,
The predicted address is a memory unit, characterized in that determined using a learning neural network.

81. The method of claim 80,
The predicted address is determined based on the determined line access pattern.

81. The method of claim 80,
wherein the address generator comprises a first address generator configured to generate the current address and a second address generator configured to generate the predicted address.

95. The method of claim 94,
and the second address generator is configured to calculate the predicted address within a predetermined period of time after the first address generator generates the current address.

96. The method of claim 95,
and the predetermined time period is adjustable.

97. The method of claim 96,
and the predetermined period of time is adjusted based on a value of at least one operating parameter associated with the memory unit.

98. The method of claim 97,
and the at least one operating parameter comprises a temperature of the memory unit.

81. The method of claim 80,
wherein the address generator is further configured to generate a confidence level associated with the predicted address and cause the bank controller to relinquish access of the next row at the predicted address if the confidence level falls below a predetermined threshold. Characterized memory unit.

81. The method of claim 80,
and the predicted address is generated by a chain of flip flops that samples the delayed generated address.

101. The method of claim 100,
The delay is settable through a multiplexer that selects between flip-flops storing the sampled address.

81. The method of claim 80,
and the bank controller is configured to ignore the predicted address received from the address generator for a predetermined period after the reset of the memory unit.

81. The method of claim 80,
and the address generator is configured to abandon providing the predicted address to the bank controller after detecting a random pattern in row accesses associated with the associated memory bank.

one or more memory banks, wherein each memory bank of the one or more memory banks comprises:
multiple rows;
a first row controller configured to control a first subset of the plurality of rows;
a second row controller configured to control a second subset of the plurality of rows;
a single data input for receiving data to be stored in the plurality of rows; and
and a single data output for providing data retrieved from said plurality of rows.

105. The method of claim 104,
and the memory unit is configured to receive a first address for processing and a second address for activation and access at a predetermined time.

105. The method of claim 104,
and the first subset of the plurality of rows is comprised of even numbered rows.

107. The method of claim 106,
and the even numbered rows are arranged in half of the one or more memory banks.

107. The method of claim 106,
and odd numbered rows are arranged in half of the one or more memory banks.

105. The method of claim 104,
and the second subset of the plurality of rows consists of odd numbered rows.

105. The method of claim 104,
and the first subset of the plurality of rows is included in a first subbank of the memory bank adjacent a second subbank of the memory bank that includes the second subset of the plurality of rows.

105. The method of claim 104,
the first row controller is configured to cause access of data contained in a row in the first subset of the plurality of rows while the second row controller activates a row in the second subset of the plurality of rows A memory unit, characterized in that.

112. The method of claim 111,
and the activated row of the second subset of the plurality of rows is spaced apart by at least two rows from the row of the first subset of the plurality of rows from which data is being accessed.

105. The method of claim 104,
the second row controller is configured to cause access of data contained in a row in the second subset of the plurality of rows while the first row controller activates a row in the second subset of the plurality of rows A memory unit, characterized in that.

114. The method of claim 113,
and the activated row of the first subset of the plurality of rows is spaced apart by at least two rows from the row of the second subset of the plurality of rows from which data is being accessed.

105. The method of claim 104,
wherein each memory bank of said one or more memory banks includes a column input for receiving a column identifier indicating a portion of a row to be accessed.

105. The method of claim 104,
A memory unit, characterized in that one line of extra overlapping mats is disposed between each of the two mat lines to form a distance to allow activation.

105. The method of claim 104,
A memory unit characterized in that lines adjacent to each other are not activated at the same time.

Board;
a memory array disposed on the substrate and comprising a plurality of discrete memory banks;
a processing array disposed on the substrate and comprising a plurality of processor subunits each associated with a corresponding dedicated discrete memory bank of the plurality of discrete memory banks; and
at least one memory mat disposed on the substrate and configured to serve as at least one register in a register file for one or more processor subunits of the plurality of processor subunits.

119. The method of claim 118,
and the at least one memory mat is included in at least one processor subunit of the plurality of processor subunits of the processing array.

119. The method of claim 118,
wherein the register file is configured as a data register file.

119. The method of claim 118,
wherein the register file is configured as an address register file.

119. The method of claim 118,
wherein the at least one memory mat provides at least one register of a register file for one or more processor subunits of the plurality of processor subunits to store data to be accessed by the one or more processor subunits of the plurality of processor subunits; A memory chip comprising:

119. The method of claim 118,
the at least one memory mat is configured to provide at least one register of a register file for one or more processor subunits of the plurality of processor subunits, the at least one register of the register file comprising the plurality of processor subunits and store coefficients used by the plurality of processor subunits during execution of convolution accelerator operations by

119. The method of claim 118,
wherein the at least one memory mat is a DRAM memory mat.

119. The method of claim 118,
and the at least one memory mat is configured to communicate via a one-way access.

119. The method of claim 118,
and said at least one memory mat allows bidirectional access.

119. The method of claim 118,
at least one redundant memory mat disposed on the substrate, wherein the at least one redundant memory mat is configured to provide at least one redundant register for one or more processor subunits of the plurality of processor subunits memory chip with

119. The method of claim 118,
at least one redundant memory mat disposed on the substrate, wherein the at least one redundant memory mat is configured to provide at least one redundant register for one or more processor subunits of the plurality of processor subunits. A memory chip comprising redundant memory bits of

119. The method of claim 118,
a first plurality of buses each coupling one processor subunit of the plurality of processor subunits to a corresponding dedicated memory bank; and
and a second plurality of buses each connecting one processor subunit of the plurality of processor subunits to another processor subunit of the plurality of processor subunits.

119. The method of claim 118,
at least one processor subunit of the plurality of processor subunits comprises a counter configured to count backwards from a predetermined number, and when the counter reaches a value of zero, the at least one processor subunit of the plurality of processor subunits: A memory chip configured to suspend a current task and trigger a memory refresh operation.

119. The method of claim 118,
and at least one processor subunit of the plurality of processor subunits comprises a mechanism to refresh the memory mat by suspending a current task and triggering a memory refresh operation at a specific time.

119. The method of claim 118,
and the register file is configured to be used as a cache.

retrieving one or more data values from a memory array of a distributed processor memory chip;
storing the one or more data values in a register formed in a memory mat of the distributed processor memory chip; and
accessing the one or more data values stored in the register according to at least one instruction executed by a processor element;
wherein the memory array includes a plurality of discrete memory banks disposed on a substrate;
wherein the processor element is a processor subunit of a plurality of processor subunits included in a processing array disposed on the substrate, each processor subunit associated with a corresponding dedicated discrete memory bank of the plurality of discrete memory banks;
wherein the register is provided by a memory mat disposed on the substrate.

134. The method of claim 133,
wherein the processor element is configured to function as an accelerator;
accessing first data stored in the register:
accessing second data from the memory array; and
The method further comprising the step of performing an operation on the first data and the second data.

134. The method of claim 133,
The at least one memory mat comprises a plurality of word lines and bit lines,
and determining timing of loading the word line and the bit line according to the size of the memory mat.

134. The method of claim 133,
The method further comprising periodically refreshing the register.

134. The method of claim 133,
wherein the memory mat comprises a DRAM memory mat.

134. The method of claim 133,
wherein said memory mat is included in said plurality of discrete memory banks of said memory array.

Board;
a processing unit disposed on the substrate; and
a memory unit disposed on the substrate;
the memory unit is configured to store data to be accessed by the processing unit;
wherein the processing unit comprises a memory mat configured to serve as a cache for the processing unit.

A method for distributed processing of at least one information stream, the method comprising:
one or more memory processing integrated circuits receiving at least one information stream via a first communication channel, wherein each memory processing integrated circuit includes a controller, multiple processor subunits, and multiple memory units;
buffering, by the one or more memory processing integrated circuits, the at least one information stream;
performing, by the one or more memory processing integrated circuits, a primary processing operation on the at least one information stream to provide a primary processing result;
transmitting the primary processing result to a processing integrated circuit; and
performing, by the one or more memory processing integrated circuits, a secondary processing operation on the primary processing result to provide secondary processing results;
and a size of a logic cell of the one or more memory processing integrated circuits is less than a size of a logic cell of the processing integrated circuit.

140. The method of claim 140,
and each of said multiple memory units is coupled to at least one of said multiple processor subunits.

140. The method of claim 140,
The method according to claim 1, wherein a total size of information units of the at least one information stream received during a specific period of time is greater than a total size of primary processing results output during the specific period of time.

140. The method of claim 140,
and a total size of the at least one information stream is smaller than a total size of the primary processing result.

140. The method of claim 140,
The method of claim 1, wherein the memory additive manufacturing process is a DRAM manufacturing process.

140. The method of claim 140,
The processing integrated circuit is manufactured by a memory additive manufacturing process,
wherein the processing integrated circuit is fabricated by a logic additive manufacturing process.

140. The method of claim 140,
and a size of a logic cell of the one or more memory processing integrated circuits is at least twice a size of a corresponding logic cell of the processing integrated circuit.

140. The method of claim 140,
and a critical dimension of a logic cell of the one or more memory processing integrated circuits is at least twice a critical dimension of a corresponding logic cell of the processing integrated circuit.

140. The method of claim 140,
and a critical dimension of a memory cell of the one or more memory processing integrated circuits is at least twice a critical dimension of a corresponding logic cell of the processing integrated circuit.

140. The method of claim 140,
and requesting, by the processing integrated circuit, the one or more memory processing integrated circuits to perform the primary processing operation.

140. The method of claim 140,
and directing the processing integrated circuit to the one or more memory processing integrated circuits to perform the primary processing operation.

140. The method of claim 140,
and configuring the one or more memory processing integrated circuits to cause the processing integrated circuit to perform the primary processing operation.

140. The method of claim 140,
and the one or more memory processing integrated circuits executing the primary processing operation without intervention of the processing integrated circuit.

140. The method of claim 140,
wherein said primary processing operation is less memory intensive than said secondary processing operation.

140. The method of claim 140,
and a total throughput of the primary processing operation is greater than a total throughput of the secondary processing operation.

140. The method of claim 140,
wherein the at least one information stream comprises one or more preprocessed information streams.

156. The method of claim 155,
wherein the one or more preprocessed information streams are data extracted from network conveyed units.

140. The method of claim 140,
a portion of the primary processing operation is executed by the one processor subunit of the multiprocessor subunit, and another portion of the primary processing operation is executed by another processor subunit of the multiprocessor subunit. How to characterize.

140. The method of claim 140,
The method of claim 1, wherein the primary processing operation and the secondary processing operation include a mobile communication network processing operation.

140. The method of claim 140,
wherein the primary processing operation and the secondary processing operation comprise a database processing operation.

140. The method of claim 140,
wherein the primary processing operation and the secondary processing operation comprise a database analysis processing operation.

140. The method of claim 140,
The method of claim 1, wherein the primary processing operation and the secondary processing operation include an artificial intelligence processing operation.

receiving, by one or more memory processing integrated circuits of a separate system comprising one or more computing subsystems separate from one or more storage subsystems, a unit of information, wherein each of the one or more memory processing integrated circuits comprises a controller, multiple a processor subunit, and multiple memory units, wherein the one or more computing subsystems include multiple processing integrated circuits, wherein a size of a logic cell of the one or more memory processing integrated circuits is a corresponding logic cell of the multiple processing integrated circuit. at least twice the size of ;
performing, by the one or more memory processing integrated circuits, a processing operation on the information unit to provide a processing result; and
outputting the processing result from the one or more memory processing integrated circuits.

163. The method of claim 162,
outputting the processing results to the one or more computing subsystems of the separate system.

163. The method of claim 162,
and receiving the unit of information to the one or more storage subsystems of the separate system.

163. The method of claim 162,
outputting the processing result to the one or more storage subsystems of the separate system.

163. The method of claim 162,
and receiving the unit of information to the one or more computing subsystems of the separate system.

166. The method of claim 166,
The information units transmitted from different groups of processing units of the multiple processing integrated circuits include different portions of intermediate results of a process executed by the multiple processing integrated circuits, and the group of processing units includes at least one processing integrated circuit. A method characterized in that

168. The method of claim 167,
The method further comprising the one or more memory processing integrated circuits outputting a result of the overall process.

169. The method of claim 168,
and transmitting the result of the overall process to each of the multiple processing integrated circuits.

169. The method of claim 168,
wherein the different parts of the intermediate result are different parts of an updated neural network model, and the result of the overall process is the updated neural network model.

169. The method of claim 168,
transmitting the updated neural network model to each of the multiprocessing integrated circuits.

163. The method of claim 162,
and outputting the processing result by utilizing a switching subunit of the separate system.

163. The method of claim 162,
wherein the one or more memory processing integrated circuits are included in a memory processing subunit of the separate system.

163. The method of claim 162,
wherein at least one of the one or more memory processing integrated circuits is included in one or more computing subsystems of the separate system.

163. The method of claim 162,
wherein at least one of the one or more memory processing integrated circuits is included in one or more memory subsystems of the separate system.

163. The method of claim 162,
wherein at least one of (a) said unit of information is received from at least one of said multiple processing integrated circuits and (b) said processing results are transmitted to one or more memory processing integrated circuits of said multiple processing integrated circuits. How to.

178. The method of claim 176,
and a critical dimension of a logic cell of the one or more memory processing integrated circuits is at least twice a critical dimension of a corresponding logic cell of the multiple processing integrated circuit.

178. The method of claim 176,
and a critical dimension of a memory cell of the one or more memory processing integrated circuits is at least twice a critical dimension of a corresponding logic cell of the multiple processing integrated circuit.

163. The method of claim 162,
The method of claim 1, wherein the information unit includes a preprocessed information unit.

180. The method of claim 179,
The method further comprising the step of the multiprocessing integrated circuit providing the preprocessed information unit.

163. The method of claim 162,
wherein the information unit carries parts of a model of a neural network.

163. The method of claim 162,
wherein the information unit carries a partial result of at least one database query.

163. The method of claim 162,
wherein the information unit carries partial results of at least one aggregate database query.

receiving, by a memory processing integrated circuit, the database query including at least one relevance criterion indicative of a database entry in a database that is relevant to the database query, wherein the memory processing integrated circuit comprises a controller, multiple processor subunits, and multiple memories includes units;
determining, by the memory processing integrated circuit, a group of relevant database entries stored in the memory processing integrated circuit based on the at least one relevance criterion; and
transferring the group of relevant database entries to the one or more processing entities to continue processing without substantially transferring the extraneous database entries stored in the memory processing integrated circuit to the one or more processing entities; and a database entry is different from the associated database entry.

185. The method of claim 184,
wherein the one or more processing entities are included in the multi-processor subunit of the memory processing integrated circuit.

185. The method of claim 185,
and the memory processing integrated circuit further processing the group of relevant database entries to complete the response to the database query.

187. The method of claim 186,
outputting the response to the database query from the memory processing integrated circuit.

187. The method of claim 187,
wherein said outputting comprises applying a flow control process.

190. The method of claim 188,
and applying the flow control process corresponds to an indicator output from the one or more processing entities regarding completion of processing of one or more database entries of the group.

185. The method of claim 185,
and the memory processing integrated circuit further processing the group of relevant database entries to provide an intermediate response to the database query.

190. The method of claim 190,
outputting the intermediate response to the database query from the memory processing integrated circuit.

192. The method of claim 191,
wherein said outputting comprises applying a flow control process.

193. The method of claim 192,
and applying the flow control process corresponds to an indicator output from the one or more processing entities regarding completion of partial processing of a database entry of the group.

185. The method of claim 185,
and generating, by the one or more processing entities, a processing status indicator indicating progress of further processing of the group's relevant database entries.

185. The method of claim 185,
further processing the group of relevant database entries utilizing the memory processing integrated circuit.

195. The method of claim 195,
The method of claim 1, wherein said processing is executed by said multi-processor subunit.

195. The method of claim 195,
The processing includes: one processing subunit of the multiple processing subunit calculating an intermediate result, sending the intermediate result to another processing subunit of the multiple processing subunit, and the other processing subunit adding A method comprising the step of performing a calculation.

195. The method of claim 195,
and the processing step is executed by the controller.

195. The method of claim 195,
and the processing step is executed by the multi-processor subunit and the controller.

185. The method of claim 184,
and the one or more processing entities are located external to the memory processing integrated circuit.

200. The method of claim 200,
and outputting the relevant database entry of the group from the memory processing integrated circuit.

201. The method of claim 201,
wherein said outputting comprises applying a flow control process.

202. The method of claim 202,
and applying the flow control process corresponds to an indicator output from the one or more processing entities and relating to the relevance of a database entry associated with the one or more processing entities.

185. The method of claim 184,
wherein the multiprocessor subunit comprises full arithmetic logic units.

185. The method of claim 184,
wherein the multiprocessor subunit comprises partial arithmetic logic units.

185. The method of claim 184,
wherein the multi-processor subunit comprises a memory controller.

185. The method of claim 184,
wherein the multiprocessor subunit comprises a partial memory controller.

185. The method of claim 184,
and outputting at least one of (i) a relevant database entry in the group, (ii) a response to the database query, and (iii) an intermediate response to the database query.

212. The method of claim 212,
wherein said outputting comprises applying traffic shaping.

212. The method of claim 212,
wherein the outputting includes attempting to match, via a link coupling the memory processing integrated circuit to a requester unit, the bandwidth used during the outputting to a maximum allowed bandwidth. Way.

212. The method of claim 212,
wherein outputting the output comprises maintaining a variation in output traffic rate below a threshold value.

185. The method of claim 184,
wherein the one or more processing entities include multiple processing entities, wherein at least one processing entity of the multiple processing entities belongs to the memory processing integrated circuit and at least another processing entity of the multiple processing entities is to the memory processing integrated circuit. A method characterized in that it does not belong.

185. The method of claim 184,
wherein the one or more processing entities belong to other memory processing integrated circuits.

receiving, by a multi-memory processing integrated circuit, the database query comprising at least one relevance criterion indicative of a database entry in a database that relates to the database query, wherein the multi-memory processing integrated circuit comprises: a controller, a multi-processor subunit; and multiple memory units;
determining, by each of the multiple memory processing integrated circuits, a group of relevant database entries stored in the memory processing integrated circuit based on the at least one relevance criterion; and
each of the multiple memory processing integrated circuits to continue processing without substantially transferring the unrelated database entries stored in the memory processing integrated circuit to one or more processing entities, the group of relevant database entries stored in the memory processing integrated circuit. to the one or more processing entities, wherein the unrelated database entry is different from the relevant database entry.

receiving, by an integrated circuit, the database query comprising at least one relevance criterion indicative of a database entry in a database that is related to the database query, wherein the integrated circuit includes a controller, a filtering unit, and multiple memory units;
determining, by the filtering unit, a group of relevant database entries stored in the integrated circuit based on the at least one relevance criterion; and
transferring the group's relevant database entries to the one or more processing entities located external to the integrated circuit to continue processing without substantially transferring the extraneous data entries stored in the integrated circuit to the one or more processing entities. A method for accelerating database analysis.

receiving, by an integrated circuit, the database query comprising at least one relevance criterion indicative of a database entry in a database that relates to the database query, wherein the integrated circuit includes a controller, a processing unit, and multiple memory units;
determining, by the processing unit, a group of relevant database entries stored in the integrated circuit based on the at least one relevance criterion;
processing, by the processing unit, the group of relevant database entries without the integrated circuit processing the extraneous data entries stored in the integrated circuit, and providing a processing result, wherein the extraneous database entries are different from the existing database entry; and
and outputting the processing result from an integrated circuit.

receiving, by a memory processing integrated circuit, retrieval information for retrieval of multiple requested feature vectors mapped to multiple sentence segments, wherein the memory processing integrated circuit comprises a controller, multiple processor subunits, and multiple memory units; each of the memory units coupled to the processor subunit;
retrieving the multiple requested feature vectors from at least some memory units of the multiple memory units, wherein the retrieving comprises simultaneously requesting the requested feature vectors stored in the two or more memory units from two or more memory units. included; and
outputting from the memory processing integrated circuit an output comprising at least one of (a) the requested feature vector and (b) a result of processing of the requested feature vector; .

218. The method of claim 217,
and the output includes the requested feature vector.

218. The method of claim 217,
and the output includes the result of the processing of the requested feature vector.

219. The method of claim 219,
and the processing is executed by the multi-processor subunit.

223. The method of claim 220,
wherein the processing comprises transmitting the requested feature vector from one processing subunit to another processing subunit.

223. The method of claim 220,
wherein said processing comprises calculating an intermediate result by one processing subunit, sending said intermediate result to another processing subunit, and calculating another intermediate result or processing result by said other processing subunit.

219. The method of claim 219,
and the processing is performed by the controller.

219. The method of claim 219,
and the processing is executed by the multi-processor subunit and the controller.

219. The method of claim 219,
wherein the processing is performed by a vector processor of the memory processing integrated circuit.

218. The method of claim 217,
and the controller is configured to simultaneously request the requested feature vector based on a known mapping between a sentence segment and the position of the feature vector mapped to the sentence segment.

12. The method of claim 11,
wherein the mapping is uploaded during a boot process of the memory processing integrated circuit.

218. The method of claim 217,
and the controller is configured to manage the retrieval of the multiple requested feature vectors.

218. The method of claim 217,
wherein the multiple sentence segments are in a specific order, and the output of the requested feature vector is performed according to the specific order.

229. The method of claim 229,
and the searching of the multiple requested feature vectors is performed according to the specific order.

229. The method of claim 229,
and wherein the searching of the multiple requested feature vectors is performed at least in part out of order, and wherein the searching further comprises rearranging the order of the multiple requested feature vectors.

218. The method of claim 217,
and the retrieval of the multiple requested feature vector comprises buffering the multiple requested feature vector before the multiple requested feature vector is read by the controller.

232. The method of claim 232,
and the retrieval of the multiple requested feature vectors comprises generating a buffer status indicator indicating when one or more buffers associated with the multiple memory units store one or more requested feature vectors.

234. The method of claim 233,
and passing the buffer status indicator over a dedicated control line.

234. The method of claim 234,
The method of claim 1, wherein one dedicated control line is allocated for each memory unit.

234. The method of claim 234,
wherein the buffer status indicator includes one or more status bits stored in one or more of the buffers.

234. The method of claim 234,
and passing the buffer status indicator over one or more shared control lines.

218. The method of claim 217,
wherein the search information is included in one or more search commands of a first resolution representing a specified number of bits.

238. The method of claim 238,
and managing the search via the controller at a higher resolution representing a smaller number of bits than the specified number of bits.

238. The method of claim 238,
and the controller is configured to manage the search on feature vector resolution.

238. The method of claim 238,
The method further comprising the step of the controller independently managing the search.

218. The method of claim 217,
wherein the multiprocessor subunit comprises full arithmetic logic units.

218. The method of claim 217,
wherein the multiprocessor subunit comprises partial arithmetic logic units.

218. The method of claim 217,
wherein the multi-processor subunit comprises a memory controller.

218. The method of claim 217,
wherein the multiprocessor subunit comprises a partial memory controller.

218. The method of claim 217,
wherein outputting the output comprises applying traffic shaping to the output.

217. The method of claim 217,
wherein outputting the output includes matching, via a link coupling the memory processing integrated circuit to a requester unit, the bandwidth used during the outputting to a maximum allowed bandwidth. Way.

218. The method of claim 217,
wherein outputting the output comprises maintaining a variation in output traffic rate below a threshold value.

218. The method of claim 217,
wherein said retrieving comprises applying a predictive search of at least some requested feature vectors of said requested feature vectors from a set of requested feature vectors stored in a single memory unit.

217. The method of claim 217,
and the requested feature vector is distributed among the memory units.

218. The method of claim 217,
The method of claim 1, wherein the requested feature vector is distributed among the memory units based on an expected search pattern.

performing processing operations by multiple processors included in a hybrid device comprising a base die, a first memory resource associated with at least one second die, and a second memory resource associated with at least one third die, wherein the base die and the at least one second die connected to each other by a wafer to wafer bond;
retrieving information stored in the first memory resource by utilizing the multiple processors; and
transmitting additional information from the second memory resource to the first memory resource, wherein an overall bandwidth of a first path between the base die and the at least one second die is between the at least one second die and the at least one greater than an overall bandwidth of a second path between the third dies of , and a storage capacity of the first memory resource is less than a storage capacity of the second memory resource.

252. The method of claim 252,
and the second memory resource comprises a high bandwidth memory (HBM) resource.

252. The method of claim 252,
and said at least one third die comprises a stack of high bandwidth memory (HBM) chips.

252. The method of claim 252,
and at least a portion of the second memory resource belongs to a third die of at least one third die coupled to the base die without wafer-to-wafer bonding.

252. The method of claim 252,
and at least a portion of the second memory resource belongs to a third die of at least one third die coupled to a second die of the at least one second die without wafer-to-wafer bonding.

252. The method of claim 252,
The method of claim 1, wherein the first memory resource and the second memory resource include different levels of cache memory.

252. The method of claim 252,
wherein the first memory resource is located between the base die and the second memory resource.

252. The method of claim 252,
wherein the first memory resource is not located above the second memory resource.

252. The method of claim 252,
The method further comprising: a second die of at least one second die comprising a plurality of processor subunits and the first memory resource performing additional processing.

260. The method of claim 260,
wherein at least one processor subunit is coupled to a dedicated portion of the first memory resource allocated to the processor subunit.

261. The method of claim 261,
and the dedicated portion of the first memory resource comprises at least one memory bank.

252. The method of claim 252,
wherein the multiple processors belong to a memory processing chip that also includes the first memory resource.

252. The method of claim 252,
wherein the base die includes the multiple processors, the multiple processors including a plurality of processor subunits coupled to the first memory resource via conductors formed of wafer-to-wafer junctions.

265. The method of claim 264,
wherein each processor subunit is coupled to a dedicated portion of the first memory resource allocated to the processor subunit.

base die;
multiple processors;
a first memory resource of at least one second die; and
a second memory resource of at least one third die;
the base die and the at least one second die are connected to each other by wafer-to-wafer bonding;
the multiple processors are configured to perform processing operations and retrieve information stored in the first memory resource;
the second memory resource is configured to transmit additional information from the second memory resource to the first memory resource;
an overall bandwidth of a first path between the base die and the at least one second die is greater than an overall bandwidth of a second path between the at least one second die and the at least one third die;
The hybrid device for memory-intensive processing, characterized in that the storage capacity of the first memory resource is smaller than the storage capacity of the second memory resource.

267. The method of claim 266,
and the second memory resource comprises a high bandwidth memory (HBM) resource.

267. The method of claim 266,
and the at least one third die is a stack of HBM memory chips.

267. The method of claim 266,
and at least a portion of the second memory resource belongs to a third die of the at least one third die coupled to the base die without wafer-to-wafer bonding.

267. The method of claim 266,
and at least a portion of the second memory resource belongs to a third die of the at least one third die coupled to a second die of the at least one second die without wafer-to-wafer bonding.

267. The method of claim 266,
The hybrid device, characterized in that the first memory resource and the second memory resource includes cache memories of different levels.

267. The method of claim 266,
The hybrid device of claim 1, wherein the first memory resource is located between the base die and the second memory resource.

267. The method of claim 266,
The first memory resource is a hybrid device, characterized in that located next to the second memory resource.

267. The method of claim 266,
and a second die of the at least one second die is configured to perform additional processing, the second die comprising a plurality of processor subunits and the first memory resource.

274. The method of claim 274,
wherein each processor subunit is coupled to a dedicated portion of the first memory resource allocated to the processor subunit.

275. The method of claim 275,
and the dedicated portion of the first memory resource comprises at least one memory bank.

267. The method of claim 266,
wherein the multiple processors include a plurality of processor subunits of a memory processing chip that also include the first memory resource.

267. The method of claim 266,
wherein the base die includes the multiple processors, the multiple processors including a plurality of processor subunits coupled to the first memory resource via conductors formed of wafer-to-wafer junctions.

287. The method of claim 278,
wherein each processor subunit is coupled to a dedicated portion of the first memory resource allocated to the processor subunit.

retrieving, by the network communication interface of the database acceleration integrated circuit, a quantity of information from the storage unit;
providing primary processed information by primary processing the amount of information;
utilizing a memory controller of the database acceleration integrated circuit and sending, via an interface, the primary processed information to multiple memory processing integrated circuits, each memory processing integrated circuit comprising: a controller, multiple processor subunits, and multiple memories includes units;
providing secondary processed information by secondary processing at least a portion of the primary processed information by utilizing the multi-memory processing integrated circuit;
retrieving, by the memory controller of the database acceleration integrated circuit, retrieved information from the multiple memory processing integrated circuit, wherein the retrieved information comprises (a) at least a portion of the primary processed information and (b) the secondary processing comprising at least one of at least a portion of the information;
providing a database acceleration result by performing a database processing operation on the retrieved information by utilizing a database acceleration unit of the database acceleration integrated circuit; and
and outputting the database acceleration result.

280. The method of claim 280,
and utilizing a management unit of the database acceleration integrated circuit to manage at least one of the retrieving, the primary processing, and the processing.

282. The method of claim 281,
and the managing is executed according to an execution plan generated by the management unit of the database acceleration integrated circuit.

282. The method of claim 281,
The method according to claim 1, wherein said managing is executed according to an execution plan received without being generated by said management unit of said database acceleration integrated circuit.

282. The method of claim 281,
The managing comprises at least one of (a) a network communication network interface resource, (b) a decompression unit resource, (c) a memory controller resource, (d) a multiple memory processing integrated circuit resource, and (e) a data acceleration unit resource. A method comprising the step of allocating

280. The method of claim 280,
wherein the network communication interface comprises at least two different types of network communication ports.

285. The method of claim 285,
wherein the at least two different types of network communication ports include a storage interface protocol port and a storage interface protocol port over a general network.

285. The method of claim 285,
wherein the at least two different types of network communication ports include a storage interface protocol port and a storage interface protocol port over Ethernet.

285. The method of claim 285,
wherein the at least two different types of network communication ports include a storage interface protocol port and a PCIe port.

280. The method of claim 280,
A method, comprising: a management unit comprising a compute node of a compute system and controlled by a manager of the compute system.

280. The method of claim 280,
The method further comprising the step of controlling, by a compute node of a compute system, at least one of the searching, the primary processing, the transmitting, and the tertiary processing.

280. The method of claim 280,
and executing multiple tasks concurrently utilizing the database acceleration integrated circuit.

280. The method of claim 280,
Using a management unit located outside the database acceleration integrated circuit, further comprising the step of managing at least one of the searching step, the primary processing step, the transmitting step, and the tertiary processing step Way.

280. The method of claim 280,
The method of claim 1, wherein the database acceleration integrated circuit belongs to a compute system.

280. The method of claim 280,
wherein said database acceleration integrated circuit does not belong to a compute system.

280. The method of claim 280,
At least one of the retrieving, the primary processing, the transmitting, and the tertiary processing based on an execution plan transmitted to the database acceleration integrated circuit by a compute node of a compute system. A method further comprising the step of executing.

280. The method of claim 280,
The performing the database processing operation includes the database processing subunit executing the database processing instruction at the same time, the database acceleration unit comprising a group of database accelerator subunits sharing a shared memory unit. How to.

297. The method of claim 296,
wherein each database subunit is configured to execute a specific type of database processing instruction.

297. The method of claim 297,
The method further comprising dynamically coupling database processing subunits to provide an execution pipeline necessary to execute a database processing operation comprising multiple instructions.

280. The method of claim 280,
wherein performing the database processing operation comprises allocating a resource of the database acceleration integrated circuit according to a time I/O bandwidth.

280. The method of claim 280,
The method further comprising outputting the database acceleration result to a local storage and retrieving the database acceleration result from the local storage.

280. The method of claim 280,
wherein the network communication interface comprises an RDMA unit.

280. The method of claim 280,
The method further comprising exchanging information between database acceleration integrated circuits of one or more groups of database acceleration integrated circuits.

280. The method of claim 280,
The method further comprising exchanging database acceleration results between database acceleration integrated circuits of one or more groups of database acceleration integrated circuits.

280. The method of claim 280,
The method further comprising exchanging at least one of (a) information and (b) database acceleration results between database acceleration integrated circuits of one or more groups of database acceleration integrated circuits.

305. The method of claim 304,
A method, characterized in that a group of database acceleration integrated circuits are connected to a common printed circuit board.

305. The method of claim 304,
A method, characterized in that a group of database acceleration integrated circuits belong to a modular unit of a computing system.

305. The method of claim 304,
wherein different groups of database acceleration integrated circuits are connected to different printed circuit boards.

305. The method of claim 304,
A method, characterized in that different groups of database acceleration integrated circuits belong to a modular unit of a computing system.

305. The method of claim 304,
and the one or more groups of database acceleration integrated circuits performing distributed processing.

305. The method of claim 304,
and the exchanging step is performed utilizing a network communication interface of at least one group of said database acceleration integrated circuits.

305. The method of claim 304,
wherein the step of exchanging is carried out over multiple groups connected to each other by a star-connection.

305. The method of claim 304,
and using at least one switch to exchange at least one of (a) information and (b) database acceleration results between the one or more groups of different groups of database acceleration integrated circuits.

305. The method of claim 304,
and at least some of the one or more groups of database acceleration integrated circuits performing distributed processing.

305. The method of claim 304,
and performing distributed processing of the first data structure and the second data structure, wherein a total size of the first data structure and the second data structure is greater than a storage capacity of the multiple memory processing integrated circuit. How to.

314. The method of claim 314,
The step of performing the distributed processing includes (a) newly allocating different pairs of the first data structure part and the second data structure part to different database acceleration integrated circuits and (b) repeating the different pairs of processing several times. A method comprising a.

315. The method of claim 315,
The method of claim 1, wherein performing the distributed processing comprises a database join operation.

315. The method of claim 315,
The step of performing the distributed processing includes:
allocating different portions of the first data structure to the one or more groups of different database acceleration integrated circuits; and
new allocating different second data structure portions to the one or more groups of different database acceleration integrated circuits and processing the first data structure portions and the second data structure portions by the database acceleration integrated circuit multiple times A method comprising the steps of performing repeatedly.

317. The method of claim 317,
wherein said step of allocating anew of a next iteration is performed in a manner that at least partially overlaps in time with the processing of the current iteration.

317. The method of claim 317,
wherein said new allocating comprises exchanging portions of a second data structure between said different database acceleration integrated circuits.

319. The method of claim 319,
wherein said exchanging step is performed in a manner that overlaps at least partially in time with said processing step.

317. The method of claim 317,
The new allocating step may include exchanging a second data structure portion between a group of said different database acceleration integrated circuits and, upon completion of the exchanging step, exchanging a second data structure portion between a different group of database acceleration integrated circuits. A method comprising the step of:

280. The method of claim 280,
wherein the database acceleration integrated circuit is included in a blade comprising multiple database acceleration integrated circuits, one or more non-volatile memory units, an Ethernet switch, a PCIe switch and an Ethernet switch, and the multiple memory processing integrated circuits. .

database acceleration integrated circuits; and
each memory processing integrated circuit comprising a controller, multiple processor subunits, and multiple memory processing integrated circuits comprising multiple memory units;
the network communication interface of the database acceleration integrated circuit is configured to receive information from a storage unit;
the database acceleration integrated circuit is configured to primary process the amount of information to provide primary processed information;
the memory controller of the database acceleration integrated circuit is configured to transmit the primary processed information to the multi-memory processing integrated circuit through an interface;
wherein the multiple memory processing integrated circuit is configured to secondary process at least portions of the primary processed information to provide secondary processed information;
The memory controller of the database acceleration integrated circuit is configured to retrieve information from the multiple memory processing integrated circuit, wherein the retrieved information comprises (a) at least a portion of the primary processed information and (b) the secondary processed information. at least one of at least a portion of the information;
the database acceleration unit of the database acceleration integrated circuit is configured to perform a database process operation on the retrieved information to provide a database acceleration result;
and the database acceleration integrated circuit is configured to output the database acceleration result.

323. The method of claim 323,
an apparatus configured to utilize a management unit of the database acceleration integrated circuit to manage at least one of the retrieval, the primary processing, and the secondary processing of the retrieved information.

325. The method of claim 324,
and the management unit is configured to manage according to an execution plan generated by the management unit of the database acceleration integrated circuit.

325. The method of claim 324,
and the management unit is configured to manage according to an execution plan received without being generated by the management unit of the database acceleration integrated circuit.

325. The method of claim 324,
The management unit is configured to manage one or more of (a) a network communication network interface resource, (b) a decompression unit resource, (c) a memory controller resource, (d) a multiple memory processing integrated circuit resource, and (e) a data acceleration unit resource. A device configured to allocate and manage.

323. The method of claim 323,
wherein the network communication interface comprises different types of network communication ports.

339. The method of claim 328,
wherein the different types of network communication ports include a storage interface protocol port and a storage interface protocol port over a general network.

339. The method of claim 328,
wherein the different types of network communication ports include a storage interface protocol port and a storage interface protocol port over Ethernet.

339. The method of claim 328,
wherein the different types of network communication ports include a storage interface protocol port and a PCIe port.

323. The method of claim 323,
wherein the device is coupled to a management unit comprising a compute node of a compute system and controlled by a manager of the compute system.

323. The method of claim 323,
A device configured to be controlled by a compute node of a compute system.

323. The method of claim 323,
an apparatus configured to execute multiple tasks concurrently by the database acceleration integrated circuit.

323. The method of claim 323,
wherein the database acceleration integrated circuit belongs to a computing system.

323. The method of claim 323,
wherein the database acceleration integrated circuit does not belong to a computing system.

323. The method of claim 323,
an apparatus configured to execute at least one of retrieval, primary processing, forwarding, and tertiary processing based on an execution plan transmitted by a compute node of a computer system to the database acceleration integrated circuit.

323. The method of claim 323,
The database acceleration unit is configured to simultaneously perform database process instructions by the database processing subunit, and the database acceleration unit includes a group of database accelerator subunits sharing a shared memory unit.

339. The method of claim 338,
wherein each database processing subunit is configured to execute a specific type of database process instruction.

329. The method of claim 329,
wherein the apparatus is configured to dynamically couple database processing subunits to provide an execution pipeline necessary to execute a database process operation comprising multiple instructions.

323. The method of claim 323,
and the apparatus is configured to allocate resources of the database acceleration integrated circuit according to a time I/O bandwidth.

323. The method of claim 323,
wherein the device comprises local storage accessible by the database acceleration integrated circuit.

323. The method of claim 323,
wherein the network communication network interface comprises an RDMA unit.

323. The method of claim 323,
wherein the apparatus comprises one or more groups of database acceleration integrated circuits configured to exchange information between database acceleration integrated circuits of one or more groups of database acceleration integrated circuits.

323. The method of claim 323,
wherein the apparatus comprises one or more groups of database acceleration integrated circuits configured to exchange acceleration results between database acceleration integrated circuits of one or more groups of database acceleration integrated circuits.

323. The method of claim 323,
wherein the apparatus comprises one or more groups of database acceleration integrated circuits configured to execute at least one of (a) information and (b) a database acceleration result between database acceleration integrated circuits of one or more groups of database acceleration integrated circuits. device characterized.

346. The method of claim 346,
A device, characterized in that a group of database acceleration integrated circuits are connected to the same printed circuit board.

346. The method of claim 346,
A device, characterized in that a group of database acceleration integrated circuits belong to a modular unit of a computing system.

346. The method of claim 346,
wherein different groups of database acceleration integrated circuits are connected to different printed circuit boards.

346. The method of claim 346,
A device according to claim 1, wherein different groups of database acceleration integrated circuits belong to a modular unit of a computing system.

346. The method of claim 346,
wherein said exchange is effected utilizing a network communication interface of said database acceleration integrated circuit of one or more groups.

346. The method of claim 346,
The device of claim 1, wherein the exchange is carried out over multiple groups connected to each other by a star-connection.

346. The method of claim 346,
and the apparatus is configured to use the at least one switch to exchange at least one of (a) information and (b) database acceleration results between different groups of database acceleration integrated circuits in the one or more groups.

346. The method of claim 346,
and the apparatus is configured to execute a distributed process by a portion of the database acceleration integrated circuits in a subset of the one or more groups.

346. The method of claim 346,
and perform distributed processing by utilizing the first data structure and the second data structure, wherein a total size of the first data structure and the second data structure is greater than a storage capacity of the multiple memory processing integrated circuit. device to do.

355. The method of claim 355,
The apparatus is configured to perform the distributed processing by (a) performing new assignment of different pairs of first data structure portions and second data structure portions to different database acceleration integrated circuits and (b) repeating the different pairs of processing multiple times; A device characterized in that it performs.

355. The method of claim 355,
wherein the distribution process comprises a database join operation.

355. The method of claim 355,
The device is:
allocating different portions of the first data structure to the one or more groups of different database acceleration integrated circuits; and
Newly allocating different second data structure portions to the one or more groups of different database acceleration integrated circuits and performing multiple iterations of processing the first data structure portions and the second data structure portions by the database acceleration integrated circuits to perform the distributed processing.

358. The method of claim 358,
and the apparatus is configured to execute the new assignment of the next iteration in a manner that at least partially overlaps in time with the processing of the current iteration.

358. The method of claim 358,
and the apparatus is configured to effect the new allocation by exchanging a second data structure portion between the different database acceleration integrated circuits.

360. The method of claim 360,
and said exchange is performed in a manner that at least partially overlaps in time with said processing by said database acceleration integrated circuit.

358. The method of claim 358,
The device is configured to exchange a second data structure portion between a group of said different database acceleration integrated circuits and by exchanging a second data structure portion between a different group of database acceleration integrated circuits when there are no more second data structure portions to exchange. A device configured to execute a new assignment.

323. The method of claim 323,
wherein the database acceleration integrated circuit is included in a blade comprising multiple database acceleration integrated circuits, one or more non-volatile memory units, an Ethernet switch, a PCIe switch and an Ethernet switch, and the multiple memory processing integrated circuits. .

retrieving, by the network communication interface of the database acceleration integrated circuit, information from the storage unit;
providing primary processed information by primary processing the amount of information;
transmitting, by the memory controller of the database acceleration integrated circuit, the primary processed information to multiple memory resources through an interface;
retrieving information from the multiple memory resources;
performing, by the database acceleration unit of the database acceleration integrated circuit, a database processing operation on the retrieved information to provide a database acceleration result; and
and outputting the database acceleration result.

364. The method of claim 364,
secondary processing the primary processed information to provide secondary processed information, wherein the processing of the primary processed information is within one or more memory processing integrated circuits further comprising the multiple memory resources. A method, characterized in that it is executed by multiple processors located there.

364. The method of claim 364,
wherein said primary processing comprises filtering of database entries.

364. The method of claim 364,
wherein said secondary processing comprises filtering of database entries.

364. The method of claim 364,
wherein said primary processing and said secondary processing comprise filtering of database entries.