KR100344065B1

KR100344065B1 - Shared memory multiprocessor system based on multi-level cache

Info

Publication number: KR100344065B1
Application number: KR1020000006953A
Authority: KR
Inventors: 장성태; 전주식; 서효중; 엄성용
Original assignee: 전주식; 장성태; 서효중; 엄성용
Priority date: 2000-02-15
Filing date: 2000-02-15
Publication date: 2002-07-24
Also published as: KR20010083446A

Abstract

본 발명은 스누핑(snooping) 방식의 상호 연결망을 이용하는 공유 메모리(shared memory) 구조의 다중 프로세서(multiprocessor) 장치에 관한 것으로, 성능을 높이기 위하여 프로세서와 공유 메모리간에 다단계의 캐시(cache)를 구성하며, 캐시의 큰 단계가 작은 단계의 유효한 내용을 포함하는 포함 관계(multi-level cache inclusion property)가 유발하는 성능 저하를 제거하기 위한 구조로서 접근 목록(access-list)이 구성됨을 특징으로 한다. 본 발명의 접근 목록은 하위 단계의 캐시내에 저장되어 있는 블록을 표시하므로써 상위 단계의 캐시는 하위 단계의 캐시의 블록을 중복하여 저장할 필요가 없다.The present invention relates to a multiprocessor device having a shared memory structure using a snooping interconnection network, and to improve performance, a multi-level cache is constructed between the processor and the shared memory. The access-list is configured as a structure for removing the performance degradation caused by the multi-level cache inclusion property, in which the large level of the cache includes the valid contents of the small level. The access list of the present invention indicates a block stored in the lower level cache so that the higher level cache does not need to store the blocks of the lower level cache in duplicate.

Description

SHARED MEMORY MULTIPROCESSOR SYSTEM BASED ON MULTI-LEVEL CACHE}

본 발명은 공유 메모리 구조의 다중 프로세서 장치에 관한 것으로, 특히 스누핑(snooping) 방식의 상호 연결망을 포함하는 공유 메모리 구조의 다중 프로세서 장치에서 프로세서간 연결 경로상의 수 개의 캐시간을 단계적으로 관리할 수 있는 장치에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multiprocessor device having a shared memory structure. In particular, a multiprocessor device having a shared memory structure including a snooping interconnection network can manage several caches on a connection path between processors in stages. Relates to a device.

일반적으로, 단일 주소 공간(single address space)과 일관성(coherence)이 유지되는 캐시(cache)를 가지는 대규모 공유 메모리 다중 프로세서 시스템(shared memory multiprocessor system)은 유동적이고도 강력한 연산 환경을 제공한다. 즉, 단일 주소 공간과 일관성이 유지되는 캐시는 데이터 분할(data partitioning) 및 동적 부하 균형(dynamic load balancing) 문제를 쉽게 하고, 병렬 컴파일러 및 표준 운영 체제, 멀티프로그래밍(multiprogramming)을 위한 더 나은 환경을 제공하여, 보다 유동적이고 효과적으로 기계를 사용할 수 있게 한다.In general, large shared memory multiprocessor systems with a single address space and a cache that maintain coherence provide a flexible and powerful computing environment. That is, a cache that is consistent with a single address space facilitates data partitioning and dynamic load balancing issues, and provides a better environment for parallel compilers, standard operating systems, and multiprogramming. To make the machine more flexible and effective.

도 1에는 이러한 공유 메모리 다중 프로세서 시스템의 한 예로서, 균등 메모리 접근 시간(Uniform Memory Access, UMA) 방식이 도시되어 있다. 도시된 바와 같이 UMA 방식에서는 단일 공유 메모리(30)를 구비하며, 단일 공유 메모리(30)는 전역 버스(21)를 통하여 다수개의 프로세서 노드(100A 내지 100C)에 연결되어 있다. 프로세서 노드(100A 내지 100C)내에는 2차 캐시(13A 내지 13C)들이 구성되며, 2차 캐시(13A 내지 13C)들은 지역 버스(20A 내지 20C)를 통하여 프로세서 모듈(10A 내지 10F)에 연결되어 있다. 프로세서 모듈(10A 내지 10F)들은 도시된 바와 같이 1 차 캐시(12A내지 12F)와 프로세서(11A내지 11F)를 구비한다.1 illustrates an example of a uniform memory access time (UMA) scheme as an example of such a shared memory multiprocessor system. As shown in the UMA scheme, a single shared memory 30 is provided, and the single shared memory 30 is connected to the plurality of processor nodes 100A to 100C through the global bus 21. Secondary caches 13A to 13C are configured in processor nodes 100A to 100C, and secondary caches 13A to 13C are connected to processor modules 10A to 10F via local buses 20A to 20C. . Processor modules 10A through 10F have primary caches 12A through 12F and processors 11A through 11F as shown.

상술한 구성을 갖는 UMA 방식에서는 공유 메모리(30)보다 용량은 작으나 훨씬 빠른 접근 시간(access time)을 제공하는 1차 및 2차 캐시(12A 내지 12F, 13A 내지 13F)들을 채용함으로써 프로세서간의 상호 연결망(지역 버스(20A 내지 20C) 및 전역 버스(21))에 발생하는 요청 및 응답 횟수를 줄이고 프로세서(11A 내지 11F)로부터의 메모리 접근 요청에 대한 작은 지연 시간(latency)을 제공한다.In the UMA scheme having the above-described configuration, the interconnection network between processors is adopted by employing primary and secondary caches 12A to 12F and 13A to 13F which have a smaller capacity than the shared memory 30 but provide much faster access time. It reduces the number of requests and responses that occur on (local buses 20A to 20C and global bus 21) and provides a small latency for memory access requests from processors 11A to 11F.

하지만, 캐시(12A 내지 12F, 13A 내지 13F)의 사용은 하나의 프로세서(11A 내지 11F)가 자체의 1차 캐시(12A 내지 12F)에 대한 쓰기 작업을 수행하면, 그 쓰기 작업의 결과가 시스템 내의 모든 캐시(12A 내지 12F, 13A 내지 13F)의 해당 데이터 블록에 반영되어야 하는 소위 캐시 일관성 유지(cache coherence) 문제가 발생한다. 공유 메모리 다중 프로세서에서 캐시의 일관성 유지를 위하여 일반적으로스누핑 방식(snooping scheme)과 디렉토리 방식(directory scheme)이 널리 사용되고 있다.However, the use of caches 12A-12F, 13A-13F means that if one processor 11A-11F performs a write operation on its primary caches 12A-12F, the result of that write operation may be in the system. A so-called cache coherence problem arises that must be reflected in the corresponding data blocks of all caches 12A-12F, 13A-13F. In general, the snooping scheme and the directory scheme are widely used to maintain cache coherency in a shared memory multiprocessor.

스누핑 방식의 일관성 유지 방법은 어떤 프로세서(11A 내지 11F)로부터 메모리 접근 요청이 발생되면, 전달 경로(지역 버스(20A 내지 20C) 및 전역 버스(21))상에 직접 연결되어 있는 캐시(12A 내지 12F, 13A 내지 13C)들이 요청된 주소에 대하여 자신의 블록 상태를 확인하고 이에 따라 응답 등의 필요한 과정을 수행하는 방식을 의미한다 여기서, 캐시(12A 내지 12F, 13A 내지 13C)들은 내부에 저장된 블록의 상태 정보가 저장되며, 블록의 상태 정보로는 공유됨, 수정됨, 무효화됨으로 구분할 수 있다. 공유됨이라함은 캐시(12A 내지 12F, 13A 내지 13C)내의 블록이 또 다른 캐시(12A 내지 12F, 13A 내지 13C)내에 저장되어 있음을 의미하며, 수정됨은 캐시(12A 내지 12F, 13A 내지 13C)내의 블록이 수정되었음을 의미하고, 무효화됨은 캐시(12A 내지 12F, 13A 내지 13C)내의 블록이 무효화되어있음 즉, 유의미한 블록 내용이 저장되어있지 않음을 의미한다.The snooping method of maintaining consistency is a cache 12A to 12F connected directly on a transfer path (local buses 20A to 20C and global bus 21) when a memory access request is generated from a processor 11A to 11F. In this case, the 13A to 13C checks their block state with respect to the requested address and accordingly performs a necessary process such as a response. Here, the caches 12A to 12F and 13A to 13C are used to store the blocks stored therein. The state information is stored, and the state information of the block can be divided into shared, modified, and invalidated. Shared means that blocks in the caches 12A-12F, 13A-13C are stored in another cache 12A-12F, 13A-13C, and modified means caches 12A-12F, 13A-13C. The blocks within are modified, and invalidated means that the blocks in the caches 12A to 12F and 13A to 13C are invalidated, that is, no significant block contents are stored.

상술한 구성에서 특정한 프로세서(예컨데, 11A)에 의하여 메모리 접근 요청이 발생하였을 경우, 이 요청은 우선 1차 캐시(12A)에 제공되며, 1차 캐시(12A)에 요구된 메모리 블록이 유효한 상태로 존재하지 않을 경우(즉, 블록의 상태가 무효화 상태이거나 존재하지 않는 경우일 때에)에 지역 버스(20A)에 메모리 접근 요청이 제공된다. 지역 버스(20A)에 제공된 요청에 따라 2차 캐시(13A)와 연결되어 있는 지역 버스(20A)내의 다른 1차 캐시(12B)의 내용이 검색되며, 2차 캐시(13A)와 다른 1차 캐시(12B)에 그 블록이 유효한 상태로 존재하지 않으면, 전역 버스(21)를통하여 공유 메모리(30) 및 전역 버스(21)에 직접 연결된 다른 2 차 캐시(13B와 13C)에 메모리 접근 요청이 전달된다.In the above configuration, when a memory access request is generated by a specific processor (for example, 11A), the request is first provided to the primary cache 12A, and the memory block required for the primary cache 12A is valid. If it does not exist (i.e., when the state of the block is invalid or does not exist), a memory access request is provided to the local bus 20A. According to the request provided to the local bus 20A, the contents of the other primary cache 12B in the local bus 20A connected with the secondary cache 13A are retrieved, and the primary cache different from the secondary cache 13A. If the block does not exist in a valid state at 12B, a memory access request is forwarded to the shared memory 30 and other secondary caches 13B and 13C directly connected to the global bus 21 via the global bus 21. do.

이와 같이 프로세서(11A)로부터 발생한 메모리 접근 요청은 계층적인 단계를 거치며 상호 연결망(20A 내지 20C, 21)을 통하여 점차적으로 시스템 전역에 전달되어 전달 경로에 위치하는 캐시(12A 내지 12F, 13A 내지 13C)들내에 요청된 주소에 해당되는 블록이 존재하는지 검색된다. 이와 같이 공유 메모리 다중 프로세서 구조에서 모든 캐시(12A 내지 12F, 13A 내지 13C)는 상호 연결망으로 접속되어 있으므로, 메모리 접근 요청에 따라 모든 캐시(12A 내지 12F, 13A 내지 13C)들의 상태(해당 블록의 저장 여부)가 검색되고 그에 따른 동작이 수반되어야 한다.In this way, the memory access request generated from the processor 11A is hierarchically passed through the interconnection networks 20A to 20C and 21, and is gradually transferred throughout the system to be located in the delivery paths of the caches 12A to 12F and 13A to 13C. Are searched for a block corresponding to the requested address. As such, in the shared memory multiprocessor architecture, all caches 12A to 12F and 13A to 13C are connected through an interconnection network, and thus, the state of all caches 12A to 12F and 13A to 13C according to a memory access request (storing of the corresponding block) Whether or not) must be retrieved and accompanied by the action accordingly.

캐시(12A 내지 12F, 13A 내지 13C)는 공유 메모리(30)내에 저장되는 일정 크기의 블록들을 저장하고, 프로세서(11A 내지 11F)로부터 발생하는 메모리 접근 주소가 캐시(12A 내지 12F, 13A 내지 13C)에 저장되어 있는 블록 주소와 일치할 경우 그 내용을 제공하도록 구성된다. 따라서, 캐시(12A 내지 12F, 13A 내지 13C)의 용량이 커질수록 프로세서(11A 내지 11F)는 공유 메모리(30)에 접근하지 않고 블록 정보를 입수할 수 있다.The caches 12A to 12F and 13A to 13C store blocks of a certain size stored in the shared memory 30, and memory access addresses generated from the processors 11A to 11F are stored in the caches 12A to 12F and 13A to 13C. It is configured to provide the contents if it matches the block address stored in. Therefore, as the capacity of the caches 12A to 12F and 13A to 13C increases, the processors 11A to 11F may obtain block information without accessing the shared memory 30.

시스템의 성능을 높이기 위한 일차적인 방법은, 우선 시스템 내에 수용할 수 있는 프로세서의 수를 늘리는 것이다. 그러나 하나의 연결망이 제공할 수 있는 대역폭(bandwidth)은 제한적이며, 따라서 다수의 프로세서를 한 시스템에 구성하기 위해서는 도 1과 같이 다단계 상호 연결망(버스)과 다단계로 캐시를 구성할 필요가 있다.The primary way to increase the performance of a system is to first increase the number of processors it can accommodate. However, the bandwidth provided by a single network is limited, and therefore, in order to configure a plurality of processors in a system, a multi-level interconnect (bus) and a multi-level cache need to be configured as shown in FIG. 1.

1 차 캐시(12A 내지 12F)는 각 캐시(12A 내지 12F)에 직접 연결되어 있는 프로세서(11A 내지 11F)로부터 발생하는 메모리 접근 요구를 가능한 상위의 버스(20A 내지 20C)에 요청하지 않고 메모리 블록을 제공할 수 있도록 구성함으로써, 일차적으로 프로세서(11A내지 11F)로부터 발생한 메모리 접근 요구가 신속히 완료되도록 하며, 이차적으로 지역 버스(20A 내지 20C)에 제공되는 메모리 접근 요청의 수를 줄이므로써 지역 버스(20A 내지 20C)에 유발되는 요구에 의한 병목 현상을 줄여줄 수 있다. 마찬가지로 2차 캐시(13A 내지 13C)는 1차 캐시(12A 내지 12F)에 유효한 상태로 존재하지 않는 메모리 블록에 대한 요청을 수용함으로써, 메모리 블록의 신속한 제공과 전역 버스(21)의 병목을 줄인다. 따라서 고성능 구조의 경우, 다단계의 캐시(12A 내지 12F, 13A 내지 13C)를 구성하고 있으며, 프로세서(11A 내지 11F)로부터 공유 메모리(30)까지의 접근 경로상에서 프로세서(11A내지 11F)에 가까운 쪽의 캐시(12A 내지 12F)는 보다 고속이며, 작은 용량을 사용하고, 공유 메모리(30) 쪽에 가까운 캐시(13A 내지 13C)측 용량을 점차 크게 구성한다. 이러한 이유로 프로세서(11A 내지 11F) 쪽에 가까이 놓여진 캐시(12A 내지 12F)를 보다 작은 단계의 캐시(1 차 캐시)로 명기하며, 공유 메모리(30)에 점차 가까운 캐시(113A 내지 13C)를 큰 단계의 캐시(2 차 캐시)로 명기한다.The primary caches 12A through 12F store memory blocks without requesting the upper buses 20A through 20C to request memory access requests from processors 11A through 11F that are directly connected to each cache 12A through 12F. By providing a configuration, the memory access request generated from the processors 11A to 11F can be completed quickly, and the local bus (by reducing the number of memory access requests provided to the local buses 20A to 20C). 20A to 20C) can reduce the bottlenecks caused by the demand. Similarly, secondary caches 13A-13C accommodate requests for memory blocks that do not exist in the primary caches 12A- 12F, thereby reducing bottlenecks in the global bus 21 and the rapid provision of memory blocks. Therefore, in the case of a high performance structure, the caches 12A through 12F and 13A through 13C are constituted, and closer to the processors 11A through 11F on the access path from the processors 11A through 11F to the shared memory 30. The caches 12A to 12F are faster, use a smaller capacity, and gradually increase the capacity of the cache 13A to 13C side closer to the shared memory 30 side. For this reason, the caches 12A to 12F placed closer to the processors 11A to 11F are designated as smaller caches (primary caches), and the caches 113A to 13C gradually closer to the shared memory 30 are designated as larger stages. Specify as cache (secondary cache).

그러나, 단위 메모리 접근 요청에 대하여 일관성을 제공하기 위해서는 시스템내의 모든 캐시(12A 내지 12F, 13A 내지 13C)에 대한 검색을 하여야 하므로 일관성을 유지하기 위한 시간 지연이 길어지는 문제가 발생한다. 이러한 문제를 해결하기 위하여 각 단계의 캐시(12A 내지 12F, 13A 내지 13C)간에서, 작은 단계의 1 차캐시(12A 내지 12F)내에 저장된 유효한 상태의 메모리 블록 내용을 보다 큰 상태의 2 차 캐시(13A 내지 13C)가 유지하도록 하는 내포성(inclusion property)을 강제하여, 검색하여야 할 캐시의 범위를 축소시킴으로써 스누핑으로 인한 처리 지연 시간을 단축하는 방법이 기본적으로 사용되고 있다.However, in order to provide consistency for unit memory access requests, all caches 12A to 12F and 13A to 13C in the system need to be searched, which causes a problem of a long time delay for maintaining consistency. To solve this problem, between the caches 12A through 12F and 13A through 13C of each stage, the contents of the valid state memory blocks stored in the primary caches 12A through 12F of the smaller stages are stored in a larger secondary cache ( 13A to 13C) are basically used to reduce the processing delay time due to snooping by forcing the inclusion property to be maintained and reducing the range of the cache to be searched.

내포성은 큰 단계의 캐시(13A 내지 13C)가 자신과 상호 연결망 즉, 지역 버스(20A 내지 20C)를 통하여 직접 연결되어 있는 작은 단계의 캐시(12A 내지 12F)에 존재하는 모든 유효한 상태의 블록을 반드시 포함하고 있도록 제한하는 것을 의미한다. 도 1의 경우 캐시(13A)가 캐시(12A와 12B)내의 모든 유효한 상태의 블록을 포함하고, 캐시(13B)가 캐시(12C와 12D)내의 모든 유효한 상태의 블록을 포함하며, 캐시(13C)가 캐시(12E와 12F)내의 모든 유효한 상태의 블록을 포함하고 있을 경우 다단계 캐시 내포성(multi-level cache inclusion property)이 유지된다고 한다.Fault tolerance must include all valid state blocks that exist in the small caches 12A through 12F, where the large caches 13A through 13C are directly connected to themselves via interconnection networks, ie, local buses 20A through 20C. It means limiting to include. In the case of FIG. 1, the cache 13A includes all valid state blocks in the caches 12A and 12B, the cache 13B includes all valid state blocks in the caches 12C and 12D, and the cache 13C. The multi-level cache inclusion property is said to be maintained if is a block containing all valid state blocks in the caches 12E and 12F.

이러한 다단계 캐시 내포성은 캐시 프로토콜의 단순화를 위한 것으로서, 캐시(13A)에 유효한 상태로 존재하지 않는 블록은 캐시(12A 및 12B)에 유효한 상태로 존재하지 않음이 보장되고, 캐시(13B)에 유효한 상태로 존재하지 않는 블록은 캐시(12C 및 12D)에 유효한 상태로 존재하지 않음이 보장되며, 캐시(13C)에 유효한 상태로 존재하지 않는 블록은 캐시(12E 및 12F)에 유효한 상태로 존재하지 않음이 보장된다. 따라서 큰 단계의 캐시(13A 내지 13C)에 대한 검색에서 어떤 주소의 블록이 발견되지 않았을 경우에는 캐시(13A 내지 13C)에 직접 연결된 작은 단계의 캐시(12A 내지 12F)에 대한 검색을 수반하지 않고도 작은 단계의 캐시(12A 내지 12F)에 그 주소의 블록이 존재하지 않음을 알 수 있다. 따라서, 이러한 내포성을 이용하여 지역 버스(20A 내지 20C)나 전역 버스(21)에 발생된 요구가 모든 캐시(12A 내지 12F, 13A 내지 13C)에 전달되지 않고도 캐시 일관성을 유지할 수 있다.This multi-level cache nesting is intended to simplify the cache protocol, in which blocks that do not exist in the valid state in the cache 13A are guaranteed to not exist in the valid states in the caches 12A and 12B, and are valid in the cache 13B. It is guaranteed that blocks that do not exist in the cache 12C and 12D do not exist in the valid state in the caches 12C and 12D, and blocks that do not exist in the cache 13C do not exist in the valid state in the caches 12E and 12F. Guaranteed. Thus, if a block of any address is not found in the search for the large caches 13A through 13C, then the small, without involving the search for the small caches 12A through 12F directly connected to the caches 13A through 13C. It can be seen that there is no block of that address in the cache 12A-12F of the step. Thus, this inclusion can be used to maintain cache coherency without the need for local buses 20A-20C or global bus 21 to be delivered to all caches 12A-12F, 13A-13C.

예를 들면, 도 1의 프로세서(11A)로부터 메모리 읽기 요청이 발생되고, 1 차 캐시(12A)에는 요청된 메모리 블록이 유효한 상태로 존재하지 않는 경우, 지역 버스(20A)를 통하여 캐시(12B) 및 캐시(13A)에 읽기 요청을 제공한다. 이러한 읽기 요청에 부응하여 캐시(12B) 및 캐시(13A)는 각각 요청된 블록이 자신의 내부에 유효한 상태로 존재하는지를 확인하고, 존재할 경우 그의 상태에 따라서 응답의 책임이 있는 캐시(12B) 또는 캐시(13A)는 프로세서(11A)의 요청 블록을 제공한다. 여기서, 응답 책임이 있는 캐시(12B) 또는 캐시(13A)는 요청 블록을 프로세서(11A)에 제공하며, 응답 책임은 해당 요청 블록에 대한 소유권에 의하여 결정된다. 즉, 캐시(13A)는 내포성을 유지하기 위하여 캐시(12B)내의 내용을 모두 저장하고 있어야 하나, 나중 쓰기 방식(Write-back 방식이라 함)의 경우에 캐시(13A)가 캐시(12B)의 올바른 내용을 저장하지 못하는 경우가 발생한다. 나중 쓰기 방식이라 함은 프로세서(예컨데 11B)로부터 발생한 쓰기 요구에 대하여, 기록된 내용을 캐시(12B)에 저장하고, 공유 메모리(30)에 대한 갱신은 캐시(12B)로부터 블록이 축출되는 차후의 시기로 미루는 방식을 말한다. 나중 쓰기 방식에서는 작은 단계의 캐시(12B)에 저장한 최신의 블록 내용이 큰 단계의 캐시(13A)에는 알려지지 않는다. 그러나, 내포성을 유지하기 위하여 큰 단계의 캐시(13A)는 해당되는 주소의 블록을 반드시 유지하여야 하므로, 큰 단계의 캐시 블록(13A)에는 해당 주소(쓰기가 발생된 블록의 주소)가 할당되나 가장 마지막으로 프로세서(11B)로부터 기록된 블록의 내용은 담고있지 않다. 따라서, 이 경우에 캐시(13A)는 캐시(12B)에 대한 내포성을 갖는다 할 것이나, 캐시(13A)내에는 올바른 블록이 저장되어 있지 않은 상태인 바, 이 경우, 블록에 대한 응답 책임은 캐시(12B)에 있다.For example, when a memory read request is generated from the processor 11A of FIG. 1 and the requested memory block does not exist in the valid state in the primary cache 12A, the cache 12B is provided through the local bus 20A. And a read request to the cache 13A. In response to this read request, the cache 12B and the cache 13A respectively check whether the requested block exists in its own valid state and, if present, the cache 12B or cache responsible for the response, depending on its state. 13A provides a request block of the processor 11A. Here, the cache 12B or the cache 13A responsible for the response provides the request block to the processor 11A, and the response responsibility is determined by ownership of the request block. That is, the cache 13A must store all the contents in the cache 12B in order to maintain its inclusion. However, in the case of a later write method (called a write-back method), the cache 13A is correct in the cache 12B. Occurs when the contents cannot be saved. The later write method means that the write contents are stored in the cache 12B for the write request from the processor (eg, 11B), and the update to the shared memory 30 is performed after the block is evicted from the cache 12B. It's a way of procrastinating. In the later write method, the latest block contents stored in the small cache 12B are not known to the large cache 13A. However, in order to maintain inclusion, the large-sized cache 13A must maintain the block of the corresponding address. Therefore, the large-sized cache block 13A is assigned the corresponding address (the address of the block in which the write occurred). Finally, the contents of the block written from the processor 11B are not contained. Therefore, in this case, the cache 13A will have nesting resistance to the cache 12B, but the correct block is not stored in the cache 13A. In this case, the response responsibility for the block is the cache ( 12B).

요청 블록에 대하여 캐시(13A)가 응답 책임을 갖는 경우를 보면, 이 경우는 과거에 프로세서(11B)로부터 쓰기가 발생하여 캐시(12B)가 응답 책임을 갖던 블록이 프로세서(11B)로부터의 새로운 블록 요청에 의하여, 캐시(12B)에 저장되어 있던 블록이 축출될 블록으로 선택되어, 캐시(12B)로부터 블록 대체가 발생한 경우이다. 블록 대체라 함은, 새로운 블록을 캐시(예컨데 12B)에 할당하는 과정에서 기존에 캐시(12B)내에 있던 블록을 없애는 과정을 말한다. 즉, 캐시(12B)가 총 10개의 블록(B1-B10)만을 저장할 수 있으나, 새로운 블록(B11)이 필요한 경우에 이들 블록(B1-B10)들중 가장 쓰이지 않는 블록(예컨데, B2)을 없애고, 블록(B11)을 새로이 저장하는 과정을 말한다.In the case where the cache 13A has a response responsibility for the request block, in this case, the block in which the write was generated from the processor 11B in the past so that the cache 12B had the response responsibility is a new block from the processor 11B. By the request, a block stored in the cache 12B is selected as a block to be evicted, and a block replacement occurs from the cache 12B. Block replacement refers to a process of removing a block previously existing in the cache 12B while allocating a new block to the cache (eg, 12B). That is, the cache 12B can store only 10 blocks B1-B10 in total, but when a new block B11 is needed, the least used block (eg B2) of these blocks B1-B10 is eliminated. , A process of newly storing the block B11.

이러한 블록 대체 과정이 수행되면, 캐시(12B)는 블록(B2)을 저장하지 않는 상태가 되나, 캐시(13A)는 블록(B2)을 저장하고 있는 상태이므로, 블록(B2)에 대한 요청에 대하여는 캐시(13A)가 응답 책임이 있다.When this block replacement process is performed, the cache 12B is in a state of not storing the block B2, but since the cache 13A is in a state of storing the block B2, the request for the block B2 is not performed. The cache 13A is responsible for the response.

그러나, 캐시(13A)에 요청한 주소의 블록이 유효한 상태로 존재하지 않을 경우, 다단계 캐시 내포성에 의하여 그 블록은 캐시(12B)에도 유효한 상태로 존재하지 않음이 보장되므로, 캐시(13A)는 캐시(12A)가 요청한 블록을 즉시 전역 버스(21)로 요청하며, 전역 버스로 요청된 블록은 캐시(13B, 13C)와 공유 메모리(30)에서 그 내용이 검색된다. 이때, 캐시(13B와 13C)에 그 블록이 유효한상태로 존재하지 않는 경우, 캐시(13B)의 하부에 해당되는 캐시(12C, 12D)와, 캐시(13C)의 하부에 해당되는 캐시(12E, 12F)에 그 블록이 유효한 상태로 존재하지 않음이 보장되므로, 공유 메모리(30)로부터 응답이 이루어진다. 만일, 캐시(13B)에 그 블록이 유효한 상태로 존재하며 수정된 상태(이 경우는 캐시(12C 또는 12D)에 최신의 블록내용이 저장되며, 캐시(13B)에는 블록 내용은 올바르지 않으나, 해당 주소 블록을 유효하고 수정됨 상태 즉, 캐시(12C 또는 12D)에 해당 블록이 유효하게 저장되어 있음을 알리는 상태가 된다.)이고, 캐시(13C)에 그 블록이 유효한 상태로 존재하지 않는 경우, 캐시(13B)는 지역 버스(20B)를 통하여 블록에 대한 요청을 하여 캐시(12C, 12D)에 대한 검색을 하여 캐시(12C나 12D)들중 수정된 상태의 캐시가 응답을 하도록 한다. 이때, 캐시(13C)에 직접 연결된 캐시(12E, 12F)에는 블록이 유효한 상태로 존재하지 않음이 보장되므로, 검색이 필요하지 않다.However, if the block of the address requested in the cache 13A does not exist in a valid state, the cache 13A is guaranteed to not exist in the valid state in the cache 12B because of the multi-level cache nesting. 12A) immediately requests the requested block to the global bus 21, and the block requested to the global bus is retrieved from the caches 13B and 13C and the shared memory 30. At this time, if the block does not exist in the cache 13B and 13C in a valid state, the cache 12C and 12D corresponding to the lower portion of the cache 13B, and the cache 12E and the lower portion of the cache 13C, Since the block is guaranteed to not exist in a valid state at 12F), a response is made from the shared memory 30. If the block exists in the cache 13B in a valid state and the modified state (in this case, the latest block content is stored in the cache 12C or 12D, the block content is not correct in the cache 13B, but the corresponding address is stored in the cache 13B). The block is valid and modified, i.e., the state that the block is validly stored in the cache 12C or 12D), and if the block does not exist in the cache 13C in a valid state, the cache 13B makes a request for a block through the local bus 20B to retrieve the caches 12C and 12D so that the cache of the modified state of the caches 12C or 12D responds. At this time, since it is guaranteed that a block does not exist in a valid state in the caches 12E and 12F directly connected to the cache 13C, a search is not necessary.

이와 같이, 다단계 캐시 내포성(multi-level cache inclusion property)은 프로세서(11A 내지 11F)로부터 공유 메모리(30)까지의 접근 경로상에 수 단계의 캐시(12A 내지 12F, 13A 내지 13C)가 유효한 상태로 존재할 경우 큰 단계의 캐시(13A 내지 13C)가 작은 단계 캐시의 캐시(12A 내지 12F)가 가지고 있는 블록을 모두 유효한 상태로 포함하고 있음을 의미한다. 다단계 캐시 내포성에 의하여, 전역 버스(21)에 특정 주소에 해당되는 블록에 대한 검색이 요구되고, 큰 단계의 캐시(13A 내지 13C)에서 상기 주소에 해당되는 블록이 존재하지 않을 경우, 캐시(13A 내지 13C)와 내포성 관계를 갖는 작은 단계의 캐시(12A 내지 12F)에 대한 검색을 수반하지 않고도, 작은 단계의 캐시(12A 내지 12F)에 요구된 주소에 해당되는 블록이 존재하지 않음을 보장받을 수 있다.As such, the multi-level cache inclusion property may be used in a state where several levels of caches 12A to 12F and 13A to 13C are valid on the access path from the processors 11A to 11F to the shared memory 30. If present, it means that the large caches 13A through 13C contain all the blocks of the small caches 12A through 12F in a valid state. By multi-level cache nesting, if a search for a block corresponding to a specific address is required in the global bus 21, and the block corresponding to the address does not exist in the large-level caches 13A to 13C, the cache 13A To 13C), it can be guaranteed that there is no block corresponding to the address required for the small caches 12A to 12F without involving a search for the small caches 12A to 12F having an implicit relationship. have.

다단계 캐시 내포성에 의하여 프로세서(11A 내지 11F)로부터의 메모리 접근 요구는 모든 캐시(12A 내지 12F, 13A 내지 13C)에 대한 검색을 필요로 하지 않음으로 검색에 소요되는 시간 지연이 축소된다는 장점이 있다.The memory access request from the processors 11A through 11F by the multi-level cache nesting does not require searching through all the caches 12A through 12F and 13A through 13C, so that the time delay required for the search is reduced.

한편, 종래의 방식으로 프로세서 노드(100A, 100B, 100C) 내의 2차 캐시(13A, 13B, 13C)와 1차 캐시(12A, 12B, 12C)간에 내포성을 유지하는 경우, 프로세서(11A)로부터 메모리 요구가 발생하고, 그 요구가 프로세서 노드 내의 1차 캐시(12A)에 적중하지 않으면, 2차 캐시(13A)로 메모리 요구가 전달된다. 2차 캐시(13A)는 발생된 메모리 요구에 대하여 2차 캐시(13A)내에 해당 블록이 유효하게 저장되어 있으면 그 블록을 제공하며, 유효한 블록이 저장되어 있지 않을 경우 해당 블록을 전역 버스(21)로 요구하고 그 응답을 받아 다시 지역 버스(20A)를 통하여 블록의 내용을 상기 메모리 요구를 발생시킨 프로세서 모듈(10A)에 제공한다. 이 때, 상기 블록을 1 차 캐시(12A) 및 2차 캐시(13A)에 할당하여야 내포성을 유지할 수 있으므로 이 블록은 1차 캐시(12A)와 2차 캐시(13A)에 중복된다. 이때, 지역 버스(21)를 통하여 제공된 상기 블록을 2차 캐시(13A)에 할당하기 위하여 2차 캐시(13A)에 존재하는 블록의 축출이 일어날 경우(즉, 블록 대체가 발생되어야 할 경우에), 축출되는 블록을 동일 프로세서 노드의 모든 1차 캐시(12A, 12B)에서 우선 축출해야만 한다. 따라서, 이러하게 축출되는 블록은 1차 캐시의 관리 정책상의 축출 대상과는 무관하게 이루어지므로, 차후 프로세서(11A, 11B)로부터 상기 축출요구를 받은 블록 접근이 발생하였을 경우에, 그 접근은 동일 프로세서 노드 내의모든 1차 캐시(12A, 12B) 및 2차 캐시(13A)에 접근 실패가 발생하여 전역 버스(21)를 통한 블록의 요구를 발생시키게 된다.On the other hand, in a conventional manner, when the nesting resistance is maintained between the secondary caches 13A, 13B, 13C and the primary caches 12A, 12B, 12C in the processor nodes 100A, 100B, and 100C, the memory from the processor 11A If a request occurs and the request does not hit the primary cache 12A in the processor node, the memory request is forwarded to the secondary cache 13A. The secondary cache 13A provides the block if the block is validly stored in the secondary cache 13A with respect to the generated memory request. If the valid block is not stored, the secondary cache 13A transfers the block to the global bus 21. And the response is provided to the processor module 10A which generated the memory request through the local bus 20A. At this time, since the block must be allocated to the primary cache 12A and the secondary cache 13A in order to maintain the inclusion, the block is duplicated in the primary cache 12A and the secondary cache 13A. At this time, if the eviction of a block present in the secondary cache 13A occurs in order to allocate the block provided via the local bus 21 to the secondary cache 13A (i.e., block replacement should occur). The block to be evicted must first be evicted from all primary caches 12A and 12B of the same processor node. Therefore, the block to be evicted is made irrespective of the object to be evicted in the management policy of the primary cache, so that in the case where a block access received the evict request from the processors 11A and 11B occurs later, the access is the same processor. Failure to access all primary caches 12A, 12B and secondary cache 13A in the node will result in a request for a block over the global bus 21.

그러나, 이러한 종래의 다단계 캐시 내포성은 프로세서(11A 내지 11F)로부터의 메모리 접근 시에 작은 단계의 캐시(12A 내지 12F)에 유효한 상태로 존재하지 않아 큰 단계의 캐시(13A 내지 13C)로 메모리 블록이 요청되었을 때, 이에 대한 적중률을 저하시킬 수 있다. 이유는 큰 단계의 캐시(13A 내지 13C)가 보다 작은 단계 캐시(12A 내지 12F)에 대하여 상위 내포성을 유지하여야 하고, 작은 단계의 캐시(12A 내지 12F)에서 블록의 요청이 일어난 주소는 작은 단계의 캐시(12A)에 존재하지 않는 것이므로, 큰 단계의 캐시(13A 내지 13C)에 그 블록이 존재할 수 있는 공간이 작은 단계의 캐시(12A 내지 12F)에 유효한 상태로 존재하는 블록이 차지하는 공간을 제외한 나머지 공간에만 할당되어 있을 수 있기 때문이다.However, this conventional multistage cache nesting does not exist in a valid state in the small caches 12A through 12F upon memory accesses from the processors 11A through 11F, so that the memory blocks are moved into the large caches 13A through 13C. When requested, it can lower the hit rate for it. The reason is that the larger caches 13A through 13C must maintain higher nesting tolerances for the smaller caches 12A through 12F, and in the smaller caches 12A through 12F, the address at which the request of the block originated is Since it does not exist in the cache 12A, the space in which the block may exist in the large caches 13A to 13C is left except the space occupied by the block existing in the valid state of the caches 12A to 12F in the small stages. It can be allocated only in space.

또한, 다단계 캐시 내포성에 의하여 프로세서(11A 내지 11F)로부터의 메모리 블록 접근이 작은 단계의 캐시(12A 내지 12F)에 존재할 확률이 저하된다. 그 이유는 내포성을 유지하기 위하여 큰 단계의 캐시(13A 내지 13C)에서 블록 대체가 일어날 경우에, 대체되는 블록 주소에 대하여 작은 단계의 캐시(12A내지 12F)에 대한 동일 블록의 축출을 강제하여야 내포성의 성질을 유지할 수 있기 때문이다.In addition, the multi-level cache nesting reduces the probability that memory block accesses from the processors 11A through 11F are present in the small-level caches 12A through 12F. The reason is that when block replacement occurs in the large caches 13A to 13C to maintain the inclusion, the expulsion of the same block for the small caches 12A to 12F must be forced for the replaced block address. Because it can maintain the nature of.

또한, 프로세서(11A 내지 11F)로부터 발생한 쓰기 요구에 대하여, 기록된 내용을 캐시에 저장하고, 공유 메모리(30)에 대한 갱신을 캐시로부터 블록이 축출되는 차후의 시기로 미룰 수 있는, 나중 쓰기(write-back) 방식의 캐시에서는 작은 단계의 캐시(12A 내지 12F)에 저장한 최신의 블록 내용이 큰 단계의 캐시(13A 내지13C)에 알려지지 않는다. 그러나 내포성을 유지하기 위하여 큰 단계의 캐시(13A 내지 13C)는 해당되는 주소의 블록을 반드시 유지하여야 하므로, 큰 단계의 캐시(13A 내지 13C) 블록은 해당되는 주소가 할당되어 있으나 가장 마지막으로 프로세서(11A 내지 11F)로부터 기록된 내용을 담고 있지 못하는 경우가 발생하게 된다. 이러한 경우에, 큰 단계의 캐시(13A 내지 13C)는 최신의 내용이 아니므로, 블록의 내용이 유효하지 않은 경우이나, 내포성을 위하여 블록의 상태를 유효하게 유지하여, 차후 전역 버스(21)로부터 블록에 대한 요청이 발생할 경우, 하위의 캐시(12A 내지 12F)에 대한 검색을 할 수 있도록 하여야 한다. 따라서, 큰 단계의 캐시(13A 내지 13C)는 유효한 내용을 가지지 못하나, 블록을 할당하는 낭비가 발생한다.In addition, for the write request generated by the processors 11A to 11F, the written contents can be stored in the cache, and the later write (which can delay the update to the shared memory 30 to a later time when the block is evicted from the cache) In the write-back cache, the latest block contents stored in the small caches 12A to 12F are unknown to the large caches 13A to 13C. However, in order to maintain inclusion, the large caches 13A to 13C must maintain a block of the corresponding address. 11A to 11F) may not contain the recorded contents. In such a case, since the large-level caches 13A to 13C are not up-to-date, the contents of the block are not valid, or the state of the block is kept valid for inclusion. When a request for a block occurs, a search for lower caches 12A to 12F should be made possible. Therefore, the large stage caches 13A to 13C do not have valid contents, but waste of allocating blocks occurs.

따라서, 본 발명의 목적은 접근 목록(access-list)을 구성하고, 접근 목록에 작은 단계의 캐시에 유효한 상태로 존재하는 블록의 정보를 저장하므로써 다단계 캐시 내포성을 제거한 장치를 제공하는데 있다.Accordingly, an object of the present invention is to provide an apparatus for removing multi-level cache nesting by constructing an access-list and storing information of blocks existing in a small level cache in the access list.

이러한 목적을 달성하기 위하여 본 발명은, 프로세서 노드들내에 상위 계층 및 하위 계층의 캐시가 구성되는 다단계 캐시 구조의 공유 메모리 다중 프로세서 장치에 있어서, 각 프로세서 노드내에는 접근 목록이 구성되며, 상기 접근 목록은 상기 하위 계층의 캐시내에 저장되는 블록들을 가르키는 상태 정보가 저장되는 것을 특징으로 한다.In order to achieve the above object, the present invention provides a shared memory multiprocessor device of a multi-level cache structure in which caches of a higher layer and a lower layer are configured in processor nodes, and an access list is formed in each processor node. Is characterized in that state information indicating blocks stored in the cache of the lower layer is stored.

본 발명은 공유 메모리 다중 프로세서 구조에서, 프로세서와 공유 메모리 사이에 복수 단계의 캐시가 구성되고, 큰 단계의 캐시와 접근 목록이 구성되며, 프로세서로부터 발생한 메모리 접근 요구에 대하여, 작은 단계의 캐시 및 상위의 상호 연결망을 거쳐 전달되는 모든 요청과 응답에 대하여, 요청과 응답이 이루어지는 메모리 블록의 주소와 명령을 감시하여, 요청 및 응답이 이루어지는 블록이 작은 단계의 캐시에 존재하는가를 판단하고, 작은 단계의 캐시에 블록이 존재하는가의 여부에 따라서 접근 목록의 상태를 설정한다.In the shared memory multiprocessor architecture, a multilevel cache is configured between a processor and a shared memory, a large level cache and an access list are configured, and a small level cache and a higher level are required for a memory access request generated from a processor. For every request and response that passes through the interconnection network, the address and command of the memory block from which the request and response are made are monitored to determine whether the request and response block exists in the small cache, Sets the state of the access list according to whether or not a block exists in the cache.

접근 목록은, 큰 단계의 캐시와 직접 연결되어 있는 보다 작은 단계의 캐시에 존재하는 모든 블록 주소의 존재 여부를 저장하여, 접근 목록의 정보에 따라 직접 연결된 작은 단계의 캐시 내용을 검색하지 않고도 존재의 가능성을 확인할 수 있도록 한다.The access list stores the presence or absence of all block addresses in the smaller caches that are directly associated with the larger caches, so that the access list can be used without retrieving the contents of the caches that are directly connected according to the information in the access list. Be sure to check the possibilities.

따라서, 큰 단계의 캐시와, 이와 직접 연결된 작은 단계의 캐시간에는 다단계 캐시 내포성을 유지하지 않고서도 접근 목록에 의하여 작은 단계의 캐시에 대하여 스누핑을 수행하여야 할 것인지를 판단할 수 있으며, 내포성의 제거에 의하여 상기 언급된 큰 단계 캐시의 적중률 저하, 상기 언급된 작은 단계 캐시의 적중률 저하, 상기 언급된 큰 단계의 올바르지 않은 내용을 가지는 블록의 유지의 문제점들이 제거된다.Therefore, it is possible to determine whether snooping should be performed on a small cache by an access list without maintaining multilevel cache nesting in a large cache and a small cache directly connected thereto. This eliminates the problems of lowering the hit ratio of the above mentioned large stage cache, lowering the hit ratio of the above mentioned small stage cache, and maintaining the block with incorrect contents of the above mentioned large stage.

결국, 본 발명은 스누핑 방식으로 큰 단계의 캐시가 작은 단계의 캐시에 존재하는 모든 유효한 상태의 블록을 포함하여야 하는 다단계 캐시 내포성을 제거함으로써, 내포성을 유지하기 위하여 큰 단계의 캐시에서 발생하는 블록 축출에 대하여 작은 단계의 캐시에 대한 동일 주소 블록의 축출을 하여야 하는 문제점을 제거하고, 큰 단계의 캐시에 존재하는 블록과 작은 단계의 캐시에 존재하는 블록간의중복을 감소시켜, 두 캐시 모두의 적중률 향상을 얻으며, 동시에 작은 단계의 캐시에 존재하는 블록의 정보를 접근 목록에 유지시킴으로서, 작은 단계의 캐시 내용에 대한 불필요한 검색을 최소화시키는 효과를 얻는다.Finally, the present invention in a snooping manner eliminates multistage cache nesting, where a large cache must contain all valid state blocks present in the small cache, thereby eliminating block arising from the large cache in order to maintain inclusion. Improve the hit rate of both caches by eliminating the problem of having to evict the same address block for the small cache and reducing the redundancy between the blocks in the large cache and the blocks in the small cache. At the same time, by maintaining the information of blocks existing in the small cache in the access list, it is possible to minimize the unnecessary search for the contents of the small cache.

도 1은 일반적인 2 단계 버스 구조의 균등 메모리 접근 공유 메모리 다중 프로세서 장치의 블록도,1 is a block diagram of an even memory access shared memory multiprocessor device of a typical two stage bus structure;

도 2는 본 발명에 따른 다단계 캐시 구조의 공유 메모리 다중 프로세서 장치의 블록도,2 is a block diagram of a shared memory multiprocessor device with a multi-level cache structure in accordance with the present invention;

도 3은 도 2의 구성에서 2 차 캐시 및 접근 목록의 구성 상태를 도시하는 도면,3 is a diagram illustrating a configuration state of a secondary cache and an access list in the configuration of FIG. 2;

도 4는 접근 목록의 다른 실시예를 도시한 도면,4 illustrates another embodiment of an access list;

도 5는 일반적인 2 단계 버스 구조의 비균등 메모리 접근 공유 메모리 다중 프로세서에 대한 개략 구성도이다.5 is a schematic structural diagram of a non-uniform memory access shared memory multiprocessor in a general two stage bus structure.

도 6은 본 발명에 따른 다단계 캐시 구조의 공유 메모리 다중 프로세서 장치의 다른 실시예를 도시한 도면,6 illustrates another embodiment of a shared memory multiprocessor device with a multi-level cache structure in accordance with the present invention;

도 7, 8은 본 발명에 따른 다단계 캐시 구조의 공유 메모리 다중 프로세서 장치의 또 다른 실시예를 도시한 도면,7 and 8 illustrate another embodiment of a shared memory multiprocessor device with a multi-stage cache structure in accordance with the present invention;

도 9는 접근 목록의 다른 실시예를 도시한 도면,9 illustrates another embodiment of an access list;

도 10은 본 발명에 따른 다단계 캐시 구조의 공유 메모리 다중 프로세서 장치의 또 다른 실시예를 도시한 도면.10 illustrates another embodiment of a shared memory multiprocessor device with a multi-stage cache structure in accordance with the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

10A-10F : 프로세서 모듈 11A-11F : 프로세서10A-10F: Processor Module 11A-11F: Processor

12A-12F : 1 차 캐시 13A-13C : 2 차 캐시12A-12F: Primary Cache 13A-13C: Secondary Cache

20A-20C : 지역 버스 21 : 전역 버스20A-20C: Local Bus 21: Global Bus

30 : 공유 메모리 40A-40C 노드 제어기30: shared memory 40A-40C node controller

50A-50C : 접근 목록50A-50C: Access List

100A-100C : 프로세서 노드100A-100C: Processor Node

이하, 본 발명에 대해 첨부 도면을 참조하여 상세히 설명한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 2 에는 본 발명의 일실시예에 따른 2 단계 캐시를 가지는 스누핑 버스에 기반한 다중 프로세서 시스템이 도시된다.2 shows a multiprocessor system based on a snooping bus having a two stage cache in accordance with one embodiment of the present invention.

도 2에서, 각 프로세서 노드(100A 내지 100C)는 스누핑을 사용하는 전역 버스(21)를 통하여 공유 메모리(30)와 연결되어 있다. 또한 프로세서 노드(100A 내지 100C)내에는 프로세서(11A 내지 11F)와 1차 캐시(12A 내지 12F)로 이루어진 다수개의 프로세서 모듈(10A 내지 10F)이 구성되어 있으며, 이 프로세서 모듈(10A 내지 11F)들은 지역 버스(20A 내지 20C)를 통하여 2차 캐시(13A 내지 13C), 노드 제어기(40A 내지 40C) 그리고 접근 목록(50A 내지 50C)과 연결되어 있다.In FIG. 2, each processor node 100A to 100C is connected to the shared memory 30 via a global bus 21 using snooping. In addition, a plurality of processor modules 10A to 10F including a processor 11A to 11F and primary caches 12A to 12F are configured in the processor nodes 100A to 100C, and these processor modules 10A to 11F are configured. The local buses 20A to 20C are connected to the secondary caches 13A to 13C, the node controllers 40A to 40C, and the access lists 50A to 50C.

노드 제어기(40A 내지 40C 예컨데, 40A)는 프로세서 노드(100A)의 프로세서 모듈(10A, 10B)로부터의 메모리 접근 요청에 대응하는 데이터 블록이 2차 캐시(13A)에 유효한 상태로 저장되어 있는지를 검색하여, 유효한 상태로 저장되어 있을 경우, 해당 데이터 블록을 프로세서 모듈(10A, 10B)에 제공하나, 2차 캐시(13A)에 유효한 상태로 저장되어 있지 않을 경우, 전역 버스(21)를 통해 다른 프로세서 노드(100B, 100C)와 공유 메모리(30)로 요청 신호를 전송하는 역할을 한다. 또한 노드 제어기(40A)는 전역 버스(21)를 통해 다른 프로세서 노드(100B,100C)들로부터 메모리 블록 요청이 입력되면, 요청에 대응하는 데이터 블록이 자신의 2차 캐시(13A)에 유효한 상태로 저장되어 있는지를 검색하며, 동시에 접근 목록(50A)에 해당되는 블록의 상태를 검색하여 2차 캐시(13A)와 접근 목록(50A)의 상태에 따라 응답과 1차 캐시(12A, 12B)에 스누핑을 요청할 것인지를 결정한다.Node controllers 40A to 40C, for example, 40A, retrieve whether data blocks corresponding to memory access requests from processor modules 10A and 10B of processor node 100A are stored in the secondary cache 13A in a valid state. For example, if the data block is stored in a valid state, the data block is provided to the processor modules 10A and 10B, but if it is not stored in the secondary cache 13A in a valid state, the other processor is provided through the global bus 21. It transmits a request signal to the nodes 100B and 100C and the shared memory 30. In addition, when a memory block request is input from the other processor nodes 100B and 100C through the global bus 21, the node controller 40A keeps the data block corresponding to the request valid for its secondary cache 13A. At the same time, the status of the block corresponding to the access list 50A is searched and snooped to the response and the primary caches 12A and 12B according to the state of the secondary cache 13A and the access list 50A. Determine whether to request.

접근 목록(50A 내지 50C 예컨데, 50A)의 상태는, 접근 목록이 존재하는 프로세서 노드(100A) 내의 1차 캐시(프로세서 캐시)(12A, 12B)에 공유 메모리(30) 주소의 블록이 유효한 상태로 존재할 수 있음을 나타내는 '포함됨'(Included)과, 프로세서 캐시(12A, 12B)에 공유 메모리(30) 주소의 블록이 유효한 상태로 존재하지 않음을 보장하는 '포함되지않음'(Excluded)상태를 가진다. 도 3은 접근 목록(50A)의 일실시예로 공유 메모리(30) 전역(전체 번지)에 대하여 블록 크기 단위로 일대일 대응되는 고유 공간을 갖는 메모리로 구성되어, 메모리 블록 주소당 한 비트의 공간이 고정적으로 할당되며, 한 비트를 이용하여 해당 블록 주소에 해당 블록이 프로세서 캐시(12A, 12B)에 '포함되지않음'과 '포함됨' 상태를 나타낼 수 있도록 구성되어 있다.The state of the access lists 50A to 50C, for example 50A, means that the block of shared memory 30 addresses is valid in the primary cache (processor cache) 12A, 12B in the processor node 100A in which the access list exists. 'Included', indicating that it may exist, and an 'Excluded' state, ensuring that a block of shared memory 30 addresses does not exist in the processor cache 12A, 12B in a valid state. . FIG. 3 illustrates an example of the access list 50A, which is configured as a memory having a unique space corresponding one-to-one in block size unit with respect to the entire shared memory 30 (all addresses). It is fixedly allocated and is configured such that the corresponding block address may indicate 'not included' and 'included' in the processor caches 12A and 12B by using one bit.

2차 캐시(13A)의 경우에는 정적인 주소가 아닌 동적인 주소와 블록의 내용 및 상태를 저장하여야 하므로 도 3에 도시된 바와 같이 주소를 저장하기 위한 태그(131A), 블록의 내용을 저장하기 위한 공간(132A), 블록의 상태를 저장하기 위한 공간(133A)이 필요하다. 블록의 상태는 종래와 같이 공유됨, 수정됨, 무효화됨(또는 필요에 따라서는(버클리 방식의 경우에는) 수정되고 공유됨) 등으로 구분할 수 있다.In the case of the secondary cache 13A, the contents and the state of the dynamic address and the block, rather than the static address, must be stored. As shown in FIG. 3, the tag 131A for storing the address and the contents of the block are stored. Space 132A is required, and space 133A is required for storing the state of the block. The state of a block may be divided into a conventional state of being shared, modified, invalidated (or modified and shared as necessary (in the case of the Berkeley method)).

접근 목록(50A 내지 50C)의 초기 상태는 '포함되지않음'이며 1차 캐시(12A 내지 12F)에 유효한 메모리 블록이 할당됨에 따라서 그 메모리 블록에 해당되는 주소의 상태를 '포함됨'으로 전환한다. 프로세서(11A 내지 11F)로부터의 메모리 블록 요청에 따라서 새로운 블록이 1차 캐시(12A내지 12F)에 할당되어야 할 경우, 새로운 블록의 할당을 위하여 1차 캐시(12A 내지 12F)에 유효한 상태로 존재하는 블록이 축출될 수 있다. 이러한 경우를 상술한 바와 같이 블록 대체라 하며, 블록 대체의 경우에 만일 축출될 블록이 이미 프로세서(11A 내지 11F)로부터 쓰기 접근이 일어났었고(즉, 수정 상태일 때에), 공유 메모리(30)로의 되쓰기의 책임을 가지는 상태의 블록일 경우, 1 차 캐시(12A 내지 12F)로부터 축출과 동시에 축출되는 블록이 2 차 캐시(13A 내지 13C) 또는 공유 메모리(30)에 저장되어야 한다. 이러한 동작을 되쓰기(write-back)라 하며, 되쓰기가 일어날 경우, 해당되는 블록의 접근 목록(50A 내지 50C)의 상태 비트(예를 들면 블록 주소 3번지가 되쓰기 되었을 경우 3번지의 상태 )는 '포함되지않음'으로 전환되어야 한다.The initial state of the access lists 50A to 50C is 'not included' and, as a valid memory block is allocated to the primary caches 12A to 12F, the state of the address corresponding to the memory block is switched to 'included'. When a new block is to be allocated to the primary caches 12A to 12F in response to a memory block request from the processors 11A to 11F, the new blocks exist in the primary caches 12A to 12F in order to allocate new blocks. Blocks can be evicted. This case is referred to as block replacement as described above, and in the case of block replacement, if a block to be evicted has already had write access from the processors 11A to 11F (ie, in a modified state), it is transferred to the shared memory 30. In the case of a block having a state of responsibility for rewriting, a block to be evicted at the same time as being evicted from the primary caches 12A to 12F must be stored in the secondary caches 13A to 13C or the shared memory 30. This operation is called write-back, and when rewriting occurs, the status bit (for example, address 3 of the block address 3) is rewritten in the access list (50A to 50C) of the corresponding block. ) Must be converted to 'not included'.

전역 버스로(21)에 어떤 프로세서 노드(예컨데, 100B, 100C들 중의 하나)가 블록을 요청할 경우 프로세서 노드(100A)는 스누핑을 하여야 하고, 해당되는 블록이 접근 목록(50A) 상태(예를 들면 블록 주소 2 번지에 대해 요청되었을 경우 접근 목록(50)내의 2번지의 상태)가 '포함됨'일 경우, 이러한 요청은 지역 버스(20A)를 통하여 1차 캐시(12A, 12B)로 요청되어야 한다.When a processor node (eg, one of 100B and 100C) requests a block to the global bus 21, the processor node 100A must snoop and the corresponding block is in the access list 50A state (e.g., If a request is made for block address 2, the status of address 2 in access list 50) is 'included', this request should be requested to primary caches 12A and 12B via local bus 20A.

그러나, 접근 목록(50A)의 2번지의 상태가 '포함됨'임에도 불구하고 해당 블록이 1차 캐시(12A, 12B)에 존재하지 않는 경우에는 차후에 반복되는 블록 주소에해당되는 요청을 1차 캐시(12A, 12B)에 요구하지 않도록 하기 위하여 해당되는 블록의 접근 목록 상태(예를 들면 블록 주소 2 번지에 대해 요청되었을 경우 주소 2번지의 상태)를 '포함되지않음'으로 전환시킨다.However, even if the state of address 2 of the access list 50A is 'included', and the block does not exist in the primary caches 12A and 12B, the request corresponding to the repeated block address is stored in the primary cache ( 12A, 12B) switch the access list state of the corresponding block (e.g. address 2 address if requested for block address 2) to 'not included'.

2차 캐시(13A 내지 13C)와 접근 목록(50A 내지 50C)의 동작을 보다 자세히 기술하면, 2차 캐시(13A 내지 13C)와 접근 목록(50A 내지 50C)으로 접근이 발생하는 경로는 두 가지이다. 첫째는 프로세서 노드(예컨데, 100A) 내의 프로세서 모듈(10A, 10B)로부터의 메모리 접근이고, 둘째는 전역 버스(21)를 통해 다른 프로세서 노드(100B, 100C)로부터 발생하는 접근이다.More specifically describing the operation of the secondary caches 13A to 13C and the access lists 50A to 50C, there are two paths through which access occurs to the secondary caches 13A to 13C and the access lists 50A to 50C. . The first is the memory access from the processor modules 10A, 10B in the processor node (eg, 100A) and the second is the access originating from the other processor nodes 100B, 100C via the global bus 21.

프로세서 노드(100A)내에서 2차 캐시(13A)나 접근 목록(50A)으로 발생되는 접근은 노드 내의 프로세서(11A, 11B)로부터 발생된 메모리 접근 유형에 따라 읽기, 쓰기, 되쓰기로 분리된다.The accesses generated by the secondary cache 13A or the access list 50A in the processor node 100A are divided into read, write and rewrite according to the types of memory accesses generated from the processors 11A and 11B in the node.

읽기 요구의 경우 어떤 프로세서(11A)로부터 프로세서 모듈(10A)내의 1차 캐시(12A)에 요청된 주소의 블록이 실패할 경우(즉 1 차 캐시(12A)에 해당 블록이 존재하지 않거나 무효화된 상태일 경우)에 발생하는 경우로, 2차 캐시(13A)에 요청된 블록이 유효한 상태로 존재할 경우 접근 목록(50A)의 해당 블록 주소의 상태(예를 들면 블록 주소가 2 번지일 경우 2번지)를 '포함됨'으로 전환하고 2차 캐시(13A)가 해당 블록을 프로세서(11A) 및 1차 캐시(12A)에 제공함으로써 응답이 이루어진다.In the case of a read request, when a block of the address requested from a processor 11A to the primary cache 12A in the processor module 10A fails (that is, the block does not exist or is invalidated in the primary cache 12A). In this case, the state of the corresponding block address of the access list 50A (for example, address 2 if the block address is 2) when the requested block exists in the secondary cache 13A in a valid state. Is switched to 'included' and the response is made by the secondary cache 13A providing the block to the processor 11A and the primary cache 12A.

2차 캐시(13A)에 블록이 유효한 상태로 존재하지 않는 경우에, 노드 제어기(40A)는 전역 버스(21)에 블록의 읽기 요청을 발생시킨다. 전역 버스에 발생된 읽기 요청은 공유 메모리(30) 및 다른 프로세서 노드(100B, 100C)에 방송되고,다른 프로세서 노드(100B, 100C) 내부의 노드 제어기(40B, 40C)에 의해 각각의 2차 캐시(13B, 13C)에 요청된 블록이 유효한 상태로 존재하지 않으며, 각각의 접근 목록(50B, 50C)에 요청된 블록이 '포함되지않음' 상태일 경우에 공유 메모리(30)가 블록을 제공함으로서 응답이 처리된다.If no block exists in the secondary cache 13A in a valid state, the node controller 40A issues a read request for the block to the global bus 21. Read requests generated on the global bus are broadcast to the shared memory 30 and other processor nodes 100B and 100C, and are each cached by node controllers 40B and 40C inside the other processor nodes 100B and 100C. If the requested block at 13B, 13C does not exist in a valid state, and the requested block is not included in each of the access lists 50B, 50C, the shared memory 30 provides the block. The response is processed.

다른 프로세서 노드(100B또는 100C)에 블록이 유효하고 수정된 상태로 존재하는 경우는 그 노드의 2차 캐시(13B 또는 13C)가 유효하고 수정된 상태로 블록을 가지고 있는 경우이거나, 그 노드의 1차 캐시들(12C, 12D, 12E, 12F중의 하나)중의 하나가 유효하고 수정된 상태의 블록을 가지고 있는 경우이므로, 유효하고 수정된 상태의 캐시가 블록을 제공하고 응답을 처리한다. 이와 같이 응답이 이루어지고 요청된 블록이 전역 버스(21)를 경유하여 노드 제어기(40A)의 제어하에 2차 캐시(13A)에 제공되면, 2차 캐시(13A)는 제공된 블록을 할당(131A, 132A, 133A)하고 블록을 요청한 1차 캐시(12A)에 블록을 다시 제공한다. 또한 1차 캐시(12A)에 블록을 제공함과 동시에 해당되는 접근 목록(50A)의 블록 주소 상태를 '포함됨'으로 전환시킨다.If a block is valid and in a modified state in another processor node 100B or 100C, the node's secondary cache 13B or 13C has the block in a valid and modified state, or 1 of that node. Since one of the difference caches (one of 12C, 12D, 12E, 12F) has a valid and modified state block, the valid and modified state cache provides the block and processes the response. When a response is made and the requested block is provided to the secondary cache 13A under the control of the node controller 40A via the global bus 21, the secondary cache 13A allocates the provided block 131A, 132A, 133A) and provide the block back to the primary cache 12A that requested the block. In addition, while providing a block to the primary cache 12A, the block address state of the corresponding access list 50A is switched to 'included'.

읽기 요구가 전역 버스(21)로 요청될 경우, 읽기 요구를 발생시킨 프로세서 노드(100A)를 제외한 다른 프로세서 노드(100B, 100C)는 전역 버스(21)로부터 읽기 요청을 받는다. 노드 제어기(40B, 40C)는 요청된 블록 주소를 2차 캐시(13B, 13C)와 접근 목록(50B, 50C)에서 검색하며, 2차 캐시(13B, 13C)의 상태와 접근 목록(50B, 50C)의 상태에 따라 요청을 처리한다. 2차 캐시(12B, 12C)에 유효한 블록이 존재하고 응답의 책임이 있을 경우, 접근 목록(50B, 50C)의 상태와 무관하게지역 버스(20B, 20C)를 통한 1차 캐시(12C, 12D, 12E, 12F)의 검색을 수행하지 않으며, 2차 캐시(13B 또는 13C 중 수정된 상태의 캐시)가 요청의 응답을 노드 제어기(40B, 40C)의 제어하에 수행한다. 2차 캐시(13B, 13C)에 유효한 블록이 존재하지 않고 접근 목록(50B, 50C)의 상태가 '포함되지않음'일 경우 노드 제어기(100B, 100C)는 요청의 응답 과정에 참여하지 않으며, 접근 목록(50B, 50C)의 상태가 '포함됨'일 경우 지역 버스(20B, 20C)를 통하여 1차 캐시(12C, 12D, 12E, 12F)에 요청을 전달하고, 1차 캐시(12C, 12D, 12E, 12F)의 블록을 검색한다. 검색 결과, 1차 캐시(12C, 12D, 12E, 12F들중의 하나)의 블록 상태가 응답의 책임이 있을 경우 1차 캐시(12C, 12D, 12E, 12F들중의 하나)에서 요구된 블록을 읽고 지역 버스(20B, 20C)를 통하여 노드 제어기(40B, 40C)의 제어에 따라 전역 버스(21)로 응답을 전송한다. 그러나, 1차 캐시(12C, 12D, 12E, 12F)에 응답의 책임이 있는 블록이 없는 경우에는 응답 과정을 수행하지 않게 된다.When a read request is requested to the global bus 21, other processor nodes 100B and 100C receive the read request from the global bus 21 except for the processor node 100A which generated the read request. The node controllers 40B and 40C retrieve the requested block addresses from the secondary caches 13B and 13C and the access lists 50B and 50C, and the statuses and access lists 50B and 50C of the secondary caches 13B and 13C. Process the request according to the state of). If there are valid blocks in the secondary caches 12B, 12C, and are responsible for the response, regardless of the state of the access lists 50B, 50C, the primary caches 12C, 12D, 12E and 12F are not performed, and the secondary cache (the modified state of 13B or 13C) performs the response of the request under the control of the node controllers 40B and 40C. If a valid block does not exist in the secondary caches 13B and 13C and the state of the access lists 50B and 50C is 'not included', the node controllers 100B and 100C do not participate in the response process of the request. If the status of the list (50B, 50C) is 'included', the request is forwarded to the primary cache (12C, 12D, 12E, 12F) via the local buses 20B, 20C, and the primary cache (12C, 12D, 12E). , 12F). As a result of the search, if the block state of the primary cache (one of 12C, 12D, 12E, 12F) is responsible for the response, the requested block in the primary cache (one of 12C, 12D, 12E, 12F) is retrieved. Reads and sends a response to the global bus 21 under the control of the node controllers 40B and 40C via the local buses 20B and 20C. However, if there are no blocks responsible for response in the primary caches 12C, 12D, 12E, and 12F, the response process is not performed.

한편, 이 과정에서 1차 캐시(12C, 12D, 12E, 12F)에 요청된 블록이 존재하지 않을 경우, 접근 목록(50B, 50C)의 상태를 '포함되지않음'으로 전환하여 동일한 주소 블록에 해당되는 요청을 더 이상 받아들이지 않도록 한다.Meanwhile, if the requested block does not exist in the primary caches 12C, 12D, 12E, and 12F during this process, the state of the access list 50B or 50C is changed to 'not included' to correspond to the same address block. Do not accept any further requests.

쓰기 요구의 경우, 프로세서 모듈(10A) 내의 1차 캐시(12A)에 쓰기 요청이 발생된 주소의 블록이 유효한 상태로 있을 경우와 있지 않을 경우로 다시 세분화된다.In the case of a write request, it is subdivided again into the case where the block of the address where the write request is issued to the primary cache 12A in the processor module 10A is in a valid state and not.

블록이 1차 캐시(12A)에 유효한 상태로 있을 경우는 다시 여러 캐시에 의해 공유되어 있을 수 있는 경우와 1 차 캐시(12A)에만 유일하게 유효한 블록이 저장되어 있는 경우로 세분되며, 유일한 블록으로 보장될 경우, 2차 캐시(13A)에 대해 요청을 발생시키지 않고 쓰기 요청이 처리 완료된다. 그러나, 공유되어 있을 가능성이 있는 상태로 존재할 경우, 공유된 블록에 대한 시스템 전역적인 무효화를 수행하여야만 쓰기 요구가 완료된다. 따라서, 이 경우에는 지역 버스(20A)에 무효화 요청을 하며, 2차 캐시(13A)는 무효화 요청에 따라서 2차 캐시(13A)의 블록을 무효화하고, 접근 목록(50A)의 상태는 포함됨으로 유지시키고, 노드 제어기(40A)에 의해 전역 버스(21)로 무효화 요청을 요구한다. 전역 버스(21)를 통하여 제공되는 무효화 요청이 제공되면, 공유 메모리(30)와 무효화 요청을 한 프로세서 노드(100A)를 제외한 모든 프로세서 노드(100B, 100C)는 무효화 요청에 의해 각각의 2차 캐시(13B, 13C)와 1차 캐시(12C, 12D, 12E, 12F)상의 블록을 무효화시키고, 무효화 요청을 한 프로세서 노드(100A)를 제외한 모든 프로세서 노드(100B, 100C)상의 접근 목록(50B, 50C)의 상태를 '포함되지않음'으로 전환시킨다.When a block remains valid in the primary cache 12A, it is subdivided into cases in which it may be shared again by multiple caches and that only valid blocks are stored in the primary cache 12A. If guaranteed, the write request is processed without issuing a request to the secondary cache 13A. However, if there is a possibility of being shared, the write request is completed only by performing a system-wide invalidation on the shared block. Therefore, in this case, an invalidation request is made to the local bus 20A, and the secondary cache 13A invalidates the block of the secondary cache 13A according to the invalidation request, and the state of the access list 50A remains as included. Requesting an invalidation request to the global bus 21 by the node controller 40A. When an invalidation request provided through the global bus 21 is provided, all the processor nodes 100B and 100C except for the shared memory 30 and the processor node 100A which made the invalidation request are each cached by the invalidation request. Access lists (50B, 50C) on all processor nodes (100B, 100C) except for processor nodes (100A) that invalidate blocks on (13B, 13C) and primary caches (12C, 12D, 12E, 12F). ) Switch status to 'not included'.

쓰기 요구에서 프로세서 모듈(10A) 내의 1차 캐시(12A)에 쓰기 요청이 발생된 주소의 블록이 유효한 상태로 있지 않을 경우, 우선 유효한 블록이 1차 캐시(12A)에 할당되어야 하므로, 지역 버스(20A)를 통하여 쓰기할당(write allocation) 요청을 발생시키고, 이 요청은 다시 2차 캐시(13A)에 제공된다. 이 경우 2 차 캐시(13A)에 해당되는 주소 블록이 존재할 경우와 존재하지 않을 경우로 나뉘어진다. 2차 캐시(13A)에 해당되는 주소 블록이 존재할 경우는 다시 여러 캐시에 블록이 공유되어 있을 수 있는 경우와 2 차 캐시(13A)에만 유일한 블록으로 보장될 경우로 세분된다. 후자의 경우, 2 차 캐시(13A)는 지역 버스(20A)를 통하여블록을 1차 캐시(12A)에 제공하고 2차 캐시(13A)의 블록을 무효화하며, 접근 목록(50A)의 상태를 포함됨으로 설정하는 것으로 쓰기 요구가 종료된다. 2차 캐시(13A)에 해당되는 블록이 존재하며 여러 캐시에 블록에 공유되어 있을 수 있는 전자의 경우는, 노드 제어기(40A)에 의해 전역 버스(21)로 무효화 요청이 요구된다. 전역 버스(21)로 무효화 요청이 제공되면, 공유 메모리(30)와 무효화 요청을 한 프로세서 노드(100A)를 제외한 모든 프로세서 노드(100B, 100C)는 무효화 요청에 의해 각각의 2차 캐시(13B, 13C)와 1 차 캐시(12C, 12D, 12E, 12F)상의 블록을 무효화시키고, 무효화 요청을 한 프로세서 노드(100A)를 제외한 모든 프로세서 노드(100B, 100C)상의 접근 목록(50B, 50C)의 상태를 '포함되지않음'으로 전환하며, 쓰기 요구를 발생시킨 프로세서 노드(100A)의 2차 캐시(13A)가 지역 버스(20A)를 통하여 블록을 1차 캐시(12A)에 제공하고 2차 캐시(13A)의 블록을 무효화하며, 접근 목록(50A)의 상태를 포함됨으로 설정하는 것으로 쓰기 요구가 종료된다.If a block of addresses for which a write request is made to the primary cache 12A in the processor module 10A in the write request is not in a valid state, a valid block must first be allocated to the primary cache 12A. 20A), a write allocation request is issued, which is again provided to the secondary cache 13A. In this case, an address block corresponding to the secondary cache 13A is divided into a case where there is and a case where there is no. When the address block corresponding to the secondary cache 13A exists, it is subdivided into the case where the block may be shared in several caches again and the case where it is guaranteed to be the only block only in the secondary cache 13A. In the latter case, the secondary cache 13A provides the block to the primary cache 12A via the local bus 20A, invalidates the block of the secondary cache 13A, and includes the state of the access list 50A. The write request is terminated by setting to. In the former case where a block corresponding to the secondary cache 13A exists and may be shared in a block in multiple caches, an invalidation request is requested to the global bus 21 by the node controller 40A. When the invalidation request is provided to the global bus 21, all the processor nodes 100B and 100C except for the shared memory 30 and the processor node 100A that made the invalidation request are assigned to each secondary cache 13B, by the invalidation request. 13C) and the state of access lists 50B, 50C on all processor nodes 100B, 100C, except processor node 100A that invalidated blocks on primary caches 12C, 12D, 12E, and 12F. To 'not included', the secondary cache 13A of the processor node 100A that issued the write request provides the block to the primary cache 12A via the local bus 20A and the secondary cache ( The write request is terminated by invalidating the block of 13A) and setting the state of the access list 50A to included.

쓰기 요구에서 요청을 발생시킨 프로세서 모듈(10A) 내의 1차 캐시(12A)에 해당되는 주소의 블록이 존재하지 않고, 지역 버스(20A)를 통하여 요청된 쓰기할당 요구가 2차 캐시(13A)에도 해당되는 주소 블록이 존재하지 않을 경우, 노드 제어기(40A)는 쓰기할당 요구를 전역 버스(21)로 요구한다. 전역 버스(21)를 통하여 쓰기할당 요구를 전달받은 공유 메모리(30) 및 요청을 발생시킨 프로세서 노드(100A)를 제외한 모든 프로세서 노드(100B, 100C)는 공유 메모리(30) 또는 2차 캐시(13B, 13C) 또는 1차 캐시(12C, 12D, 12E, 12F) 중 응답의 책임을 지는 유일한 장치에 의하여 응답이 처리되어 해당 블록을 전송한다. 동시에, 요청을 발생시킨프로세서 노드(100A)를 제외한 모든 프로세서 노드(100B, 100C), 2차 캐시(13B, 13C)와 1차 캐시(12C, 12D, 12E, 12F)는 해당되는 블록을 무효화하고, 접근 목록(50B, 50C)을 '포함되지않음'으로 설정한다. 쓰기할당 요구를 발생시킨 노드(100A)가 전역 버스(21)를 통하여 블록을 전송받게 되면 노드 제어기(40A)는 전송받은 블록을 지역 버스(21A)를 통하여 1차 캐시(12A)에 할당하고, 접근 목록(50A)의 상태를 '포함됨'으로 설정한다.There is no block of addresses corresponding to the primary cache 12A in the processor module 10A that issued the request in the write request, and the write allocation request requested through the local bus 20A also exists in the secondary cache 13A. If the corresponding address block does not exist, the node controller 40A requests a write allocation request to the global bus 21. All of the processor nodes 100B and 100C except the shared memory 30 that has received the write allocation request through the global bus 21 and the processor node 100A that issued the request have shared memory 30 or secondary cache 13B. , 13C) or the primary cache 12C, 12D, 12E, 12F, the response is processed by the only device responsible for the response and transmits the corresponding block. At the same time, all processor nodes 100B, 100C, secondary caches 13B, 13C, and primary caches 12C, 12D, 12E, and 12F, except for the processor node 100A that issued the request, invalidate the corresponding block. , Set the access list (50B, 50C) to 'not included'. When the node 100A having issued the write allocation request receives a block through the global bus 21, the node controller 40A allocates the received block to the primary cache 12A through the local bus 21A. The state of the access list 50A is set to 'included'.

쓰기 요구에 의하여 전역 버스(21)로부터 프로세서 노드(100A, 100B, 100C)로 요구되는 명령은 상기 기술한 바와 같이 무효화 요구와 쓰기할당 요구가 있다. 무효화 요구일 경우 노드 제어기(40A, 40B, 40C)는 해당되는 블록 주소를 2차 캐시(13A, 13B, 13C)에서 무효화시키며, 접근 목록(50A, 50B, 50C)의 상태를 검색한다. 접근 목록(50A, 50B, 50C)의 상태가 포함됨일 경우, 노드 제어기(40A, 40B, 40C)는 지역 버스(20A, 20B, 20C)를 통하여 무효화 요구를 발생시켜 1차 캐시(12A 내지 12F)의 해당되는 블록을 무효화시키며, 접근 목록(50A, 50B, 50C)의 상태를 '포함되지않음'으로 전환한다.Instructions required from the global bus 21 to the processor nodes 100A, 100B, and 100C by the write request are invalidation request and write allocation request as described above. In the case of an invalidation request, the node controllers 40A, 40B, and 40C invalidate the corresponding block addresses in the secondary caches 13A, 13B, and 13C, and retrieve the state of the access lists 50A, 50B, and 50C. If the status of the access list 50A, 50B, 50C is included, the node controllers 40A, 40B, 40C generate an invalidation request via the local buses 20A, 20B, 20C to enable the primary caches 12A-12F. Invalidates the corresponding block in and switches the state of the access list 50A, 50B, 50C to 'not included'.

쓰기 요구에 의하여 전역 버스로(21)부터 프로세서 노드(100A, 100B, 100C)로 요구되는 명령이 쓰기할당 요구일 경우 노드 제어기(40A, 40B, 40C)는 해당되는 블록을 2차 캐시(13A, 13B, 13C)에서 검색하고 그 상태가 응답의 책임이 있을 경우, 블록을 전역 버스(21)로 전송하며, 동시에 접근 목록(50A, 50B, 50C)의 상태를 검색하여 '포함됨'일 경우, 지역 버스(20A, 20B, 20C)를 통하여 무효화 요구를 발생시키고, 접근 목록(50A, 50B, 50C)의 상태를 포함되지 않음으로 전환시킨다.When the command requested from the global bus 21 to the processor nodes 100A, 100B, and 100C by the write request is a write allocation request, the node controllers 40A, 40B, and 40C store the corresponding block in the secondary cache 13A, 13B, 13C), if the status is responsible for the response, send the block to the global bus 21, and simultaneously retrieve the status of the access list 50A, 50B, 50C, and if it is 'included', An invalidation request is generated via the buses 20A, 20B, and 20C, and the state of the access list 50A, 50B, 50C is switched to not included.

쓰기 요구에 의하여 전역 버스(21)로부터 프로세서 노드(100A, 100B, 100C)로 요구되는 명령이 쓰기할당 요구일 경우 노드 제어기(40A, 40B, 40C)는 해당되는 블록을 2차 캐시(13A, 13B, 13C)에서 검색하고 그 상태가 응답의 책임이 있지 않을 경우 2차 캐시의 블록을 무효화시키고, 동시에 접근 목록(50A, 50B, 50C)의 상태를 검색하여 '포함됨'일 경우, 지역 버스(20A, 20B, 20C)를 통하여 쓰기할당 요구를 발생시킨다. 지역 버스(20A, 20B, 20C)에 요구된 쓰기할당 요구에 의하여 1차 캐시(12A 내지 12F) 블록이 검색되고, 그 상태가 응답의 책임을 지는 경우, 지역 버스(20A, 20B, 20C)를 통하여 응답이 이루어지고, 1차 캐시(12A 내지 12F) 상태를 무효화하게 되며, 다시 노드 제어기(40A, 40B, 40C)에 의해 접근 목록(50A, 50B, 50C)의 상태를 '포함되지않음'으로 전환하며 노드 제어기(40A, 40B, 40C)에 의해 전역 버스(21)로 블록을 전송한다. 만일, 지역 버스(20A, 20B, 20C)에 발생된 쓰기할당 요구에 의해 검색된 1차 캐시(12A 내지 12F) 블록이 응답 책임이 없는 상태일 경우 1차 캐시(12A 내지 12F) 블록을 무효화하고 접근 목록(50A, 50B, 50C)의 상태를 '포함되지않음'으로 전환한다.When the command requested from the global bus 21 to the processor nodes 100A, 100B, and 100C by the write request is a write allocation request, the node controllers 40A, 40B, and 40C store the corresponding block in the secondary cache 13A, 13B. 13C), invalidate the block in the secondary cache if its status is not responsible for the response, and simultaneously retrieve the status of the access list 50A, 50B, 50C, and if it is included, the local bus 20A. 20B, 20C) to generate the write allocation request. If the primary cache 12A to 12F blocks are retrieved by the write allocation request requested to the local buses 20A, 20B, and 20C, and the status is responsible for the response, the local buses 20A, 20B, and 20C are Response is made, the primary caches 12A through 12F are invalidated, and the state of the access list 50A, 50B, 50C is 'not included' by the node controllers 40A, 40B, and 40C. Transmit and send the block to the global bus 21 by the node controllers 40A, 40B, and 40C. If the primary cache blocks 12A to 12F retrieved by the write allocation request generated on the local buses 20A, 20B, and 20C are not responsible for response, the primary cache 12A to 12F blocks are invalidated and accessed. Toggle the status of lists 50A, 50B, and 50C to 'not included'.

쓰기 요구에 의하여 전역 버스(21)로부터 프로세서 노드(100A, 100B, 100C)로 요구되는 명령이 쓰기할당 요구일 경우 노드 제어기(40A, 40B, 40C)는 해당되는 블록을 2차 캐시(13A, 13B, 13C)에서 검색하고 블록이 존재하지 않을 경우, 동시에, 접근 목록(50A, 50B, 50C)의 상태를 검색하여 '포함됨'일 경우, 지역 버스(20A, 20B, 20C)를 통하여 쓰기할당 요구를 발생시킨다. 지역 버스에 요구된 쓰기할당 요구에 의하여 1차 캐시(12A 내지 12F)에서 블록이 검색되고, 그 상태가응답의 책임을 지는 경우, 지역 버스(20A, 20B, 20C)를 통하여 응답이 이루어지고, 1차 캐시(12A 내지 12F) 상태를 무효화하게 되며, 다시 노드 제어기(40A, 40B, 40C)에 의해 접근 목록의 상태를 '포함되지않음'으로 전환하며 노드 제어기에 의해 전역 버스(21)로 블록을 전송한다. 만일, 지역 버스(20A, 20B, 20C)에 발생된 쓰기할당 요구에 의해 검색된 1차 캐시 블록이 응답 책임이 없는 상태일 경우 1차 캐시 블록을 무효화하고 접근 목록(50A, 50B, 50C)의 상태를 '포함되지않음'으로 전환한다.When the command requested from the global bus 21 to the processor nodes 100A, 100B, and 100C by the write request is a write allocation request, the node controllers 40A, 40B, and 40C store the corresponding block in the secondary cache 13A, 13B. , 13C) and if no block exists, at the same time, the status of the access list 50A, 50B, 50C is retrieved and if it is included, the write allocation request is made via the local bus 20A, 20B, 20C. Generate. When a block is retrieved from the primary caches 12A to 12F by the write allocation request requested to the local bus, and the state is responsible for the response, a response is made through the local buses 20A, 20B, and 20C, The primary caches 12A through 12F are invalidated, and the node controller 40A, 40B, 40C switches the state of the access list to 'not included' and blocks by the node controller to the global bus 21. Send it. If the primary cache block retrieved by the write allocation request generated on the local buses 20A, 20B, and 20C is not responsible for the response, the primary cache block is invalidated and the state of the access list 50A, 50B, or 50C. Switch to 'Not included'.

쓰기 요구에 의하여 전역 버스(21)로부터 프로세서 노드(100A, 100B, 100C)로 요구되는 명령이 무효화 요구일 경우 2차 캐시(13A, 13B, 13C)에 블록이 존재하면 무효화하고, 동시에, 접근 목록(50A, 50B, 50C)의 상태를 검색하여 '포함됨'일 경우 지역 버스(20A, 20B, 20C)에 무효화 요구를 하여 1차 캐시(12A 내지 12F)의 블록을 무효화시키며 접근 목록(50A, 50B, 50C)의 상태를 '포함되지않음'으로 전환한다.If the instruction requested from the global bus 21 to the processor nodes 100A, 100B, and 100C by the write request is an invalidation request, the block is invalidated if the block exists in the secondary caches 13A, 13B, and 13C. If the status of (50A, 50B, 50C) is 'included', the local buses 20A, 20B, and 20C are invalidated to invalidate the blocks in the primary caches 12A to 12F and the access list 50A and 50B. , 50C) switches its status to 'not included'.

되쓰기(write-back) 요구가 1차 캐시(12A)로부터 발생할 경우 발생한 되쓰기는 2차 캐시(13A)에 할당하고 되쓰기에 대한 책임을 가지도록 하고 접근 목록(50A)의 상태를 포함되지 않음으로 변경한다. 즉, 일반적인 블록 대체가 발생할 경우에는 1차 캐시(12A)에서 가장 오래 쓰지 않았던 블록을 삭제하나, 삭제할 블록의 상태가 수정됨인 경우에 이 블록(수정된 블럭)을 삭제할 수 없는 바, 2 차 캐시(13A)에 기록하고, 1차 캐시(12A)내에서는 삭제한 후에 접근 목록(50A)의 상태를 포함되지 않음으로 변경하는 것이다. 한편, 버클리 방식의 경우에는 캐시가 수정되고 공유된 상태를 가질 수 있으며, 되쓰기할 블록이 수정되며 공유됨이라는 상태를 가지고 있는 경우에는 이 블록에 대하여 다른 캐시가 공유하고 있음을 의미하므로, 이 경우에는 1차 캐시(12A)내에서 해당 블록을 삭제하고, 접근 목록(50A)의 상태를 '포함됨'으로 유지한다. 이는 버클리 방식에서 수정되고 공유된 상태의 경우 동일 프로세서 노드(100A)내의 다른 1차 캐시(12B)에 해당 블록이 공유됨 상태로 있을 수 있으므로, 이 경우 접근 목록이 '포함되지않음' 상태로 전이되면 접근 목록의 상태 의미와 달리 1차 캐시(12B)에 해당되는 블록이 있을 수 있기 때문이다. 한편, 동일 프로세서 노드(100A) 내의 다른 1차 캐시(12B)에 해당 블록이 유효한 상태로 존재하지 않는 경우, 프로세서 노드 내의 모든 1차 캐시(12A, 12B)에 해당되는 블록이 유효한 상태로 있지 않음에도 불구하고 접근 목록(50A)의 상태가 '포함됨'으로 유지되는데, 이 경우에 접근 목록의 '포함됨' 상태의 의미(블록이 유효한 상태로 존재할 수 있음)에 어긋나지 않는다. 한편, 버클리 방식의 경우에 해당 블록이 수정 상태인 경우에는 일반적인 경우와 마찬가지로 2차 캐시(13A)에 해당 블록을 기록하고, 접근 목록(50A)의 상태를 '포함되지않음'으로 전환한다. 이는 수정 상태의 경우, 유일하고 유효한 블록임이 보장되기 때문이다.If a write-back request originates from the primary cache 12A, the resulting rewrite is allocated to the secondary cache 13A and held accountable for rewriting and does not include the state of the access list 50A. Change it to None. That is, if a general block replacement occurs, the block that has not been written the longest in the primary cache 12A is deleted, but if the state of the block to be deleted is modified, this block (modified block) cannot be deleted. After writing to the cache 13A and deleting it in the primary cache 12A, the state of the access list 50A is changed to not included. On the other hand, in the case of the Berkeley method, the cache may be modified and have a shared state. If the block to be rewritten is modified and has a state of shared, this means that another cache is shared with the block. In this case, the corresponding block is deleted from the primary cache 12A, and the state of the access list 50A is kept as 'included'. This is modified in Berkeley and in the shared state the block may be in the shared state in another primary cache 12B in the same processor node 100A, so in this case the access list transitions to 'not included'. This is because there may be blocks corresponding to the primary cache 12B unlike the state meaning of the access list. On the other hand, if the block does not exist in another primary cache 12B in the same processor node 100A in a valid state, the blocks corresponding to all primary caches 12A, 12B in the processor node are not in a valid state. Nevertheless, the status of the access list 50A remains 'included', in which case it does not contradict the meaning of the 'included' status of the access list (the block may exist in a valid state). Meanwhile, in the case of the Berkeley method, when the block is in the modified state, the block is written to the secondary cache 13A as in the general case, and the state of the access list 50A is changed to 'not included'. This is because, in the modified state, it is guaranteed to be the only valid block.

전역 버스(30)에 되쓰기 요청이 발생될 경우, 프로세서 노드(100A, 100B 100C)는 스누핑에 참가하지 않으며 공유 메모리(30)로 블록을 전송하므로써 명령이 완료된다.When a rewrite request is made to the global bus 30, the processor nodes 100A and 100B 100C do not participate in snooping and the instruction is completed by transferring the block to the shared memory 30.

전역 버스(21)에 발생하는 메모리 접근 요청은 읽기와 쓰기로 나누어 볼 수 있다. 쓰기의 경우, 시스템 내의 모든 캐시에 대하여 일관성 유지를 위하여 쓰기가알려져야만 함에 반하여, 읽기의 경우 응답의 책임을 가지는 유효하고 수정된 상태의 캐시에 읽기 요청이 전달되면 일관성을 유지할 수 있다. 앞서 서술한 바와 같이, 접근 목록(50A 내지 50C)의 상태를 '포함됨'과 '포함하지않음' 두 가지로 유지할 경우, 1차 캐시(12A 내지 12F)에 유효한 상태의 블록이 존재하는가의 여부는 알 수 있으나, 1차 캐시(12A 내지 12F)에 존재하는 블록이 유효하고 수정된 상태의 블록이 존재하는가의 여부는 제공하지 않는다. 따라서 전역 버스(21)에 발생한 모든 읽기 요구는 접근 목록(50A 내지 50C)의 상태가 '포함됨'인 모든 프로세서 노드(100A 내지 100C)의 1차 캐시(12A 내지 12F)에서 검색되어야 하고, 그 중에 유효하고 수정된 상태의 캐시가 읽기 요청에 대한 응답을 하게 된다.Memory access requests that occur on the global bus 21 can be divided into read and write. In the case of writes, the writes must be known to maintain consistency for all caches in the system, whereas in the case of reads, reads are delivered to a valid and modified cache that is responsible for the response. As described above, if the state of the access lists 50A to 50C is maintained in two states, 'included' and 'not included', whether or not a block having a valid state exists in the primary caches 12A to 12F is determined. As can be seen, it does not provide whether or not blocks existing in the primary caches 12A to 12F are valid and in a modified state. Thus, all read requests that occur on the global bus 21 must be retrieved from the primary caches 12A-12F of all processor nodes 100A-100C with the status of access list 50A-50C 'included', among which The valid and modified cache will respond to read requests.

접근 목록(50A 내지 50C)의 상태를 앞서 서술한 바와 같이 포함 여부에 따르는 상태와 별도로, 유효하고 수정된 상태가 1차 캐시(12A 내지 12F)에 존재함을 알 수 있도록, 도 4와 같이 '수정됨' 상태를 추가할 수 있으며, 이 경우, 전역 버스(21)에 발생한 읽기 요청은 접근 목록의 상태가 '포함됨'인 모든 프로세서 노드(100A 내지 100C)의 1차 캐시(12A 내지 12F)에서 검색되지 않을 수 있으며, 접근 목록(50A 내지 50C)의 상태가 '수정됨'인 프로세서 노드의 1차 캐시(12A 내지 12F)에서만 검색되고, 이 캐시가 읽기 요청에 대한 응답을 하게 된다. 이와 같이 접근 목록(50A 내지 50C)의 상태에 '수정됨'을 부가할 경우, 상태가 늘어남에 따라 상태의 저장을 위한 공간 확장이 필요하나, 읽기 요청에 따른 검색 대상이 되는 1차 캐시(12A 내지 12F)의 수를 줄일 수 있으며, 따라서 지역 버스(20A 내지 20C)의 부하를 경감시킬 수 있다.Apart from the state depending on whether or not the state of the access list 50A to 50C is included as described above, as shown in FIG. 4, it can be seen that a valid and modified state exists in the primary caches 12A to 12F. 'Modified' status, in which case a read request to global bus 21 occurs in the primary cache 12A-12F of all processor nodes 100A- 100C whose status of the access list is 'included'. It may not be retrieved and is only retrieved from the primary caches 12A-12F of the processor node whose access list 50A-50C has a status of 'Modified', and this cache will respond to read requests. As described above, when 'modified' is added to the states of the access lists 50A to 50C, as the state increases, the space for storing the state needs to be expanded, but the primary cache 12A to be searched according to the read request is required. To 12F) can be reduced, thereby reducing the load on the local buses 20A to 20C.

캐시 일관성 유지를 위한 프로토콜은 바로쓰기(write-through) 방식의 프로토콜과 나중쓰기(write-back)방식의 프로토콜로 크게 나누어 볼 수 있다. 바로쓰기 방식의 프로토콜은 프로세서(11A 내지 11F)로부터의 모든 메모리 접근에 대한 응답 책임을 공유 메모리(30)가 지는 형태로서, 프로세서(11A 내지 11F)로부터 쓰기가 발생하였을 경우, 공유 메모리(30) 및 동일 주소 블록을 가진 모든 캐시에 쓰기가 일어난 내용으로 블록 내용들을 즉시 갱신한다. 따라서, 바로쓰기 방법의 프로토콜은 나중쓰기 방법의 프로토콜에 비하여 비교적 낮은 성능을 보이는 것이 일반적이다.The protocols for maintaining cache coherency can be broadly classified into a write-through protocol and a write-back protocol. The direct write protocol is a form in which the shared memory 30 takes responsibility for responding to all memory accesses from the processors 11A to 11F. When a write occurs from the processors 11A to 11F, the shared memory 30 is used. And immediately update the block contents with the contents that have been written to all caches having the same address block. Therefore, the protocol of the direct write method generally exhibits relatively low performance compared to the protocol of the later write method.

나중쓰기 방법의 프로토콜은 write-once, Synapse, Berkeley, Illinois, read-broadcast 프로토콜과 그 수정된 형태가 일반적으로 쓰인다. 이들 프로토콜과 그 변형된 형태들은 상태의 차이와 각 상태에 따른 캐시 블록의 상태를 쓰기가 발생하였을 때와 읽기가 발생하였을 때의 동작의 차이가 발생하기는 하나 공통적으로 쓰기에 대한 권한 및 응답의 책임에 대한 관계를 나타내는 것으로, 다단계 캐시 내포성을 유지하는 경우에 본 발명에서 제안한 접근 목록을 적용할 수 있다. .Write-once, Synapse, Berkeley, Illinois, read-broadcast protocols, and their modified forms, are commonly used for later write methods. These protocols and their variants differ in the state and behavior of the cache block depending on the state when the write occurs and the read occurs. In this case, the access list proposed by the present invention can be applied when maintaining the multilevel cache nesting. .

또한 메모리 접근 시간이 일정한 균등 메모리 접근 공유 메모리 다중 프로세서 구조 뿐 아니라 비균등 메모리 공유 메모리 다중 프로세서 구조(Non-Uniform Memory Access, NUMA)의 경우도 프로세서와 메모리 사이에 동일한 캐시 계층이 형성되므로 본 발명이 동일하게 적용될 수 있음은 자명하다.In addition, in the case of a non-uniform memory shared memory multiprocessor structure (NUMA) as well as a uniform memory access shared memory multiprocessor structure with a constant memory access time, the same cache hierarchy is formed between the processor and the memory. It is obvious that the same can be applied.

도 5 는 비균등 메모리 접근 공유 메모리 다중 프로세서로서, 각 프로세서 노드(100A, 100B, 100C)에 지역 메모리(30A, 30B, 30C)로 공유 메모리가 분할되어있는 구조이다. 이 경우에, 어떤 프로세서(11A)로부터의 메모리 접근이 발생하게 되면, 발생된 메모리 접근은 그 주소 영역에 따라서 지역 메모리(30A) 영역과 원격 메모리(30B, 30C) 영역으로 구별된다. 발생된 메모리 접근 주소가 지역 메모리(30A) 영역일 경우, 이러한 접근은 원격 메모리(30B, 30C) 영역에 비하여 짧은 메모리 접근 시간을 가지게 되고, 상대적으로 원격 메모리 접근 시간이 길어지는 효과가 나타나게 되므로, 이러한 경우에 노드 내의 가장 큰 단계의 캐시는 원격 캐시(13A) 형태로 유지된다. 지역 메모리 접근 시간에 비하여 상당히 긴 지연시간을 유발하는 원격 메모리 접근을 최소한 줄이기 위하여, 원격 캐시(13A 내지 12C)는 지역 메모리(30A 내지 30C) 주소를 담지 않도록 유도되며, 따라서 원격 캐시(13A)와 프로세서 모듈(10A) 내의 프로세서 캐시(12A, 12B) 사이에 발생하는 다단계 캐시 내포성은 원격 메모리 주소 영역에 대해서만 유지된다.5 is a non-uniform memory access shared memory multiprocessor, in which shared memory is divided into local memories 30A, 30B, and 30C in each of the processor nodes 100A, 100B, and 100C. In this case, when a memory access from any processor 11A occurs, the generated memory access is divided into a local memory 30A area and a remote memory 30B, 30C area according to its address area. When the generated memory access address is the local memory 30A area, such an access has a shorter memory access time than the remote memory 30B and 30C areas, and the remote memory access time is relatively long. In this case the cache of the largest level in the node is kept in the form of a remote cache 13A. In order to at least reduce remote memory access, which results in a significantly longer delay compared to local memory access time, the remote caches 13A-12C are derived not to contain local memory 30A-30C addresses, and thus, with the remote cache 13A. Multilevel cache nesting that occurs between processor caches 12A and 12B in processor module 10A is maintained only for remote memory address regions.

도 6은 도 5의 비균등 메모리 접근 공유 메모리 다중 프로세서 구조에 본 발명에서 제시한 접근 목록(50A 내지 50C)이 적용된 경우로서, 각 프로세서 노드(100A, 100B, 100C)에 접근 목록(50A, 50B, 50C)이 추가되고, 접근 목록(50A, 50B, 50C들중 예컨데 50A)의 상태는 지역 메모리(30A)를 제외한 원격 메모리(30B, 30C) 블록 주소에 대해서만 유지한다. 즉, 접근 목록(50A)은 1 차 캐시(12A,12B)내에 저장된 블록들 중에서 원격 메모리(30B, 30C)에 저장되는 블록들의 상태에 대해서만 1 차 캐시(12A,12B)내에 유효하게 저장되어 있는지를 나타내는 것이다. 따라서 전역 버스(21)에 스누핑 요구가 발생하였고, 그 주소 영역이 특정 노드(100A)의 지역 메모리(30A)에 해당될 경우, 그 노드(100A)의 노드 제어기(40A)는 발생된 요청의 주소를 판별하고, 그 주소 영역이 해당 노드(100A)의 지역 메모리(30A)에 해당되면, 접근 목록(50A)에 대한 검색을 수행하지 않고, 지역 버스(20A)에 대한 요청을 발생시키며, 프로세서 캐시(12A, 12B)에 대한 검색을 수행하고, 그 결과에 따른 동작한다. 즉, 특정 노드(100A)의 지역 메모리(30A) 주소 영역에 대한 전역 버스(21)로부터의 요청에 대하여서는, 해당 노드(100A)는 원격 캐시(13A)와 접근 목록(50A)이 존재하지 않는 경우와 동일하게 작동한다.FIG. 6 illustrates a case in which the access lists 50A to 50C of the present invention are applied to the non-uniform memory access shared memory multiprocessor structure of FIG. 5, and the access lists 50A and 50B to each processor node 100A, 100B, and 100C. , 50C) is added, and the state of the access lists 50A, 50B, 50C, for example 50A, is maintained only for remote memory 30B, 30C block addresses excluding local memory 30A. That is, whether the access list 50A is validly stored in the primary cache 12A, 12B only for the state of the blocks stored in the remote memories 30B, 30C among the blocks stored in the primary caches 12A, 12B. It represents. Therefore, when a snooping request has occurred on the global bus 21 and its address area corresponds to the local memory 30A of a specific node 100A, the node controller 40A of the node 100A receives the address of the generated request. If the address area corresponds to the local memory 30A of the node 100A, the processor does not perform a search for the access list 50A, generates a request for the local bus 20A, and generates a processor cache. Perform a search for 12A, 12B and act on the result. That is, for a request from the global bus 21 for the local memory 30A address area of a specific node 100A, the node 100A does not have a remote cache 13A and an access list 50A. It works the same way.

도 7 은 비균등 메모리 접근 공유 메모리 다중 프로세서 구조에서 노드(100A 내지 100F)간 연결 구조로써, 버스가 아닌 점 대 점 연결 구조를 이용한 고리 형태로 이루어진 경우이다. 본 발명은 상호 연결망 구조에 의해 제한 받는 것이 아니므로, 프로세서(10A,10B)와 공유 메모리(30A)간의 접근 경로에 다단계의 캐시가 구성되며 앞서 언급한 바와 같이, 프로세서(11A,11B)에 가까운 작은 단계의 캐시(12A,12B)의 내용이 보다 큰 단계의 캐시(13A, 13B)에 포함되어야 하는 다단계 캐시 내포성을 유지하여야 경우에 상술한 바와 같이 동일하게 적용될 수 있다.FIG. 7 illustrates a connection structure between nodes 100A to 100F in a non-uniform memory access shared memory multiprocessor structure, and is formed in a ring shape using a point-to-point connection structure instead of a bus. Since the present invention is not limited by the interconnection network structure, a multi-level cache is configured in the access path between the processors 10A and 10B and the shared memory 30A and, as mentioned above, is closer to the processors 11A and 11B. The same may be applied as described above in the case where the contents of the small caches 12A and 12B must maintain the multilevel cache nesting that should be included in the larger caches 13A and 13B.

도 8 은 비균등 메모리 접근 공유 메모리 다중 프로세서 구조에서 노드간 연결 구조로써, 버스가 아닌 이중 점 대 점 연결 구조를 이용한 이중 고리 형태로 이루어진 경우로서, 본 발명은 상호 연결망 구조에 의해 제한 받지 않으며, 프로세서(11A, 11B)와 공유 메모리(30A)간의 접근 경로에 다단계의 캐시가 구성되며 앞서 언급한 바, 프로세서(11A, 11B)에 가까운 작은 단계의 캐시(12A, 12B)의 내용이 보다 큰 단계의 캐시(13A)에 포함되어야 하는 다단계 캐시 내포성을 유지하여야 할 때에 접근 목록(50A)을 이용하는 본 발명을 상술한 바와 같이 동일하게 적용될 수 있다.FIG. 8 is a node-to-node connection structure in a non-uniform memory access shared memory multiprocessor structure, and the case is formed in a double ring form using a dual point-to-point connection structure instead of a bus, and the present invention is not limited by the interconnection network structure. A multilevel cache is configured in the access path between the processors 11A and 11B and the shared memory 30A, and as mentioned above, the contents of the small caches 12A and 12B closer to the processors 11A and 11B are larger. The invention using the access list 50A is equally applicable as described above when it is necessary to maintain the multi-level cache nesting that should be included in the cache 13A of.

도 9는 접근 목록의 저장 구조를 도 3 과 같은 고정된 블록 주소를 할당한 것이 아닌, 가변적인 블록 주소를 할당할 수 있는 형태로 확장한 것으로, 블록 주소의 저장을 위한 공간 확보가 필요하나, 공유 메모리 전역에 대한 상태를 유지하지 않을 수 있으므로, 접근 목록의 크기를 축소시킬 수 있다. 접근 목록에 저장되지 않는 주소의 상태는 '포함되지않음'으로 간주된다.FIG. 9 extends the storage structure of an access list to a form in which a variable block address can be allocated, rather than a fixed block address as shown in FIG. 3, and it is necessary to secure space for storing the block address. You can reduce the size of the access list by not maintaining state for the entire shared memory. Status of addresses not stored in the access list is considered 'not included'.

도 9에서는 접근 목록의 저장 구조를 가변적인 블록 주소를 할당할 수 있는 형태로 확장하고, 하나의 접근 목록 단위 저장 공간(150A, 151A, 152A, ...)이 가변적인 크기의 메모리 블록의 포함 상태를 나타낼 수 있도록 주소 공간과, 블록 크기를 나타낼 수 있는 공간을 추가한 것이다. 이 경우에, 접근 목록의 상태는 대응되는 메모리 블록들 중 하나 이상을 1차 캐시가 저장하고 있을 경우, '포함됨'으로 유지하게 된다.In FIG. 9, the storage structure of the access list is extended to a form in which a variable block address can be allocated, and one access list unit storage space 150A, 151A, 152A, ... is included in a variable size memory block. We added an address space to indicate the state and a space to indicate the block size. In this case, the state of the access list will remain 'included' if the primary cache is storing one or more of the corresponding memory blocks.

또한, 접근 목록은 작은 단계 캐시에 대한 큰 단계의 내포성을 제거할 수 있으므로, 이러한 작은 단계에 보다 작은 단계의 캐시가 구성되거나, 큰 단계의 캐시에 보다 큰 단계의 캐시가 구성될 경우에도 동일하게 적용할 수 있음은 자명하다. 도 10은 프로세서 노드(100A, 내지 100E)내부의 프로세서 모듈(10A 내지 10D)에 캐시의 단계를 한층 늘리고, 프로세서 노드 외부에도 4차 캐시(15A)와 3 단계 버스(22)를 부가하여 구성한 실시예이다. 이 경우에 2차 캐시(13A)와 3차 캐시(14A) 사이에는 다단계 캐시 내포성이 유지되지 않으며 2차 캐시(13A)에 존재하는 모든 메모리 블록은 접근 목록(50A)에 '포함됨'상태로 유지된다.In addition, the access list can eliminate the nesting of large stages for small stage caches, so the same applies to smaller stages of cache or to larger stages of cache for large stages of cache. Applicability is obvious. FIG. 10 shows an embodiment in which the cache stage is further extended to the processor modules 10A to 10D inside the processor nodes 100A and 100E, and the fourth cache 15A and the third stage bus 22 are added to the outside of the processor node. Yes. In this case, the multilevel cache implicitity is not maintained between the secondary cache 13A and the tertiary cache 14A, and all memory blocks present in the secondary cache 13A remain 'included' in the access list 50A. do.

이상 설명한 바와 같이 본 발명은 특정 캐시 일관성 프로토콜에 직접 관련되어 있지 않고, 모든 스누핑 기반의 프로토콜에 적용시킬 수 있으며, 모든 스누핑 기반의 프로토콜에서 스누핑 하여야 할 캐시의 갯수를 줄이기 위하여 사용되는 다단계 캐시 내포성을 유지하지 않으면서도 스누핑 하여야 할 캐시의 갯수를 접근 목록의 정보에 의하여 줄일 수 있다.As described above, the present invention is not directly related to a specific cache coherency protocol, and can be applied to all snooping-based protocols, and the multi-level cache nesting used to reduce the number of caches to be snooped in all snooping-based protocols. The number of caches to be snooped without being maintained can be reduced by the information in the access list.

또한, 본 발명은 다단계 캐시 내포성에 의하여 큰 단계의 캐시가 작은 단계의 캐시에 존재하는 블록을 모두 포함하여야 함에 따른 큰 단계와 작은 단계의 캐시에 존재하는 블록 중복에 의한 큰 단계의 캐시 적중률 저하를 감소시키는 효과가 있다.In addition, the present invention is directed to reducing the cache hit ratio of the large stage due to block duplication in the large stage and the small stage cache as the large stage cache must include all the blocks in the small stage cache due to multi-level cache nesting. It has a reducing effect.

또한, 본 발명은 큰 단계의 블록 축출이 발생될 때 작은 단계의 캐시에 존재하는 동일 주소 블록에 대한 블록 축출이 발생하고, 이러한 작은 단계의 캐시에 요구된 블록 축출에 의하여 작은 단계의 캐시 적중률을 하락되는 문제점을 제거할 수 있는 효과가 있다.In addition, the present invention generates a block egress for the same address block present in the small cache when a large block eruption occurs, and reduces the cache hit ratio of the small level by the block elicitation required for the small cache. There is an effect that can eliminate the falling problem.

Claims

A shared memory multiprocessor device with a multi-level cache structure,

A processor for requesting data processing for a specific data block stored in the shared memory, and a processor cache which is a primary cache memory which pre-stores some data blocks among the data blocks stored in the shared memory for fast data processing of the processor. A plurality of processor modules configured;

A secondary cache connected to the processor module by a local bus and storing some data blocks stored in the shared memory;

An access list for storing valid state information of data blocks stored in the processor caches;

Cooperatively manage the secondary cache and the access list to update the state information of the access list to correspond to the state of the data block stored in the processor cache, and the secondary for the specific data block requested for data processing from the processor And a node controller for retrieving from the cache memory and the shared memory in another processor module connected to the cache or the global bus and providing the cache to the processor cache.

The method of claim 1,

The access list has the same number of addresses as the memory block addresses in the shared memory, and the valid state information of the data block is recorded at a corresponding address corresponding to the data block stored in the processor cache. Shared memory multiprocessor device with cache structure.

The method of claim 2,

The access list is composed of a memory having a unique space corresponding one to one in block size unit with respect to the entire shared memory, and fixedly allocates one bit space per memory block address to correspond to each block address in the shared memory. And a data block included in the processor cache to indicate whether or not a data block is included in the processor cache.

The method of claim 1,

The node controller records the state information recorded at each address of the access list as 'not included' at the beginning, and then an address on the access list corresponding to the data block is allocated as a valid data block in the processor cache is allocated. And converting the state information of the data into 'included' to indicate that the corresponding data block is stored in the lower processor cache.

The method of claim 1,

The node controller checks whether a corresponding data block in a secondary cache connected to the processor and an internal local bus exists in a valid state when a read request for a specific data block is received from the processor, thereby checking the requested data in a valid state. And providing the corresponding data block to the processor cache when the block exists, and converting the state information of the corresponding data block address in the access list to 'included'.

The method of claim 5,

The node controller, when a read request for a specific data block from the processor, does not exist in the secondary cache in a valid state, the request through the global bus connecting a plurality of processor modules and the shared memory. Generate a read request of the data block, and provide the processor cache with a corresponding data block received in response to the read request from another processor module or shared memory connected to the global bus, and the corresponding data block address in the access list. A shared memory multiprocessor device with a multi-level cache structure characterized by converting state information to 'included'.

The method of claim 6,

When the node controller receives a read request through the global bus, the node controller searches for the requested data block in an access list storing a list of data blocks stored in the secondary cache and the processor cache, and stores the data block. And a multi-stage cache structure, wherein the shared memory multiprocessor device reads from the cache memory and transmits the data to the global bus.

The method of claim 1,

The node controller checks whether the write requested data block in the processor cache or the secondary cache exists in a valid state upon a write request for a specific data block from the processor, and the requested data block exists in a valid state. In this case, the corresponding data block is provided to the processor cache, the corresponding data block in the secondary cache is invalidated, and an invalidation request for the data block is generated by a global bus connecting a plurality of processor modules and the shared memory. And invalidating the corresponding data block in the cache memory and the shared memory of another processor module, and maintaining the state information of the corresponding data block in the access list as 'included'.

The method of claim 8,

The node controller, when the write request for a specific data block from the processor, if the write request data block in the processor cache or the secondary cache does not exist in a valid state, the write request data through the global bus Generates a request for a block, provides a corresponding data block received in response to the write request from the other processor module or shared memory to the processor cache, and sets status information of the corresponding block address in the access list as 'included'. Shared memory multiprocessor device with a multi-level cache structure, characterized in that for switching.

The method of claim 9,

When the node controller receives a write request through the global bus, the node controller retrieves the requested data block from an access list storing a list of data blocks stored in the secondary cache and the processor cache, and stores the corresponding data block. After reading from the cache memory and transmitting it to the global bus, multi-level cache characterized in that invalidates the data block in the cache memory, and converts the status information of the data block address on the access list to 'not included' Shared memory multiprocessor device in architecture.

The method of claim 1,

When the node controller requests a rewrite of a specific data block from the processor, the node controller stores the rewrite-requested data block in the secondary cache and deletes the data block in the corresponding processor cache, and then status information of the corresponding data block of the access list. Multi-stage cache structure, characterized in that the conversion to 'not included'.

The method of claim 11,

The access list has the same number of addresses as the memory block addresses in the shared memory and includes information on whether to modify the block along with valid state information of the data block at a corresponding address corresponding to a data block stored in the processor cache. And a shared memory multiprocessor device having a multi-level cache structure.

A shared memory multiprocessor device with a multi-level cache structure,

A plurality of processors comprising a processor for requesting data processing for a specific data block and a processor cache, which is a primary cache memory that pre-stores some data blocks among data blocks stored in the shared memory for fast data processing of the processor. A module;

A local memory arranged to allocate the shared memory for each processor module and to be connected to each processor module by a local bus;

A remote cache connected to each processor module by a local bus and storing some data blocks among data blocks stored in the remaining local memory allocated to the other processor module;

Cooperatively manage the remote cache and the access list to update the state information of the access list to correspond to the state of the data block stored in the processor cache, and the remote cache for the specific data block requested for data processing from the processor And a node controller which retrieves from the local memory and the cache memory in another processor module connected to the local memory or the global bus and provides the corresponding processor cache to the shared memory multiprocessor device.

The method of claim 13,

The access list has the same number of addresses as the memory block addresses in the local memory allocated to the remaining processor modules connected to the processor module and the global bus, and stored in the processor cache among the data blocks stored in the remaining local memory. And the valid state information of the data block is recorded at a corresponding address corresponding to the data block.

The method according to any one of claims 1 to 14,

The shared memory multiprocessor device of claim 1, wherein the shared memory multiprocessor device has a uniform memory access time (UMA) scheme.

The method according to any one of claims 1 to 14,

The shared memory multiprocessor device is a shared memory multiprocessor device of a multi-level cache structure, characterized in that the non-uniform memory shared memory multi-processor structure (NUMA) method.

The method according to claim 15 or 16,

Node-control period connection of the processor module is a shared memory multi-processor device having a multi-stage cache structure, characterized in that the ring is made using a point-to-point connection structure.