[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

TWI317065B - Method of accessing cache memory for parallel processing processors - Google Patents

Method of accessing cache memory for parallel processing processors Download PDF

Info

Publication number
TWI317065B
TWI317065B TW95129660A TW95129660A TWI317065B TW I317065 B TWI317065 B TW I317065B TW 95129660 A TW95129660 A TW 95129660A TW 95129660 A TW95129660 A TW 95129660A TW I317065 B TWI317065 B TW I317065B
Authority
TW
Taiwan
Prior art keywords
cache
processor
accessing
data
parallel processing
Prior art date
Application number
TW95129660A
Other languages
Chinese (zh)
Other versions
TW200809502A (en
Inventor
Shiwu Lo
Original Assignee
Shiwu Lo
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shiwu Lo filed Critical Shiwu Lo
Priority to TW95129660A priority Critical patent/TWI317065B/en
Publication of TW200809502A publication Critical patent/TW200809502A/en
Application granted granted Critical
Publication of TWI317065B publication Critical patent/TWI317065B/en

Links

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Description

1317065 九、發明說明: 【發明所屬之技術領域】 本發明是有關於一種快取記憶體的設計方法,且特別 是有關於一種適用於平行處理處理器之快取記憶體資料 存取的方法。 * * 【先前技術】 在過去幾年來,由於處理器(processor)製程技術的持 # 續進步,明顯的增加了處理器的運算速度,相對地,記憶 體的存取速度卻未明顯的提升,造成處理器與記憶體在存 取上速度上的差距越來越大。 一般解決處理器與記憶體存取速度的差異問題,舉例 來說,現代的處理器(modern processor),是利用階層式記 憶體設計(hierarchical memory design)的方法來克服存取 速度的差異問題。 階層式記憶體設計(hierarchical memory design)將記 φ 憶體的資料區分為時間局部性(temporal locality)與空間局 部性(spatial locality)的方式,提高快取(cache)存取資料的 速度,由進一步提高處理器的效能。由於快取屬於由硬體 動態分配的的記憶體,且程式執行時間高度相關於快取的 命中率(hit rate)。 在應用 SMT(simultaneous multithreading ;同步多線 程)/CMP(chip-multiprocessor ;單晶片多處理器)技術於處 理器時,當此處理器執行的一程式的執行時間,會受到其 6 1317065 . 他一起執行的程式所影響。雖然利用快取分割(cache partition)的方法,可以消除並行處理程式在同一個實體處 理器之間的交互影響,但是這個方法不允許共用快取,造 成快取的使用率仍然無法有效的提升。 雖然上述的問題可以透過動態調整每個迷你處理器 (mini-processor)所專有的次快取(sub-caehe)記憶體大小而 加以克服,但是這些動態調整的方法往往藉由修改替代演 算法(replacement algorithm)來動態調整次快取記憶體的大 I 小。然而,當處理器的系統中只有少量的快取遺失(cache miss)出現時’改變一個快取分割的大小需要耗費相當長的 時間’而產生所謂的延遲(latency),這個延遲對於有服務 〇口質(quality of service)以及時序限制(timing constraint)的 程式而言’會降低服務品質或是出現最後期限遺失 (deadline miss)的情況發生,造成系統效能無法有效提升。 因此’對系統設計者而言,局部性(locality)造成的影 響有三個方面。 _ 首先’最糟計算時間(WCET; worst case execution time) 因為快取命中率的關係,而變得難以估計。在程序設計 中’ WCET會嚴重影響到預測整個系統運行的時間,而準 確的預測快取的命中率將成為WCET的一大挑戰課題。此 外,由於WCET的估計是許多嵌入式系統與即時系統的基 礎,WCET的難以估計會造成軟體設計上的困難。 其次,指令層(instruction level)平行處理程式可能導 致輾轉現象(trashing),也就是當處理器内部的第一層快取 1317065 記憶體(level one cache memory)發生大量的快取遺失 (cache miss)時,亦即當快取命中率低時,因而造成處理器 每秒鐘所能完成的指令數突然的降低。此時,當平行處理 的不同程式之工作組(working set)對應到快取上的同一群 快取列(cache line)時,不同程式將會互相覆蓋掉對方的工 作組,因而發生輾轉現象(trashing),造成系統效能不彰。 第三,設計作業系統(operating system)排程器 (scheduler)的難度提高。一般的作業系統都假設處理器(包 括CMP/SMT處理器中的簡單處理器與邏輯處理器)都能公 平的使用所有的硬體資源且不會相互影響,假如不對快取 限制,這會造成同一個程式與記憶體加強(memory intensive)的程式並行處理或與中央處理單元(CPU)加強的 程式並行處理,將會有不一樣的執行效果而產生差異性, 這種差異性將會增加作業系統的設計難度。 【發明内容】 本發明的一目的就是在提供一種適用於平行處理處 理器之快取記憶體資料存取的方法,用以有效提高快取記 憶體的使用率。 本發明的另一目的是在提供一種適用於平行處理處 理器之快取記憶體資料存取的方法,更精準的估算處理器 之最糟計算時間WCET。 本發明一種適用於平行處理處理器之快取記憶體資 料存取的方法,包含: 13170651317065 IX. Description of the Invention: [Technical Field] The present invention relates to a method of designing a cache memory, and more particularly to a method for accessing a cache memory data for a parallel processing processor. * * [Prior Art] In the past few years, due to the continuous advancement of processor technology, the operating speed of the processor has been significantly increased. In contrast, the access speed of the memory has not been significantly improved. The gap between the speed of access between the processor and the memory is increasing. Generally, the difference between the processor and the memory access speed is solved. For example, a modern processor uses a hierarchical memory design method to overcome the difference in access speed. Hierarchical memory design divides the data of φ memory into temporal locality and spatial locality, and improves the speed of accessing data by cache. Further improve the performance of the processor. Since the cache is a memory that is dynamically allocated by the hardware, the program execution time is highly correlated with the cache hit rate. When applying SMT (simultaneous multithreading)/CMP (chip-multiprocessor) technology to the processor, the execution time of a program executed by this processor will be affected by its 6 1317065. The program executed is affected. Although the method of cache partitioning can eliminate the interaction between parallel processors in the same entity processor, this method does not allow shared caches, and the utilization rate of caches cannot be effectively improved. Although the above problems can be overcome by dynamically adjusting the sub-caehe memory size that is unique to each mini-processor, these dynamic adjustment methods often modify the alternative algorithm. (replacement algorithm) to dynamically adjust the size of the secondary cache memory. However, when there is only a small amount of cache miss in the processor's system, it takes a long time to change the size of a cache split, and a so-called latency is generated. This delay is for service. In terms of quality of service and timing constraints, 'there will be a reduction in service quality or a deadline miss, resulting in an inability to effectively improve system performance. Therefore, there are three aspects to the system designer's impact on locality. _ First, the worst case execution time (WCET; worst case execution time) becomes difficult to estimate because of the fast hit rate. In programming, WCET will seriously affect the prediction of the entire system operation time, and the accurate prediction of the cache hit rate will become a major challenge for WCET. In addition, because WCET estimates are the basis of many embedded systems and real-time systems, the difficulty of estimating WCET can cause software design difficulties. Secondly, the instruction level parallel processing program may cause trashing, that is, a large amount of cache miss occurs when the first layer of the processor's internal layer 1706650 memory (level one cache memory) occurs. At that time, that is, when the cache hit rate is low, the number of instructions that the processor can complete per second is suddenly reduced. At this time, when the working sets of different programs processed in parallel correspond to the same group cache line on the cache, different programs will overwrite each other's work groups, thus causing a tumbling phenomenon ( Trashing), causing system performance is not good. Third, it is more difficult to design an operating system scheduler. The general operating system assumes that the processor (including the simple processor and the logical processor in the CMP/SMT processor) can use all the hardware resources fairly and without affecting each other. If the cache limit is not imposed, this will cause the same Parallel processing of a program with memory intensive programs or parallel processing with a central processing unit (CPU) will have different execution effects and differences, and this difference will increase the operating system. Design difficulty. SUMMARY OF THE INVENTION One object of the present invention is to provide a method for accessing a cache memory data suitable for a parallel processing processor to effectively increase the usage rate of the cache memory. Another object of the present invention is to provide a method for accessing a cache memory data suitable for a parallel processing processor to more accurately estimate the worst computation time WCET of the processor. The invention provides a method for accessing cache memory data of a parallel processing processor, comprising: 1317065

&lt;曰7爽埋7L件係對應於其中一 中一次快取; π,該處理器使用 元’如一第一層快 次快取,其中每一 使用其中一帛一指令處理元件, 第' 人快取中的一預定資料; 存取其所對應的其中 ,第當ί:該第一次快取存取到該預定資料時,存取異於 ㈣他次絲,直狀其卜異於該第—次 陕取的第一次快取存取到該預定資料; 當於前-步驟中,未於該第二次快取存取到該預定資 料時存取該下層記憶體單元,直到存取到該預定資料; 當發生快取遺失(caehemiss),可依據_預定順序自下 層兒憶體單元載人該預定資料於該些次快取的其中一個 中,該預定順序為: 第一,第一次快取中的已經被宣告為將來不太可能再 被使用的其中一停滞快取線(dead cache line); 第一 ’異於該第一次快取的其他次快取中的其中一停 滞停滞線;以及 第二’依照一預定規則將新載入的預定資料放入到第 一次快取中的一位置。而該預定規則係可使用先進先出 (FIF0)方法、隨機(RANDOM)方法或是取代最近最少使用 (Least Recently Used,LRU)方法等等。 本發明具有下列優點: 1.可透過限定各指令處理元件只能使用與其相對應 1317065 的次快取可估算最糟計算時間(WCET)。 2. 本發明不需動態分割第一層快取之快取容量,避免 發生額外的延遲(latency) ’並且可更有效率的分享第一層 快取,減少指令處理元件的回應時間。 3. 本發明可將各指令處理元件的最近最多使用(Most Recently Used,MRU)資料’存放於對應的次快取中,以提 高快取命中的機率。 4. 本發明可避免發生快取辍轉現象(cache thrashing) 0 【實施方式】 請參照第1圖’其繪示依照本發明一實施例的系統架 構示意圖。本發明之實施例適用於平行處理處理器之快取 記憶體資料存取的方法,包含提供一處理器100以及一下 層記憶體單元200,其中,該處理器1〇〇係可使用一 SMT/CMP(Simultaneous multithreading/ chip-multiprocessor)處理器,並具有複數指令處理元件101 以及一上層記憶體單元,如一第一層快取(L1 cache) 102, 其中,指令處理元件101可為一迷你處理器(mini processor),該些迷你處理器形成此實體處理器1 〇〇的核心 (core)裝置。指令處理元件101也可為一簡單處理器(simple processor)或邏輯處理器(logical processor ; 一種虛擬的處 理器)或是一指令執行程式等。 下層記憶體單元200係相對於上層記憶體單元,如上 1317065 層記憶體單s為第-層快取時,下層記憶體單元2〇〇可為 第二層快取(L2Caehe)、第三層快取(L3eaehe)等,或是主 記憶體(main memory)等。 第一層快取102、經由分割(Ρ_οη)形成複數次快取 (sub-cache) i 03 ’其中’ €每—指令處理元件i g ι係對應於 其中-次快取103 ’亦即-第i個指令處理元件1〇1具有 一對應的第i個次快取103,而具有優先存取的權力。 請進一步參照第2圖所示’係繪示依照本發明較佳實 施例之流程圖,為使方便說明,本發明較佳實施例以使用 兩個指令處理元件101作說明,但本發明可擴及至使用多 個指令處理元件10b當使用第1個指令處理元件1〇1,存 取其所對應的第i個次快取103,即步驟300時,需判斷 第1個次快取103是否可以存取到一預定資料,即步驟 3〇卜當於該第i個次快取103可以存取到該預定資料時, 該第1個次快取103回傳一存取該預定資料的結果,即步 驟 302。 若未於該第i個次快取103存取到該預定資料時,使 用該第i個指令處理元件101存取其中一第j個次快取 l〇3a,即步驟303,需判斷第j個次快取1〇3a是否可以存 取到該預定資料,即步驟304,當於該第j個次快取1〇3a 存取到該預定資料時,該第j個指令處理元件1〇3a回傳該 結果,其中,i不等於j,即步驟305,亦即,該第i個指 令處理元件101存取異於該第i個次快取1〇3的其他次快 取’直到於該第j個次快取1 〇3a存取到該預定資料為止。 11 1317065 夫於t於該第]個次快取1G3a#取到該預定資料時,即 =所有次快取存取到該預„料時,使用該^個指; 101存取該下層記憶體單元2〇〇,直 ^資料日個指令處理元件⑻回傳該結果,时 鄉 307。 .固私7處理元件101於第j個次快取1 〇3a或是 I層β己憶體2GG中存取到該預^資料時,可進—步使用一 父換快取資料(Swap)步驟,如步驟3〇6,將該預定資料載 入^層快取1()2巾,可以提昇系統整體的首次快取命中 (1 rate)機率’增加快取第一層快取1〇2的存取效率。 田發生決取遺失時,如同前述,將存取其他次快取或 下層5己憶體單το 200直到存取到該預定資料。此時,由下 層記憶體單元200新載入快取記憶體中的預定資料將依照 預定順序放置於其中一次快取的對應快取組合㈣che set)中的-個快取線(eaeheUne)卜該預定順序係為: 第一、第1次快取103中的已經被宣告為將來不太可 月b再被使用的快取線’其中,該些已經被宣告為將來不太 可能再被使用的快取線即為停滯快取線(dead cache line)。關於停滯快取線的判定與宣告,已經有許多公開文 獻探討,因而,對於此技術領域中具有通常之知識者,可 以了解其運作原理,且並非本發明之技術特徵,因此不再 詳加贅述。 第二、其他次快取中(不包含第i次快取〗〇3)中已經被 宣告為將來不太可能再被使用的停滯快取線。 12 1317065 第三、新載入的資料將放入到第i次快取103中,依 照一預定規則所放置的一位置中。 該預定規則係可使用現有的資料方法規則,例如:依照 先進先出(FIFO)方法、隨機(RANDOM)方法或是取代最近 最少使用(Least Recently Used,LRU)方法等等。 如此一來,可透過限定各指令處理元件101只能使用 與其相對應的次快取103估算最糟計算時間(WCET),在實 際應用時,系統中會同時存在即時(real time)及非即時 (non-real-time)的應用程式,因此,可以產生多的寬裕時間 (slack time)用於提供服務品質(QoS)的提昇或是處理器進 入省電模式。 由上述本發明較佳實施例可知,應用本發明具有下列 優點。 1. 可透過限定各指令處理元件101只能使用與其相 對應的次快取103估算最糟計算時間(WCET)。 2. 本發明不需動態分割第一層快取102之快取容量, 避免發生額外的延遲(latency),並且可更有效率的分享第 一層快取102,減少指令處理元件101的回應時間。 3. 本發明可將各指令處理元件101的最近最多使用 (Most Recently Used, MRU)資料,存放於對應的次快取103 中,以提高快取命中的機率。 4. 本發明可避免發生快取輾轉現象(cache thrashing)。 本發明之較佳實施例說明及範例中,上層記憶體單元 雖以第一層快取為例,但本發明之應用範疇並不侷限於第 13 1317065 —層快取。其它具儲存特性的元件如第二層、第三層快取 δ己憶體及轉換表緩衝區(transiati〇I1 l〇〇k-aside buffer,TLB) 等亦可利用此技術達到本發明所提供以及產生之功效β 雖然本發明已以一較佳實施例揭露如上,然其並非用 以限定本發明,任何熟習此技藝者,在不脫離本發明之精 神和範圍内,當可作各種之更動與潤飾,因此本發明之保 護範圍當視後附之申請專利範圍所界定者為準。 【圖式簡單說明】 為讓本發明之上述和其他目的、特徵、優點與實施例 能更明顯易懂,所附圖式之詳細說明如下: 第1圖係繪示依照本發明一實施例之一系統架構示意 圖。 第2圖係繪示依照本發明較佳實施例之流程圖。 【主要元件符號說明】 100:處理器 300:步驟 101:指令處理元件 301:步驟 102:第—層快取 302:步驟 103:次快取 303:步驟 103a:次快取 304:步驟 200:下層記憶體單元 305:步驟 306:步驟 307:步驟&lt;曰7 爽 7L parts correspond to one of the first cache; π, the processor uses the element 'such as a first layer fast cache, each of which uses one of the instruction processing elements, the first person a predetermined data in the cache; access to the corresponding one of them, the first ί: the first cache access to the predetermined data, the access is different from (4) the second wire, the difference is different from the The first cache access of the first-time access to the predetermined data; in the previous-step, the lower-level memory unit is not accessed when the second cache accesses the predetermined data, until the save Obtaining the predetermined data; when a cache loss occurs, the predetermined data may be carried in one of the caches from the lower layer of the memory unit according to the predetermined sequence: the first order is: The first cache has been declared as one of the dead cache lines that are unlikely to be used again in the future; the first one is different from the other caches of the first cache. a stagnant stagnation line; and a second 'newly scheduled data to be loaded according to a predetermined rule Put it in a position in the first cache. The predetermined rule may use a first in first out (FIF0) method, a random (RANDOM) method, or a Least Recently Used (LRU) method or the like. The present invention has the following advantages: 1. The worst calculation time (WCET) can be estimated by limiting the number of instruction processing elements to only use the secondary cache corresponding to 1317065. 2. The present invention does not need to dynamically split the cache capacity of the first layer cache, avoiding additional latency&apos; and more efficient sharing of the first layer cache, reducing the response time of the instruction processing component. 3. The present invention can store the Most Recently Used (MRU) data of each instruction processing component in the corresponding secondary cache to increase the probability of a fast hit. 4. The present invention can avoid the occurrence of cache thrashing. [Embodiment] Please refer to FIG. 1 for a schematic diagram of a system architecture according to an embodiment of the present invention. The embodiment of the present invention is applicable to a method for accessing a cache memory data of a parallel processing processor, and includes a processor 100 and a lower layer memory unit 200, wherein the processor 1 can use an SMT/ The CMP (Simultaneous multithreading/chip-multiprocessor) processor has a complex instruction processing component 101 and an upper memory unit, such as a first layer cache (L1 cache) 102, wherein the instruction processing component 101 can be a mini processor. (mini processor), the mini processors form the core device of the physical processor 1 . The instruction processing component 101 can also be a simple processor or a logical processor (a virtual processor) or an instruction execution program. The lower memory unit 200 is relative to the upper memory unit. When the first layer of the memory is as long as the first layer is cached, the lower memory unit 2 can be the second layer cache (L2Caehe) and the third layer is fast. Take (L3eaehe), etc., or main memory (main memory). The first layer cache 102 forms a plurality of sub-caches i 03 ' by partitioning (Ρ_οη), where '€ per-instruction processing element ig ι corresponds to one-time cache 103', ie - i The instruction processing elements 101 have a corresponding ith secondary cache 103 with priority access. 2 is a flow chart according to a preferred embodiment of the present invention. For convenience of description, the preferred embodiment of the present invention uses two instruction processing elements 101 for illustration, but the present invention is expandable. And using the plurality of instruction processing elements 10b, when using the first instruction processing element 1〇1, accessing the corresponding i-th cache 103, that is, step 300, it is necessary to determine whether the first cache 103 can be Accessing a predetermined data, that is, step 3, when the ith cache 103 can access the predetermined data, the first cache 103 returns a result of accessing the predetermined data. That is, step 302. If the ith command processing component 101 does not access the predetermined data, the i-th command processing component 101 accesses one of the j-th caches l〇3a, that is, step 303, the j-th needs to be judged. Whether the next cache 1 〇 3a can access the predetermined data, that is, step 304, when the j-th cache 1 〇 3a accesses the predetermined data, the j-th instruction processing component 1 〇 3a Returning the result, where i is not equal to j, that is, step 305, that is, the i-th instruction processing element 101 accesses another sub-cache that is different from the i-th cache 1 直到 3 until the The jth cache 1 〇 3a accesses the predetermined data. 11 1317065 The husband uses the ^ finger when the first data is taken by the 1G3a#, that is, when all the cache accesses the pre-material, the access is used; 101 accessing the lower memory Unit 2〇〇, the data processing instruction component (8) returns the result, and the time is 307. The solid and private processing element 101 is in the jth cache 1 〇 3a or the I layer β recall 2GG. When accessing the pre-me data, the step of using a parent swap data (Swap) can be further used, such as step 3〇6, the predetermined data is loaded into the layer 1 cache, and the system can be upgraded. The overall first-time cache hit rate increases the access efficiency of the first-tier cache of 1〇2. When the field is lost, as described above, it will access other caches or lower layers. The body sheet το 200 is accessed until the predetermined data is accessed. At this time, the predetermined data newly loaded by the lower layer memory unit 200 in the cache memory will be placed in a corresponding order in one of the cached corresponding cache combinations (four) che set) The order of the cache line (eaeheUne) is: The first, the first cache 103 has been declared as not in the future The cache line that is used again in the month b. Among them, the cache line that has been declared as unlikely to be used again in the future is the dead cache line. The determination and announcement of the stagnation cache line There have been many open literature discussions, and thus, those who have ordinary knowledge in this technical field can understand the operation principle thereof, and are not technical features of the present invention, and therefore will not be described in detail. Second, other sub-caches (not including the i-th cache 〇3) has been declared as a stagnant cache line that is unlikely to be used again in the future. 12 1317065 Third, the newly loaded data will be placed in the i-th cache 103 In a location placed according to a predetermined rule. The predetermined rule may use existing data method rules, for example, according to a first in first out (FIFO) method, a random (RANDOM) method, or a least recently used (Least Recently) Used, LRU) method, etc. In this way, the worst processing time (WCET) can be estimated by limiting each instruction processing component 101 using only the corresponding secondary cache 103, in practical applications, There are both real-time and non-real-time applications in the system, so you can generate a lot of slack time to provide quality of service (QoS) improvements or processors. Entering the power saving mode. It is apparent from the above-described preferred embodiments of the present invention that the application of the present invention has the following advantages: 1. The worst processing time can be estimated by limiting each command processing element 101 only by using the corresponding secondary cache 103 (WCET) 2. The present invention does not need to dynamically split the cache capacity of the first layer cache 102, avoids extra latency, and can share the first layer cache 102 more efficiently, reducing the instruction processing component 101. Response time. 3. The present invention can store the Most Recently Used (MRU) data of each instruction processing component 101 in the corresponding secondary cache 103 to increase the probability of a fast hit. 4. The present invention avoids the occurrence of cache thrashing. In the description and examples of the preferred embodiment of the present invention, the upper layer memory unit is exemplified by the first layer cache, but the application scope of the present invention is not limited to the 13 1317065 layer cache. Other components with storage characteristics such as the second layer, the third layer cache δ mnemonic and the conversion table buffer (transiati〇I1 l〇〇k-aside buffer, TLB), etc. can also be used to achieve the present invention. And the resulting effect of the present invention. Although the present invention has been described above in terms of a preferred embodiment, it is not intended to limit the invention, and various modifications may be made without departing from the spirit and scope of the invention. And the scope of the present invention is defined by the scope of the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, features, advantages and embodiments of the present invention will become more <RTIgt; A schematic diagram of the system architecture. Figure 2 is a flow chart showing a preferred embodiment of the present invention. [Main component symbol description] 100: Processor 300: Step 101: Instruction processing component 301: Step 102: First layer cache 302: Step 103: Secondary cache 303: Step 103a: Secondary cache 304: Step 200: Lower layer Memory unit 305: Step 306: Step 307: Step

Claims (1)

1317065 十、申請專利範圍: l -種適用於平行處理處理器之快取記憶體資料存 取的方法,包含: 、 a·提供-處理器以及—下層記憶體單元,該處理器使 用!數指令處理元件以及一上層記憶體單元,該上層記憶 體單7L包合有複數次快取,其中每—指令處理元件係對應 於其中一次快取; 〜 b·使用其中-第-指令處理元件,存取其所對應的其 中一第一次快取中的一預定資料; C.虽未於該第一次快取存取到該預定資料時,存取異 於該第-次快取的其他次快取,直到於其卜異於該第_; 次快取的第二次快取存取到該預定資料; d.當於步驟e中,未於該第二次快取存取到該預定資 料時’存取該下層記憶體單元,直到存取到該預定資料; 以及 e·當發生快取遣失時,㈣—狀順料存取自該下 層記憶體單元的該預定資料載人並放置於該些次快取的 其中一個中,其中,該預定順序係為: 第一,該第一次快取中的其中一停滯快取線(dead cache line)中; 第二,異於該第一次快取的其他一個次快取中的 其中一停滯快取線中;以及 第三’該第-次快取中,依照—預定規則放置的 一位置。 15 1317065 2.如申請專利範圍第1項所述之適用於平行處理處 理器之快取記憶體資料存取的方法,其中,該預定規則係 使用先進先出(FIFO)方法。 3·如申請專利範圍第1項所述之適用於平行處理處 理器之快取記憶體資料存取的方法,其中,該預定規則係 使用隨機(RANDOM)方法。 4. 如申請專利範圍第1項所述之適用於平行處理處 理器之快取記憶體資料存取的方法,其中,該預定規則係 使用取代最近最少使用(Least Recently Used,LRU)方法。 5. 如申請專利範圍第1項所述之適用於平行處理處 理器之快取記憶體資料存取的方法,其中,各該指令處理 元件係使用一迷你處理器。 6. 如申請專利範圍第1項所述之適用於平行處理處 理器之快取記憶體資料存取的方法,其中,該上層記憶體 單元係使用一第一層快取,各該次快取係使用該第一層快 取的分割(partition)。 7·如申請專利範圍第1項所述之適用於平行處理處 理器之快取記憶體資料存取的方法,其中,該處理器係使 1317065 用 一 SMT/CMP(Simultaneous multithreading/ chip-multiprocessor)處理器0 8. 如申請專利範圍第1項所述之適用於平行處理處 理器之快取記憶體資料存取的方法,其中,該處理器係使 用一簡單處理器(simple processor)。 9. 如申請專利範圍第1項所述之適用於平行處理處 理器之快取記憶體資料存取的方法,其十,該處理器係使 用一邏輯處理器(logical processor)。 10. 如申請專利範圍第1項所述之適用於平行處理處 理器之快取記憶體資料存取的方法,其中,該處理器係使 用一指令執行程式。 11. 如申請專利範圍第1項所述之適用於平行處理處 理器之快取記憶體資料存取的方法,其中,該上層記憶體 單元係為一第一層快取、一第二層快取、一第三層快取或 一轉換表緩衝區(translation look-aside buffer,TLB)。 171317065 X. Patent application scope: l - A method for accessing the cache data of the parallel processing processor, comprising: a, providing-processor and - lower memory unit, the processor is used! a plurality of instruction processing elements and an upper memory unit, wherein the upper memory bank 7L includes a plurality of caches, wherein each instruction processing component corresponds to one of the caches; and b. uses the -first instruction processing component Accessing a predetermined data in one of the first caches corresponding thereto; C. although the first cache access is not accessed to the predetermined data, the access is different from the first cache Other times, the cache is different from the first _; the second cache accesses the predetermined data; d. When in step e, the second cache access is not When the predetermined data is accessed, the lower memory unit is accessed until the predetermined data is accessed; and e. when a cache relocation occurs, (4) the data is accessed from the predetermined data of the lower memory unit. The person is placed in one of the secondary caches, wherein the predetermined sequence is: first, one of the first caches in a dead cache line; second, different In one of the other caches of the first cache, in one of the stagnant cache lines And a third 'of the - secondary cache, in accordance with - a location disposed a predetermined rule. 15 1317065 2. A method of accessing a cache memory data for a parallel processing processor as described in claim 1 wherein the predetermined rule is a first in first out (FIFO) method. 3. A method of accessing a cache memory data for a parallel processing processor as described in claim 1 wherein the predetermined rule is a random (RANDOM) method. 4. A method of accessing a cache memory data for a parallel processing processor as described in claim 1 wherein the predetermined rule is to use a Least Recently Used (LRU) method. 5. A method of accessing a cache memory data for a parallel processing processor as described in claim 1 wherein each of the instruction processing elements uses a mini processor. 6. The method of accessing a cache memory data for a parallel processing processor as described in claim 1, wherein the upper memory unit uses a first layer cache, each of the caches The partition of the first layer cache is used. 7. The method of accessing a cache memory data for a parallel processing processor as described in claim 1, wherein the processor uses a SMT/CMP (Simultaneous multithreading/chip-multiprocessor) Processor 0 8. The method of accessing a cache memory data for a parallel processing processor as described in claim 1 wherein the processor uses a simple processor. 9. The method of accessing a cache memory data for a parallel processing processor as described in claim 1 of the patent application, wherein the processor uses a logical processor. 10. A method of accessing a cache memory data for a parallel processing processor as described in claim 1 wherein the processor executes the program using an instruction. 11. The method of accessing a cache memory data for a parallel processing processor according to claim 1, wherein the upper memory unit is a first layer cache and a second layer fast Take a third layer cache or a translation look-aside buffer (TLB). 17
TW95129660A 2006-08-11 2006-08-11 Method of accessing cache memory for parallel processing processors TWI317065B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW95129660A TWI317065B (en) 2006-08-11 2006-08-11 Method of accessing cache memory for parallel processing processors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW95129660A TWI317065B (en) 2006-08-11 2006-08-11 Method of accessing cache memory for parallel processing processors

Publications (2)

Publication Number Publication Date
TW200809502A TW200809502A (en) 2008-02-16
TWI317065B true TWI317065B (en) 2009-11-11

Family

ID=44767163

Family Applications (1)

Application Number Title Priority Date Filing Date
TW95129660A TWI317065B (en) 2006-08-11 2006-08-11 Method of accessing cache memory for parallel processing processors

Country Status (1)

Country Link
TW (1) TWI317065B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8627014B2 (en) * 2008-12-30 2014-01-07 Intel Corporation Memory model for hardware attributes within a transactional memory system
US9921967B2 (en) 2011-07-26 2018-03-20 Intel Corporation Multi-core shared page miss handler

Also Published As

Publication number Publication date
TW200809502A (en) 2008-02-16

Similar Documents

Publication Publication Date Title
Kaseridis et al. Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era
Stuecheli et al. The virtual write queue: Coordinating DRAM and last-level cache policies
Nesbit et al. Fair queuing memory systems
US8443151B2 (en) Prefetch optimization in shared resource multi-core systems
US8621157B2 (en) Cache prefetching from non-uniform memories
US9239798B2 (en) Prefetcher with arbitrary downstream prefetch cancelation
US8904154B2 (en) Execution migration
US20090313435A1 (en) Optimizing concurrent accesses in a directory-based coherency protocol
Yedlapalli et al. Meeting midway: Improving CMP performance with memory-side prefetching
WO2013044829A1 (en) Data readahead method and device for non-uniform memory access
Kim et al. Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU
Bock et al. Concurrent migration of multiple pages in software-managed hybrid main memory
Racunas et al. Partitioned first-level cache design for clustered microarchitectures
Usui et al. Squash: Simple qos-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators
Pimpalkhute et al. NoC scheduling for improved application-aware and memory-aware transfers in multi-core systems
US20090006777A1 (en) Apparatus for reducing cache latency while preserving cache bandwidth in a cache subsystem of a processor
Herrero et al. Thread row buffers: Improving memory performance isolation and throughput in multiprogrammed environments
TWI317065B (en) Method of accessing cache memory for parallel processing processors
Pimpalkhute et al. An application-aware heterogeneous prioritization framework for NoC based chip multiprocessors
Jahre et al. A high performance adaptive miss handling architecture for chip multiprocessors
Zhang et al. DualStack: A high efficient dynamic page scheduling scheme in hybrid main memory
Zhou et al. Real-time scheduling for phase change main memory systems
Jeon et al. Reducing DRAM row activations with eager read/write clustering
US7434001B2 (en) Method of accessing cache memory for parallel processing processors
Huan et al. Processor directed dynamic page policy

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees