US20080184011A1 - Speculative Throughput Computing - Google Patents
Speculative Throughput Computing Download PDFInfo
- Publication number
- US20080184011A1 US20080184011A1 US12/022,814 US2281408A US2008184011A1 US 20080184011 A1 US20080184011 A1 US 20080184011A1 US 2281408 A US2281408 A US 2281408A US 2008184011 A1 US2008184011 A1 US 2008184011A1
- Authority
- US
- United States
- Prior art keywords
- speculative
- program
- methods
- execution
- processors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 claims abstract description 118
- 238000004590 computer program Methods 0.000 claims abstract description 19
- 238000012545 processing Methods 0.000 claims description 6
- 230000000977 initiatory effect Effects 0.000 claims description 2
- 230000015654 memory Effects 0.000 description 63
- 230000008569 process Effects 0.000 description 51
- 230000008901 benefit Effects 0.000 description 7
- 239000011159 matrix material Substances 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 230000000644 propagated effect Effects 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- 101100138677 Arabidopsis thaliana NPF8.1 gene Proteins 0.000 description 2
- 101150059273 PTR1 gene Proteins 0.000 description 2
- 101100262635 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) UBR1 gene Proteins 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 2
- 238000004883 computer application Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 101100031674 Arabidopsis thaliana NPF8.3 gene Proteins 0.000 description 1
- 244000068988 Glycine max Species 0.000 description 1
- 101100235787 Schizosaccharomyces pombe (strain 972 / ATCC 24843) pim1 gene Proteins 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000004793 poor memory Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 101150114015 ptr-2 gene Proteins 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3842—Speculative instruction execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/451—Code distribution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/383—Operand prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/3834—Maintaining memory consistency
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3838—Dependency mechanisms, e.g. register scoreboarding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3861—Recovery, e.g. branch miss-prediction, exception handling
- G06F9/3863—Recovery, e.g. branch miss-prediction, exception handling using multiple copies of the architectural state, e.g. shadow registers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
Definitions
- This subject matter generally relates to throughput computing.
- a program (e.g., computer application) can be partitioned into a plurality of program segments.
- a program can be partitioned into program segments P 1 , P 2 , . . . , P N , where N is a number of program segments.
- Conventional computing systems can execute the program segments one after another in an enumerated order. For example, a single processor computing system can execute P 1 before executing P 2 , P 2 before executing P 3 , and P N-1 before executing PN.
- Executing program segments in this order respects sequential semantics (e.g., a program segment with a higher enumeration order x reads from a memory location before a program segment with a lower enumeration order y writes to the memory location, where x>y).
- a first program segment can have an enumeration order i
- a second program segment can have an enumeration order j, where i ⁇ j.
- the first program segment and the second program segment can be executed in parallel without violating sequential semantics if the program segments do not access the same memory locations. Furthermore, sequential semantics is not violated if the first program segment does not write to a memory location after the second program segment reads from the memory location.
- Multiprocessor, multicore, or multithreading computing systems can execute program segments in parallel (e.g., executing program segments at substantially the same time) on a plurality of processors, processor cores, or threads. Executing program segments in parallel that were not originally designed to execute in parallel can be referred to as “speculative execution.”
- some conventional analysis methods execute write instructions of program segments at temporary memory locations. These conventional analysis methods create execution overhead associated with using the temporary memory locations (e.g., storing and moving data from the temporary memory locations).
- Other conventional analysis methods use centralized data structures to store original data of memory locations where write instructions of program segments write, so that the original data may be restored. Updating the centralized structure can cause excessive overhead, especially if the write log is implemented in software using a dedicated data structure.
- a miss-speculation occurs (e.g., when a program segment with an enumeration order i has written to a location that a program segment with an enumeration order j has already read, where i ⁇ j), program segments with an enumeration order higher (e.g., greater) than j halt and redo their executions. Halting and redoing executions causes execution overhead that can make speculative execution inefficient.
- typical hardware and software implementations of conventional analysis methods use complex mechanisms and are inefficient.
- typical software implementations create execution overhead because they use extra instructions, and cause poor memory system performance by lowering the memory locality, which can result in cache misses.
- Some implementations monitor fixed regions of memory (e.g., a fixed range of one or more consecutive memory locations) to track if the region has been modified.
- a region size that is too large can result in false determinations of violations of sequential semantics. For example, a program segment may be forced to halt and redo its execution if it accesses the same region as another program segment with a lower enumeration order, even if none of the program segments accesses the same location.
- a region size that is too small increases the overhead of monitoring read and write instructions.
- profiling methods e.g., test executions of programs to determine properties of a program to predict gains from speculative execution of the programs
- Methods for speculative execution are also referred to as “speculative methods” or “processes for speculative throughput computing.”
- speculative methods or “processes for speculative throughput computing.”
- dependence analyzers often are not able to determine whether the program segments may be executed in parallel.
- Speculative throughput computing is used to translate a program to execute on a plurality of processors, processor cores, or threads.
- An advantage of speculative throughput computing is that it reduces execution overheads by using speculative read and write computer instructions.
- An additional advantage of speculative throughput computing is that it reduces execution overhead related to commit operations by assuming that a speculation is valid, and by restoring content of speculatively modified memory locations using a decentralized scheme.
- An additional advantage of speculative throughput computing is that it determines speculative parallelism by executing program segments out of sequential order.
- An additional advantage of speculative throughput computing is that it increases the accuracy of determining violations of sequential semantics and reduces execution overhead by maintaining a precise range of locations that have been speculatively accessed.
- An additional advantage of speculative throughput computing is that it increases speedup gains from speculative execution by selecting one of a plurality of speculative methods.
- An additional advantage of speculative throughput computing is that by increasing speedup gains, speculative throughput computing lowers an operating frequency of a system and reduces energy consumption.
- FIG. 1 illustrates an example multiprocessor computing system.
- FIG. 2 shows an example process for translating a program to execute on a plurality of processors, processor cores, or threads.
- FIG. 3 shows an example process for speculative throughput computing.
- FIG. 4 illustrates structures for an example hardware implementation of the process for speculative throughput computing of FIG. 3 .
- FIG. 5 shows an example process for generating the structures of FIG. 4 .
- FIG. 6 shows an example process for a commit operation using the structures of FIG. 4 .
- FIG. 7 shows an example process for a roll-back operation using the structures of FIG. 4 and a back-track stack.
- FIG. 8 illustrates an example back-track stack.
- FIG. 9 illustrates structures for an example software implementation of the process for speculative throughput computing of FIG. 3 .
- FIG. 10 shows an example process for generating the structures of FIG. 9 .
- FIG. 11 shows an example process for a commit operation using the structures of FIG. 9 .
- FIG. 12 shows an example process for a roll-back operation using the structures of FIG. 9 and a back-track stack.
- FIG. 13 illustrates an example data dependence graph.
- FIG. 1 illustrates an example multiprocessor computing system 100 .
- the multiprocessor computing system 100 includes processors (e.g., processors 131 , 132 , and 133 ); cache memory coupled to the processors (e.g., private caches 121 , 122 , and 123 ); an interconnect 140 ; and a memory 110 .
- the processors and the cache memory are coupled to the interconnect 140 (e.g., a bus, or a crossbar switch).
- the interconnect 140 allows cache memory to send a request for additional memory to memory 110 or other cache memory.
- additional hierarchies of memory can be used in the multiprocessor computing system 100 .
- the memory 110 can be a secondary cache that is coupled to additional memory.
- the cache memory includes local memory that can be accessed by a processor coupled to the local memory.
- a read or write instruction that is executed by the processor 131 can access local memory coupled to the processor 132 by invoking a software routine that sends a signal to the processor 132 .
- the signal can invoke a software routine executed by the processor 132 , where the processor 132 accesses the local memory and returns a value to the processor 131 by sending a signal to the processor 132 .
- Coherence (e.g., cache coherence) is maintained in the cache memory.
- a write-invalidate cache coherence mechanism can be used to maintain cache coherence.
- the write-invalidate cache coherence mechanism can invalidate a block of memory (e.g., contiguous locations of memory) in a cache memory when a processor coupled to another cache memory writes to the block of memory.
- a write-update cache coherence mechanism can be used to maintain cache coherence.
- a block of memory is updated when a processor coupled to another cache modifies the block of memory.
- a distribution protocol of invalidate and update requests can be one-to-all (e.g., snoopy cache protocols).
- the distribution protocol of invalidate and update requests can be one-to-one (e.g., directory-based protocols).
- one or more of the processors can include a plurality of independent processor cores (e.g., a multi-core processor).
- a dual-core processor includes two processors cores
- a quad-core processor includes four processor cores.
- the processor cores can execute a plurality of threads (e.g., program segments created by forks or splits of a program, or threads of execution) in parallel.
- FIG. 2 shows an example process 200 for translating a program to execute on a plurality of processors, processor cores, or threads.
- a computer application written in a high-level programming language e.g., C, C++, Fortran, or Java
- a machine language program that is executed on a processor.
- the process 200 will be described with respect to a system that performs the process 200 .
- the system analyzes 210 data dependence of program segments of a program.
- the system determines whether program segments of the program, that can be sequentially executed (e.g., one after another) on a single processor computing system, can be executed in parallel (e.g., on a plurality of processors in a multiprocessor computing system, on a plurality of processor cores on a single processor, on a plurality of threads).
- the system determines one or more methods to speculatively execute read and write instructions of program segments in parallel for use in the translation.
- the system profiles 220 the program.
- the system executes the program to collect statistics. Some examples of statistics include but are not limited to: a number of instructions executed in a program segment, an average number of cycles needed by each instruction, and a number of speculative reads and writes in a program segment.
- the system selects 230 one or more speculative methods to use for translation. In some implementations, the system selects a combination of speculative methods.
- FIG. 3 shows an example process 300 for speculative throughput computing.
- the process 300 will be described with respect to a system that performs the process 300 .
- the system generates 310 a precise range of locations that speculative read or write instructions have accessed.
- the range of locations can be precise because the highest location (e.g., a maximum data address in the range) has been accessed by a speculative read or write instruction.
- the lowest location e.g., a minimum data address in the range
- the system generates a precise range of locations that speculative read or write instructions have accessed for each processor in a multiprocessor, processor core in a multi-core processor, or thread. Other implementations are possible.
- the system can compare 320 the first precise range of locations with the second precise range of locations. In particular, the system can determine if a speculative execution of the program segments in parallel respects sequential semantics. For example, the first program segment and the second program segment can be executed in parallel. The first program segment can have an enumeration order less than an enumeration order of the second program segment. If the second program segment reads from a memory location before the first program segment writes to the memory location, then sequential semantics has been violated.
- Sequential semantics can be violated if the ranges of locations overlap. For example, if the first precise range of locations overlaps with the second precise range of locations, then the first program segment and second program segment may have accessed the same memory location. If the first precise range of locations overlaps with the second precise range of locations and a location in the first precise range or the second precise range has been modified, the system identifies 330 a miss-speculation (e.g., a speculation that may not conform to sequential semantics).
- a miss-speculation e.g., a speculation that may not conform to sequential semantics.
- the system restores 340 the memory content of locations speculative write instructions have accessed. Implementations of the process 300 for speculative throughput computing will be described in further detail with respect to FIGS. 4-13 .
- FIG. 4 illustrates structures 400 for an example hardware implementation of the process for speculative throughput computing of FIG. 3 .
- the structures in hardware, include a data structure 410 (e.g., an access matrix table), a write log 420 , and a write log pointer 430 .
- each processor is extended with the structures (e.g., data structure 410 , write log 420 , and write log pointer 430 ).
- each processor core in a processor or thread is extended with the structures.
- the data structure 410 is a table with a number of entries. Each entry can include fields. Some examples of fields include but are not limited to: an instruction address field 411 , a data validity field 412 , a maximum data address field 413 , a minimum data address field 414 , and an indicator field 415 .
- the instruction address field 411 e.g., “TAG” field
- the data validity field 412 can store, for example, a single bit that indicates whether the data stored in an entry is valid.
- the maximum data address field 413 (e.g., “MAX” field) and the minimum data address field 414 (e.g., “MIN” field) can store data addresses.
- a maximum data address is stored in the maximum data address field 413 and a minimum data address is stored in the minimum data address field 414 .
- the maximum data address field 413 and minimum data address field 414 can be used to define a precise range of locations, as described above with reference to FIG. 3 .
- the indicator field 415 (e.g., “RW” field) can store an indicator (e.g., a single bit) to indicate whether the instruction corresponding to the entry is a read instruction or a write instruction.
- the write log 420 can be a table with entries that can store addresses and data associated with write instructions.
- a register can be used to store the write log pointer 430 .
- the register can store an index of a next free entry in the write log 420 .
- the data associated with the write instructions can include an old value (e.g., a value of a location before a write instruction is executed), a new value (e.g., a value a write instruction will write to the location), and one or more status bits. For example, if a write instruction is executed, a value (e.g., new value) is written to a memory location. The original value (e.g., value in the memory location before the write instruction was executed) is stored as the old value.
- the size of the old values and new values can be selected for different architectures.
- a 32-bit architecture can include old values and new values that are 32-bit values.
- a 64-bit architecture can include old values and new values that are 64-bit values.
- the status bits can indicate whether the data stored in the old value and new value is valid.
- the status bits can also be used when deriving the old value of one or more memory locations.
- the number of status bits can depend on the size of the values (e.g., old values, and new values).
- each entry in the write log can include one status bit per addressable unit (e.g., a status bit for each 8-bit value, or byte).
- unused bits or bytes in a log entry are filled with placeholder bits or bytes.
- placeholders bits or bytes can be the same bit or byte values used to fill the unused bits or bytes in both the old value and the new value.
- FIG. 5 illustrates an example process 500 for generating the structures of FIG. 4 .
- the process 500 will be described with respect to a system that performs the process 500 .
- the system can distinguish between two types of memory instructions: regular read and write instructions (e.g., read and write instructions that are not speculative) and speculative read and write instructions.
- regular read and write instructions e.g., read and write instructions that are not speculative
- speculative read and write instructions e.g., read and write instructions that are not speculative
- the system determines 501 if a read or write instruction is speculative. If the read or write instruction is not speculative (“No” branch of step 501 ), the read or write instruction is executed as usual (e.g., executed as a regular read or write instruction). Then, the system executes 513 a next instruction.
- a speculative read or write instruction is executed.
- the system locates 502 the speculative read or write instruction in the data structure (e.g., data structure 410 of FIG. 4 ) for a corresponding processor, processor core, or thread.
- the data structure can be a hash structure and can use address mapping methods, such as, for example, fully-associative, direct-mapped, or set-associative address mapping methods to locate instructions.
- the system stores 504 the speculative read or write instruction. In particular, the system compares the instruction address of the speculative read or write instruction to the TAG fields of all of the entries in the data structure. If the execution of the speculative read or write instruction is a first execution of the speculative read or write instruction (e.g., the instruction address does not match any of the TAG fields), then the speculative read or write instruction is stored in the data structure. The system can allocate a new entry or use an entry that was evicted earlier.
- the system compares an effective address of the speculative read or write instruction (e.g., a memory address that the speculative read or write instruction will access) with a maximum data address and a minimum data address. In particular, the system retrieves 505 values from the MAX field and MIN field in the data structure. If the effective address (e.g., ADDR of FIG. 5 ) is greater than the maximum data address (“Yes” branch of step 506 ), then the system stores 507 the effective address as the maximum data address. Otherwise (“No” branch of step 506 ), if the effective address is less than the minimum data address (“Yes” branch of step 508 ), then the system stores 509 the effective address as the minimum data address.
- an effective address of the speculative read or write instruction e.g., a memory address that the speculative read or write instruction will access
- the system retrieves 505 values from the MAX field and MIN field in the data structure. If the effective address (e.g., ADDR of FIG. 5 ) is
- the system determines whether or not the speculative read or write instruction is a speculative write instruction. If the speculative read or write instruction is not a speculative write instruction (“No” branch of step 510 ), then the system performs step 513 . If the speculative read or write instruction is a speculative write instruction (“Yes” branch of step 510 ), then the system sets 511 an indicator (e.g., a bit in the “RW” field) to identify a speculative write instruction. In addition, the system stores 512 the write, or the effective address and the data associated with the write instruction, in the write log (e.g., write log 420 ).
- the write log e.g., write log 420
- the effective address and the data associated with the write log can be stored in a write log entry pointed to by the write log pointer (e.g., write log pointer 430 of FIG. 4 ), and the write log pointer is incremented to point to a next free write log entry (e.g., by incrementing the value in the register). Then, the system performs step 513 .
- the write log pointer e.g., write log pointer 430 of FIG. 4
- the write log pointer is incremented to point to a next free write log entry (e.g., by incrementing the value in the register).
- the system compares a first precise range of locations with a second precise range of locations to determine whether a range of locations accessed by one processor, processor core, or thread overlaps with the ranges of locations accessed by other processors, processor cores, or threads.
- the system determines if there are entries in a first data structure that have not been compared. If there are not more entries (“No” branch of step 602 ), then the system stops 610 .
- step 602 the system retrieves 603 a maximum data address (e.g., MAX), a minimum data address (e.g., MIN), and an indicator (e.g., RW) from the data structure for a first processor, processor core, or thread.
- a maximum data address e.g., MAX
- a minimum data address e.g., MIN
- an indicator e.g., RW
- the system determines if there are entries in a second data structure that have not been compared to the entry determined in step 602 . If there are not more entries in the second data structure (“No” branch of step 604 ), the system returns to step 602 . If there are more entries (“Yes” branch of step 604 ), the system retrieves 605 a maximum data address (e.g., N.MAX), a minimum data address (e.g., N.MIN), and an indicator (e.g., N.RW) in the data structure for a second processor, processor core, or thread. The system compares the maximum data address and the minimum data address from the first processor, processor core, or thread with the maximum data address and minimum data address from the second processor, processor core, or thread.
- a maximum data address e.g., N.MAX
- a minimum data address e.g.MIN
- an indicator e.g., N.RW
- step 606 if the indicator in the data structure for the first processor, processor core, or thread is set, or the indicator in the data structure for a second processor, processor core, or thread is set (“Yes” branch of step 606 ); then a speculative write instruction has accessed the first precise range of locations corresponding to a first program segment, or the second precise range of locations corresponding to a second program segment, respectively. If the indicators are not set (“No” branch of step 606 ), then the system returns to step 604 .
- step 607 If the maximum data address from the first processor is less than the maximum data address from the second processor, processor core, or thread; and the maximum data address from the first processor is greater than the minimum data address from the second processor (“Yes” branch of step 607 ); then the first precise range of locations overlaps with the second precise range of locations, and the system identifies 609 a miss-speculation. Otherwise (“No” branch of step 607 ), the system performs step 608 .
- step 608 If the minimum data address from the first process is greater than the minimum data address from the second processor, processor core, or thread; and the minimum data address from the first processor is less than the maximum data address from the second processor (“Yes” branch of step 608 ); then the first precise range of locations overlaps with the second precise range of locations, and the system identifies 609 a miss-speculation. Otherwise (“No” branch of step 608 ), the system returns to step 604 . The system compares each entry for a processor, processor core, or thread with each entry in all of the other processors, processor cores, or threads, in this manner.
- FIG. 7 shows an example process 700 for a roll-back operation using the data structures of FIG. 4 and a back-track stack.
- FIG. 7 shows an example process for restoring memory content of a single location.
- the system can use the process 700 to restore memory content of all locations speculative write instructions have accessed.
- the process 700 is applied to one addressable unit (e.g., a byte) at a time. If the width of the values in the write log is larger than the addressable unit, the process 700 is applied for each addressable unit in the log entry (e.g., one unit at a time).
- the system determines a processor, processor core, or thread that wrote a final value to a location in memory.
- the system stores a value of the location and an address of the location. For example, the system can store 720 the value of the location in an “Actual” register and the address of the location (e.g., current address) in a “Current address” register.
- the value in the Actual register is compared 730 with all of the write log entries in all of the write logs.
- the system determines (“Yes” branch of step 740 ) a matching entry.
- the system stores the matching entry in a data structure to book-keep the matching entries (e.g., back-track stack 800 of FIG. 8 ).
- the back-track stack 800 can include a number of entries. In some implementations, the number of entries can be less than or equal to a total number of entries in all of the write logs in the system.
- the back-track stack can be coupled to an interconnect (e.g., interconnect 140 of FIG. 1 ) and can be accessed by one or more processors, processor cores, or threads in the system.
- each entry in the back-track stack 800 can include a current “Actual” field 810 and a “Path” field 820 .
- the value in the Path field 820 can identify a particular processor, processor core, or thread.
- Each entry in the back-track stack can also include a number of fields 830 equal to the number of processors, processor cores, or threads. Each field can store an entry number that identifies the write log entry corresponding to the processor, processor core, or thread for the matching entries.
- the back-track stack can be a pointer structure. A pointer 840 keeps track of a next free entry in the back-track stack 800 .
- matching entries are stored 780 in a back-track stack (e.g., back-track stack 800 ).
- the value in the Actual register is stored in the Actual field, and write log entry numbers for all matching entries are stored in a corresponding field of the processor, processor core, or thread of a matching entry.
- An identifier of the particular processor, processor core, or thread (e.g., processor number one) is stored 790 in the Path field.
- the matching write log entry for the particular processor, processor core, or thread can be called the current entry.
- a checked bit for the current address in the current entry is marked 790 .
- the system determines a next value.
- the next value is an old value stored in the current entry in the write log(e.g., Actual.old).
- the next value is stored 795 in the Actual register. If there are matching entries remaining (e.g., the system determines another matching entry; “Yes” branch of step 740 ), the system returns to step 780 .
- the system determines if all entries have been checked. In particular, the system checks whether all write log entries containing the current address have the checked bit set, or if the old values and new values for the current address are equal. If all entries have been checked (“Yes” branch of step 750 ), then the system restores 760 a memory value and terminates 710 . In particular, the system stores the Actual value in the location, thereby restoring the original value of the location.
- the system returns, or back-tracks, 770 to the previous entry in the back-track stack that included multiple matching entries (e.g., more than one of the processor, processor core, or thread fields are non-empty).
- the last entry used is pointed to by a pointer (e.g., pointer 840 of FIG. 8 ).
- the system clears the checked bit associated with the current address in the entry stored in the processor, processor core, or thread number field of the processor number in the Path field; the processor, processor core, or thread number field; and the Path field.
- the entry in this field becomes the current entry.
- the checked bit in the current entry is marked 790 , and the corresponding processor, processor core, or thread number is stored in the Path field. If there are no more matching entries in the processor, processor core, or thread number fields, the system returns to the previous entry in the back-track stack (e.g., the system returns to step 770 ).
- the process for speculative throughput computing can be implemented in software.
- the access matrix table can be implemented as a data structure in virtual memory that can be accessed by regular read and write instructions.
- the speculative read and write instructions can be emulated using a sequence of regular computer instructions that access the data structure in virtual memory (e.g., global variables, and pointer structures).
- FIG. 9 illustrates data structures 900 in virtual memory for an example software implementation of the process for speculative throughput computing.
- FIG. 9 illustrates a matrix with rows that are pointed to by pointer variables.
- a pointer variable e.g., PTR 1 , PTR 2 , . . . PTRM; where M>0
- a row can be allocated in the data structure 900 at the time a computer program (e.g., a computer program written in C programming language) is compiled.
- Each pointer can be associated with a number of elements equal to a number of program segments (e.g., P 1 , P 2 , . . . PN; where N>0).
- a “SUM” element can also be associated with each pointer.
- Each element can include but is not limited to: a maximum data address (e.g., MAX entity), a minimum data address (e.g., MIN entity), and an indicator (e.g., W entity).
- the MAX entity can store a maximum data address accessed by a corresponding program segment.
- the MIN entity can store a minimum data address accessed by a corresponding program segment.
- the W entity can identify whether a location between the maximum data address and the minimum data address has been modified (e.g., accessed by a write instruction) by the corresponding program segment.
- the software implementation can also generate a write log and write log pointer in virtual memory that is analogous to the write log 420 and write log pointer 430 of FIG. 4 .
- the write log can be generated using, for example, global variables or a pointer structure.
- FIG. 10 shows an example process 1000 for generating the structures of FIG. 9 .
- process 1000 will be described with respect to a system that performs the process 1000 .
- a speculative read or write instruction can be emulated using regular read or write instructions (e.g., load or store instructions).
- the speculative read or write instructions can be regular read or write instructions that are augmented with a sequence of ordinary instructions (e.g., checking code).
- the checking code can be used to determine 1020 if a read or write instruction is speculative. If the read or write instruction is not speculative (“No” branch of step 1020 ), then the read or write instruction is executed as usual (e.g., executed as a regular read or write instruction) and the system executes 1010 a next instruction.
- the system compares an effective address (e.g., ADDR of FIG. 10 ) of the speculative read or write instruction with a maximum data address and minimum data address. For example, the system retrieves 1030 MAX, MIN, and W entities from a data structure using the address for the pointer and the program segment number. If the effective address is greater than the maximum data address (“Yes” branch of step 1040 ), then the system stores 1050 the effective address as the maximum data address. Otherwise (“No” branch of step 1040 ), if the effective address is less than the minimum data address (“Yes” branch of step 1055 ), then the system stores 1060 the effective address as the minimum data address.
- an effective address e.g., ADDR of FIG. 10
- the system determines whether or not the speculative read or write instruction is a speculative write instruction. If the speculative read or write instruction is not a speculative write instruction (“No” branch of step 1070 ), then the system executes 1010 a next instruction. If the speculative read or write instruction is a speculative write instruction (“Yes” branch of step 1070 ), then the system sets 1080 an indicator (e.g., a “W” bit) to identify a speculative write instruction. In addition, the system stores 1090 the write, or the effective address and the data associated with the write instruction, in a write log (e.g., a write log analogous to write log 420 , but implemented in virtual memory).
- a write log e.g., a write log analogous to write log 420 , but implemented in virtual memory.
- the effective address and the data associated with the write log can be stored in a write log entry pointed to by a write log pointer, and the write log pointer is incremented to point to a next free write log entry. Then, the system returns to step 1010 .
- the system compares a first precise range of locations with a second precise range of locations to determine whether a range of locations accessed by a program overlaps with the ranges of locations accessed by other program segments.
- the system selects 1100 a first pointer (e.g., a first pointer row, pointed to by PTR 1 , in the matrix of FIG. 9 ).
- the system selects 1110 a first element for the first pointer, and retrieves the MAX, MIN, and W entities corresponding to the first element.
- the system selects 1115 a next element for the first pointer, and retrieves the MAX, MIN, and W entities corresponding to the next element. If the first pointer or the next element do not exist, then the process terminates.
- the system compares the maximum data address (e.g., MAX entity) and the minimum data address (e.g., MIN entity) from the first element with the maximum data address (e.g. MAX entity) and minimum data address (e.g., MIN entity) from the next element.
- MAX entity maximum data address
- MIN entity minimum data address
- the system determines if there is a harmful overlap. In particular, if the indicator in the first element is set, or the indicator in the next element is set; then a speculative write instruction has accessed the first precise range of locations corresponding to the first program segment, or the second precise range of locations corresponding to the second program segment, respectively. If the maximum data address from the first element is less than the maximum data address from the next element; and the maximum data address from the first element is greater than the minimum data address from the next element; then the first precise range of locations overlaps with the second precise range of locations. The system determines (“Yes” branch of step 1120 ) a harmful overlap, and the system identifies 1140 a miss-speculation.
- the system determines a harmful overlap (“Yes” branch of step 1120 ), and the system identifies 1140 a miss-speculation.
- the system determines if there are more entries for the pointer.
- the system compares each element associated with a pointer with all of the other elements associated with the pointer, in this manner. For example, assume that a row of a current pointer includes elements A, B, and C. The system compares pairs A with B, A with C, and B with C. If there are more entries for the pointer (“Yes” branch of step 1130 ), then the system returns to step 1115 .
- the system determines and stores 1135 the values (e.g., MAX entity, MIN entity, and W entity) in the SUM element for the row.
- the system determines the maximum of all the MAX entities (e.g., highest maximum data address) and the minimum of all the MIN entities (e.g., lowest minimum data address) in the row. For example, assume that a row includes elements A, B, and C. The determined maximum is the highest MAX value of A, B, and C; and the determined minimum will be the lowest MIN value of A, B, and C. The determined maximum and minimum values are stored in the MAX and MIN entities of the SUM element of the row.
- the system computes the W entity as the logical OR operation of the W entities of all the elements in the row. For example, assume that a row includes elements A, B, and C.
- the W entity of the SUM element is computed using the expression:
- W(x) represents the W entity of element x.
- step 1105 If there are more pointers to compare (“Yes” branch of step 1105 ), then the system returns to step 1110 . If there are not more pointers to compare (“No” branch of step 1105 ), then the system compares the SUM elements. In particular, the system selects 1145 a first SUM element, and retrieves MAX, MIN, and W entities corresponding to the first SUM element. The system selects 1150 a next SUM element, and retrieves MAX, MIN, and W entities corresponding to the next SUM element.
- the system uses these entities to compare a first precise range of locations with a second precise range of locations to determine whether a range of locations accessed by a processor, processor core, or thread overlaps with the ranges of locations accessed by other processors, processor cores, or threads.
- the system determines miss-speculations using a process (e.g., steps 1155 and 1165 ) analogous to the process described previously with reference to steps 1120 and 1130 . If there are no more entries for SUM (“No” branch of step 1065 ), then the system stops. In particular, if the system does not identify a harmful overlap (“No” branch of step 1155 ), the system determines if the system has compared all SUM elements. If the system has compared all SUM elements (“Yes” branch of step 1165 ), the system stops 1170 . Otherwise (“No” branch of step 1165 ), the system returns to step 1150 . The SUM elements are compared in a manner analogous to the comparison of the row elements described previously.
- the system restores the memory content of locations speculative write instructions have accessed in a manner analogous to that described above with reference to FIG. 7 .
- the system uses, however, computer instructions instead of data structures in hardware.
- pointer structures can be used to implement the data structures, write log, and write log pointer.
- the system uses the write log and a back-track stack to restore the memory content of locations speculative write instructions have accessed.
- miss-speculations can also be reduced in the process for speculative throughput computing.
- FIG. 12 illustrates an example process 1200 for reducing miss-speculations.
- process 1200 will be described with respect to a system that performs the process 1200 .
- the system generates a schedule of program segments for a program to reduce a number of miss-speculations.
- the system derives 1210 a data dependence graph.
- the system determines 1220 an execution order of the program segments from the data dependence graph.
- the system executes 1230 the program segments according to the execution order.
- the system compares 1240 the program segment executions to identify dependencies.
- FIG. 13 illustrates an example data dependence graph 1300 .
- the data dependence graph can be derived from a computer program. For example, assume that a computer program includes the following code:
- a matrix element with index variables i and j (e.g., A[i,j]) is the sum of four neighbor matrix elements (e.g., A[i ⁇ 1,j]+A[i+1,j]+A[i,j ⁇ 1]+A[i,j+1]).
- the iterations are enumerated in the order that they are executed (e.g., a first iteration is represented by state 1, and a second iteration by state 2), the resulting dependencies between iterations are derived, as illustrated in FIG. 13 .
- the second enumeration e.g., represented by state 2
- the fifth enumeration e.g., represented by state 5
- the fifth enumeration does not depend on the second enumeration (A[1,2]), but the fifth enumeration depends on the first enumeration (A[1,1]).
- the arrows in the state graph illustrate the dependencies.
- the program segments to be executed in parallel can be formed using each of the iterations. Because there are dependencies between almost all of the consecutive iterations (e.g., 4 depends from 3, 2, and 1), uncovering parallelism can be difficult.
- the data dependence graph allows a system to uncover parallelism, for example, in the loop. For example, program segments with enumeration order 4, 7, 10, and 13 can be executed in parallel.
- the system augments the program with trigger points that delimit program segments.
- the trigger points increment a variable that enumerates the enumeration order of the program segment.
- the program loop (code illustrated above) can be augmented with a trigger point (e.g., “trigger;”) as in the following example code:
- A[i,j] A[i ⁇ 1,j]+A[i+1,j]+A[i,j ⁇ 1]+A[i,j+1]; end;
- Each iteration in the program loop includes a call to the trigger point.
- the trigger point can increment a variable that enumerates the enumeration order of the program segment.
- the system also augments the program with markings of read and write instructions that potentially cause dependencies. The markings identify the instructions as speculative instructions.
- a trap is generated each time a speculative read or write instruction is executed.
- a software routine e.g., a trap handler
- the enumeration order of the program segment is also recorded in the file.
- Post-processing of the file is used to derive 1210 the data dependence graph, and the system determines 1220 an execution order of the program segments from the data dependence graph.
- the system executes 1230 the program segments according to the execution order.
- the system can use a programming construct to execute program segments in parallel.
- an example program loop that uses the construct includes the code:
- the program loop includes seven “parallel_for” constructs.
- the system uses the construct to execute 1230 the program segments according to the execution order.
- the second “parallel_for” construct allows the system to execute a set of program segments (2,1) and (1,2) in parallel.
- consecutive “parallel_for” constructs are executed sequentially.
- the third “parallel_for” construct allows the system to execute program segments (3,1), (2,2) and (1,3) in parallel, after the system executes program segments (2,1) and (1,2) in parallel.
- the system compares 1240 the program segment executions to identify dependencies.
- the system compares 1240 the program segment executions to identify dependencies, after all of the program segments have been executed (e.g., at a “commit” point in the program).
- the system analyzes 210 data dependence of program segments of a program, and the system profiles 220 the program.
- the system collects statistics for use in selecting a speculative method (e.g., a process for speculative throughput computing).
- the statistics include a number of instructions executed in program segment i (N i ), where i is an integer greater than zero; an average number of cycles used by an instruction (CPI i ); a number of speculative read instructions in program segment i (R i ); and a number of speculative write instructions in program segment i (W i ).
- the system predicts a speedup gain from speculative execution.
- the system determines an execution time of a sequential execution of the program (T sequential ).
- the system derives the speedup gain by applying an analytical model.
- the analytical model can include execution times represented by an equation:
- T exec ( K ) Max[T segment-start ( K )+ N i ⁇ CPI i +R i ⁇ R cost ( K )+ W i ⁇ W cost ( K )+ P roll-back ⁇ Roll-back cost ( K )+(1 ⁇ P roll-back ) ⁇ C cost ( K )], where:
- K identifies one of a plurality of speculative methods; Max is a maximum function; T segment-start models a start-up cost of initiating a speculative thread; R cost models a cost of a speculative read instruction; W cost models a cost of a speculative write instruction; P roll-back is a probability of a miss-speculation; Roll-back cost models a cost of a roll-back; and C cost models a cost of a commit.
- the expression inside the maximum function (Max) is an estimated execution time for program segment i.
- the maximum function determines an execution time of a slowest program segment for a speculative method K. Therefore, the analytical model includes execution times of a plurality of speculative methods, and the execution times of the plurality of speculative methods equal the execution times of a slowest program segment.
- T exec (K) from the analytical model, the system can predict a speedup gain using an equation:
- the system selects 230 one or more of a plurality of speculative methods (e.g., processes described with reference to FIGS. 3-13 ) to use for translation using the speedup gains for each speculative method. For example, the system selects the speculative method with the highest speedup gain. In some implementations, the system selects sequential execution if sequential execution has a lower execution time than the speculative methods (e.g., T sequential ⁇ T exec (K), for all K).
- speculative methods e.g., processes described with reference to FIGS. 3-13
- the system selects a combination of speculative methods. For example, the system can use a combination of parts of one or more speculative methods of the plurality of speculative methods. As another example, the system can use a combination of one or more speculative methods of the plurality of speculative methods executed sequentially. As yet another example, the system can use a combination of one or more speculative methods of the plurality of speculative methods executed in parallel.
- the process 200 can be included in a module that can be integrated with a compiler.
- a combination of one or more steps of process 200 e.g., combinations of steps 210 , 220 , and 230
- the modules can be integrated with compilers.
- a first compiler can include a module that can analyze data dependence of the program (e.g., step 210 ).
- a second compiler can include a module that can perform the other steps of process 200 (e.g., steps 220 and 230 ).
- an apparatus e.g., multiprocessor computing system 100 , multicore computing system, or multi-threaded computing system
- can perform speculative execution e.g., according to some or all of the processes described with reference to FIGS. 3-13 .
- a system that includes the apparatus and the compiler can perform speculative execution (e.g., according to some or all of the methods described with reference to FIGS. 3-13 ).
- Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
- Embodiments of the subject matter described in this specification can be implemented as one or more computer program products, e.g., one or more modules of computer program instructions encoded on a tangible program carrier for execution by, or to control the operation of, data processing apparatus.
- the tangible program carrier can be a propagated signal or a computer-readable medium.
- the propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a computer.
- the computer-readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them.
- data processing apparatus encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
- the apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
- a computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- a computer program does not necessarily correspond to a file in a file system.
- a program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
- a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
- the processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output.
- the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
- processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
- a processor will receive instructions and data from a read-only memory or a random access memory or both.
- the essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
- mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
- a computer need not have such devices.
- a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, to name just a few.
- PDA personal digital assistant
- GPS Global Positioning System
- Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
- semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
- magnetic disks e.g., internal hard disks or removable disks
- magneto-optical disks e.g., CD-ROM and DVD-ROM disks.
- the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- keyboard and a pointing device e.g., a mouse or a trackball
- Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
- Devices For Executing Special Programs (AREA)
- Image Processing (AREA)
Abstract
Systems, methods, and apparatuses including computer program products for speculative throughput computing are disclosed. Speculative throughput computing is used to translate a program to execute on a plurality of processors, processor cores, or threads.
Description
- This application claims priority to U.S. Provisional Application Ser. No. 60/897,969, for “System, methods, and business ideas for speculative execution of program segments in multiprocessors,” filed on Jan. 30, 2007, which provisional patent application is incorporated by reference herein in its entirety.
- This subject matter generally relates to throughput computing.
- A program (e.g., computer application) can be partitioned into a plurality of program segments. For example, a program can be partitioned into program segments P1, P2, . . . , PN, where N is a number of program segments. Conventional computing systems can execute the program segments one after another in an enumerated order. For example, a single processor computing system can execute P1 before executing P2, P2 before executing P3, and PN-1 before executing PN. Executing program segments in this order respects sequential semantics (e.g., a program segment with a higher enumeration order x reads from a memory location before a program segment with a lower enumeration order y writes to the memory location, where x>y).
- For example, a first program segment can have an enumeration order i, and a second program segment can have an enumeration order j, where i<j. The first program segment and the second program segment can be executed in parallel without violating sequential semantics if the program segments do not access the same memory locations. Furthermore, sequential semantics is not violated if the first program segment does not write to a memory location after the second program segment reads from the memory location.
- Multiprocessor, multicore, or multithreading computing systems can execute program segments in parallel (e.g., executing program segments at substantially the same time) on a plurality of processors, processor cores, or threads. Executing program segments in parallel that were not originally designed to execute in parallel can be referred to as “speculative execution.”
- Conventional compilers can partition a program into program segments by determining which program segments access the same memory locations. Due to limitations of conventional analysis methods, or because accessed memory locations are unknown at a time of compiling, many programs cannot be partitioned by conventional compilers to allow for parallel execution of the program segments.
- For example, some conventional analysis methods execute write instructions of program segments at temporary memory locations. These conventional analysis methods create execution overhead associated with using the temporary memory locations (e.g., storing and moving data from the temporary memory locations). Other conventional analysis methods use centralized data structures to store original data of memory locations where write instructions of program segments write, so that the original data may be restored. Updating the centralized structure can cause excessive overhead, especially if the write log is implemented in software using a dedicated data structure. Furthermore, if a miss-speculation occurs (e.g., when a program segment with an enumeration order i has written to a location that a program segment with an enumeration order j has already read, where i<j), program segments with an enumeration order higher (e.g., greater) than j halt and redo their executions. Halting and redoing executions causes execution overhead that can make speculative execution inefficient.
- Furthermore, typical hardware and software implementations of conventional analysis methods use complex mechanisms and are inefficient. For example, typical software implementations create execution overhead because they use extra instructions, and cause poor memory system performance by lowering the memory locality, which can result in cache misses.
- Some implementations monitor fixed regions of memory (e.g., a fixed range of one or more consecutive memory locations) to track if the region has been modified. A region size that is too large can result in false determinations of violations of sequential semantics. For example, a program segment may be forced to halt and redo its execution if it accesses the same region as another program segment with a lower enumeration order, even if none of the program segments accesses the same location. Alternatively, a region size that is too small increases the overhead of monitoring read and write instructions.
- In addition, conventional profiling methods (e.g., test executions of programs to determine properties of a program to predict gains from speculative execution of the programs) assume a single method for speculative execution. Methods for speculative execution are also referred to as “speculative methods” or “processes for speculative throughput computing.” Furthermore, conventional dependence analyzers often are not able to determine whether the program segments may be executed in parallel.
- Systems, methods, and apparatuses including computer program products for speculative throughput computing are disclosed. Speculative throughput computing is used to translate a program to execute on a plurality of processors, processor cores, or threads.
- Particular embodiments of the subject matter described in this specification can be implemented to realize one or more of the following advantages. An advantage of speculative throughput computing is that it reduces execution overheads by using speculative read and write computer instructions.
- An additional advantage of speculative throughput computing is that it reduces execution overhead related to commit operations by assuming that a speculation is valid, and by restoring content of speculatively modified memory locations using a decentralized scheme.
- An additional advantage of speculative throughput computing is that it determines speculative parallelism by executing program segments out of sequential order.
- An additional advantage of speculative throughput computing is that it increases the accuracy of determining violations of sequential semantics and reduces execution overhead by maintaining a precise range of locations that have been speculatively accessed.
- An additional advantage of speculative throughput computing is that it increases speedup gains from speculative execution by selecting one of a plurality of speculative methods.
- An additional advantage of speculative throughput computing is that by increasing speedup gains, speculative throughput computing lowers an operating frequency of a system and reduces energy consumption.
-
FIG. 1 illustrates an example multiprocessor computing system. -
FIG. 2 shows an example process for translating a program to execute on a plurality of processors, processor cores, or threads. -
FIG. 3 shows an example process for speculative throughput computing. -
FIG. 4 illustrates structures for an example hardware implementation of the process for speculative throughput computing ofFIG. 3 . -
FIG. 5 shows an example process for generating the structures ofFIG. 4 . -
FIG. 6 shows an example process for a commit operation using the structures ofFIG. 4 . -
FIG. 7 shows an example process for a roll-back operation using the structures ofFIG. 4 and a back-track stack. -
FIG. 8 illustrates an example back-track stack. -
FIG. 9 illustrates structures for an example software implementation of the process for speculative throughput computing ofFIG. 3 . -
FIG. 10 shows an example process for generating the structures ofFIG. 9 . -
FIG. 11 shows an example process for a commit operation using the structures ofFIG. 9 . -
FIG. 12 shows an example process for a roll-back operation using the structures ofFIG. 9 and a back-track stack. -
FIG. 13 illustrates an example data dependence graph. -
FIG. 1 illustrates an examplemultiprocessor computing system 100. Themultiprocessor computing system 100 includes processors (e.g.,processors interconnect 140; and amemory 110. The processors and the cache memory are coupled to the interconnect 140 (e.g., a bus, or a crossbar switch). Theinterconnect 140 allows cache memory to send a request for additional memory tomemory 110 or other cache memory. - In some implementations, additional hierarchies of memory can be used in the
multiprocessor computing system 100. For example, thememory 110 can be a secondary cache that is coupled to additional memory. In some implementations, the cache memory includes local memory that can be accessed by a processor coupled to the local memory. For example, a read or write instruction that is executed by theprocessor 131 can access local memory coupled to theprocessor 132 by invoking a software routine that sends a signal to theprocessor 132. The signal can invoke a software routine executed by theprocessor 132, where theprocessor 132 accesses the local memory and returns a value to theprocessor 131 by sending a signal to theprocessor 132. - Coherence (e.g., cache coherence) is maintained in the cache memory. In some implementations, a write-invalidate cache coherence mechanism can be used to maintain cache coherence. The write-invalidate cache coherence mechanism can invalidate a block of memory (e.g., contiguous locations of memory) in a cache memory when a processor coupled to another cache memory writes to the block of memory. In some implementations, a write-update cache coherence mechanism can be used to maintain cache coherence. In particular, a block of memory is updated when a processor coupled to another cache modifies the block of memory. In some implementations, a distribution protocol of invalidate and update requests can be one-to-all (e.g., snoopy cache protocols). In some implementations, the distribution protocol of invalidate and update requests can be one-to-one (e.g., directory-based protocols).
- In some implementations, one or more of the processors can include a plurality of independent processor cores (e.g., a multi-core processor). For example, a dual-core processor includes two processors cores, and a quad-core processor includes four processor cores. The processor cores can execute a plurality of threads (e.g., program segments created by forks or splits of a program, or threads of execution) in parallel.
-
FIG. 2 shows anexample process 200 for translating a program to execute on a plurality of processors, processor cores, or threads. For example, a computer application written in a high-level programming language (e.g., C, C++, Fortran, or Java) can be translated into a machine language program that is executed on a processor. For convenience, theprocess 200 will be described with respect to a system that performs theprocess 200. - The system analyzes 210 data dependence of program segments of a program. In particular, the system determines whether program segments of the program, that can be sequentially executed (e.g., one after another) on a single processor computing system, can be executed in parallel (e.g., on a plurality of processors in a multiprocessor computing system, on a plurality of processor cores on a single processor, on a plurality of threads).
- The system determines one or more methods to speculatively execute read and write instructions of program segments in parallel for use in the translation. In particular, the system profiles 220 the program. During a profiling pass, the system executes the program to collect statistics. Some examples of statistics include but are not limited to: a number of instructions executed in a program segment, an average number of cycles needed by each instruction, and a number of speculative reads and writes in a program segment. The system then selects 230 one or more speculative methods to use for translation. In some implementations, the system selects a combination of speculative methods.
-
FIG. 3 shows anexample process 300 for speculative throughput computing. For convenience, theprocess 300 will be described with respect to a system that performs theprocess 300. In some implementations, the system generates 310 a precise range of locations that speculative read or write instructions have accessed. The range of locations can be precise because the highest location (e.g., a maximum data address in the range) has been accessed by a speculative read or write instruction. Furthermore, the lowest location (e.g., a minimum data address in the range) has been accessed by a speculative read or write instruction. - In some implementations, the system generates a precise range of locations that speculative read or write instructions have accessed for each processor in a multiprocessor, processor core in a multi-core processor, or thread. Other implementations are possible.
- If a speculative write instruction has accessed a first precise range of locations corresponding to a first program segment or a second precise range of locations corresponding to a second program segment, the system can compare 320 the first precise range of locations with the second precise range of locations. In particular, the system can determine if a speculative execution of the program segments in parallel respects sequential semantics. For example, the first program segment and the second program segment can be executed in parallel. The first program segment can have an enumeration order less than an enumeration order of the second program segment. If the second program segment reads from a memory location before the first program segment writes to the memory location, then sequential semantics has been violated.
- Sequential semantics can be violated if the ranges of locations overlap. For example, if the first precise range of locations overlaps with the second precise range of locations, then the first program segment and second program segment may have accessed the same memory location. If the first precise range of locations overlaps with the second precise range of locations and a location in the first precise range or the second precise range has been modified, the system identifies 330 a miss-speculation (e.g., a speculation that may not conform to sequential semantics).
- If a miss-speculation is identified, the system restores 340 the memory content of locations speculative write instructions have accessed. Implementations of the
process 300 for speculative throughput computing will be described in further detail with respect toFIGS. 4-13 . -
FIG. 4 illustrates structures 400 for an example hardware implementation of the process for speculative throughput computing ofFIG. 3 . The structures, in hardware, include a data structure 410 (e.g., an access matrix table), awrite log 420, and awrite log pointer 430. In some implementations, each processor is extended with the structures (e.g.,data structure 410, writelog 420, and write log pointer 430). In some implementations, each processor core in a processor or thread is extended with the structures. - In some implementations, the
data structure 410 is a table with a number of entries. Each entry can include fields. Some examples of fields include but are not limited to: aninstruction address field 411, adata validity field 412, a maximumdata address field 413, a minimumdata address field 414, and anindicator field 415. The instruction address field 411 (e.g., “TAG” field) can store an instruction address (e.g., a 64-bit address that identifies a location of an instruction in memory). The data validity field 412 (e.g., “V” field) can store, for example, a single bit that indicates whether the data stored in an entry is valid. The maximum data address field 413 (e.g., “MAX” field) and the minimum data address field 414 (e.g., “MIN” field) can store data addresses. In particular, a maximum data address is stored in the maximum data addressfield 413 and a minimum data address is stored in the minimum data addressfield 414. The maximum data addressfield 413 and minimum data addressfield 414 can be used to define a precise range of locations, as described above with reference toFIG. 3 . The indicator field 415 (e.g., “RW” field) can store an indicator (e.g., a single bit) to indicate whether the instruction corresponding to the entry is a read instruction or a write instruction. - In some implementations, the
write log 420 can be a table with entries that can store addresses and data associated with write instructions. In some implementations, a register can be used to store thewrite log pointer 430. In particular, the register can store an index of a next free entry in thewrite log 420. - The data associated with the write instructions can include an old value (e.g., a value of a location before a write instruction is executed), a new value (e.g., a value a write instruction will write to the location), and one or more status bits. For example, if a write instruction is executed, a value (e.g., new value) is written to a memory location. The original value (e.g., value in the memory location before the write instruction was executed) is stored as the old value.
- The size of the old values and new values (e.g., number of bits that are used to store the value) can be selected for different architectures. For example, a 32-bit architecture can include old values and new values that are 32-bit values. As another example, a 64-bit architecture can include old values and new values that are 64-bit values.
- The status bits can indicate whether the data stored in the old value and new value is valid. The status bits can also be used when deriving the old value of one or more memory locations. The number of status bits can depend on the size of the values (e.g., old values, and new values). For example, each entry in the write log can include one status bit per addressable unit (e.g., a status bit for each 8-bit value, or byte). In some implementations, unused bits or bytes in a log entry are filled with placeholder bits or bytes. For example, placeholders bits or bytes can be the same bit or byte values used to fill the unused bits or bytes in both the old value and the new value.
-
FIG. 5 illustrates anexample process 500 for generating the structures ofFIG. 4 . For convenience, theprocess 500 will be described with respect to a system that performs theprocess 500. - The system can distinguish between two types of memory instructions: regular read and write instructions (e.g., read and write instructions that are not speculative) and speculative read and write instructions. The system determines 501 if a read or write instruction is speculative. If the read or write instruction is not speculative (“No” branch of step 501), the read or write instruction is executed as usual (e.g., executed as a regular read or write instruction). Then, the system executes 513 a next instruction.
- If the read or write instruction is speculative (“Yes” branch of step 501), a speculative read or write instruction is executed. The system locates 502 the speculative read or write instruction in the data structure (e.g.,
data structure 410 ofFIG. 4 ) for a corresponding processor, processor core, or thread. In some implementations, the data structure can be a hash structure and can use address mapping methods, such as, for example, fully-associative, direct-mapped, or set-associative address mapping methods to locate instructions. - If the execution of the speculative read or write instruction is a first execution of the speculative read or write instruction (e.g., a match is not located in the data structure; “No” branch of step 503), then the system stores 504 the speculative read or write instruction. In particular, the system compares the instruction address of the speculative read or write instruction to the TAG fields of all of the entries in the data structure. If the execution of the speculative read or write instruction is a first execution of the speculative read or write instruction (e.g., the instruction address does not match any of the TAG fields), then the speculative read or write instruction is stored in the data structure. The system can allocate a new entry or use an entry that was evicted earlier.
- If the execution of the speculative read or write instruction is not the first execution (e.g., a match is located in the data structure; “Yes” branch of step 503), the system compares an effective address of the speculative read or write instruction (e.g., a memory address that the speculative read or write instruction will access) with a maximum data address and a minimum data address. In particular, the system retrieves 505 values from the MAX field and MIN field in the data structure. If the effective address (e.g., ADDR of
FIG. 5 ) is greater than the maximum data address (“Yes” branch of step 506), then the system stores 507 the effective address as the maximum data address. Otherwise (“No” branch of step 506), if the effective address is less than the minimum data address (“Yes” branch of step 508), then the system stores 509 the effective address as the minimum data address. - The system determines whether or not the speculative read or write instruction is a speculative write instruction. If the speculative read or write instruction is not a speculative write instruction (“No” branch of step 510), then the system performs
step 513. If the speculative read or write instruction is a speculative write instruction (“Yes” branch of step 510), then the system sets 511 an indicator (e.g., a bit in the “RW” field) to identify a speculative write instruction. In addition, the system stores 512 the write, or the effective address and the data associated with the write instruction, in the write log (e.g., write log 420). In particular, the effective address and the data associated with the write log can be stored in a write log entry pointed to by the write log pointer (e.g., writelog pointer 430 ofFIG. 4 ), and the write log pointer is incremented to point to a next free write log entry (e.g., by incrementing the value in the register). Then, the system performsstep 513. - Referring to
FIG. 6 , the system compares a first precise range of locations with a second precise range of locations to determine whether a range of locations accessed by one processor, processor core, or thread overlaps with the ranges of locations accessed by other processors, processor cores, or threads. The system determines if there are entries in a first data structure that have not been compared. If there are not more entries (“No” branch of step 602), then the system stops 610. If there are more entries (“Yes” branch of step 602), then the system retrieves 603 a maximum data address (e.g., MAX), a minimum data address (e.g., MIN), and an indicator (e.g., RW) from the data structure for a first processor, processor core, or thread. - In addition, the system determines if there are entries in a second data structure that have not been compared to the entry determined in
step 602. If there are not more entries in the second data structure (“No” branch of step 604), the system returns to step 602. If there are more entries (“Yes” branch of step 604), the system retrieves 605 a maximum data address (e.g., N.MAX), a minimum data address (e.g., N.MIN), and an indicator (e.g., N.RW) in the data structure for a second processor, processor core, or thread. The system compares the maximum data address and the minimum data address from the first processor, processor core, or thread with the maximum data address and minimum data address from the second processor, processor core, or thread. - In particular, if the indicator in the data structure for the first processor, processor core, or thread is set, or the indicator in the data structure for a second processor, processor core, or thread is set (“Yes” branch of step 606); then a speculative write instruction has accessed the first precise range of locations corresponding to a first program segment, or the second precise range of locations corresponding to a second program segment, respectively. If the indicators are not set (“No” branch of step 606), then the system returns to step 604.
- If the maximum data address from the first processor is less than the maximum data address from the second processor, processor core, or thread; and the maximum data address from the first processor is greater than the minimum data address from the second processor (“Yes” branch of step 607); then the first precise range of locations overlaps with the second precise range of locations, and the system identifies 609 a miss-speculation. Otherwise (“No” branch of step 607), the system performs
step 608. - If the minimum data address from the first process is greater than the minimum data address from the second processor, processor core, or thread; and the minimum data address from the first processor is less than the maximum data address from the second processor (“Yes” branch of step 608); then the first precise range of locations overlaps with the second precise range of locations, and the system identifies 609 a miss-speculation. Otherwise (“No” branch of step 608), the system returns to step 604. The system compares each entry for a processor, processor core, or thread with each entry in all of the other processors, processor cores, or threads, in this manner.
- If a miss-speculation is identified, the system restores memory content of locations speculative write instructions have accessed.
FIG. 7 shows anexample process 700 for a roll-back operation using the data structures ofFIG. 4 and a back-track stack. In particular,FIG. 7 shows an example process for restoring memory content of a single location. The system can use theprocess 700 to restore memory content of all locations speculative write instructions have accessed. In some implementations, theprocess 700 is applied to one addressable unit (e.g., a byte) at a time. If the width of the values in the write log is larger than the addressable unit, theprocess 700 is applied for each addressable unit in the log entry (e.g., one unit at a time). - The system determines a processor, processor core, or thread that wrote a final value to a location in memory. The system stores a value of the location and an address of the location. For example, the system can store 720 the value of the location in an “Actual” register and the address of the location (e.g., current address) in a “Current address” register. The value in the Actual register is compared 730 with all of the write log entries in all of the write logs. If the current address is contained in a write log entry, a new value corresponding to the current address is equal to the value of the location (e.g., value in the Actual register), and the new value is not equal to a corresponding old value, then the system determines (“Yes” branch of step 740) a matching entry. The system stores the matching entry in a data structure to book-keep the matching entries (e.g., back-track stack 800 of
FIG. 8 ). - Referring to
FIG. 8 , the back-track stack 800 can include a number of entries. In some implementations, the number of entries can be less than or equal to a total number of entries in all of the write logs in the system. The back-track stack can be coupled to an interconnect (e.g., interconnect 140 ofFIG. 1 ) and can be accessed by one or more processors, processor cores, or threads in the system. Also, each entry in the back-track stack 800 can include a current “Actual”field 810 and a “Path”field 820. The value in thePath field 820 can identify a particular processor, processor core, or thread. Each entry in the back-track stack can also include a number offields 830 equal to the number of processors, processor cores, or threads. Each field can store an entry number that identifies the write log entry corresponding to the processor, processor core, or thread for the matching entries. In some implementations, the back-track stack can be a pointer structure. Apointer 840 keeps track of a next free entry in the back-track stack 800. - Returning to
FIG. 7 , matching entries are stored 780 in a back-track stack (e.g., back-track stack 800). The value in the Actual register is stored in the Actual field, and write log entry numbers for all matching entries are stored in a corresponding field of the processor, processor core, or thread of a matching entry. - An identifier of the particular processor, processor core, or thread (e.g., processor number one) is stored 790 in the Path field. The matching write log entry for the particular processor, processor core, or thread can be called the current entry. In some implementations, a checked bit for the current address in the current entry is marked 790.
- The system determines a next value. The next value is an old value stored in the current entry in the write log(e.g., Actual.old). The next value is stored 795 in the Actual register. If there are matching entries remaining (e.g., the system determines another matching entry; “Yes” branch of step 740), the system returns to step 780.
- If there is not another matching entry (“No” branch of step 740), the system determines if all entries have been checked. In particular, the system checks whether all write log entries containing the current address have the checked bit set, or if the old values and new values for the current address are equal. If all entries have been checked (“Yes” branch of step 750), then the system restores 760 a memory value and terminates 710. In particular, the system stores the Actual value in the location, thereby restoring the original value of the location.
- If all entries have not been checked (e.g., determining that a write log entry containing the current address does not have a checked bit set, and the old values and new values for the current address are not equal; “No” branch of step 750), the system returns, or back-tracks, 770 to the previous entry in the back-track stack that included multiple matching entries (e.g., more than one of the processor, processor core, or thread fields are non-empty). The last entry used is pointed to by a pointer (e.g.,
pointer 840 ofFIG. 8 ). The system clears the checked bit associated with the current address in the entry stored in the processor, processor core, or thread number field of the processor number in the Path field; the processor, processor core, or thread number field; and the Path field. If there is another non-empty processor number field in the entry, the entry in this field becomes the current entry. The checked bit in the current entry is marked 790, and the corresponding processor, processor core, or thread number is stored in the Path field. If there are no more matching entries in the processor, processor core, or thread number fields, the system returns to the previous entry in the back-track stack (e.g., the system returns to step 770). - The process for speculative throughput computing can be implemented in software. For example, the access matrix table can be implemented as a data structure in virtual memory that can be accessed by regular read and write instructions. The speculative read and write instructions can be emulated using a sequence of regular computer instructions that access the data structure in virtual memory (e.g., global variables, and pointer structures).
-
FIG. 9 illustratesdata structures 900 in virtual memory for an example software implementation of the process for speculative throughput computing. For example,FIG. 9 illustrates a matrix with rows that are pointed to by pointer variables. For each pointer variable (e.g., PTR1, PTR2, . . . PTRM; where M>0), a row can be allocated in thedata structure 900 at the time a computer program (e.g., a computer program written in C programming language) is compiled. Each pointer can be associated with a number of elements equal to a number of program segments (e.g., P1, P2, . . . PN; where N>0). A “SUM” element can also be associated with each pointer. - Each element can include but is not limited to: a maximum data address (e.g., MAX entity), a minimum data address (e.g., MIN entity), and an indicator (e.g., W entity). The MAX entity can store a maximum data address accessed by a corresponding program segment. The MIN entity can store a minimum data address accessed by a corresponding program segment. The W entity can identify whether a location between the maximum data address and the minimum data address has been modified (e.g., accessed by a write instruction) by the corresponding program segment.
- The software implementation can also generate a write log and write log pointer in virtual memory that is analogous to the
write log 420 and writelog pointer 430 ofFIG. 4 . The write log can be generated using, for example, global variables or a pointer structure. -
FIG. 10 shows anexample process 1000 for generating the structures ofFIG. 9 . For convenience,process 1000 will be described with respect to a system that performs theprocess 1000. - In a software implementation, a speculative read or write instruction can be emulated using regular read or write instructions (e.g., load or store instructions). For example, the speculative read or write instructions can be regular read or write instructions that are augmented with a sequence of ordinary instructions (e.g., checking code). The checking code can be used to determine 1020 if a read or write instruction is speculative. If the read or write instruction is not speculative (“No” branch of step 1020), then the read or write instruction is executed as usual (e.g., executed as a regular read or write instruction) and the system executes 1010 a next instruction.
- If the read or write instruction is speculative (“Yes” branch of step 1020), then the system compares an effective address (e.g., ADDR of
FIG. 10 ) of the speculative read or write instruction with a maximum data address and minimum data address. For example, the system retrieves 1030 MAX, MIN, and W entities from a data structure using the address for the pointer and the program segment number. If the effective address is greater than the maximum data address (“Yes” branch of step 1040), then thesystem stores 1050 the effective address as the maximum data address. Otherwise (“No” branch of step 1040), if the effective address is less than the minimum data address (“Yes” branch of step 1055), then thesystem stores 1060 the effective address as the minimum data address. - The system determines whether or not the speculative read or write instruction is a speculative write instruction. If the speculative read or write instruction is not a speculative write instruction (“No” branch of step 1070), then the system executes 1010 a next instruction. If the speculative read or write instruction is a speculative write instruction (“Yes” branch of step 1070), then the system sets 1080 an indicator (e.g., a “W” bit) to identify a speculative write instruction. In addition, the
system stores 1090 the write, or the effective address and the data associated with the write instruction, in a write log (e.g., a write log analogous to write log 420, but implemented in virtual memory). In particular, the effective address and the data associated with the write log can be stored in a write log entry pointed to by a write log pointer, and the write log pointer is incremented to point to a next free write log entry. Then, the system returns to step 1010. - Referring to
FIG. 11 , the system compares a first precise range of locations with a second precise range of locations to determine whether a range of locations accessed by a program overlaps with the ranges of locations accessed by other program segments. The system selects 1100 a first pointer (e.g., a first pointer row, pointed to by PTR1, in the matrix ofFIG. 9 ). The system selects 1110 a first element for the first pointer, and retrieves the MAX, MIN, and W entities corresponding to the first element. The system then selects 1115 a next element for the first pointer, and retrieves the MAX, MIN, and W entities corresponding to the next element. If the first pointer or the next element do not exist, then the process terminates. The system compares the maximum data address (e.g., MAX entity) and the minimum data address (e.g., MIN entity) from the first element with the maximum data address (e.g. MAX entity) and minimum data address (e.g., MIN entity) from the next element. - The system determines if there is a harmful overlap. In particular, if the indicator in the first element is set, or the indicator in the next element is set; then a speculative write instruction has accessed the first precise range of locations corresponding to the first program segment, or the second precise range of locations corresponding to the second program segment, respectively. If the maximum data address from the first element is less than the maximum data address from the next element; and the maximum data address from the first element is greater than the minimum data address from the next element; then the first precise range of locations overlaps with the second precise range of locations. The system determines (“Yes” branch of step 1120) a harmful overlap, and the system identifies 1140 a miss-speculation. If the minimum data address from the first element is greater than the minimum data address from the next element; and the minimum data address from the first element is less than the maximum data address from the next element; then the first precise range of locations overlaps with the second precise range of locations. The system determines a harmful overlap (“Yes” branch of step 1120), and the system identifies 1140 a miss-speculation.
- If the system does not determine a harmful overlap (“No” branch of step 1120), the system determines if there are more entries for the pointer. The system compares each element associated with a pointer with all of the other elements associated with the pointer, in this manner. For example, assume that a row of a current pointer includes elements A, B, and C. The system compares pairs A with B, A with C, and B with C. If there are more entries for the pointer (“Yes” branch of step 1130), then the system returns to step 1115.
- If there are not more entries for the pointer (“No” branch of step 1130), then the system determines and
stores 1135 the values (e.g., MAX entity, MIN entity, and W entity) in the SUM element for the row. In particular, the system determines the maximum of all the MAX entities (e.g., highest maximum data address) and the minimum of all the MIN entities (e.g., lowest minimum data address) in the row. For example, assume that a row includes elements A, B, and C. The determined maximum is the highest MAX value of A, B, and C; and the determined minimum will be the lowest MIN value of A, B, and C. The determined maximum and minimum values are stored in the MAX and MIN entities of the SUM element of the row. - The system computes the W entity as the logical OR operation of the W entities of all the elements in the row. For example, assume that a row includes elements A, B, and C. The W entity of the SUM element is computed using the expression:
-
W(A) OR W(B) OR W(C), where - W(x) represents the W entity of element x.
- If there are more pointers to compare (“Yes” branch of step 1105), then the system returns to step 1110. If there are not more pointers to compare (“No” branch of step 1105), then the system compares the SUM elements. In particular, the system selects 1145 a first SUM element, and retrieves MAX, MIN, and W entities corresponding to the first SUM element. The system selects 1150 a next SUM element, and retrieves MAX, MIN, and W entities corresponding to the next SUM element. The system uses these entities to compare a first precise range of locations with a second precise range of locations to determine whether a range of locations accessed by a processor, processor core, or thread overlaps with the ranges of locations accessed by other processors, processor cores, or threads.
- The system determines miss-speculations using a process (e.g., steps 1155 and 1165) analogous to the process described previously with reference to
steps - If a miss-speculation is identified, the system restores the memory content of locations speculative write instructions have accessed in a manner analogous to that described above with reference to
FIG. 7 . In some implementations, the system uses, however, computer instructions instead of data structures in hardware. For example, pointer structures can be used to implement the data structures, write log, and write log pointer. The system uses the write log and a back-track stack to restore the memory content of locations speculative write instructions have accessed. - The example hardware and software implementations of the process for speculative throughput computing, described previously, illustrated processes to identify miss-speculations. In some implementations, miss-speculations can also be reduced in the process for speculative throughput computing.
-
FIG. 12 illustrates anexample process 1200 for reducing miss-speculations. For convenience,process 1200 will be described with respect to a system that performs theprocess 1200. The system generates a schedule of program segments for a program to reduce a number of miss-speculations. In particular, the system derives 1210 a data dependence graph. The system determines 1220 an execution order of the program segments from the data dependence graph. The system executes 1230 the program segments according to the execution order. The system compares 1240 the program segment executions to identify dependencies. -
FIG. 13 illustrates an exampledata dependence graph 1300. The data dependence graph can be derived from a computer program. For example, assume that a computer program includes the following code: -
while condition do for i=1 to 4 do for j=1 to 4 do begin A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; end; - In each iteration of the loop, a matrix element with index variables i and j (e.g., A[i,j]) is the sum of four neighbor matrix elements (e.g., A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]). If the iterations are enumerated in the order that they are executed (e.g., a first iteration is represented by
state 1, and a second iteration by state 2), the resulting dependencies between iterations are derived, as illustrated inFIG. 13 . For example, in the first enumeration (e.g., represented by state 1), i=1 and j=1, so A[1,1] is derived. As another example, in the second enumeration (e.g., represented by state 2), i=1 and j=2, so A[1,2] is derived. As a further example, in the fifth enumeration (e.g., represented by state 5), i=2 and j=1, so A[2,1] is derived. The fifth enumeration derives A[2,1]=A[1,1]+A[3,1]+A[2,0]+A[2,2]. The fifth enumeration does not depend on the second enumeration (A[1,2]), but the fifth enumeration depends on the first enumeration (A[1,1]). The arrows in the state graph illustrate the dependencies. - The program segments to be executed in parallel can be formed using each of the iterations. Because there are dependencies between almost all of the consecutive iterations (e.g., 4 depends from 3, 2, and 1), uncovering parallelism can be difficult. The data dependence graph allows a system to uncover parallelism, for example, in the loop. For example, program segments with
enumeration order - In order to derive the data dependence graph, the system augments the program with trigger points that delimit program segments. In some implementations, the trigger points increment a variable that enumerates the enumeration order of the program segment. For example, the program loop (code illustrated above) can be augmented with a trigger point (e.g., “trigger;”) as in the following example code:
-
while condition do for i=1 to 4 do for j=1 to 4 do begin trigger; A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; end; - Each iteration in the program loop includes a call to the trigger point. The trigger point can increment a variable that enumerates the enumeration order of the program segment. The system also augments the program with markings of read and write instructions that potentially cause dependencies. The markings identify the instructions as speculative instructions.
- If the program executes sequentially, a trap is generated each time a speculative read or write instruction is executed. A software routine (e.g., a trap handler) records the address and type (e.g., read or write) of speculative instruction in a file. The enumeration order of the program segment is also recorded in the file. Post-processing of the file is used to derive 1210 the data dependence graph, and the system determines 1220 an execution order of the program segments from the data dependence graph.
- The system executes 1230 the program segments according to the execution order. In particular, the system can use a programming construct to execute program segments in parallel. For example, a construct “parallel_for (i,j)=(x1,y1; x2,y2; . . . ; xn,yn)” can be interpreted so that program segments with index variables (x1,y1), (x2,y2), . . . , and (xn,yn) execute speculatively in parallel. As a further example, an example program loop that uses the construct includes the code:
-
while condition do begin parallel_for (i,j) =(1,1) A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; parallel_for (i,j) =(2,1;1,2) A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; parallel_for (i,j) =(3,1;2,2;1,3) A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; parallel_for (i,j) =(4,1;3,2;2,3;1,4) A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; parallel_for (i,j) =(4,2;3,3;2,4) A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; parallel_for (i,j) =(4,3;3,4) A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; parallel_for (i,j) =(4,4) A[i,j] := A[i−1,j]+A[i+1,j]+A[i,j−1]+A[i,j+1]; commit; end; - The program loop includes seven “parallel_for” constructs. The system uses the construct to execute 1230 the program segments according to the execution order. For example, the second “parallel_for” construct allows the system to execute a set of program segments (2,1) and (1,2) in parallel. Alternatively, consecutive “parallel_for” constructs are executed sequentially. For example, the third “parallel_for” construct allows the system to execute program segments (3,1), (2,2) and (1,3) in parallel, after the system executes program segments (2,1) and (1,2) in parallel.
- The system compares 1240 the program segment executions to identify dependencies.
- Determining whether or not a program segment in a later “parallel_for” construct with a higher enumeration order causes a dependency with a program segment in an earlier “parallel_for” construct with a lower enumeration order may be difficult. In some implementations, the system compares 1240 the program segment executions to identify dependencies, after all of the program segments have been executed (e.g., at a “commit” point in the program).
- Returning to
FIG. 2 , the system analyzes 210 data dependence of program segments of a program, and the system profiles 220 the program. In particular, the system collects statistics for use in selecting a speculative method (e.g., a process for speculative throughput computing). In some implementations, the statistics include a number of instructions executed in program segment i (Ni), where i is an integer greater than zero; an average number of cycles used by an instruction (CPIi); a number of speculative read instructions in program segment i (Ri); and a number of speculative write instructions in program segment i (Wi). - The system predicts a speedup gain from speculative execution. In particular, the system determines an execution time of a sequential execution of the program (Tsequential). The system derives the speedup gain by applying an analytical model. For example, the analytical model can include execution times represented by an equation:
-
T exec(K)=Max[Tsegment-start(K)+N i ·CPI i +R i ·R cost(K)+W i ·W cost(K)+P roll-back·Roll-backcost(K)+(1−P roll-back)·C cost(K)], where: - K identifies one of a plurality of speculative methods; Max is a maximum function; Tsegment-start models a start-up cost of initiating a speculative thread; Rcost models a cost of a speculative read instruction; Wcost models a cost of a speculative write instruction; Proll-back is a probability of a miss-speculation; Roll-backcost models a cost of a roll-back; and Ccost models a cost of a commit.
- The expression inside the maximum function (Max) is an estimated execution time for program segment i. The maximum function determines an execution time of a slowest program segment for a speculative method K. Therefore, the analytical model includes execution times of a plurality of speculative methods, and the execution times of the plurality of speculative methods equal the execution times of a slowest program segment. Applying Texec(K) from the analytical model, the system can predict a speedup gain using an equation:
-
Speedup(K)=T sequential /T exec(K). - The system then selects 230 one or more of a plurality of speculative methods (e.g., processes described with reference to
FIGS. 3-13 ) to use for translation using the speedup gains for each speculative method. For example, the system selects the speculative method with the highest speedup gain. In some implementations, the system selects sequential execution if sequential execution has a lower execution time than the speculative methods (e.g., Tsequential<Texec(K), for all K). - In some implementations, the system selects a combination of speculative methods. For example, the system can use a combination of parts of one or more speculative methods of the plurality of speculative methods. As another example, the system can use a combination of one or more speculative methods of the plurality of speculative methods executed sequentially. As yet another example, the system can use a combination of one or more speculative methods of the plurality of speculative methods executed in parallel.
- In some implementations, the
process 200 can be included in a module that can be integrated with a compiler. In some implementations, a combination of one or more steps of process 200 (e.g., combinations ofsteps multiprocessor computing system 100, multicore computing system, or multi-threaded computing system) can perform speculative execution (e.g., according to some or all of the processes described with reference toFIGS. 3-13 ). In some implementations, a system that includes the apparatus and the compiler can perform speculative execution (e.g., according to some or all of the methods described with reference toFIGS. 3-13 ). - Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer program products, e.g., one or more modules of computer program instructions encoded on a tangible program carrier for execution by, or to control the operation of, data processing apparatus. The tangible program carrier can be a propagated signal or a computer-readable medium. The propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a computer. The computer-readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them.
- The term “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
- A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
- The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
- Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, to name just a few.
- Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
- While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
- Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
- Particular embodiments of the subject matter described in this specification have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.
Claims (17)
1. A method comprising:
translating a program to execute on a plurality of processors, processor cores, or threads including:
analyzing data dependence of program segments of the program;
collecting statistics that include at least one of:
a number of instructions executed in a program segment,
an average number of cycles used by an instruction, and
a number of speculative reads and writes in a program segment; and
predicting a speedup gain from speculative execution including:
determining an execution time of a sequential execution of the program, and
deriving the speedup gain by applying an analytical model.
2. The method of claim 1 , where the analytical model includes execution times of a plurality of speculative methods, and the execution times of the plurality of speculative methods equal the execution times of a slowest program segment.
3. The method of claim 2 , where translating the program to execute on a plurality of processors, processor cores, or threads further comprises:
selecting a selected speculative method from the plurality of speculative methods using the speedup gain.
4. The method of claim 3 , further comprising:
executing the program using the selected speculative method.
5. The method of claim 2 , where the execution times are represented by the equation Texec(K)=Max[Tsegment-start(K)+Ni·CPIi+Ri·Rcost(K)+Wi·Wcost(K)+Proll-back·Roll-backcost(K)+(1−Proll-back)·Ccost(K)], where:
K identifies one of a plurality of speculative methods;
Max is a maximum function;
Tsegment-start models a start-up cost of initiating a speculative thread;
Rcost models a cost of a speculative read instruction;
Wcost models a cost of a speculative write instruction;
Proll-back is a probability of a miss-speculation;
Roll-backcost models a cost of a roll-back;
Ccost models a cost of a commit;
Ni is a number of instructions executed in program segment i;
CPIi is an average number of cycles used by an instruction;
Ri is a number of speculative read instructions in program segment i; and
Wi is a number of speculative write instructions in program segment i.
6. The method of claim 5 , where the speedup gain is represented by the equation Speedup(K)=Tsequential/Texec(K), and Tsequential is the execution time of the sequential execution of the program.
7. The method of claim 3 , where the plurality of speculative methods comprises speculative methods that include:
a combination of parts of one or more speculative methods of the plurality of speculative methods.
8. The method of claim 3 , where the plurality of speculative methods comprises speculative methods that include:
a combination of one or more speculative methods of the plurality of speculative methods executed sequentially.
9. The method of claim 3 , where the plurality of speculative methods comprises speculative methods that include:
a combination of one or more speculative methods of the plurality of speculative methods executed in parallel.
10. A computer program product, encoded on a computer-readable medium, operable to cause a data processing apparatus to:
translate a program to execute on a plurality of processors, processor cores, or threads including:
analyzing data dependence of program segments of the program;
collecting statistics that include at least one of:
a number of instructions executed in a program segment,
an average number of cycles used by an instruction, and
a number of speculative reads and writes in a program segment; and
predicting a speedup gain from speculative execution including:
determining an execution time of a sequential execution of the program, and
deriving the speedup gain by applying an analytical model.
11. The computer program product of claim 10 , where the analytical model includes execution times of a plurality of speculative methods, and the execution times of the plurality of speculative methods equal the execution times of a slowest program segment.
12. The computer program product of claim 11 , where translating the program to execute on a plurality of processors, processor cores, or threads further comprises:
selecting a selected speculative method from the plurality of speculative methods using the speedup gain.
13. The computer program product of claim 12 , further comprising:
executing the program using the selected speculative method.
14. A system comprising:
one or more processors or processor cores;
a computer-readable medium coupled to the one or more processors or processor cores and having instructions contained thereon, which, when executed by the one or more processors or processor cores, causes the one or more processors or processor cores to perform the operations of:
translating a program to execute on a plurality of processors, processor cores, or threads including:
analyzing data dependence of program segments of the program;
collecting statistics that include at least one of:
a number of instructions executed in a program segment,
an average number of cycles used by an instruction, and
a number of speculative reads and writes in a program segment; and
predicting a speedup gain from speculative execution including:
determining an execution time of a sequential execution of the program, and
deriving the speedup gain by applying an analytical model.
15. The system of claim 14 , where the analytical model includes execution times of a plurality of speculative methods, and the execution times of the plurality of speculative methods equal the execution times of a slowest program segment.
16. The system of claim 15 , where translating the program to execute on a plurality of processors, processor cores, or threads further comprises:
selecting a selected speculative method from the plurality of speculative methods using the speedup gain.
17. The system of claim 16 , further comprising:
executing the program using the selected speculative method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/022,814 US20080184011A1 (en) | 2007-01-30 | 2008-01-30 | Speculative Throughput Computing |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US89796907P | 2007-01-30 | 2007-01-30 | |
US12/022,814 US20080184011A1 (en) | 2007-01-30 | 2008-01-30 | Speculative Throughput Computing |
Publications (1)
Publication Number | Publication Date |
---|---|
US20080184011A1 true US20080184011A1 (en) | 2008-07-31 |
Family
ID=39301097
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/022,713 Expired - Fee Related US7681015B2 (en) | 2007-01-30 | 2008-01-30 | Generating and comparing memory access ranges for speculative throughput computing |
US12/022,814 Abandoned US20080184011A1 (en) | 2007-01-30 | 2008-01-30 | Speculative Throughput Computing |
US12/022,767 Abandoned US20080184012A1 (en) | 2007-01-30 | 2008-01-30 | Speculative Throughput Computing |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/022,713 Expired - Fee Related US7681015B2 (en) | 2007-01-30 | 2008-01-30 | Generating and comparing memory access ranges for speculative throughput computing |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/022,767 Abandoned US20080184012A1 (en) | 2007-01-30 | 2008-01-30 | Speculative Throughput Computing |
Country Status (4)
Country | Link |
---|---|
US (3) | US7681015B2 (en) |
EP (1) | EP2115583A2 (en) |
CN (1) | CN101611380A (en) |
WO (1) | WO2008092883A2 (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080184018A1 (en) * | 2007-01-30 | 2008-07-31 | Nema Labs Ab | Speculative Throughput Computing |
US20100106949A1 (en) * | 2008-10-24 | 2010-04-29 | International Business Machines Corporation | Source code processing method, system and program |
US20110145318A1 (en) * | 2009-12-15 | 2011-06-16 | International Business Machines Corporation | Interactive analytics processing |
US20110145366A1 (en) * | 2009-12-15 | 2011-06-16 | International Business Machines Corporation | Concurrent execution of request processing and analytics of requests |
US20110145429A1 (en) * | 2009-12-15 | 2011-06-16 | International Business Machines Corporation | Multi-granular stream processing |
CN102214141A (en) * | 2011-07-02 | 2011-10-12 | 中国矿业大学 | Real-time stack-based program slicing method |
EP2738681A1 (en) * | 2011-07-27 | 2014-06-04 | Fujitsu Limited | Electronic device, device access method, and program |
WO2016105786A1 (en) * | 2014-12-24 | 2016-06-30 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
WO2016105800A1 (en) * | 2014-12-24 | 2016-06-30 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
WO2016105802A1 (en) * | 2014-12-24 | 2016-06-30 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
CN107003850A (en) * | 2014-12-24 | 2017-08-01 | 英特尔公司 | The systems, devices and methods performed for data-speculative |
US10061589B2 (en) | 2014-12-24 | 2018-08-28 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
US10387156B2 (en) | 2014-12-24 | 2019-08-20 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
US10387158B2 (en) | 2014-12-24 | 2019-08-20 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
US11974157B2 (en) | 2020-10-29 | 2024-04-30 | Honda Motor Co., Ltd. | Information processing device, moving object, computer-readable storage medium, and information processing method |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE102008045767A1 (en) * | 2008-09-04 | 2010-03-11 | Continental Teves Ag & Co. Ohg | Microprocessor with pipeline bubble detector |
US8930920B2 (en) * | 2012-12-31 | 2015-01-06 | Oracle International Corporation | Self-optimizing interpreter and snapshot compilation |
US9086873B2 (en) * | 2013-03-15 | 2015-07-21 | Intel Corporation | Methods and apparatus to compile instructions for a vector of instruction pointers processor architecture |
US9658963B2 (en) * | 2014-12-23 | 2017-05-23 | Intel Corporation | Speculative reads in buffered memory |
CN109213575B (en) * | 2017-06-30 | 2024-04-05 | 北京忆恒创源科技股份有限公司 | Method for running program by single processor |
CN107729990B (en) * | 2017-07-20 | 2021-06-08 | 上海寒武纪信息科技有限公司 | Apparatus and method for performing forward operations in support of discrete data representations |
US11403394B2 (en) * | 2019-09-17 | 2022-08-02 | International Business Machines Corporation | Preventing selective events of a computing environment |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5864692A (en) * | 1996-12-16 | 1999-01-26 | Hewlett-Packard Company | Method and apparatus for protecting memory-mapped devices from side effects of speculative instructions |
US5918005A (en) * | 1997-03-25 | 1999-06-29 | International Business Machines Corporation | Apparatus region-based detection of interference among reordered memory operations in a processor |
US5931957A (en) * | 1997-03-31 | 1999-08-03 | International Business Machines Corporation | Support for out-of-order execution of loads and stores in a processor |
US6345351B1 (en) * | 1999-11-12 | 2002-02-05 | Telefonaktiebolaget Lm Ericsson(Publ) | Maintenance of speculative state of parallel executed jobs in an information processing system |
US6484254B1 (en) * | 1999-12-30 | 2002-11-19 | Intel Corporation | Method, apparatus, and system for maintaining processor ordering by checking load addresses of unretired load instructions against snooping store addresses |
US20040154012A1 (en) * | 2003-01-31 | 2004-08-05 | Hong Wang | Safe store for speculative helper threads |
US6889315B2 (en) * | 1999-12-17 | 2005-05-03 | Fujitsu Limited | Processor and method of controlling the same |
US20060294326A1 (en) * | 2005-06-23 | 2006-12-28 | Jacobson Quinn A | Primitives to enhance thread-level speculation |
US20080184012A1 (en) * | 2007-01-30 | 2008-07-31 | Nema Labs Ab | Speculative Throughput Computing |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2433379A1 (en) * | 2003-06-25 | 2004-12-25 | Ibm Canada Limited - Ibm Canada Limitee | Modulo scheduling of multiple instruction chains |
-
2008
- 2008-01-30 US US12/022,713 patent/US7681015B2/en not_active Expired - Fee Related
- 2008-01-30 US US12/022,814 patent/US20080184011A1/en not_active Abandoned
- 2008-01-30 WO PCT/EP2008/051098 patent/WO2008092883A2/en active Application Filing
- 2008-01-30 US US12/022,767 patent/US20080184012A1/en not_active Abandoned
- 2008-01-30 CN CNA2008800031604A patent/CN101611380A/en active Pending
- 2008-01-30 EP EP08708413A patent/EP2115583A2/en not_active Withdrawn
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5864692A (en) * | 1996-12-16 | 1999-01-26 | Hewlett-Packard Company | Method and apparatus for protecting memory-mapped devices from side effects of speculative instructions |
US5918005A (en) * | 1997-03-25 | 1999-06-29 | International Business Machines Corporation | Apparatus region-based detection of interference among reordered memory operations in a processor |
US5931957A (en) * | 1997-03-31 | 1999-08-03 | International Business Machines Corporation | Support for out-of-order execution of loads and stores in a processor |
US6345351B1 (en) * | 1999-11-12 | 2002-02-05 | Telefonaktiebolaget Lm Ericsson(Publ) | Maintenance of speculative state of parallel executed jobs in an information processing system |
US6889315B2 (en) * | 1999-12-17 | 2005-05-03 | Fujitsu Limited | Processor and method of controlling the same |
US6484254B1 (en) * | 1999-12-30 | 2002-11-19 | Intel Corporation | Method, apparatus, and system for maintaining processor ordering by checking load addresses of unretired load instructions against snooping store addresses |
US20040154012A1 (en) * | 2003-01-31 | 2004-08-05 | Hong Wang | Safe store for speculative helper threads |
US20060294326A1 (en) * | 2005-06-23 | 2006-12-28 | Jacobson Quinn A | Primitives to enhance thread-level speculation |
US20080184012A1 (en) * | 2007-01-30 | 2008-07-31 | Nema Labs Ab | Speculative Throughput Computing |
US20080184018A1 (en) * | 2007-01-30 | 2008-07-31 | Nema Labs Ab | Speculative Throughput Computing |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080184012A1 (en) * | 2007-01-30 | 2008-07-31 | Nema Labs Ab | Speculative Throughput Computing |
US7681015B2 (en) | 2007-01-30 | 2010-03-16 | Nema Labs Ab | Generating and comparing memory access ranges for speculative throughput computing |
US20080184018A1 (en) * | 2007-01-30 | 2008-07-31 | Nema Labs Ab | Speculative Throughput Computing |
EP2352087A1 (en) * | 2008-10-24 | 2011-08-03 | International Business Machines Corporation | Source code processing method, system, and program |
US20100106949A1 (en) * | 2008-10-24 | 2010-04-29 | International Business Machines Corporation | Source code processing method, system and program |
US8595712B2 (en) | 2008-10-24 | 2013-11-26 | International Business Machines Corporation | Source code processing method, system and program |
US8407679B2 (en) | 2008-10-24 | 2013-03-26 | International Business Machines Corporation | Source code processing method, system and program |
EP2352087A4 (en) * | 2008-10-24 | 2012-08-08 | Ibm | Source code processing method, system, and program |
US8819183B2 (en) | 2009-12-15 | 2014-08-26 | International Business Machines Corporation | Concurrent execution of request processing and analytics of requests |
WO2011072979A1 (en) * | 2009-12-15 | 2011-06-23 | International Business Machines Corporation | Concurrent execution of request processing and analytics of requests |
GB2488941A (en) * | 2009-12-15 | 2012-09-12 | Ibm | Concurrent execution of request processing and analytics of requests |
US20110145429A1 (en) * | 2009-12-15 | 2011-06-16 | International Business Machines Corporation | Multi-granular stream processing |
US20110145366A1 (en) * | 2009-12-15 | 2011-06-16 | International Business Machines Corporation | Concurrent execution of request processing and analytics of requests |
US20110145318A1 (en) * | 2009-12-15 | 2011-06-16 | International Business Machines Corporation | Interactive analytics processing |
US8874638B2 (en) | 2009-12-15 | 2014-10-28 | International Business Machines Corporation | Interactive analytics processing |
US8892762B2 (en) | 2009-12-15 | 2014-11-18 | International Business Machines Corporation | Multi-granular stream processing |
CN102214141A (en) * | 2011-07-02 | 2011-10-12 | 中国矿业大学 | Real-time stack-based program slicing method |
EP2738681A1 (en) * | 2011-07-27 | 2014-06-04 | Fujitsu Limited | Electronic device, device access method, and program |
EP2738681A4 (en) * | 2011-07-27 | 2014-07-30 | Fujitsu Ltd | Electronic device, device access method, and program |
WO2016105802A1 (en) * | 2014-12-24 | 2016-06-30 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
US10061583B2 (en) | 2014-12-24 | 2018-08-28 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
WO2016105786A1 (en) * | 2014-12-24 | 2016-06-30 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
CN107003850A (en) * | 2014-12-24 | 2017-08-01 | 英特尔公司 | The systems, devices and methods performed for data-speculative |
KR20170098803A (en) * | 2014-12-24 | 2017-08-30 | 인텔 코포레이션 | Systems, apparatuses, and methods for data speculation execution |
US9785442B2 (en) | 2014-12-24 | 2017-10-10 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
JP2017539008A (en) * | 2014-12-24 | 2017-12-28 | インテル・コーポレーション | System, apparatus and method for executing data speculation |
WO2016105800A1 (en) * | 2014-12-24 | 2016-06-30 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
US10061589B2 (en) | 2014-12-24 | 2018-08-28 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
US20190121644A1 (en) * | 2014-12-24 | 2019-04-25 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
US10303525B2 (en) | 2014-12-24 | 2019-05-28 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
US10387156B2 (en) | 2014-12-24 | 2019-08-20 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
US10387158B2 (en) | 2014-12-24 | 2019-08-20 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
US10942744B2 (en) | 2014-12-24 | 2021-03-09 | Intel Corporation | Systems, apparatuses, and methods for data speculation execution |
KR102453594B1 (en) * | 2014-12-24 | 2022-10-12 | 인텔 코포레이션 | Systems, apparatuses, and methods for data speculation execution |
US11974157B2 (en) | 2020-10-29 | 2024-04-30 | Honda Motor Co., Ltd. | Information processing device, moving object, computer-readable storage medium, and information processing method |
Also Published As
Publication number | Publication date |
---|---|
WO2008092883A3 (en) | 2008-10-02 |
EP2115583A2 (en) | 2009-11-11 |
US20080184018A1 (en) | 2008-07-31 |
WO2008092883A2 (en) | 2008-08-07 |
CN101611380A (en) | 2009-12-23 |
US7681015B2 (en) | 2010-03-16 |
US20080184012A1 (en) | 2008-07-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7681015B2 (en) | Generating and comparing memory access ranges for speculative throughput computing | |
US9798528B2 (en) | Software solution for cooperative memory-side and processor-side data prefetching | |
US10394714B2 (en) | System and method for false sharing prediction | |
US8359587B2 (en) | Runtime profitability control for speculative automatic parallelization | |
US7730470B2 (en) | Binary code instrumentation to reduce effective memory latency | |
US8677337B2 (en) | Static profitability control for speculative automatic parallelization | |
US20110154289A1 (en) | Optimization of an application program | |
Liu et al. | Pinpointing data locality bottlenecks with low overhead | |
Lepak et al. | Silent stores and store value locality | |
US20170242772A1 (en) | System and Method for Detecting False Sharing | |
Huber et al. | Worst‐case execution time analysis‐driven object cache design | |
Chen et al. | Data access history cache and associated data prefetching mechanisms | |
Guo et al. | Mira: A program-behavior-guided far memory system | |
US8612952B2 (en) | Performance optimization based on data accesses during critical sections | |
Nikoleris et al. | Directed statistical warming through time traveling | |
Sato et al. | An accurate simulator of cache-line conflicts to exploit the underlying cache performance | |
Huber et al. | WCET driven design space exploration of an object cache | |
Pinto et al. | A highly efficient, thread-safe software cache implementation for tightly-coupled multicore clusters | |
Chen et al. | A lightweight hybrid hardware/software approach for object-relative memory profiling | |
Hu et al. | CBANA: A Lightweight, Efficient, and Flexible Cache Behavior Analysis Framework | |
Lakhdar et al. | Toward Modeling Cache-Miss Ratio for Dense-Data-Access-Based Optimization | |
Schönauer | Impact of garbage collection policies on load stalls on AArch64 in OpenJ9 | |
Roelandts et al. | Scalar Vector Runahead | |
CHOUDHARY | Hardware Data Prefetchers for Shakti I-Class Processor | |
Thulasidasan et al. | A Scalable Analytical Memory Model for CPU Performance Prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NEMA LABS AB, SWEDEN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BUSCK, ALEXANDER;ENGBOM, MIKAEL;STENSTROM, PER;AND OTHERS;REEL/FRAME:020484/0846 Effective date: 20080130 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |