[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2007133101A1 - Floating point addition for different floating point formats - Google Patents

Floating point addition for different floating point formats Download PDF

Info

Publication number
WO2007133101A1
WO2007133101A1 PCT/RU2006/000236 RU2006000236W WO2007133101A1 WO 2007133101 A1 WO2007133101 A1 WO 2007133101A1 RU 2006000236 W RU2006000236 W RU 2006000236W WO 2007133101 A1 WO2007133101 A1 WO 2007133101A1
Authority
WO
WIPO (PCT)
Prior art keywords
logic
operand
processor
operands
floating point
Prior art date
Application number
PCT/RU2006/000236
Other languages
French (fr)
Inventor
Alexey Yurievich Sivtsov
Valery Yakovlevich Gorshtein
Original Assignee
Intel Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corporation filed Critical Intel Corporation
Priority to DE112006003875T priority Critical patent/DE112006003875T5/en
Priority to CN200680054583.XA priority patent/CN101438232B/en
Priority to US10/589,448 priority patent/US20080133895A1/en
Priority to PCT/RU2006/000236 priority patent/WO2007133101A1/en
Publication of WO2007133101A1 publication Critical patent/WO2007133101A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/3808Details concerning the type of numbers or the way they are handled
    • G06F2207/3812Devices capable of handling different types of numbers
    • G06F2207/382Reconfigurable for different fixed word lengths
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/485Adding; Subtracting

Definitions

  • the present disclosure generally relates to the field of electronics. More particularly, an embodiment of the invention relates to techniques to perform floating point addition within a computer system.
  • Floating point representations of numbers may be used to provide efficiency when performing arithmetic operations on real numbers.
  • differing floating point representation formats may be utilized.
  • real numbers may be represented as a single precision floating point number, a double precision floating point number, or a double-extended precision floating number.
  • processors or computer systems may include more than one floating point adder to operate on numbers having different floating point formats. Having different floating point adders for different floating point formats may cause more die area on a processor to be consumed, as well as additional power.
  • FIGs. 1, 8 and 9 illustrate block diagrams of embodiments of computing systems, which may be utilized to implement various embodiments discussed herein.
  • Fig. 2 illustrates a block diagram of portions of a processor core, according to an embodiment of the invention.
  • Figs. 3a-b and 4 illustrate block diagrams of portions of a floating point adder, according to various embodiments of the invention.
  • Figs. 5 and 6 illustrate operand formats in accordance with various embodiments of the invention.
  • Fig. 7 illustrates a flow diagram of an embodiment of a method in accordance with an embodiment of the invention.
  • Fig. 1 illustrates a block diagram of a computing system 100, according to an embodiment of the invention.
  • the system 100 may include one or more processors 102-1 through 102-N (generally referred to herein as "processors 102" or “processor 102").
  • the processors 102 may communicate via an interconnection or bus 104.
  • Each processor may include various components some of which are only discussed with reference to processor 102-1 for clarity. Accordingly, each of the remaining processors 102-2 through 102-N may include the same or similar components discussed with reference to the processor 102-1.
  • the processor 102-1 may include one or more processor cores 106-1 through 106-M (referred to herein as “cores 106," or more generally as “core 106"), a cache 108 (which may be a shared cache or a private cache in various embodiments), and/or a router 110.
  • the processor cores 106 may be implemented on a single integrated circuit (IC) chip.
  • the chip may include one or more shared and/or private caches (such as cache 108), buses or interconnections (such as a bus or interconnection 112), memory controllers (such as those discussed with reference to Figs. 8 and 9), or other components.
  • the router 110 may be used to communicate between various components of the processor 102-1 and/or system 100.
  • the processor 102-1 may include more than one router 110.
  • the multitude of routers may be used to communicate between various components of the processor 102-1 and/or system 100.
  • (110) may be in communication to enable data routing between various components inside or outside of the processor 102-1.
  • the cache 108 may store data (e.g., including instructions) that are utilized by one or more components of the processor 102-1, such as the cores 106.
  • the cache 108 may locally cache data stored in a memory 114 for faster access by the components of the processor 102.
  • the memory 114 may be in communication with the processors 102 via the interconnection 104.
  • the cache 108 (that may be shared) may include a mid-level cache and/or a last level cache (LLC).
  • each of the cores 106 may include a level 1 (Ll) cache (116-1) (generally referred to herein as "Ll cache 116").
  • LLC last level cache
  • LLC last level cache
  • each of the cores 106 may include a level 1 (Ll) cache (116-1) (generally referred to herein as "Ll cache 116").
  • LLC last level cache
  • LLC last level cache
  • each of the cores 106 may include a level 1 (Ll) cache (116
  • Fig. 2 illustrates a block diagram of portions of a processor core 106, according to an embodiment of the invention.
  • the arrows shown in Fig. 2 illustrate the flow direction of instructions through the core 106.
  • One or more processor cores may be implemented on a single integrated circuit chip (or die) such as discussed with reference to Fig, 1.
  • the chip may include one or more shared and/or private caches (e.g., cache 108 of Fig. 1), interconnections (e.g., interconnections 104 and/or 112 of Fig. 1), memory controllers, or other components.
  • the processor core 106 may include a fetch unit 202 to fetch instructions for execution by the core 106.
  • the instructions may be fetched from any storage devices such as the memory 114 and/or the memory devices discussed with reference to Figs. 8 and 9.
  • the core 106 may also include a decode unit 204 to decode the fetched instruction.
  • the decode unit 204 may decode the fetched instruction into a plurality of uops (micro-operations).
  • the core 106 may include a schedule unit 206.
  • the schedule unit 206 may perform various operations associated with storing decoded instructions (e.g., received from the decode unit 204) until the instructions are ready for dispatch, e.g., until all source values of a decoded instruction become available.
  • the schedule unit 206 may schedule and/or issue (or dispatch) decoded instructions to an execution unit 208 for execution.
  • the execution unit 208 may execute the dispatched instructions after they are decoded (e.g., by the decode unit 204) and dispatched (e.g., by the schedule unit 206).
  • the execution unit 208 may include more than one execution unit, such as a memory execution unit, an integer execution unit, a floating-point execution unit, or other execution units.
  • the execution unit 208 may also perform various arithmetic operations such as addition, subtraction, multiplication, and/or division, and may include one or more an arithmetic logic units (ALUs).
  • ALUs arithmetic logic units
  • a co-processor (not shown) may perform various arithmetic operations in conjunction with the execution unit 208.
  • the execution unit 208 may include a floating point (FP) adder 209 to perform addition, subtraction, comparison, and/or format conversion of floating numbers that may be represented in varying floating point representation formats.
  • the floating point numbers being added and/or subtracted may have any format, e.g., including a single precision, a double precision, and/or a double-extended precision floating number format (such as those discussed with reference to Figs. 5 and 6).
  • the adder 209 may operate in response to a single instruction, multiple data (SIMD) instruction.
  • SIMD multiple data
  • an SIMD instruction may cause identical operations to be performed on multiple pieces of data at the same time, e.g., in parallel.
  • the SIMD instruction may correspond to streaming SIMD extensions (SSE) or other forms of streaming SIMD implementations (such as streaming SIMD extensions 2 (SSE2)). Further details regarding various embodiments of the adder 209 will be further discussed herein, e.g., with reference to Figs. 3-7.
  • the execution unit 208 may include more than one floating point adder 209. Further, the execution unit 208 may execute instructions out-of-order.
  • the processor core 106 may be an out-of-order processor core in one embodiment.
  • the core 106 may also include a retirement unit 210.
  • the retirement unit 210 may retire executed instructions after they are committed. In an embodiment, retirement of the executed instructions may result in processor state being committed from the execution of the instructions, physical registers used by the instructions being de-allocated, etc.
  • the core 106 may additionally include a trace cache or microcode read-only memory (uROM) 212 to store microcode and/or traces of instructions that have been fetched (e.g., by the fetch unit 202).
  • the microcode stored in the uROM 212 may be used to configure various hardware components of the core 106.
  • the microcode stored in the uROM 212 may be loaded from another component in communication with the processor core 106, such as a computer-readable medium or other storage device discussed with reference to Figs. 8 and 9.
  • the core 106 may also include a bus unit 220 to enable communication between components of the processor core 106 and other components (such as the components discussed with reference to Fig. 1) via one or more buses (e.g., buses 104 and/or 112).
  • the core 106 may also include one or more registers 222 to store data accessed by various components of the core 106.
  • Fig. 3a-b illustrates a block diagram of portions of a floating point adder (209), according to an embodiment of the invention.
  • the floating point adder 209 of Fig. 3a-b may be the same or similar to the floating point adder 209 discussed with reference to Fig. 2.
  • the width of various signal paths of the adder 209 are shown in Fig. 3a-b in accordance with an embodiment of the invention.
  • the adder 209 may include an exponent path 302 and a mantissa path 304 to perform various operations corresponding to addition (or subtraction) of two operands 306 and 308.
  • the adder 209 may include various portions including an alignment portion 305.
  • the alignment portion 305 may include an operand formatting logic 310, e.g., to modify one or more of the operands 306 and 308 from a first format (such as those shown in Fig. 5) into a second format (such as those shown in Fig. 6).
  • the exponent path 302 may receive an opcode 312 that corresponds to an arithmetic operation (such as an addition or subtraction).
  • a logic 314 may determine (e.g., look up from a table or storage unit that includes predefined data) an exponent corresponding to opcode 312 for a conversion instruction.
  • a conversion instruction may operate on a single operand (e.g., operand 308), while another operand (e.g., operand 306) is supplied with a zero value.
  • a predefined exponent from 314 may be used to calculate a resultant exponent and align data if it is needed, as will be discussed further with reference to Fig. 3a-b.
  • a multiplexer 316 may receive and select one of the exponents from the logic 314 and an exponent corresponding to one of the operands (e.g., operand 306). In an embodiment, the multiplexer 316 may select one of its inputs based on the opcode 312.
  • An exponent difference logic 318 may receive and compare the selected exponent from the multiplexer 316 and an exponent corresponding to the operand 308.
  • the logic 318 may generate one or more signals based on the result of the comparison (which may be one or more subtraction operations in an embodiment) and provide the generated signals (such as subtraction results and carry outs) to various components of the adder 209, for example, to provide mantissa alignment, such as discussed and illustrated with reference to Fig. 3a-b.
  • the mantissa path 304 may include logics 320 and 322 to receive the formatted operands from the logic 310 and swap (or extract a portion of) the mantissas, e.g., based on carry-out signals generated from the exponent difference computation by the logic 318. Alignment of the mantissas corresponding to the operand with smaller exponent is performed using rotators (e.g., logics 324 and 326) and mask generators (e.g., mask generators 336 and 338).
  • rotators e.g., logics 324 and 326
  • mask generators e.g., mask generators 336 and 338
  • one or more of the signals generated by the logic 318 may be used for determining shift code alignment, e.g., to enable the mantissas corresponding to the operand with smaller exponent to be cycle shifted right by rotators 324 and 326.
  • the shift code signals provided by the logic 318 to logics 324 and 326 may be five bits wide.
  • the shift code (and/or carryout) signals provided to logics 324 and 326 may be the same in an embodiment.
  • the logics 320 and 322 may provide the mantissas of operands 306 and 308 with larger exponents to inverters 328 and 330 and multiplexers 332 and 334.
  • mask generators 336 and 338 may generate masks based on shift code signals from the logic 318 to enable the shifting of one or more bits of the outputs of logics 324 and 326.
  • the outputs of logics 324 and 326 may be shifted left by one bit by logics 340 and 342, respectively.
  • an operand analyzer 344 may analyze the operands 306 and 308, and generate one or more signals to enable shifting in logics 340 and 342 if one of the operands is denormal.
  • Logic 346 logically combines (e.g., by using an AND operation) the outputs of the mask generator 336 and logic 340.
  • logic 348 logically combines (e.g., by using an AND operation) the outputs of the mask generator 338 and logic 342.
  • a multiplexing logic 350 receives the outputs of the logics 346 and 348 and provides a signal to one of the inputs of each of the adders 352 and 354 which are in an addition portion 355 of the adder 209. Additional details regarding an embodiment of the logics 346, 348, and 350 are further discussed with reference to Fig. 4.
  • the adders 352 and 354 may also receive an input signal form the multiplexers 332 and 334.
  • the multiplexers 332 and 334 may select one of their inputs based on signals 356 (Compl_Hi) and 358 (ComplJLo) which may be generated based on opcode (312) and signs of operands (e.g., operands 306 and 308) to provide a (true) subtraction (e.g., subtraction of operands with the same signs or the addition of operands with different signs) or a (true) addition operation (e.g., subtraction of operands with different signs or the addition of operands with the same signs) by an opcode decoder logic (not shown).
  • the adders 352 and 354 may receive aligned and non-aligned mantissas, e.g., through multiplexers 332 and 334 which may in turn provide the non-inverted or inverted (for example by inventors 328 and 330) mantissas selected by logics 320 and 322, respectively.
  • the adders 352 and 354 also receive carry in signals.
  • adder 354 receives as carry in signals the signal 358, e.g., to provide full two's complementing for true subtraction cases.
  • the adder 352 receives as its carry in signal a carry out signal 360 from the adder 354 or the signal 356 that may be provided through a multiplexer 362 based on the precision format of the opcode 312.
  • the outputs of the adders 352 and 354 are provided to inverters 364 and 366, and multiplexers 368 and 370.
  • the multiplexers 368 and 370 may select one of their input signals as output based on a selection signal generated by the adder 352 and a multiplexer 371, respectively.
  • mantissa of operand with larger or equal exponent may be two's complemented for the true subtraction cases, the result of addition may be negative and can be two's complemented.
  • the two's complementing may be performed by inversion of results of adders 352 and 354 and adding of a binary one ("1") using a rounder hardware (e.g., logic 397).
  • the exponent path 302 of the addition portion 355 may also include a logic 372 to generate a limiter shift value for normalization, e.g., because the adder 209 may support gradual underflow.
  • the outputs of multiplexers 368 and 370 and the logic 372 is provided to a normalization portion 373 (of the adder 209) including the leading zero detection (LZD) logics 374 and 376.
  • the logics 374 and 376 may determine shift codes for normalization, e.g., by detecting the leading zeros in the results of the addition that are provided by the adders 352 and 354 through the multiplexers 368 and 370.
  • the output signals from the logics 374 and 376 may be provided to logics 378 and 380, together with the output signals from the multiplexers 368 and 370.
  • the logics 378 and 380 may perform cycle shifts left based on the outputs of logics 374 and 376 to provide normalization on the addition results.
  • the outputs of the logics 374 and 376 may be provided to an exponent adjustment logic 382 and mask generators 384 and 386.
  • the mask generators 384 and 386 may generate masks based shift code signals from the logics 374 and 376 to enable normalization of the outputs of logics 378 and 380 by logics 388 and 390, respectively.
  • the logics 388 and 390 may logically combine their inputs (e.g., by utilizing a logic AND operation) such as discussed with reference to logics 346 and 348.
  • the output signals from the logics 388 and 390 may be selected by multiplexing logic 392 (e.g., such as discussed with reference to logic 350 in an embodiment) to provide an output to the rounding portion 393 of the adder 209.
  • the logic 392 may provide guard and/or round bits to the rounding portion 393.
  • the logic 396 may compute sticky bits, e.g., by logically combining (for example through a logic OR operation) the shifted out bits provided by the logics 346 and 348 as will be further discussed with reference to Fig. 4.
  • the logic 396 may combine the outputs of the logics 394 and 395 to provide two sticky bits for two single-precision operands, and a single sticky bit for double-precision and double-extended precision operands.
  • the output signal from the logics 396 and 392 are provided to rounder logic 397 to perform rounding of the addition (or subtraction) of the mantissas.
  • a logic 398 may receive the exponent from the logic 382 and modify (or fix) the exponent for round up situations, e.g., by adding a one if round up occurs. Moreover, the logic 382 may adjust the exponent (e.g., received from logic 318) by the shift code for normalization (e.g., provided by the logics 374 and 376). In an embodiment, the logic 382 may subtract the shift codes received from logics 374 and 376 from the larger exponent provided by the logic 318. Hence, in one embodiment, the larger exponent provided by the logic 318 may be corrected for normalization (e.g., by logic 382) and round up situations (e.g., by logic 398).
  • the mantissa path 304 may include two separate paths to process the most significant (MS) 32 bits and least significant (LS) 36 bits of operands 306 and 308.
  • a first MS 32-bit path e.g., including logics 320, 324, 352, and/or 378, may operate on a first set of data (e.g., a pair of single precision floating point mantissas such as discussed with reference to the operand 602 of Fig.
  • a second LS 36-bit path (e.g., including logics 322, 326, 354, and/or 380) may operate on a second set of data (which may be a different pair of single precision floating point mantissas).
  • a combination of the first and second paths may be used to operate on double precision or double extended precision operands (e.g., operands 630 and/or 650 of Fig. 6).
  • logics 350 and 392 may enable combination of signals between these two mantissa paths.
  • Fig. 4 illustrates a block diagram of further details of portions of the adder 209 of Fig. 3a-b, according to an embodiment of the invention.
  • signals 402-410 that are generated by the logic 318 may be provided to multiplexers 412-416.
  • the inputs to the multiplexers 412-416 may be selected by signals that are generated based on precision format of the opcode 312.
  • the outputs of the multiplexers 412, 414, and 416 are provided to the logics 320, logics 336 and 324, and logics 338 and 326, respectively.
  • signal 402 may correspond to a shift code for alignment of MS 32-bits for the single precision case;
  • signal 404 may correspond to a shift code for alignment for the double precision or double extended precision cases;
  • signal 406 may correspond to a shift code for alignment of LS 36-bits for the single precision case;
  • signal 408 may correspond to a carry-out signal from exponent difference of second pair of single precision data;
  • signal 410 may correspond to a carry-out signal from exponent difference of double precision or double extended precision data.
  • the logic 346 may include AND gates 424 and 426 to combine the outputs of the logic 340 and 336.
  • the logic 348 may include AND gates 428 and 430 to combine the outputs of the logic 338 and 342.
  • one of the inputs to the gates 426 and 430 may be inverted such as shown in Fig. 4.
  • an OR gate 434 may combine the outputs of the gates 426 and 428 (e.g., by logically OR-ing the outputs of the gates 426 and 428).
  • the logic 350 may include multiplexers 436-440.
  • the inputs of the multiplexers 436-440 may be selected by a signal 442 which is generated by a logic (e.g., based on precision format of the opcode 312 and on signals from 318 of Fig.3) to indicate how an aligned mantissa is combined with signals from a storage unit 441 (which may be a hardware register in an embodiment) and logics 424, 434, and 428.
  • a logic e.g., based on precision format of the opcode 312 and on signals from 318 of Fig.3
  • the multiplexers 436-440 may receive a signal from the storage unit 441 (e.g., including all zero's) to fill the first 32 bits of the output of the logic 350 with zeros for the case when exponent difference is more than 32 bits or fill the first 64 bits of the output of the logic 350 with zeros for the case when exponent difference is more than 64 bits.
  • the logic 350 may provide the outputs of the multiplexers 436-440 in the 68 bit format 444 (which, in one embodiment, includes a most significant (MS) 32-bit portion 446, a middle 32-bit portion 448, and a least significant (LS) 32-bit portion 450) to the adders 352 and 354 such as illustrated in Fig. 4.
  • MS most significant
  • LS least significant
  • portions 446, 448, and 450 may be provided in accordance with one or more of the following: • If the opcode 312 corresponds to a single precision format and the exponent difference (318) of the second pair of single precision operands (306 and 308) is less than 24, then portion 446 may be supplied by logic 424 through logic 436. A similar situation may be applied to the first pair of operands (306, 308) also; namely, portion 448 may be supplied by logic 428 through logic 438. Moreover, in an embodiment (such as discussed with reference to Figs.
  • the first pair may correspond to x ⁇ and y ⁇ , while the second pair may correspond to xl and yl.
  • portion 446 may be supplied by logic
  • portion 448 may be supplied by logic 434.
  • portion 446 may be supplied by storage unit 441 and portion 448 may be supplied by logic 424.
  • portion 446 may be supplied from storage unit 441
  • portion 448 may be supplied by storage unit 441
  • portion 450 may be supplied by logic 424.
  • Fig. 5 illustrates sample operand formats 500 for operands 306 and 308 of Fig. 3a-b, in accordance with an embodiment of the invention.
  • Fig. 6 illustrates formatted floating point adder operand formats 600 corresponding to the formats 500 of Fig. 5, after the operands of Fig. 5 are formatted by the logic 310 of Fig. 3a-b. Width of each field of the operands shown in Figs. 5 and 6 is illustrated in accordance with some embodiments of the invention.
  • a single precision operand 502 (which may represent two single precision floating point numbers in an embodiment) may include sign fields 504 and 506, exponent fields 508 and 510, and mantissa fields 512 and 514.
  • a double precision operand 520 may include a sign filed 522, an exponent field 524, and a mantissa field 526.
  • a double-extended precision operand 530 may include a sign filed 532, an exponent field 534, a J field 536 (which may indicate whether the mantissa is normalized), and a mantissa field 538.
  • a J bit (536) may correspond to the integer part of a mantissa which may be hidden in single precision and double precision formats. Further, the J bit may be set to zero for denormals.
  • a single precision operand 602 may include a sign fields 604 (which may correspond to the sign field 504), exponent fields 606 and 608 (which may correspond to fields 508 and 510 in an embodiment), a zero field 610 (which may correspond to the sign field 506), overflow fields 612 and 614 (e.g., to indicate an overflow condition in a path of the adder 209), J fields 616 and 618 (e.g., to indicate that the corresponding floating point number is normal), and mantissa fields 620 and 622 (which may correspond to fields 512 and 514 in an embodiment).
  • a double precision operand 630 may include a sign filed 632 (which may correspond to field 522 in an embodiment), an exponent field 634 (which may correspond to field 524 in an embodiment), an overflow field 636 (e.g., to indicate an overflow condition), a J field 638 (e.g., to indicate that the corresponding floating point number is normal), and a mantissa field 640 (which may correspond to the field 526 in an embodiment).
  • a double-extended precision operand 650 may include a sign filed 652 (which may correspond to the field 532 in an embodiment), an exponent field 654 (which may correspond to the field 534 in an embodiment), an overflow field 656 (e.g., to indicate an overflow condition), a J field 658 (e.g., to indicate that the corresponding floating point number is normal), and a mantissa field 660 (which may correspond to the field 538 in an embodiment).
  • other fields of the operands 602, 630, and 650 may be unused (e.g., have all zeros).
  • the logic 310 may format the operands 502, 520, and 530 into the operands 602, 630, and 650, respectively.
  • Fig. 7 illustrates a flow diagram of an embodiment of a method 700 to add and/or subtract floating point numbers, in accordance with an embodiment of the invention.
  • the floating point numbers being added and/or subtracted may be represented in varying floating point representation formats, for example, such as two single precision, double precision, and/or double-extended precision floating numbers such as those discussed with reference to Figs. 5 and 6.
  • various components discussed with reference to Figs. 1-6 and 8-9 may be utilized to perform one or more of the operations discussed with reference to Fig. 7.
  • the method 700 may be used to add and/or subtract floating point numbers stored (and/or read) from a storage unit such as the cache 108, cache 116, memory 114, and/or registers 222.
  • the adder 209 may receive the opcode 312 and the operands 306-308.
  • the logic 310 may format the operands 306-308 such as discussed with reference to Fig. 3a-b.
  • the logic 318 may compare the exponents at an operation 706, such as discussed with reference to Fig. 3a- b.
  • the mantissas of the formatted operands may be aligned at an operation 708 by the alignment portion 305.
  • the aligned mantissas may be combined (e.g., added or subtracted) such as discussed with reference to the addition portion 355 of Fig. 3a-b.
  • Fig. 8 illustrates a block diagram of a computing system 800 in accordance with an embodiment of the invention.
  • the computing system 800 may include one or more central processing unit(s) (CPUs) 802 or processors that communicate via an interconnection network (or bus) 804.
  • CPUs central processing unit
  • bus interconnection network
  • the processors 802 may include a general purpose processor, a network processor (that processes data communicated over a computer network 803), or other types of a processor (including a reduced instruction set computer (RISC) processor or a complex instruction set computer (CISC)). Moreover, the processors 802 may have a single or multiple core design. The processors 802 with a multiple core design may integrate different types of processor cores on the same integrated circuit (IC) die. Also, the processors 802 with a multiple core design may be implemented as symmetrical or asymmetrical multiprocessors. In an embodiment, one or more of the processors 802 may be the same or similar to the processors 102 of Fig. 1.
  • one or more of the processors 802 may include one or more of the cores 106 (e.g., including the adder 209) and/or cache 108. Also, the operations discussed with reference to Figs. 1-7 may be performed by one or more components of the system 800.
  • a chipset 806 may also communicate with the interconnection network 804.
  • the chipset 806 may include a memory control hub (MCH) 808.
  • the MCH 808 may include a memory controller 810 that communicates with the memory 114.
  • the memory 114 may store data, including sequences of instructions that are executed by the CPU 802, or any other device included in the computing system 800.
  • the memory 114 may include one or more volatile storage (or memory) devices such as random access memory (RAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), static RAM (SRAM), or other types of storage devices.
  • RAM random access memory
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • SRAM static RAM
  • Nonvolatile memory may also be utilized such as a hard disk. Additional devices may communicate via the interconnection network 804, such as multiple CPUs and/or multiple system memories.
  • the MCH 808 may also include a graphics interface 814 that communicates with a graphics accelerator 816.
  • the graphics interface 814 may communicate with the graphics accelerator 816 via an accelerated graphics port (AGP).
  • AGP accelerated graphics port
  • a display (such as a flat panel display) may communicate with the graphics interface 814 through, for example, a signal converter that translates a digital representation of an image stored in a storage device such as video memory or system memory into display signals that are interpreted and displayed by the display.
  • the display signals produced by the display device may pass through various control devices before being interpreted by and subsequently displayed on the display.
  • a hub interface 818 may allow the MCH 808 and an input/output control hub (ICH) 820 to communicate.
  • the ICH 820 may provide an interface to I/O devices that communicate with the computing system 800.
  • the ICH 820 may communicate with a bus 822 through a peripheral bridge (or controller) 824, such as a peripheral component interconnect (PCI) bridge, a universal serial bus (USB) controller, or other types of peripheral bridges or controllers.
  • the bridge 824 may provide a data path between the CPU 802 and peripheral devices. Other types of topologies may be utilized.
  • multiple buses may communicate with the ICH 820, e.g., through multiple bridges or controllers.
  • peripherals in communication with the ICH 820 may include, in various embodiments of the invention, integrated drive electronics (IDE) or small computer system interface (SCSI) hard drive(s), USB port(s), a keyboard, a mouse, parallel port(s), serial port(s), floppy disk drive(s), digital output support (e.g., digital video interface (DVI)), or other devices.
  • IDE integrated drive electronics
  • SCSI small computer system interface
  • the bus 822 may communicate with an audio device 826, one or more disk drive(s) 828, and a network interface device 830 (which is in communication with the computer network 803). Other devices may communicate via the bus 822. Also, various components (such as the network interface device 830) may communicate with the MCH 808 in some embodiments of the invention. In addition, the processor 802 and the MCH 808 may be combined to form a single chip. Furthermore, the graphics accelerator 816 may be included within the MCH 808 in other embodiments of the invention.
  • nonvolatile memory may include one or more of the following: read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically EPROM (EEPROM), a disk drive (e.g., 828), a floppy disk, a compact disk ROM (CD-ROM), a digital versatile disk (DVD), flash memory, a magneto-optical disk, or other types of nonvolatile machine-readable media that are capable of storing electronic data (e.g., including instructions).
  • Fig. 9 illustrates a computing system 900 that is arranged in a point-to-point
  • FIG. 9 shows a system where processors, memory, and input/output devices are interconnected by a number of point-to-point interfaces. The operations discussed with reference to Figs. 1-8 may be performed by one or more components of the system 900.
  • the system 900 may include several processors, of which only two, processors 902 and 904 are shown for clarity.
  • the processors 902 and 904 may each include a local memory controller hub (MCH) 906 and 908 to enable communication with memories 910 and 912.
  • MCH memory controller hub
  • the memories 910 and/or 912 may store various data such as those discussed with reference to the memory 114 of Fig. 8.
  • the processors 902 and 904 may be one of the processors 802 discussed with reference to Fig. 8.
  • the processors 902 and 904 may exchange data via a point-to-point (PtP) interface 914 using PtP interface circuits 916 and 918, respectively.
  • the processors 902 and 904 may each exchange data with a chipset 920 via individual PtP interfaces 922 and 924 using point-to-point interface circuits 926, 928, 930, and 932.
  • the chipset 920 may further exchange data with a high-performance graphics circuit 934 via a high-performance graphics interface 936, e.g., using a PtP interface circuit 937.
  • At least one embodiment of the invention may be provided within the processors 902 and 904.
  • one or more of the cores 106 e.g., including the adder 209 and/or cache 108 of Fig. 1 may be located within the processors 902 and 904.
  • Other embodiments of the invention may exist in other circuits, logic units, or devices within the system 900 of Fig. 9.
  • other embodiments of the invention may be distributed throughout several circuits, logic units, or devices illustrated in Fig. 9.
  • the chipset 920 may communicate with a bus 940 using a PtP interface circuit 941.
  • the bus 940 may have one or more devices that communicate with it, such as a bus bridge 942 and I/O devices 943.
  • the bus bridge 943 may communicate with other devices such as a keyboard/mouse 945, communication devices 946 (such as modems, network interface devices, or other communication devices that may communicate with the computer network 803), audio I/O device, and/or a data storage device 948.
  • the data storage device 948 may store code 949 that may be executed by the processors 902 and/or 904.
  • the operations discussed herein, e.g., with reference to Figs. 1-9 may be implemented as hardware (e.g., circuitry), software, firmware, microcode, or combinations thereof, which may be provided as a computer program product, e.g., including a machine-readable or computer-readable medium having stored thereon instructions (or software procedures) used to program a computer to perform a process discussed herein.
  • a computer program product e.g., including a machine-readable or computer-readable medium having stored thereon instructions (or software procedures) used to program a computer to perform a process discussed herein.
  • the term "logic" may include, by way of example, software, hardware, or combinations of software and hardware.
  • the machine- readable medium may include a storage device such as those discussed with respect to Figs. 1-9.
  • Such computer-readable media may be downloaded as a computer program product, wherein the program may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a bus, a modem, or a network connection).
  • a remote computer e.g., a server
  • a requesting computer e.g., a client
  • a communication link e.g., a bus, a modem, or a network connection
  • Coupled may mean that two or more elements are in direct physical or electrical contact.
  • Coupled may mean that two or more elements are in direct physical or electrical contact.
  • coupled may also mean that two or more elements may not be in direct contact with each other, but may still cooperate or interact with each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computing Systems (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Nonlinear Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Advance Control (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

Methods and apparatus to perform floating point addition are described. In one embodiment, a plurality of operands (306, 308) are formatted (310) into a common format (602, 630, 650) and combined (e.g., added or subtracted).

Description

FLOATING POINT ADDITION FOR DIFFERENT FLOATING POINT FORMATS
BACKGROUND
The present disclosure generally relates to the field of electronics. More particularly, an embodiment of the invention relates to techniques to perform floating point addition within a computer system.
Floating point representations of numbers may be used to provide efficiency when performing arithmetic operations on real numbers. Depending on precision requirements, differing floating point representation formats may be utilized. For example, real numbers may be represented as a single precision floating point number, a double precision floating point number, or a double-extended precision floating number.
To increase computational efficiency, some processors or computer systems may include more than one floating point adder to operate on numbers having different floating point formats. Having different floating point adders for different floating point formats may cause more die area on a processor to be consumed, as well as additional power.
BRIEF DESCRIPTION OF THE DRAWINGS The detailed description is provided with reference to the accompanying figures.
In the figures, the left-most digit(s) of a reference number identifies the figure in which
I the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.
Figs. 1, 8 and 9 illustrate block diagrams of embodiments of computing systems, which may be utilized to implement various embodiments discussed herein.
Fig. 2 illustrates a block diagram of portions of a processor core, according to an embodiment of the invention. Figs. 3a-b and 4 illustrate block diagrams of portions of a floating point adder, according to various embodiments of the invention.
Figs. 5 and 6 illustrate operand formats in accordance with various embodiments of the invention.
Fig. 7 illustrates a flow diagram of an embodiment of a method in accordance with an embodiment of the invention.
DETAILED DESCRIPTION In the following description, numerous specific details are set forth in order to provide a thorough understanding of various embodiments. However, some embodiments may be practiced without the specific details. In other instances, well- known methods, procedures, components, and circuits have not been described in detail so as not to obscure the particular embodiments. Various aspects of embodiments of the invention may be performed using various means, such as integrated semiconductor circuits ("hardware"), computer-readable instructions organized into one or more programs ("software") or some combination of hardware and software. For the purposes of this disclosure reference to "logic" shall mean either hardware, software, or some combination thereof. Some of the embodiments discussed herein may provide efficient mechanisms for adding floating point numbers. In one embodiment, the same logic may be used for addition and/or subtraction. For example, addition of floating point numbers with opposite signs may correspond to a subtraction operation. Further, in an embodiment, the same floating point adder logic may be used for addition (and/or subtraction) of floating point numbers that are represented in varying floating point representation formats, for example, as a single precision, a double precision, and/or a double-extended precision floating number. Additionally, such a floating point adder may be utilized in a processor core, such as the processor cores discussed with reference to Figs. 1-9. More particularly, Fig. 1 illustrates a block diagram of a computing system 100, according to an embodiment of the invention. The system 100 may include one or more processors 102-1 through 102-N (generally referred to herein as "processors 102" or "processor 102"). The processors 102 may communicate via an interconnection or bus 104. Each processor may include various components some of which are only discussed with reference to processor 102-1 for clarity. Accordingly, each of the remaining processors 102-2 through 102-N may include the same or similar components discussed with reference to the processor 102-1.
In an embodiment, the processor 102-1 may include one or more processor cores 106-1 through 106-M (referred to herein as "cores 106," or more generally as "core 106"), a cache 108 (which may be a shared cache or a private cache in various embodiments), and/or a router 110. The processor cores 106 may be implemented on a single integrated circuit (IC) chip. Moreover, the chip may include one or more shared and/or private caches (such as cache 108), buses or interconnections (such as a bus or interconnection 112), memory controllers (such as those discussed with reference to Figs. 8 and 9), or other components.
In one embodiment, the router 110 may be used to communicate between various components of the processor 102-1 and/or system 100. Moreover, the processor 102-1 may include more than one router 110. Furthermore, the multitude of routers
(110) may be in communication to enable data routing between various components inside or outside of the processor 102-1.
The cache 108 may store data (e.g., including instructions) that are utilized by one or more components of the processor 102-1, such as the cores 106. For example, the cache 108 may locally cache data stored in a memory 114 for faster access by the components of the processor 102. As shown in Fig. 1, the memory 114 may be in communication with the processors 102 via the interconnection 104. In an embodiment, the cache 108 (that may be shared) may include a mid-level cache and/or a last level cache (LLC). Also, each of the cores 106 may include a level 1 (Ll) cache (116-1) (generally referred to herein as "Ll cache 116"). Various components of the processor 102-1 may communicate with the cache 108 directly, through a bus (e.g., the bus 112), and/or a memory controller or hub.
Fig. 2 illustrates a block diagram of portions of a processor core 106, according to an embodiment of the invention. In one embodiment, the arrows shown in Fig. 2 illustrate the flow direction of instructions through the core 106. One or more processor cores (such as the processor core 106) may be implemented on a single integrated circuit chip (or die) such as discussed with reference to Fig, 1. Moreover, the chip may include one or more shared and/or private caches (e.g., cache 108 of Fig. 1), interconnections (e.g., interconnections 104 and/or 112 of Fig. 1), memory controllers, or other components.
As illustrated in Fig. 2, the processor core 106 may include a fetch unit 202 to fetch instructions for execution by the core 106. The instructions may be fetched from any storage devices such as the memory 114 and/or the memory devices discussed with reference to Figs. 8 and 9. The core 106 may also include a decode unit 204 to decode the fetched instruction. For instance, the decode unit 204 may decode the fetched instruction into a plurality of uops (micro-operations). Additionally, the core 106 may include a schedule unit 206. The schedule unit 206 may perform various operations associated with storing decoded instructions (e.g., received from the decode unit 204) until the instructions are ready for dispatch, e.g., until all source values of a decoded instruction become available. In one embodiment, the schedule unit 206 may schedule and/or issue (or dispatch) decoded instructions to an execution unit 208 for execution. The execution unit 208 may execute the dispatched instructions after they are decoded (e.g., by the decode unit 204) and dispatched (e.g., by the schedule unit 206). In an embodiment, the execution unit 208 may include more than one execution unit, such as a memory execution unit, an integer execution unit, a floating-point execution unit, or other execution units. The execution unit 208 may also perform various arithmetic operations such as addition, subtraction, multiplication, and/or division, and may include one or more an arithmetic logic units (ALUs). In an embodiment, a co-processor (not shown) may perform various arithmetic operations in conjunction with the execution unit 208.
As shown in Fig. 2, the execution unit 208 may include a floating point (FP) adder 209 to perform addition, subtraction, comparison, and/or format conversion of floating numbers that may be represented in varying floating point representation formats. In one embodiment, the floating point numbers being added and/or subtracted may have any format, e.g., including a single precision, a double precision, and/or a double-extended precision floating number format (such as those discussed with reference to Figs. 5 and 6). In an embodiment, the adder 209 may operate in response to a single instruction, multiple data (SIMD) instruction. Generally, an SIMD instruction may cause identical operations to be performed on multiple pieces of data at the same time, e.g., in parallel. Moreover, in accordance with at least one instruction set architecture, the SIMD instruction may correspond to streaming SIMD extensions (SSE) or other forms of streaming SIMD implementations (such as streaming SIMD extensions 2 (SSE2)). Further details regarding various embodiments of the adder 209 will be further discussed herein, e.g., with reference to Figs. 3-7. Also, the execution unit 208 may include more than one floating point adder 209. Further, the execution unit 208 may execute instructions out-of-order. Hence, the processor core 106 may be an out-of-order processor core in one embodiment. The core 106 may also include a retirement unit 210. The retirement unit 210 may retire executed instructions after they are committed. In an embodiment, retirement of the executed instructions may result in processor state being committed from the execution of the instructions, physical registers used by the instructions being de-allocated, etc.
The core 106 may additionally include a trace cache or microcode read-only memory (uROM) 212 to store microcode and/or traces of instructions that have been fetched (e.g., by the fetch unit 202). The microcode stored in the uROM 212 may be used to configure various hardware components of the core 106. In an embodiment, the microcode stored in the uROM 212 may be loaded from another component in communication with the processor core 106, such as a computer-readable medium or other storage device discussed with reference to Figs. 8 and 9. The core 106 may also include a bus unit 220 to enable communication between components of the processor core 106 and other components (such as the components discussed with reference to Fig. 1) via one or more buses (e.g., buses 104 and/or 112). The core 106 may also include one or more registers 222 to store data accessed by various components of the core 106. Fig. 3a-b illustrates a block diagram of portions of a floating point adder (209), according to an embodiment of the invention. The floating point adder 209 of Fig. 3a-b may be the same or similar to the floating point adder 209 discussed with reference to Fig. 2. The width of various signal paths of the adder 209 are shown in Fig. 3a-b in accordance with an embodiment of the invention. Also, as illustrated in Fig. 3a-b, the adder 209 may include an exponent path 302 and a mantissa path 304 to perform various operations corresponding to addition (or subtraction) of two operands 306 and 308.
As shown in Fig. 3a-b, the adder 209 may include various portions including an alignment portion 305. The alignment portion 305 may include an operand formatting logic 310, e.g., to modify one or more of the operands 306 and 308 from a first format (such as those shown in Fig. 5) into a second format (such as those shown in Fig. 6). The exponent path 302 may receive an opcode 312 that corresponds to an arithmetic operation (such as an addition or subtraction). A logic 314 may determine (e.g., look up from a table or storage unit that includes predefined data) an exponent corresponding to opcode 312 for a conversion instruction. Generally, a conversion instruction may operate on a single operand (e.g., operand 308), while another operand (e.g., operand 306) is supplied with a zero value. Hence, in one embodiment, a predefined exponent from 314 may be used to calculate a resultant exponent and align data if it is needed, as will be discussed further with reference to Fig. 3a-b. To this end, a multiplexer 316 may receive and select one of the exponents from the logic 314 and an exponent corresponding to one of the operands (e.g., operand 306). In an embodiment, the multiplexer 316 may select one of its inputs based on the opcode 312. An exponent difference logic 318 may receive and compare the selected exponent from the multiplexer 316 and an exponent corresponding to the operand 308. The logic 318 may generate one or more signals based on the result of the comparison (which may be one or more subtraction operations in an embodiment) and provide the generated signals (such as subtraction results and carry outs) to various components of the adder 209, for example, to provide mantissa alignment, such as discussed and illustrated with reference to Fig. 3a-b.
The mantissa path 304 may include logics 320 and 322 to receive the formatted operands from the logic 310 and swap (or extract a portion of) the mantissas, e.g., based on carry-out signals generated from the exponent difference computation by the logic 318. Alignment of the mantissas corresponding to the operand with smaller exponent is performed using rotators (e.g., logics 324 and 326) and mask generators (e.g., mask generators 336 and 338). In an embodiment, one or more of the signals generated by the logic 318 may be used for determining shift code alignment, e.g., to enable the mantissas corresponding to the operand with smaller exponent to be cycle shifted right by rotators 324 and 326. Also, in one embodiment, the shift code signals provided by the logic 318 to logics 324 and 326 may be five bits wide. For double precision and double-extended precision operands, the shift code (and/or carryout) signals provided to logics 324 and 326 may be the same in an embodiment. Moreover, the logics 320 and 322 may provide the mantissas of operands 306 and 308 with larger exponents to inverters 328 and 330 and multiplexers 332 and 334. Moreover, mask generators 336 and 338 may generate masks based on shift code signals from the logic 318 to enable the shifting of one or more bits of the outputs of logics 324 and 326.
As shown in Fig. 3a-b, the outputs of logics 324 and 326 may be shifted left by one bit by logics 340 and 342, respectively. In particular, an operand analyzer 344 may analyze the operands 306 and 308, and generate one or more signals to enable shifting in logics 340 and 342 if one of the operands is denormal. Logic 346 logically combines (e.g., by using an AND operation) the outputs of the mask generator 336 and logic 340. Similarly, logic 348 logically combines (e.g., by using an AND operation) the outputs of the mask generator 338 and logic 342. A multiplexing logic 350 receives the outputs of the logics 346 and 348 and provides a signal to one of the inputs of each of the adders 352 and 354 which are in an addition portion 355 of the adder 209. Additional details regarding an embodiment of the logics 346, 348, and 350 are further discussed with reference to Fig. 4.
As illustrated in Fig. 3a-b, the adders 352 and 354 may also receive an input signal form the multiplexers 332 and 334. The multiplexers 332 and 334 may select one of their inputs based on signals 356 (Compl_Hi) and 358 (ComplJLo) which may be generated based on opcode (312) and signs of operands (e.g., operands 306 and 308) to provide a (true) subtraction (e.g., subtraction of operands with the same signs or the addition of operands with different signs) or a (true) addition operation (e.g., subtraction of operands with different signs or the addition of operands with the same signs) by an opcode decoder logic (not shown). Accordingly, the adders 352 and 354 may receive aligned and non-aligned mantissas, e.g., through multiplexers 332 and 334 which may in turn provide the non-inverted or inverted (for example by inventors 328 and 330) mantissas selected by logics 320 and 322, respectively. The adders 352 and 354 also receive carry in signals. For example, adder 354 receives as carry in signals the signal 358, e.g., to provide full two's complementing for true subtraction cases. The adder 352 receives as its carry in signal a carry out signal 360 from the adder 354 or the signal 356 that may be provided through a multiplexer 362 based on the precision format of the opcode 312. The outputs of the adders 352 and 354 are provided to inverters 364 and 366, and multiplexers 368 and 370. The multiplexers 368 and 370 may select one of their input signals as output based on a selection signal generated by the adder 352 and a multiplexer 371, respectively. In an embodiment, since mantissa of operand with larger or equal exponent may be two's complemented for the true subtraction cases, the result of addition may be negative and can be two's complemented. The two's complementing may be performed by inversion of results of adders 352 and 354 and adding of a binary one ("1") using a rounder hardware (e.g., logic 397). The exponent path 302 of the addition portion 355 may also include a logic 372 to generate a limiter shift value for normalization, e.g., because the adder 209 may support gradual underflow. The outputs of multiplexers 368 and 370 and the logic 372 is provided to a normalization portion 373 (of the adder 209) including the leading zero detection (LZD) logics 374 and 376. More particularly, the logics 374 and 376 may determine shift codes for normalization, e.g., by detecting the leading zeros in the results of the addition that are provided by the adders 352 and 354 through the multiplexers 368 and 370. The output signals from the logics 374 and 376 may be provided to logics 378 and 380, together with the output signals from the multiplexers 368 and 370. The logics 378 and 380 may perform cycle shifts left based on the outputs of logics 374 and 376 to provide normalization on the addition results. As shown in Fig. 3a-b, the outputs of the logics 374 and 376 may be provided to an exponent adjustment logic 382 and mask generators 384 and 386. The mask generators 384 and 386 may generate masks based shift code signals from the logics 374 and 376 to enable normalization of the outputs of logics 378 and 380 by logics 388 and 390, respectively. In an embodiment, the logics 388 and 390 may logically combine their inputs (e.g., by utilizing a logic AND operation) such as discussed with reference to logics 346 and 348. The output signals from the logics 388 and 390 may be selected by multiplexing logic 392 (e.g., such as discussed with reference to logic 350 in an embodiment) to provide an output to the rounding portion 393 of the adder 209. In accordance with one embodiment, the logic 392 may provide guard and/or round bits to the rounding portion 393. In an embodiment, in the addition portion 355 of the adder 209, logics 394 and
395 may compute sticky bits, e.g., by logically combining (for example through a logic OR operation) the shifted out bits provided by the logics 346 and 348 as will be further discussed with reference to Fig. 4. In turn, the logic 396 may combine the outputs of the logics 394 and 395 to provide two sticky bits for two single-precision operands, and a single sticky bit for double-precision and double-extended precision operands. The output signal from the logics 396 and 392 are provided to rounder logic 397 to perform rounding of the addition (or subtraction) of the mantissas. Additionally, a logic 398 may receive the exponent from the logic 382 and modify (or fix) the exponent for round up situations, e.g., by adding a one if round up occurs. Moreover, the logic 382 may adjust the exponent (e.g., received from logic 318) by the shift code for normalization (e.g., provided by the logics 374 and 376). In an embodiment, the logic 382 may subtract the shift codes received from logics 374 and 376 from the larger exponent provided by the logic 318. Hence, in one embodiment, the larger exponent provided by the logic 318 may be corrected for normalization (e.g., by logic 382) and round up situations (e.g., by logic 398).
In one embodiment (such as illustrated in Fig. 3a-b), the mantissa path 304 may include two separate paths to process the most significant (MS) 32 bits and least significant (LS) 36 bits of operands 306 and 308. For example, a first MS 32-bit path (e.g., including logics 320, 324, 352, and/or 378) may operate on a first set of data (e.g., a pair of single precision floating point mantissas such as discussed with reference to the operand 602 of Fig. 6), while a second LS 36-bit path (e.g., including logics 322, 326, 354, and/or 380) may operate on a second set of data (which may be a different pair of single precision floating point mantissas). Hence, two pairs of single precision mantissas may be processed in these two paths independently. Also, a combination of the first and second paths may be used to operate on double precision or double extended precision operands (e.g., operands 630 and/or 650 of Fig. 6). As shown in Fig. 3a-b, logics 350 and 392 may enable combination of signals between these two mantissa paths.
Fig. 4 illustrates a block diagram of further details of portions of the adder 209 of Fig. 3a-b, according to an embodiment of the invention. As shown in Fig. 4, signals 402-410 that are generated by the logic 318 may be provided to multiplexers 412-416. The inputs to the multiplexers 412-416 may be selected by signals that are generated based on precision format of the opcode 312. The outputs of the multiplexers 412, 414, and 416 are provided to the logics 320, logics 336 and 324, and logics 338 and 326, respectively. In various embodiment, signal 402 may correspond to a shift code for alignment of MS 32-bits for the single precision case; signal 404 may correspond to a shift code for alignment for the double precision or double extended precision cases; signal 406 may correspond to a shift code for alignment of LS 36-bits for the single precision case; signal 408 may correspond to a carry-out signal from exponent difference of second pair of single precision data; and signal 410 may correspond to a carry-out signal from exponent difference of double precision or double extended precision data.
As shown in Fig. 4, in an embodiment, the logic 346 may include AND gates 424 and 426 to combine the outputs of the logic 340 and 336. Similarly, the logic 348 may include AND gates 428 and 430 to combine the outputs of the logic 338 and 342. Also, one of the inputs to the gates 426 and 430 may be inverted such as shown in Fig. 4. Furthermore, an OR gate 434 may combine the outputs of the gates 426 and 428 (e.g., by logically OR-ing the outputs of the gates 426 and 428). Additionally, the logic 350 may include multiplexers 436-440. As shown, the inputs of the multiplexers 436-440 may be selected by a signal 442 which is generated by a logic (e.g., based on precision format of the opcode 312 and on signals from 318 of Fig.3) to indicate how an aligned mantissa is combined with signals from a storage unit 441 (which may be a hardware register in an embodiment) and logics 424, 434, and 428. Additionally, the multiplexers 436-440 may receive a signal from the storage unit 441 (e.g., including all zero's) to fill the first 32 bits of the output of the logic 350 with zeros for the case when exponent difference is more than 32 bits or fill the first 64 bits of the output of the logic 350 with zeros for the case when exponent difference is more than 64 bits. The logic 350 may provide the outputs of the multiplexers 436-440 in the 68 bit format 444 (which, in one embodiment, includes a most significant (MS) 32-bit portion 446, a middle 32-bit portion 448, and a least significant (LS) 32-bit portion 450) to the adders 352 and 354 such as illustrated in Fig. 4.
In various embodiments, portions 446, 448, and 450 may be provided in accordance with one or more of the following: • If the opcode 312 corresponds to a single precision format and the exponent difference (318) of the second pair of single precision operands (306 and 308) is less than 24, then portion 446 may be supplied by logic 424 through logic 436. A similar situation may be applied to the first pair of operands (306, 308) also; namely, portion 448 may be supplied by logic 428 through logic 438. Moreover, in an embodiment (such as discussed with reference to Figs. 5 and 6), each of the operands (306 and 308) may include two single precision numbers (e.g., opl={xl,x0} and op2={yl,y0}, where "{}" indicates concatenation). In such an embodiment, the first pair may correspond to xθ and yθ, while the second pair may correspond to xl and yl.
• If the opcode 312 corresponds to double or double extended precision formats and exponent difference (318) is less than 32, then portion 446 may be supplied by logic
424 and portion 448 may be supplied by logic 434.
• If the opcode 312 corresponds to double or double extended precision formats and exponent difference (318) is less than 64, and more than 32, then portion 446 may be supplied by storage unit 441 and portion 448 may be supplied by logic 424.
• If opcode 312 corresponds to double or double extended precision formats and exponent difference (318) is more than 64, then portion 446 may be supplied from storage unit 441, portion 448 may be supplied by storage unit 441, and portion 450 may be supplied by logic 424.
Fig. 5 illustrates sample operand formats 500 for operands 306 and 308 of Fig. 3a-b, in accordance with an embodiment of the invention. Fig. 6 illustrates formatted floating point adder operand formats 600 corresponding to the formats 500 of Fig. 5, after the operands of Fig. 5 are formatted by the logic 310 of Fig. 3a-b. Width of each field of the operands shown in Figs. 5 and 6 is illustrated in accordance with some embodiments of the invention.
Referring to Fig. 5, a single precision operand 502 (which may represent two single precision floating point numbers in an embodiment) may include sign fields 504 and 506, exponent fields 508 and 510, and mantissa fields 512 and 514. Also, a double precision operand 520 may include a sign filed 522, an exponent field 524, and a mantissa field 526. Furthermore, a double-extended precision operand 530 may include a sign filed 532, an exponent field 534, a J field 536 (which may indicate whether the mantissa is normalized), and a mantissa field 538. Generally, a J bit (536) may correspond to the integer part of a mantissa which may be hidden in single precision and double precision formats. Further, the J bit may be set to zero for denormals.
Referring to Fig. 6, a single precision operand 602 may include a sign fields 604 (which may correspond to the sign field 504), exponent fields 606 and 608 (which may correspond to fields 508 and 510 in an embodiment), a zero field 610 (which may correspond to the sign field 506), overflow fields 612 and 614 (e.g., to indicate an overflow condition in a path of the adder 209), J fields 616 and 618 (e.g., to indicate that the corresponding floating point number is normal), and mantissa fields 620 and 622 (which may correspond to fields 512 and 514 in an embodiment). Also, a double precision operand 630 may include a sign filed 632 (which may correspond to field 522 in an embodiment), an exponent field 634 (which may correspond to field 524 in an embodiment), an overflow field 636 (e.g., to indicate an overflow condition), a J field 638 (e.g., to indicate that the corresponding floating point number is normal), and a mantissa field 640 (which may correspond to the field 526 in an embodiment). Furthermore, a double-extended precision operand 650 may include a sign filed 652 (which may correspond to the field 532 in an embodiment), an exponent field 654 (which may correspond to the field 534 in an embodiment), an overflow field 656 (e.g., to indicate an overflow condition), a J field 658 (e.g., to indicate that the corresponding floating point number is normal), and a mantissa field 660 (which may correspond to the field 538 in an embodiment). As shown in Fig. 6, other fields of the operands 602, 630, and 650 may be unused (e.g., have all zeros). In an embodiment, the logic 310 may format the operands 502, 520, and 530 into the operands 602, 630, and 650, respectively.
Fig. 7 illustrates a flow diagram of an embodiment of a method 700 to add and/or subtract floating point numbers, in accordance with an embodiment of the invention. In one embodiment, the floating point numbers being added and/or subtracted may be represented in varying floating point representation formats, for example, such as two single precision, double precision, and/or double-extended precision floating numbers such as those discussed with reference to Figs. 5 and 6. In an embodiment, various components discussed with reference to Figs. 1-6 and 8-9 may be utilized to perform one or more of the operations discussed with reference to Fig. 7. For example, the method 700 may be used to add and/or subtract floating point numbers stored (and/or read) from a storage unit such as the cache 108, cache 116, memory 114, and/or registers 222.
Referring to Figs. 1-7, at an operation 702, the adder 209 may receive the opcode 312 and the operands 306-308. At an operation 704, the logic 310 may format the operands 306-308 such as discussed with reference to Fig. 3a-b. The logic 318 may compare the exponents at an operation 706, such as discussed with reference to Fig. 3a- b. The mantissas of the formatted operands may be aligned at an operation 708 by the alignment portion 305. At an operation 710, the aligned mantissas may be combined (e.g., added or subtracted) such as discussed with reference to the addition portion 355 of Fig. 3a-b. The results of the addition portion 355 of the adder 209 may be normalized by the normalization portion 373 at an operation 712. The results from the normalization portion 373 of the adder 209 may then be rounded at an operation 714, e.g., by the rounding portion 393 such as discussed with reference to Fig. 3a-b. Fig. 8 illustrates a block diagram of a computing system 800 in accordance with an embodiment of the invention. The computing system 800 may include one or more central processing unit(s) (CPUs) 802 or processors that communicate via an interconnection network (or bus) 804. The processors 802 may include a general purpose processor, a network processor (that processes data communicated over a computer network 803), or other types of a processor (including a reduced instruction set computer (RISC) processor or a complex instruction set computer (CISC)). Moreover, the processors 802 may have a single or multiple core design. The processors 802 with a multiple core design may integrate different types of processor cores on the same integrated circuit (IC) die. Also, the processors 802 with a multiple core design may be implemented as symmetrical or asymmetrical multiprocessors. In an embodiment, one or more of the processors 802 may be the same or similar to the processors 102 of Fig. 1. For example, one or more of the processors 802 may include one or more of the cores 106 (e.g., including the adder 209) and/or cache 108. Also, the operations discussed with reference to Figs. 1-7 may be performed by one or more components of the system 800.
A chipset 806 may also communicate with the interconnection network 804. The chipset 806 may include a memory control hub (MCH) 808. The MCH 808 may include a memory controller 810 that communicates with the memory 114. The memory 114 may store data, including sequences of instructions that are executed by the CPU 802, or any other device included in the computing system 800. In one embodiment of the invention, the memory 114 may include one or more volatile storage (or memory) devices such as random access memory (RAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), static RAM (SRAM), or other types of storage devices. Nonvolatile memory may also be utilized such as a hard disk. Additional devices may communicate via the interconnection network 804, such as multiple CPUs and/or multiple system memories.
The MCH 808 may also include a graphics interface 814 that communicates with a graphics accelerator 816. In one embodiment of the invention, the graphics interface 814 may communicate with the graphics accelerator 816 via an accelerated graphics port (AGP). In an embodiment of the invention, a display (such as a flat panel display) may communicate with the graphics interface 814 through, for example, a signal converter that translates a digital representation of an image stored in a storage device such as video memory or system memory into display signals that are interpreted and displayed by the display. The display signals produced by the display device may pass through various control devices before being interpreted by and subsequently displayed on the display.
A hub interface 818 may allow the MCH 808 and an input/output control hub (ICH) 820 to communicate. The ICH 820 may provide an interface to I/O devices that communicate with the computing system 800. The ICH 820 may communicate with a bus 822 through a peripheral bridge (or controller) 824, such as a peripheral component interconnect (PCI) bridge, a universal serial bus (USB) controller, or other types of peripheral bridges or controllers. The bridge 824 may provide a data path between the CPU 802 and peripheral devices. Other types of topologies may be utilized. Also, multiple buses may communicate with the ICH 820, e.g., through multiple bridges or controllers. Moreover, other peripherals in communication with the ICH 820 may include, in various embodiments of the invention, integrated drive electronics (IDE) or small computer system interface (SCSI) hard drive(s), USB port(s), a keyboard, a mouse, parallel port(s), serial port(s), floppy disk drive(s), digital output support (e.g., digital video interface (DVI)), or other devices.
The bus 822 may communicate with an audio device 826, one or more disk drive(s) 828, and a network interface device 830 (which is in communication with the computer network 803). Other devices may communicate via the bus 822. Also, various components (such as the network interface device 830) may communicate with the MCH 808 in some embodiments of the invention. In addition, the processor 802 and the MCH 808 may be combined to form a single chip. Furthermore, the graphics accelerator 816 may be included within the MCH 808 in other embodiments of the invention.
Furthermore, the computing system 800 may include volatile and/or nonvolatile memory (or storage). For example, nonvolatile memory may include one or more of the following: read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically EPROM (EEPROM), a disk drive (e.g., 828), a floppy disk, a compact disk ROM (CD-ROM), a digital versatile disk (DVD), flash memory, a magneto-optical disk, or other types of nonvolatile machine-readable media that are capable of storing electronic data (e.g., including instructions). Fig. 9 illustrates a computing system 900 that is arranged in a point-to-point
(PtP) configuration, according to an embodiment of the invention. In particular, Fig. 9 shows a system where processors, memory, and input/output devices are interconnected by a number of point-to-point interfaces. The operations discussed with reference to Figs. 1-8 may be performed by one or more components of the system 900.
As illustrated in Fig. 9, the system 900 may include several processors, of which only two, processors 902 and 904 are shown for clarity. The processors 902 and 904 may each include a local memory controller hub (MCH) 906 and 908 to enable communication with memories 910 and 912. The memories 910 and/or 912 may store various data such as those discussed with reference to the memory 114 of Fig. 8.
In an embodiment, the processors 902 and 904 may be one of the processors 802 discussed with reference to Fig. 8. The processors 902 and 904 may exchange data via a point-to-point (PtP) interface 914 using PtP interface circuits 916 and 918, respectively. Also, the processors 902 and 904 may each exchange data with a chipset 920 via individual PtP interfaces 922 and 924 using point-to-point interface circuits 926, 928, 930, and 932. The chipset 920 may further exchange data with a high-performance graphics circuit 934 via a high-performance graphics interface 936, e.g., using a PtP interface circuit 937.
At least one embodiment of the invention may be provided within the processors 902 and 904. For example, one or more of the cores 106 (e.g., including the adder 209) and/or cache 108 of Fig. 1 may be located within the processors 902 and 904. Other embodiments of the invention, however, may exist in other circuits, logic units, or devices within the system 900 of Fig. 9. Furthermore, other embodiments of the invention may be distributed throughout several circuits, logic units, or devices illustrated in Fig. 9.
The chipset 920 may communicate with a bus 940 using a PtP interface circuit 941. The bus 940 may have one or more devices that communicate with it, such as a bus bridge 942 and I/O devices 943. Via a bus 944, the bus bridge 943 may communicate with other devices such as a keyboard/mouse 945, communication devices 946 (such as modems, network interface devices, or other communication devices that may communicate with the computer network 803), audio I/O device, and/or a data storage device 948. The data storage device 948 may store code 949 that may be executed by the processors 902 and/or 904.
In various embodiments of the invention, the operations discussed herein, e.g., with reference to Figs. 1-9, may be implemented as hardware (e.g., circuitry), software, firmware, microcode, or combinations thereof, which may be provided as a computer program product, e.g., including a machine-readable or computer-readable medium having stored thereon instructions (or software procedures) used to program a computer to perform a process discussed herein. Also, the term "logic" may include, by way of example, software, hardware, or combinations of software and hardware. The machine- readable medium may include a storage device such as those discussed with respect to Figs. 1-9. Additionally, such computer-readable media may be downloaded as a computer program product, wherein the program may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a bus, a modem, or a network connection). Accordingly, herein, a carrier wave shall be regarded as comprising a machine-readable medium.
Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least an implementation. The appearances of the phrase "in one embodiment" in various places in the specification may or may not be all referring to the same embodiment.
Also, in the description and claims, the terms "coupled" and "connected," along with their derivatives, may be used. In some embodiments of the invention, "connected" may be used to indicate that two or more elements are in direct physical or electrical contact with each other. "Coupled" may mean that two or more elements are in direct physical or electrical contact. However, "coupled" may also mean that two or more elements may not be in direct contact with each other, but may still cooperate or interact with each other.
Thus, although embodiments of the invention have been described in language specific to structural features and/or methodological acts, it is to be understood that claimed subject matter may not be limited to the specific features or acts described. Rather, the specific features and acts are disclosed as sample forms of implementing the claimed subject matter.

Claims

CLAIMS What is claimed is:
1. A processor comprising: a first logic (310) to convert a first operand (306, 308) from a first format (502, 520, 530) into a second format (602, 630, 650); and a second logic (352, 354) to combine a portion of the converted first operand with a portion of a second operand that is in the second format.
2. The processor of claim 1, further comprising a third logic (318) to compare a first exponent corresponding to the first operand with a second exponent of the second operand.
3. The processor of claim 1, wherein the second logic combines a plurality of single precision operands in a same path (304) as a double precision exponent or a double-extended precision path.
4. The processor of claim 1, further comprising a third logic (310) to convert the second operand from a third format into the second format.
5. The processor of claim 1, wherein the second logic combines the portion of the converted first operand and the portion of the second operand by an addition operation or a subtraction operation.
6. The processor of claim 1, further comprising a third logic (397) to round results of the combination by the second logic.
7. The processor of claim 1, further comprising a third logic (344) to analyze a portion of the converted first operand and the second operand to determine whether one of the first or second operands corresponds to a denormal operand.
8. The processor of claim 1, further comprising one or more processor cores (106), wherein at least some of the one or more processor cores comprise one or more of the first logic or the second logic.
9. The apparatus of claim 8, wherein at least one of the one or more processor cores (106), the first logic, and the second logic are on a same die.
10. A method comprising: modifying (704) a plurality of operands into a same format; and combining (710) a plurality of mantissas corresponding to the modified plurality of operands.
11. The method of claim 10, further comprising comparing (706) portions of the modified plurality of operands.
12. The method of claim 10, further comprising aligning (708) portions of the plurality of mantissas.
13. The method of claim 10, further comprising normalizing (712) results of the combination of the plurality of mantissas.
14. The method of claim 10, further comprising rounding (714) results of the combination of the plurality of mantissas.
15. A system comprising: a memory (108, 114, 116) to store data; a first logic (202) to fetch an opcode (312), a first operand (306), and a second operand (308) from the memory; a second logic (310) to modify the first operand and the second operand into a same format; and a third logic (324, 326) to align one of the first or second operands in accordance with a comparison (318) of a first exponent corresponding to the first operand and a second exponent corresponding to the second operand.
16. The system of claim 15, further comprising a fourth logic (352, 354) to combine a portion of the first operand and a portion of the second operand.
17. The system of claim 15, further comprising a fourth logic (344) to analyze a portion of the first operand and the second operand to determine whether one of the first or second operands corresponds to a denormal operand.
18. The system of claim 15, wherein the memory comprises one or more of a level 1 cache, a mid-level cache, or a last level cache.
19. The system of claim 15, further comprising a plurality of processor cores
(106) to access the data stored in the memory.
20. The system of claim 15, further comprising an audio device (947).
PCT/RU2006/000236 2006-05-16 2006-05-16 Floating point addition for different floating point formats WO2007133101A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
DE112006003875T DE112006003875T5 (en) 2006-05-16 2006-05-16 Floating point addition for different floating point formats
CN200680054583.XA CN101438232B (en) 2006-05-16 2006-05-16 The floating add of different floating-point format
US10/589,448 US20080133895A1 (en) 2006-05-16 2006-05-16 Floating Point Addition
PCT/RU2006/000236 WO2007133101A1 (en) 2006-05-16 2006-05-16 Floating point addition for different floating point formats

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2006/000236 WO2007133101A1 (en) 2006-05-16 2006-05-16 Floating point addition for different floating point formats

Publications (1)

Publication Number Publication Date
WO2007133101A1 true WO2007133101A1 (en) 2007-11-22

Family

ID=37890158

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2006/000236 WO2007133101A1 (en) 2006-05-16 2006-05-16 Floating point addition for different floating point formats

Country Status (4)

Country Link
US (1) US20080133895A1 (en)
CN (1) CN101438232B (en)
DE (1) DE112006003875T5 (en)
WO (1) WO2007133101A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013155744A1 (en) * 2012-04-20 2013-10-24 Huawei Technologies Co., Ltd. System and method for signal processing in digital signal processors
WO2015096001A1 (en) * 2013-12-23 2015-07-02 Intel Corporation System-on-a-chip (soc) including hybrid processor cores

Families Citing this family (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8515052B2 (en) 2007-12-17 2013-08-20 Wai Wu Parallel signal processing system and method
CN101916182B (en) * 2009-09-09 2014-08-20 威盛电子股份有限公司 Transmission of fast floating point result using non-architected data format
GB201111035D0 (en) * 2011-06-29 2011-08-10 Advanced Risc Mach Ltd Floating point adder
WO2013100783A1 (en) 2011-12-29 2013-07-04 Intel Corporation Method and system for control signalling in a data path module
US10331583B2 (en) 2013-09-26 2019-06-25 Intel Corporation Executing distributed memory operations using processing elements connected by distributed channels
US9582248B2 (en) * 2014-09-26 2017-02-28 Arm Limited Standalone floating-point conversion unit
US10402168B2 (en) 2016-10-01 2019-09-03 Intel Corporation Low energy consumption mantissa multiplication for floating point multiply-add operations
CN106557299B (en) * 2016-11-30 2019-08-30 上海兆芯集成电路有限公司 Floating-point operation number calculating method and the device for using the method
US10061579B1 (en) * 2016-12-02 2018-08-28 Intel Corporation Distributed double-precision floating-point addition
US10031752B1 (en) 2016-12-02 2018-07-24 Intel Corporation Distributed double-precision floating-point addition
US10474375B2 (en) 2016-12-30 2019-11-12 Intel Corporation Runtime address disambiguation in acceleration hardware
US10416999B2 (en) 2016-12-30 2019-09-17 Intel Corporation Processors, methods, and systems with a configurable spatial accelerator
US10558575B2 (en) 2016-12-30 2020-02-11 Intel Corporation Processors, methods, and systems with a configurable spatial accelerator
US10572376B2 (en) 2016-12-30 2020-02-25 Intel Corporation Memory ordering in acceleration hardware
US10467183B2 (en) 2017-07-01 2019-11-05 Intel Corporation Processors and methods for pipelined runtime services in a spatial array
US10515046B2 (en) 2017-07-01 2019-12-24 Intel Corporation Processors, methods, and systems with a configurable spatial accelerator
US10469397B2 (en) 2017-07-01 2019-11-05 Intel Corporation Processors and methods with configurable network-based dataflow operator circuits
US10387319B2 (en) 2017-07-01 2019-08-20 Intel Corporation Processors, methods, and systems for a configurable spatial accelerator with memory system performance, power reduction, and atomics support features
US10445451B2 (en) 2017-07-01 2019-10-15 Intel Corporation Processors, methods, and systems for a configurable spatial accelerator with performance, correctness, and power reduction features
US10515049B1 (en) 2017-07-01 2019-12-24 Intel Corporation Memory circuits and methods for distributed memory hazard detection and error recovery
US10445234B2 (en) 2017-07-01 2019-10-15 Intel Corporation Processors, methods, and systems for a configurable spatial accelerator with transactional and replay features
US10496574B2 (en) 2017-09-28 2019-12-03 Intel Corporation Processors, methods, and systems for a memory fence in a configurable spatial accelerator
US11086816B2 (en) 2017-09-28 2021-08-10 Intel Corporation Processors, methods, and systems for debugging a configurable spatial accelerator
US10445098B2 (en) 2017-09-30 2019-10-15 Intel Corporation Processors and methods for privileged configuration in a spatial array
US10380063B2 (en) 2017-09-30 2019-08-13 Intel Corporation Processors, methods, and systems with a configurable spatial accelerator having a sequencer dataflow operator
US10445250B2 (en) 2017-12-30 2019-10-15 Intel Corporation Apparatus, methods, and systems with a configurable spatial accelerator
US10565134B2 (en) 2017-12-30 2020-02-18 Intel Corporation Apparatus, methods, and systems for multicast in a configurable spatial accelerator
US10417175B2 (en) 2017-12-30 2019-09-17 Intel Corporation Apparatus, methods, and systems for memory consistency in a configurable spatial accelerator
US10564980B2 (en) 2018-04-03 2020-02-18 Intel Corporation Apparatus, methods, and systems for conditional queues in a configurable spatial accelerator
US11307873B2 (en) 2018-04-03 2022-04-19 Intel Corporation Apparatus, methods, and systems for unstructured data flow in a configurable spatial accelerator with predicate propagation and merging
US11200186B2 (en) 2018-06-30 2021-12-14 Intel Corporation Apparatuses, methods, and systems for operations in a configurable spatial accelerator
US10891240B2 (en) 2018-06-30 2021-01-12 Intel Corporation Apparatus, methods, and systems for low latency communication in a configurable spatial accelerator
US10853073B2 (en) 2018-06-30 2020-12-01 Intel Corporation Apparatuses, methods, and systems for conditional operations in a configurable spatial accelerator
US10459866B1 (en) 2018-06-30 2019-10-29 Intel Corporation Apparatuses, methods, and systems for integrated control and data processing in a configurable spatial accelerator
US10678724B1 (en) 2018-12-29 2020-06-09 Intel Corporation Apparatuses, methods, and systems for in-network storage in a configurable spatial accelerator
US10965536B2 (en) 2019-03-30 2021-03-30 Intel Corporation Methods and apparatus to insert buffers in a dataflow graph
US11029927B2 (en) 2019-03-30 2021-06-08 Intel Corporation Methods and apparatus to detect and annotate backedges in a dataflow graph
US10915471B2 (en) 2019-03-30 2021-02-09 Intel Corporation Apparatuses, methods, and systems for memory interface circuit allocation in a configurable spatial accelerator
US10817291B2 (en) 2019-03-30 2020-10-27 Intel Corporation Apparatuses, methods, and systems for swizzle operations in a configurable spatial accelerator
US11037050B2 (en) 2019-06-29 2021-06-15 Intel Corporation Apparatuses, methods, and systems for memory interface circuit arbitration in a configurable spatial accelerator
US11907713B2 (en) 2019-12-28 2024-02-20 Intel Corporation Apparatuses, methods, and systems for fused operations using sign modification in a processing element of a configurable spatial accelerator
US12086080B2 (en) 2020-09-26 2024-09-10 Intel Corporation Apparatuses, methods, and systems for a configurable accelerator having dataflow execution circuits

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4644490A (en) * 1983-04-11 1987-02-17 Hitachi, Ltd. Floating point data adder
US5808926A (en) * 1995-06-01 1998-09-15 Sun Microsystems, Inc. Floating point addition methods and apparatus
US20030065698A1 (en) * 2001-09-28 2003-04-03 Ford Richard L. Operand conversion optimization

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5940311A (en) * 1996-04-30 1999-08-17 Texas Instruments Incorporated Immediate floating-point operand reformatting in a microprocessor
US6493817B1 (en) * 1999-05-21 2002-12-10 Hewlett-Packard Company Floating-point unit which utilizes standard MAC units for performing SIMD operations
US6829627B2 (en) * 2001-01-18 2004-12-07 International Business Machines Corporation Floating point unit for multiple data architectures
US6889241B2 (en) * 2001-06-04 2005-05-03 Intel Corporation Floating point adder

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4644490A (en) * 1983-04-11 1987-02-17 Hitachi, Ltd. Floating point data adder
US5808926A (en) * 1995-06-01 1998-09-15 Sun Microsystems, Inc. Floating point addition methods and apparatus
US20030065698A1 (en) * 2001-09-28 2003-04-03 Ford Richard L. Operand conversion optimization

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013155744A1 (en) * 2012-04-20 2013-10-24 Huawei Technologies Co., Ltd. System and method for signal processing in digital signal processors
US9274750B2 (en) 2012-04-20 2016-03-01 Futurewei Technologies, Inc. System and method for signal processing in digital signal processors
WO2015096001A1 (en) * 2013-12-23 2015-07-02 Intel Corporation System-on-a-chip (soc) including hybrid processor cores

Also Published As

Publication number Publication date
US20080133895A1 (en) 2008-06-05
CN101438232B (en) 2015-10-21
CN101438232A (en) 2009-05-20
DE112006003875T5 (en) 2009-06-18

Similar Documents

Publication Publication Date Title
US20080133895A1 (en) Floating Point Addition
CN109643228B (en) Low energy mantissa multiplication for floating point multiply-add operations
KR101566257B1 (en) Reducing power consumption in a fused multiply-add (fma) unit responsive to input data values
US8447800B2 (en) Mode-based multiply-add recoding for denormal operands
US8577948B2 (en) Split path multiply accumulate unit
US9104474B2 (en) Variable precision floating point multiply-add circuit
US9110713B2 (en) Microarchitecture for floating point fused multiply-add with exponent scaling
US8103858B2 (en) Efficient parallel floating point exception handling in a processor
US8239440B2 (en) Processor which implements fused and unfused multiply-add instructions in a pipelined manner
US8769249B2 (en) Instructions with floating point control override
US8606840B2 (en) Apparatus and method for floating-point fused multiply add
US8838665B2 (en) Fast condition code generation for arithmetic logic unit
US11226791B2 (en) Arithmetic processing device and method of controlling arithmetic processing device that enables suppression of size of device
US8918446B2 (en) Reducing power consumption in multi-precision floating point multipliers
US9274752B2 (en) Leading change anticipator logic
Tsen et al. A combined decimal and binary floating-point multiplier
US7747669B2 (en) Rounding of binary integers
US10289386B2 (en) Iterative division with reduced latency
US9128759B2 (en) Decimal multi-precision overflow and tininess detection
US20240354057A1 (en) Processor circuitry to perform a fused multiply-add

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 10589448

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 06835781

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 200680054583.X

Country of ref document: CN

RET De translation (de og part 6b)

Ref document number: 112006003875

Country of ref document: DE

Date of ref document: 20090618

Kind code of ref document: P

122 Ep: pct application non-entry in european phase

Ref document number: 06835781

Country of ref document: EP

Kind code of ref document: A1

REG Reference to national code

Ref country code: DE

Ref legal event code: 8607